JP2010170352A

JP2010170352A - Pattern recognition dictionary creation device and program

Info

Publication number: JP2010170352A
Application number: JP2009012503A
Authority: JP
Inventors: Yutaka Katsuyama; 裕勝山; Akihiro Minagawa; 明洋皆川; Yoshinobu Hotta; 悦伸堀田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-01-23
Filing date: 2009-01-23
Publication date: 2010-08-05
Anticipated expiration: 2029-01-23
Also published as: JP5343579B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a pattern recognition dictionary creation device for improving the recognizing precision of an image pattern, even if the storage capacity for storing a dictionary is small. <P>SOLUTION: This pattern recognition dictionary creation device is provided with: a feature vector generation part 111 for generating feature vectors by calculating the feature values of an input image; a grouping part 114 for gathering images whose similarity is high as a group by comparing the feature vectors of respective images; a main component analyzing part 113 for generating a plurality of feature data, containing the information of the image distribution of the images, based on the feature vectors generated by the feature vector generation part; and a dictionary generating/registering part 115 for, when creating a dictionary for pattern recognition, making a storage device 23 store data so that the data amounts of the feature data of the images which are contained in the group can be made larger than the data amounts of the feature data of images which are not contained in the group. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、パターン認識辞書作成装置及びプログラムに関する。 The present invention relates to a pattern recognition dictionary creation device and a program.

近年、携帯電話、ＰＤＡ（Personal Digital Assistants）、カーナビゲーション装置等のメモリの容量が小さい装置で文字認識に代表されるパターン認識が行われるようになってきた。メモリ容量の小さい装置では、パターン認識に用いる辞書の容量と、パターン認識の精度とが課題となる。すなわち、認識精度をあまり低下させず、辞書容量が大幅に削減できるパターン認識辞書が求められている。 In recent years, pattern recognition typified by character recognition has been performed by devices having a small memory capacity such as mobile phones, PDAs (Personal Digital Assistants), and car navigation devices. In an apparatus having a small memory capacity, the capacity of a dictionary used for pattern recognition and the accuracy of pattern recognition become problems. That is, there is a need for a pattern recognition dictionary that can significantly reduce the dictionary capacity without significantly reducing recognition accuracy.

例えば、文字認識であれば、まず、各文字の標準的な文字パターンに基づいて求めた特徴ベクトルを認識用辞書に登録しておく（特徴ベクトルについては後述する）。そして、入力した文字パターンに基づいて求めた特徴ベクトルと、認識用辞書に登録された各特徴ベクトルとのベクトル間距離を計算し、ベクトル間距離が最も近い文字を認識結果とする手法が通常の手法である。
また、距離計算の方法として、１次識別手法と２次識別手法とがある。一次識別手法は、認識用辞書に登録された各文字の特徴ベクトルと、入力した文字パターンに基づいて求めた特徴ベクトルとのユークリッド距離を求める方法である。また、２次識別手法では、マハラノビス距離、擬似ベイズ距離識別等の手法を用いて、文字パターンの画素分布を表す固有値、固有ベクトルを求め、これらの値を認識用辞書に登録しておく。そして、入力した文字パターンから求めた固有値、固有ベクトルと、認識用辞書に登録された固有値、固有ベクトルとを比較して、入力した文字パターンを認識する方法である（例えば、特許文献１参照）。 For example, in the case of character recognition, first, a feature vector obtained based on a standard character pattern of each character is registered in a recognition dictionary (the feature vector will be described later). Then, the method of calculating the intervector distance between the feature vector obtained based on the input character pattern and each feature vector registered in the recognition dictionary, and making the character with the closest intervector distance as the recognition result is a normal method. It is a technique.
Further, as a distance calculation method, there are a primary identification method and a secondary identification method. The primary identification method is a method for obtaining the Euclidean distance between the feature vector of each character registered in the recognition dictionary and the feature vector obtained based on the inputted character pattern. In the secondary identification method, eigenvalues and eigenvectors representing the pixel distribution of the character pattern are obtained using techniques such as Mahalanobis distance and pseudo Bayes distance identification, and these values are registered in the recognition dictionary. The eigenvalue and eigenvector obtained from the input character pattern are compared with the eigenvalue and eigenvector registered in the recognition dictionary to recognize the input character pattern (see, for example, Patent Document 1).

特開２００３−６７７４３号公報JP 2003-67743 A

１次識別手法では、辞書として各文字コードの特徴ベクトルを登録しておけばよいので辞書容量は小さいが、パターン認識の精度が低いという課題がある。また、２次識別手法では、パターン認識の精度が高いが、文字パターンの画素分布を表す固有値、固有ベクトルを辞書に登録しなければならないため大きな辞書容量が必要となった。従って、メモリ容量の小さな装置では、パターン認識の精度を犠牲にして１次識別手法を使用していた。 In the primary identification method, since the feature vector of each character code has only to be registered as a dictionary, the dictionary capacity is small, but there is a problem that the accuracy of pattern recognition is low. The secondary identification method has high pattern recognition accuracy, but requires a large dictionary capacity because eigenvalues and eigenvectors representing pixel distributions of character patterns must be registered in the dictionary. Therefore, in a device with a small memory capacity, the primary identification method is used at the expense of pattern recognition accuracy.

本発明は上記事情に鑑みてなされたものであり、辞書を記憶する記憶容量が小さな装置であっても、画像パターンの認識精度を向上させることができるパターン認識辞書作成装置及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and provides a pattern recognition dictionary creation device and program capable of improving the recognition accuracy of an image pattern even in a device having a small storage capacity for storing a dictionary. With the goal.

本明細書に開示のパターン認識辞書作成装置は、入力画像の画像パターンに基づく特徴量を求めて、該特徴量を要素とする特徴ベクトルを生成する特徴ベクトル生成手段と、各画像の特徴ベクトル同士を比較して類似度の高い画像を抽出し、前記類似度の高い画像をグループとしてまとめるグループ化手段と、前記特徴ベクトル生成手段で生成された特徴ベクトルに基づいて、画像の画素分布の情報を含む複数の特徴データを生成する特徴データ生成手段と、各画像を識別する識別情報と、該各画像の特徴データとを記憶手段に記憶させてパターン認識用辞書を作成する場合に、前記グループに含まれない画像の特徴データのデータ量よりも前記グループに含まれる画像の特徴データのデータ量が多くなるように前記パターン認識用辞書を作成し、該作成したパターン認識用辞書を前記記憶手段に記憶させる認識用辞書作成手段とを備えている。 A pattern recognition dictionary creation device disclosed in the present specification obtains a feature amount based on an image pattern of an input image, generates a feature vector having the feature amount as an element, and feature vectors of each image And extracting image having high similarity, grouping the images having high similarity as a group, and information on pixel distribution of the image based on the feature vector generated by the feature vector generating unit. In the case of creating a pattern recognition dictionary by storing the feature data generating means for generating a plurality of feature data including the identification information for identifying each image and the feature data of each image in the storage means, The pattern recognition dictionary is created so that the amount of image feature data included in the group is larger than the amount of image feature data included. And, and a recognition dictionary creation means for storing a pattern recognition dictionary which is the creation in the memory means.

本明細書に開示のパターン認識辞書作成装置によれば、パターン認識用辞書を記憶する記憶容量が小さな装置であっても、画像パターンの認識精度を向上させる認識辞書を作成することができる。 According to the pattern recognition dictionary creation device disclosed in this specification, a recognition dictionary that improves the recognition accuracy of an image pattern can be created even with a device having a small storage capacity for storing a pattern recognition dictionary.

パターン認識辞書作成装置の構成を示す図である。It is a figure which shows the structure of a pattern recognition dictionary creation apparatus. 辞書作成部の構成を示す図である。It is a figure which shows the structure of a dictionary creation part. グループ化について説明するために用意された文字と、その平均ベクトルとを示す図である。It is a figure which shows the character prepared in order to demonstrate grouping, and its average vector. （Ａ）は、各文字の平均ベクトルに基づいてベクトル空間に文字を配置した様子を示す図であり、（Ｂ）は、１回目の階層的クラスタリングの結果を示す図である。(A) is a figure which shows a mode that the character was arrange | positioned in the vector space based on the average vector of each character, (B) is a figure which shows the result of the 1st hierarchical clustering. （Ａ）は、２回目の階層的クラスタリングの結果を示す図であり、（Ｂ）は、３回目の階層的クラスタリングの結果を示す図である。(A) is a figure which shows the result of the 2nd hierarchical clustering, (B) is a figure which shows the result of the 3rd hierarchical clustering. （Ａ）は、４回目の階層的クラスタリングの結果を示す図であり、（Ｂ）は、５回目の階層的クラスタリングの結果を示す図である。(A) is a figure which shows the result of the 4th hierarchical clustering, (B) is a figure which shows the result of the 5th hierarchical clustering. ６回目の階層的クラスタリングの結果を示す図である。It is a figure which shows the result of the 6th hierarchical clustering. インデックステーブルの構成の一例を示す図である。It is a figure which shows an example of a structure of an index table. コードブックテーブルの構成の一例を示す図である。It is a figure which shows an example of a structure of a code book table. （Ａ）は、固有ベクトルの要素列のうち、部分一致するサブベクトルを抽出する様子を示す図であり、(Ｂ)は、抽出したサブベクトルにインデックス番号を付与してコードブックテーブルに登録する例を示す図であり、（Ｃ）は、固有ベクトルの要素列を、誤差を許容して部分一致するサブベクトルを抽出する様子を示す図であり、（Ｄ）は、抽出したサブベクトルにインデックス番号を付与してコードブックテーブルに登録する例を示す図である。(A) is a figure which shows a mode that the subvector which partially matches among the element strings of an eigenvector is extracted, (B) is an example which assigns an index number to the extracted subvector and registers it in the codebook table (C) is a diagram showing a state of extracting sub-vectors that partially match the element sequence of the eigen vector while allowing an error, and (D) shows an index number for the extracted sub-vector. It is a figure which shows the example given and registered to a codebook table. 辞書作成部の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a dictionary preparation part. グループ化の他の方法を説明するための図である。It is a figure for demonstrating the other method of grouping.

以下、添付図面を参照しながら本発明の好適な実施例を説明する。 Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

本実施例のパターン認識辞書作成装置１は、図１に示すように辞書作成部１０と、スキャナ装置２１と、操作部２２と、記憶装置２３と、表示装置２４とを備えている。 As shown in FIG. 1, the pattern recognition dictionary creation device 1 of this embodiment includes a dictionary creation unit 10, a scanner device 21, an operation unit 22, a storage device 23, and a display device 24.

スキャナ装置２１は、用紙に印刷された文字等の画像をパターン画像データとして読み込む。読み込んだパターン画像データは、ＣＰＵ（Central Processing Unit）１１の制御により記憶装置２３に保存される。 The scanner device 21 reads an image such as characters printed on paper as pattern image data. The read pattern image data is stored in the storage device 23 under the control of a CPU (Central Processing Unit) 11.

操作部２２は、ユーザの指示を受け付ける操作入力受付部であり、スキャナ装置２１の動作開始の指示や、辞書作成部１０で作成した辞書を表示装置２４に表示させる指示等を受け付ける。 The operation unit 22 is an operation input reception unit that receives a user instruction, and receives an instruction to start the operation of the scanner device 21, an instruction to display the dictionary created by the dictionary creation unit 10 on the display device 24, and the like.

記憶装置２３には、スキャナ装置２１で読み込まれたパターン画像データや、辞書作成部１０で作成されたパターン認識用辞書等が記憶される。 The storage device 23 stores pattern image data read by the scanner device 21, a pattern recognition dictionary created by the dictionary creation unit 10, and the like.

表示装置２４は、ＣＰＵ１１の制御に従って、辞書作成部１０で作成したパターン認識用辞書のデータを表示させる。 The display device 24 displays the data of the pattern recognition dictionary created by the dictionary creation unit 10 under the control of the CPU 11.

辞書作成部１０は、図１に示すようにＣＰＵ１１と、ＲＯＭ（Read Only Memory）１２と、ＲＡＭ（Random Access Memory）１３と、入出力インターフェース１４と、グラフィックインターフェース１５と、ネットワークインターフェース１６とを備えている。 As shown in FIG. 1, the dictionary creation unit 10 includes a CPU 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, an input / output interface 14, a graphic interface 15, and a network interface 16. ing.

ＲＯＭ１２には、ＣＰＵ１１を制御するプログラムが記録されている。ＣＰＵ１１は、ＲＯＭ１２に記録されたプログラムを読み込んで、読み込んだプログラムに従った演算を行う。ＣＰＵ１１などのハードウェアと、ＲＯＭ１２に格納されたプログラムとの協働によって実現される辞書作成部１０の機能ブロックについては図２を参照しながら後ほど説明する。また、ＲＡＭ１３には、ＣＰＵ１１による演算途中のデータや、演算後のデータが記録される。例えば、スキャナ装置２１により読み込まれ、記憶装置２３に保存されていたパターン画像データを、ＣＰＵ１１の制御により読み出してＲＡＭ１３に格納する。
なお、プログラムについては、必ずしもＲＯＭ１２に記憶させておく必要はなく、例えば、コンピュータで読み込み可能なフレキシブルディスク（ＦＤ）、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）、光磁気ディスク、ＩＣカードなどの可搬記憶媒体、またはコンピュータに備えられるＨＤＤなどの記憶媒体、さらには公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータに接続される他のコンピュータ（またはサーバ）などにプログラムを記憶させておき、コンピュータがこれからプログラムを読み出して実行するようにしてもよい。あるいは公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介して他のコンピュータ（またはサーバ）からプログラムを可搬記憶媒体や記憶媒体に格納し、コンピュータがこれからプログラムを読み出して実行するようにしてもよい。 The ROM 12 stores a program for controlling the CPU 11. The CPU 11 reads a program recorded in the ROM 12 and performs a calculation according to the read program. A functional block of the dictionary creation unit 10 realized by cooperation of hardware such as the CPU 11 and a program stored in the ROM 12 will be described later with reference to FIG. Further, the RAM 13 records data being calculated by the CPU 11 and data after the calculation. For example, the pattern image data read by the scanner device 21 and stored in the storage device 23 is read by the control of the CPU 11 and stored in the RAM 13.
The program does not necessarily have to be stored in the ROM 12. For example, a computer-readable flexible disk (FD), DVD (Digital Versatile Disc), DVD-RAM, CD-ROM (Compact Disc Read Only Memory) ), CD-R (Recordable) / RW (ReWritable), magneto-optical disk, portable storage medium such as an IC card, or storage medium such as HDD provided in a computer, public line, Internet, LAN, WAN, etc. The program may be stored in another computer (or server) connected to the computer via the computer, and the computer may read and execute the program from this. Alternatively, the program may be stored in a portable storage medium or storage medium from another computer (or server) via a public line, the Internet, a LAN, a WAN, or the like, and the computer may read the program from there and execute it.

入出力インターフェース１４は、信号の入出力部である。入出力インターフェース１４には、スキャナ装置２１で読み取られたパターン画像データが入力される。入出力インターフェース１４は、入力したパターン画像データをＣＰＵ１１の制御により記憶装置２３に出力する。また、入出力インターフェース１４は、ＣＰＵ１１の制御によって記憶装置２３から読み出されたパターン認識用辞書を、例えば外部記憶装置３０等に出力する。 The input / output interface 14 is a signal input / output unit. The pattern image data read by the scanner device 21 is input to the input / output interface 14. The input / output interface 14 outputs the input pattern image data to the storage device 23 under the control of the CPU 11. The input / output interface 14 outputs the pattern recognition dictionary read from the storage device 23 under the control of the CPU 11 to, for example, the external storage device 30.

グラフィックインターフェース１５は、ＣＰＵ１１で処理された画像を表示装置２４に表示させるためのインターフェースであり、表示装置２４に表示させるためにグラフィックデータを波形電気信号に変換する。 The graphic interface 15 is an interface for displaying an image processed by the CPU 11 on the display device 24, and converts graphic data into a waveform electrical signal for display on the display device 24.

ネットワークインターフェース１６は、ネットワークに接続する。例えば、ＣＰＵ１１の制御に従って記憶装置２３から読み出されたパターン認識用辞書をネットワークを介して他の装置に転送する。 The network interface 16 is connected to the network. For example, the pattern recognition dictionary read from the storage device 23 is transferred to another device via the network under the control of the CPU 11.

次に、図２を参照しながらＣＰＵ１１などのハードウェアと、ＲＯＭ１２に格納されたプログラムとの協働によって実現される辞書作成部１０の機能ブロックについて説明する。
プログラム制御されたＣＰＵ１１によって実現される機能部１１０は、特徴ベクトル生成部１１１と、平均ベクトル算出部１１２と、主成分分析部１１３と、グループ化部１１４と、辞書生成・登録部１１５とを備えている。また、辞書生成・登録部１１５は、グループ化文字情報登録部１１６と、非グループ化文字情報登録部１１７と、テーブル格納部１１８とを備えている。以下、各ブロックについて説明する。 Next, functional blocks of the dictionary creation unit 10 realized by cooperation of hardware such as the CPU 11 and a program stored in the ROM 12 will be described with reference to FIG.
The function unit 110 realized by the program-controlled CPU 11 includes a feature vector generation unit 111, an average vector calculation unit 112, a principal component analysis unit 113, a grouping unit 114, and a dictionary generation / registration unit 115. ing. The dictionary generation / registration unit 115 includes a grouped character information registration unit 116, an ungrouped character information registration unit 117, and a table storage unit 118. Hereinafter, each block will be described.

まず、特徴ベクトル生成部１１１について説明する。
スキャナ装置２１でスキャンされ、記憶装置２３に保存されているパターン画像データは、ＣＰＵ１１の制御によて記憶装置２３から読み出されＲＡＭ１３に保存される。
特徴ベクトル生成部１１１は、ＲＡＭ１３に保存されているパターン画像データをＲＡＭ１３から取得する。
なお、以下では用紙に印刷された文字をスキャナ装置２１で読み込んだ文字画像を用いてパターン認識用辞書を作成する場合を例に説明するが、本実施例は文字に限定されるものではない。例えば、人間の顔画像や、声をマイクで入力した音声データであってもよい。
特徴ベクトル生成部１１１は、文字画像の文字パターンに基づく特徴量を求めて、特徴量を要素とする特徴ベクトルを生成する。具体的には、例えば、
「孫寧, 田原透, 阿曽弘具, 木村正行著「方向線素特徴量を用いた高精度文字認識」電子情報通信学会論文誌(D-II), J74-D-II, 3, pp.330-339 (1991-2)」
等の文献に開示された方法を用いて、文字画像をＤ次元（Ｄは任意の自然数）の特徴ベクトルに変換する。また、記憶装置２３には、同一の文字について、フォントの異なる文字や手書き文字などの複数の学習サンプルが収集されている。特徴ベクトル生成部１１１は、ＲＡＭ１３からこれら複数の学習サンプルの文字画像を読み込み、読み込んだ文字画像の特徴ベクトルをそれぞれ作成する。なお、本実施例では、パターン画像データとしてスキャナ装置２１で読み取ったパターン画像データを例に説明するが、パターン認識辞書作成装置１自身で作成したパターン画像データであってもよい。その他に、他の装置で生成したパターン画像データをネットワークを介して入力したものを使用してもよい
式（１）には、同一文字のＮ（Ｎは任意の自然数）個の学習サンプルＸ^１，Ｘ^２，Ｘ^３，・・・，Ｘ^Ｎの特徴ベクトルをそれぞれ示す。 First, the feature vector generation unit 111 will be described.
The pattern image data scanned by the scanner device 21 and stored in the storage device 23 is read from the storage device 23 and stored in the RAM 13 under the control of the CPU 11.
The feature vector generation unit 111 acquires pattern image data stored in the RAM 13 from the RAM 13.
In the following description, a case where a pattern recognition dictionary is created using a character image obtained by reading characters printed on paper with the scanner device 21 will be described as an example. However, the present embodiment is not limited to characters. For example, it may be a human face image or voice data obtained by inputting a voice with a microphone.
The feature vector generation unit 111 obtains a feature amount based on the character pattern of the character image, and generates a feature vector having the feature amount as an element. Specifically, for example,
`` Nao Son, Toru Tahara, Hiroki Aso, Masayuki Kimura, `` Highly accurate character recognition using directional element features '', IEICE Transactions (D-II), J74-D-II, 3, pp. 330-339 (1991-2) ''
The character image is converted into a D-dimensional (D is an arbitrary natural number) feature vector using a method disclosed in the literature. The storage device 23 collects a plurality of learning samples such as characters with different fonts and handwritten characters for the same character. The feature vector generation unit 111 reads the character images of the plurality of learning samples from the RAM 13 and creates feature vectors of the read character images. In this embodiment, pattern image data read by the scanner device 21 is described as an example of pattern image data. However, pattern image data created by the pattern recognition dictionary creating device 1 itself may be used. In addition, the pattern image data generated by another apparatus may be input via a network. In the formula (1), N learning samples X ¹ of the same character (N is an arbitrary natural number) are used. , X ² , X ³ ,..., X ^N feature vectors, respectively.

特徴ベクトル生成部１１１は、上述した処理を複数の文字ごとにそれぞれ行う。特徴ベクトル生成部１１１は、特徴ベクトルを生成すると、文字ごとに複数収集した学習サンプルの特徴ベクトルに文字を識別する文字コードを割り付けて記憶装置２３に保存する。また、特徴ベクトル生成部１１１は、文字を識別する文字コードと、文字ごとに複数収集した学習サンプルの特徴ベクトルとを平均ベクトル算出部１１２と、主成分分析部１１３とに出力する。

The feature vector generation unit 111 performs the above-described processing for each of a plurality of characters. When the feature vector generation unit 111 generates the feature vector, the feature vector generation unit 111 assigns a character code for identifying the character to the feature vector of the learning sample collected for each character and stores the character code in the storage device 23. Further, the feature vector generation unit 111 outputs a character code for identifying a character and a feature vector of learning samples collected for each character to the average vector calculation unit 112 and the principal component analysis unit 113.

平均ベクトル算出部１１２は、特徴ベクトル生成部１１１から取得した同じ文字の特徴ベクトルを用いて、平均ベクトルを文字ごとに算出する。
平均ベクトルは、以下に示す式（２）によって求められる。 The average vector calculation unit 112 calculates the average vector for each character using the feature vector of the same character acquired from the feature vector generation unit 111.
The average vector is obtained by the following equation (2).

また、式（２）に従って求めた平均ベクトルを式（３）のように表す。

Moreover, the average vector calculated | required according to Formula (2) is represented like Formula (3).

平均ベクトル算出部１１２は、式（２）に示すように各特徴ベクトルの同一次元のベクトル要素同士を加算し、これを学習サンプル数Ｎで除算して平均ベクトルを求める。
例えば、特徴ベクトルの１次元のベクトル要素であれば、上述した式（１）に示すｘ^１ _１，ｘ^２ _１，ｘ^３ _１，・・・，ｘ^Ｎ _１の値を加算して、これをＮで除算した値が、式（３）に示す平均ベクトルの１次元のベクトル要素となる。
平均ベクトル算出部１１２は、文字ごとに平均ベクトルを算出し、算出した平均ベクトルを文字コードに対応付けて記憶装置２３に保存する。また、平均ベクトル算出部１１２は、文字ごとに求めた平均ベクトルを主成分分析部１１３と、グループ化部１１４とに出力する。

The average vector calculation unit 112 adds vector elements of the same dimension of each feature vector as shown in Expression (2), and divides this by the learning sample number N to obtain an average vector.
For example, if the one-dimensional vector elements of the feature vector, ^x ₁ 1 shown in formula (1) ^above, _x 2 ^_1, x 3 1, · · ^·, by adding the value of ^x _{N 1,} this The value divided by N becomes a one-dimensional vector element of the average vector shown in Equation (3).
The average vector calculation unit 112 calculates an average vector for each character, and stores the calculated average vector in the storage device 23 in association with the character code. The average vector calculation unit 112 outputs the average vector obtained for each character to the principal component analysis unit 113 and the grouping unit 114.

次に、主成分分析部１１３について説明する。主成分分析部１１３は、特徴ベクトル生成部１１１で生成された複数の学習サンプルの特徴ベクトルを文字ごとに取得する。また、主成分分析部１１３は、平均ベクトル算出部１１２で文字ごとに求められた平均ベクトルを取得する。
主成分分析部１１３は、取得した情報を用いてパターン画像データの画素分布を表す固有値、固有ベクトルを各文字ごとに生成する。まず、主成分分析部１１３は、平均ベクトルと特徴ベクトルとから分散共分散行列を求める。分散共分散行列Σを式（４）に示す。

Next, the principal component analysis unit 113 will be described. The principal component analysis unit 113 acquires feature vectors of a plurality of learning samples generated by the feature vector generation unit 111 for each character. In addition, the principal component analysis unit 113 acquires the average vector obtained for each character by the average vector calculation unit 112.
The principal component analysis unit 113 uses the acquired information to generate eigenvalues and eigenvectors representing the pixel distribution of the pattern image data for each character. First, the principal component analysis unit 113 obtains a variance-covariance matrix from the average vector and the feature vector. The variance-covariance matrix Σ is shown in Equation (4).

分散共分散行列Σの共分散値σ_ｉｊは、以下に示す式（５）によって算出される。 The covariance value σ _ij of the variance covariance matrix Σ is calculated by the following equation (5).

主成分分析部１１３は、各学習サンプルの特徴ベクトルの要素ｉと平均ベクトルの要素ｉとの差と、特徴ベクトルの要素ｊと平均ベクトルの要素ｊとの差とを掛け合わせて平均を求めたものを、要素ｉと要素ｊの共分散値として算出する。
なお、ｉとｊは、１〜Ｄの任意の自然数である。

The principal component analysis unit 113 obtains the average by multiplying the difference between the element i of the feature vector and the element i of the average vector of each learning sample by the difference between the element j of the feature vector and the element j of the average vector. Is calculated as the covariance value of element i and element j.
Note that i and j are arbitrary natural numbers from 1 to D.

次に、主成分分析部１１３は、求めた分散共分散行列Σを対象に固有値計算を行い、分散共分散行列Σから固有値、固有ベクトルを求める。
固有値は、λｉ（ｉ＝１，２，・・・，Ｄ）のＤ個の値で、λ１＞λ２＞・・・＞λＤとなる。
固有ベクトルは、Φｉ＝（Φ^ｉ _１，Φ^ｉ _２，・・・，Φ^ｉ _Ｄ）（ｉ＝１，２，・・・，Ｄ）となる。
固有ベクトルは、互いに直交する大きさ１のベクトルとなる。
また、固有値λｉは、固有ベクトルΦｉで示される方向で見たときの複数の学習サンプルの標準偏差値を表す。そのため、複数個の固有値、固有ベクトルで複数の学習サンプルの全体分布を近似的に表現できる。 Next, the principal component analysis unit 113 performs eigenvalue calculation on the obtained variance-covariance matrix Σ, and obtains eigenvalues and eigenvectors from the variance-covariance matrix Σ.
The eigenvalues are D values of λi (i = 1, 2,..., D), and λ1>λ2>.
The eigenvectors are Φi = (Φ ⁱ ₁ , Φ ⁱ ₂ ,..., Φ ⁱ _D ) (i = 1, 2,..., D).
The eigenvectors are vectors of size 1 that are orthogonal to each other.
Further, the eigenvalue λi represents a standard deviation value of a plurality of learning samples when viewed in the direction indicated by the eigenvector Φi. Therefore, the entire distribution of a plurality of learning samples can be approximately expressed by a plurality of eigenvalues and eigenvectors.

なお、固有値、固有ベクトルは、Ｄ行Ｄ列の分散共分散行列Σから通常Ｄ個ずつ取得できる。しかし、一般的に値の小さい固有値は、学習サンプルの分布形状にあまり大きな影響を与えないので、固有値の大きな上位Ｍ個だけで学習サンプルの分布を十分に表現できる。
従って、
固有値は、λｉ（ｉ＝１，２，・・・，Ｍ）
固有ベクトルは、Φ^ｉ＝（Φ^ｉ _１，Φ^ｉ _２，・・・，Φ^ｉ _Ｄ）（ｉ＝１，２，・・・，Ｍ）（Ｍ＜Ｄ）となる。なお、Ｍは、Ｄよりも小さい任意の自然数である。
主成分分析部１１３は、固有値、固有ベクトルを文字ごとに求め、求めた固有値、固有ベクトルを文字コードと共に記憶装置２３に保存する。また、主成分分析部１１３は、求めた固有値、固有ベクトルを辞書生成・登録部１１５のグループ化文字情報登録部１１６と、非グループ化文字情報登録部１１７とに出力する。 It should be noted that eigenvalues and eigenvectors can usually be obtained in units of D from the variance-covariance matrix Σ of D rows and D columns. However, since eigenvalues having a small value generally do not significantly affect the distribution shape of the learning sample, the distribution of the learning sample can be sufficiently expressed by only the top M pieces having the largest eigenvalue.
Therefore,
The eigenvalue is λi (i = 1, 2,..., M)
The eigenvectors are Φ ⁱ = (Φ ⁱ ₁ , Φ ⁱ ₂ ,..., Φ ⁱ _D ) (i = 1, 2,..., M) (M <D). Note that M is an arbitrary natural number smaller than D.
The principal component analysis unit 113 obtains eigenvalues and eigenvectors for each character and stores the obtained eigenvalues and eigenvectors in the storage device 23 together with the character code. The principal component analysis unit 113 outputs the obtained eigenvalues and eigenvectors to the grouped character information registration unit 116 and the ungrouped character information registration unit 117 of the dictionary generation / registration unit 115.

次に、グループ化部１１４の処理について説明する。
グループ化部１１４は、平均ベクトル算出部１１２から平均ベクトルを文字ごとに取得する。グループ化部１１４は、取得した各文字の平均ベクトルに基づいて階層的クラスタリングを行い、文字を、平均ベクトルの類似する文字が存在する文字と、平均ベクトルの類似する文字が存在しない文字とに分ける。グループ化部１１４は、平均ベクトルの類似する文字が存在する文字を、同じグループに振り分ける。 Next, processing of the grouping unit 114 will be described.
The grouping unit 114 acquires an average vector for each character from the average vector calculation unit 112. The grouping unit 114 performs hierarchical clustering based on the acquired average vector of each character, and divides the character into a character having a character with a similar average vector and a character having no similar character with an average vector. . The grouping unit 114 distributes characters having similar characters in the average vector to the same group.

階層的クラスタリングについて図３〜図７を参照しながら具体的に説明する。
図３には、「大」、「犬」、「太」、「木」、「鳥」、「烏」、「土」、「士」、「±」、「亜」、「腕」、「右」、「日」の１３個の文字と、その平均ベクトルのベクトル要素とを示す。なお、図３に示す例では、平均ベクトルの次元を２０次元（すなわち、２０個のベクトル要素）としているが、平均ベクトルの次元は２０次元に限定されるものではない。また、図３に示す１３個の文字は、「大、犬、太、木」と、「鳥、烏」と、「土、士、±」の３つの類似文字グループと、類似する文字のない、それ以外の４つの文字とに分けられる。 Hierarchical clustering will be specifically described with reference to FIGS.
In FIG. 3, “large”, “dog”, “thick”, “tree”, “bird”, “eagle”, “soil”, “shi”, “±”, “sub”, “arm”, “ 13 characters of “right” and “day” and vector elements of the average vector are shown. In the example shown in FIG. 3, the dimension of the average vector is 20 dimensions (that is, 20 vector elements), but the dimension of the average vector is not limited to 20 dimensions. Further, the 13 characters shown in FIG. 3 have three similar character groups “Large, Dog, Thick, Tree”, “Bird, Spear”, and “Sat, Shi, ±”, and no similar characters. , And the other four characters.

グループ化部１１４は、これら１３個の文字を対象として階層的クラスタリングを行う。図４（Ａ）には、１３個の文字を２次元のベクトル空間に配置した様子を示す。なお、平均ベクトルは、実際には２０次元のベクトルであるが、簡略化して２次元で表示している。
また、ベクトル空間を規定する軸１は、平均ベクトルの２０個あるベクトル要素のうち第１ベクトル要素の成分だけが値を持つ軸である。また、軸２は、２０個のベクトル要素のうち第２ベクトル要素だけが値を持つ軸である。
グループ化部１１４は、１３個の平均ベクトルのベクトル間距離をそれぞれ求める。グループ化部１１４は、求めたベクトル間距離のうち、距離の最も近い２つのベクトルをグループとして統合する。図４（Ｂ）に１回目の階層的クラスタリングの結果を示す。図４（Ｂ）に示す例では、グループ化部１１４は、「土」と「±」のベクトル間距離が最も近いと判断し、これらの文字を同一グループに分類する。グループ化部１１４は、同一グループに分類した文字の平均ベクトル同士の平均を平均ベクトルとして求め、求めた平均ベクトルをグループの代表ベクトルに設定する。 The grouping unit 114 performs hierarchical clustering on these 13 characters. FIG. 4A shows a state in which 13 characters are arranged in a two-dimensional vector space. The average vector is actually a 20-dimensional vector, but is simplified and displayed in two dimensions.
The axis 1 that defines the vector space is an axis in which only the component of the first vector element has a value among the 20 vector elements of the average vector. An axis 2 is an axis in which only the second vector element has a value among the 20 vector elements.
The grouping unit 114 obtains the intervector distances of the 13 average vectors. The grouping unit 114 integrates, as a group, two vectors having the closest distances among the obtained distances between vectors. FIG. 4B shows the result of the first hierarchical clustering. In the example shown in FIG. 4B, the grouping unit 114 determines that the distance between the vectors of “Sat” and “±” is the shortest, and classifies these characters into the same group. The grouping unit 114 obtains an average of average vectors of characters classified into the same group as an average vector, and sets the obtained average vector as a representative vector of the group.

次に、グループ化部１１４は、１１文字の平均ベクトルと、設定したグループの代表ベクトルとについてベクトル間距離をそれぞれ求める。グループ化部１１４は、求めたベクトル間距離のうち、距離の最も近い２つのベクトルをグループとして統合する。図５（Ａ）に２回目の階層的クラスタリングの結果を示す。
図５（Ａ）に示す例では、グループ化部１１４は、「土」と「±」のグループを示す代表ベクトルと、「士」の平均ベクトルとのベクトル間距離が最も近いと判定し、「土」と「±」のグループに「士」を統合する。
グループ化部１１４は、同一グループに分類した文字「土」、「±」、「士」の平均ベクトルの平均である平均ベクトルを求め、求めた平均ベクトルをグループの代表ベクトルとする。なお、ここでは、先に求めた「土」と「±」の代表ベクトルと、「士」の平均ベクトルとの平均を平均ベクトルとして求め、求めた平均ベクトルをグループの代表ベクトルとしてもよい。 Next, the grouping unit 114 obtains an inter-vector distance for the average vector of 11 characters and the representative vector of the set group. The grouping unit 114 integrates, as a group, two vectors having the closest distances among the obtained distances between vectors. FIG. 5A shows the result of the second hierarchical clustering.
In the example illustrated in FIG. 5A, the grouping unit 114 determines that the inter-vector distance between the representative vector indicating the group of “Sat” and “±” and the average vector of “Shi” is the closest, “Shi” is integrated into the “Sat” and “±” groups.
The grouping unit 114 obtains an average vector that is an average of the average vectors of the characters “Sat”, “±”, and “Shi” classified into the same group, and uses the obtained average vector as a representative vector of the group. Here, the average of the previously obtained “Soil” and “±” representative vectors and the “shi” average vector may be obtained as an average vector, and the obtained average vector may be used as the group representative vector.

以下、同様の手順で、グループ化部１１４は階層的クラスタリングを行っていく。図５（Ｂ）に３回目の階層的クラスタリングの結果を示し、図６（Ａ）に４回目の階層的クラスタリングの結果を示す。また、図６（Ｂ）に５回目の階層的クラスタリングの結果を示し、図７に６回目の階層的クラスタリングの結果を示す。 Thereafter, the grouping unit 114 performs hierarchical clustering in the same procedure. FIG. 5B shows the result of the third hierarchical clustering, and FIG. 6A shows the result of the fourth hierarchical clustering. FIG. 6B shows the result of the fifth hierarchical clustering, and FIG. 7 shows the result of the sixth hierarchical clustering.

グループ化部１１４は、上述したように平均ベクトルと平均ベクトル、又は代表ベクトルと平均ベクトルとのベクトル間距離を求めて、ベクトル間距離の最も近い文字を同一グループに分類していく。また、ベクトル間距離には予めしきい値が設定してあり、グループ化部１１４は、求めたベクトル間距離がこのしきい値よりも大きくなると、同一グループへの分類を終了させる。図４〜図７に示す例では、６回の階層的クラスタリングで、３つのグループと、類似する文字のない４つの文字とに分類される。 As described above, the grouping unit 114 obtains the distance between vectors of the average vector and the average vector, or the representative vector and the average vector, and classifies the characters having the closest vector distance into the same group. In addition, a threshold value is set in advance for the inter-vector distance, and the grouping unit 114 ends the classification into the same group when the calculated inter-vector distance becomes larger than the threshold value. In the example shown in FIGS. 4 to 7, classification is made into three groups and four characters having no similar characters by six times of hierarchical clustering.

グループ化部１１４は、分類したグループにグループを識別する識別情報を割り当て、グループに分類された文字コードと共に記憶装置２３に保存する。また、グループ化部１１４は、グループ化されなかった文字コードに、当該文字がグループ化されなかった文字コードであることを示す識別情報を付加して記憶装置２３に保存する。
また、グループ化部１１４は、グループに分類された文字コード、及び当該グループを示す識別情報をグループ化文字情報登録部１１６と非グループ化文字情報登録部１１７とに出力する。またグループ化部１１４は、グループに分類されなかった文字コードと、この文字コードがグループに分類されなかったことを示す識別情報とをグループ化文字情報登録部１１６と非グループ化文字情報登録部１１７とに出力する。 The grouping unit 114 assigns identification information for identifying the group to the classified group, and stores it in the storage device 23 together with the character code classified into the group. Further, the grouping unit 114 adds the identification information indicating that the character is a character code that has not been grouped to the character code that has not been grouped, and stores it in the storage device 23.
The grouping unit 114 also outputs the character codes classified into groups and identification information indicating the groups to the grouped character information registration unit 116 and the non-grouped character information registration unit 117. The grouping unit 114 also displays a character code that has not been classified into a group and identification information indicating that the character code has not been classified into a group, and a grouped character information registration unit 116 and a non-grouped character information registration unit 117. And output.

次に、辞書作成・登録部１１５について説明する。辞書作成・登録部１１５は、グループ化文字情報登録部１１６と、非グループ化文字情報登録部１１７と、テーブル格納部１１８とを有している。
グループ化文字情報登録部１１６は、グループ化部１１４で類似する文字があると判定され、グループに分類された文字の固有値、固有ベクトルをインデックステーブルに登録する。インデックステーブルの一例を図８に示す。インデックステーブルには、図８に示すように文字を識別する文字コードと、インデックスフラグと、文字コードが示す文字の固有値、固有ベクトルとが含まれる。なお、インデックステーブルとコードブックテーブルとが前述のパターン認識用辞書に該当する。
グループに分類された文字には、類似する文字が存在し、文字認識用に多くの情報を必要とすることから、主成分分析部１１３で生成される固有値、固有ベクトルのすべてをインデックステーブルに登録する。上述したように主成分分析部１１３で生成される固有ベクトルは、複数のベクトル要素を有するＤ次元のベクトルである。グループ化文字情報登録部１１６は、Ｄ次元の固有ベクトルのベクトル要素のすべてを、文字認識用の固有ベクトルとしてインデックステーブルの該当文字欄に登録する。なお、ベクトル要素のすべてをインデックステーブルに登録した文字には、インデックスフラグとして「０」が記録される。 Next, the dictionary creation / registration unit 115 will be described. The dictionary creation / registration unit 115 includes a grouped character information registration unit 116, an ungrouped character information registration unit 117, and a table storage unit 118.
The grouping character information registration unit 116 determines that there are similar characters in the grouping unit 114, and registers the eigenvalues and eigenvectors of the characters classified into groups in the index table. An example of the index table is shown in FIG. As shown in FIG. 8, the index table includes a character code for identifying a character, an index flag, an eigenvalue of the character indicated by the character code, and an eigenvector. Note that the index table and the code book table correspond to the aforementioned pattern recognition dictionary.
Since the characters classified into groups have similar characters and require a lot of information for character recognition, all of the eigenvalues and eigenvectors generated by the principal component analysis unit 113 are registered in the index table. . As described above, the eigenvector generated by the principal component analysis unit 113 is a D-dimensional vector having a plurality of vector elements. The grouped character information registration unit 116 registers all the vector elements of the D-dimensional eigenvector in the corresponding character column of the index table as eigenvectors for character recognition. Note that “0” is recorded as an index flag for characters in which all vector elements are registered in the index table.

非グループ化文字情報登録部１１７は、グループ化部１１４で類似する文字がないと判定され、グループに分類されなかった文字の固有値、固有ベクトルをインデックステーブルと、コードブックテーブルとに登録する。
グループに分類されなかった文字は、類似する文字が存在せず、文字認識用に多くの情報を必要としないことから、主成分分析部１１３で生成された固有ベクトルのベクトル要素のすべてを正確にインデックステーブルに登録する必要がない。そこで、非グループ化文字情報登録部１１７は、グループ化されなかった文字の固有ベクトルのベクトル要素のうち、他の文字の固有ベクトルのベクトル要素と部分一致する要素列を抽出する。非グループ化文字情報登録部１１７は、抽出した要素列に、この要素列を識別するインデックス番号を付与する。そして、非グループ化文字情報登録部１１７は、インデッステーブルの該当文字の固有ベクトル記録欄には、付与したインデックス番号を記録させる。また、非グループ化文字情報登録部１１７は、インデックス番号と、抽出した要素列との対応付けを管理するため、インデックス番号と、対応する要素列とをコードブックテーブル（サブ辞書）に記録する。コードブックテーブルの一例を図９に示す。 The non-grouped character information registering unit 117 determines that there is no similar character in the grouping unit 114, and registers the eigenvalues and eigenvectors of characters not classified into groups in the index table and codebook table.
Characters that are not classified into groups do not have similar characters and do not require much information for character recognition. Therefore, all vector elements of eigenvectors generated by the principal component analysis unit 113 are accurately indexed. There is no need to register in the table. Therefore, the ungrouped character information registration unit 117 extracts an element sequence that partially matches the vector elements of the eigenvectors of other characters from the vector elements of the eigenvectors of the characters that are not grouped. The ungrouped character information registration unit 117 assigns an index number for identifying the element string to the extracted element string. Then, the ungrouped character information registration unit 117 records the assigned index number in the eigenvector recording column of the corresponding character in the index table. In addition, the ungrouped character information registration unit 117 records the index number and the corresponding element string in the codebook table (sub dictionary) in order to manage the association between the index number and the extracted element string. An example of the code book table is shown in FIG.

非グループ化文字情報登録部１１７の処理について具体的に説明する。
主成分分析部１１３で生成される固有ベクトルは、上述のようにＤ次元のベクトルであり、Ｄ個のベクトル要素からなる。
Φ^ｉ＝（Φ^ｉ _１，Φ^ｉ _２，・・・，Φ^ｉ _Ｄ）
これらＤ個の要素のうち、他のグループ化されなかった文字コードの固有ベクトルの要素と部分一致する要素列（以下、この要素列をサブベクトルと呼ぶ）を抽出する。例えば、図１０（Ａ）に示す例では、文字コード「０ｘ８１４８」で固有値が「１２．５」の固有ベクトルと、文字コード「０ｘ７６５２」で固有値が「２５．３」の固有ベクトルとでサブベクトルに部分一致が見られる。
非グループ化文字情報登録部１１７は、この部分一致した（２１、１１、４５、６２）のサブベクトルにインデックス番号を付与する。そして、非グループ化文字情報登録部１１７は、インデックステーブルに文字コードが「０ｘ８１４８」で固有値が「２５．３」の固有ベクトルを記録するときに、インデックステーブルの該当欄には、付与したインデックス番号を記録する。同様に、非グループ化文字情報登録部１１７は、インデックステーブルに文字コードが「０ｘ７６５２」で固有値が「１２．５」の固有ベクトルを記録するときに、インデックステーブルの該当欄には、付与したインデックス番号を記録する。また、非グループ化文字情報登録部１１７は、コードブックテーブルには、部分一致したサブベクトルと、このサブベクトルを識別するインデックス番号とを登録する（図１０（Ｂ）参照）。 The processing of the ungrouped character information registration unit 117 will be specifically described.
The eigenvector generated by the principal component analysis unit 113 is a D-dimensional vector as described above, and is composed of D vector elements.
Φ ⁱ = (Φ ⁱ ₁ , Φ ⁱ ₂ ,..., Φ ⁱ _D )
Among these D elements, an element string that partially matches an element of an eigenvector of a character code that has not been grouped (hereinafter, this element string is referred to as a subvector) is extracted. For example, in the example shown in FIG. 10A, the eigenvector with the character code “0x8148” and the eigenvalue “12.5” and the eigenvector with the character code “0x7652” and the eigenvalue “25.3” are part of the subvector. There is a match.
The ungrouped character information registration unit 117 assigns an index number to the sub-vectors (21, 11, 45, 62) that are partially matched. The ungrouped character information registration unit 117 records the assigned index number in the corresponding column of the index table when the eigenvector having the character code “0x8148” and the eigenvalue “25.3” is recorded in the index table. Record. Similarly, the ungrouped character information registration unit 117 records the assigned index number in the corresponding column of the index table when the eigenvector having the character code “0x7652” and the eigenvalue “12.5” is recorded in the index table. Record. The ungrouped character information registration unit 117 registers the partially matched subvector and an index number for identifying the subvector in the codebook table (see FIG. 10B).

また、非グループ化文字情報登録部１１７は、誤差を許容して部分一致するサブベクトルを検出する。
例えば、図１０（Ｃ）に示す例では、文字コード「０ｘ５２４１」、固有値「１０．１」の「８３、５５、３６、２２」のサブベクトルと、文字コード「０ｘ３６９８」、固有値「９．６」の「８３、５５、３２、２２」のサブベクトルとは、一致してはいないが、類似したサブベクトルである。
非グループ化文字情報登録部１１７には、ベクトル間距離のしきい値が設定されている。非グループ化文字情報登録部１１７は、サブベクトル同士が完全一致していないが、ベクトル間距離がしきい値以下であるサブベクトル同士を、同じインデックス番号に割り当てる。
このときコードブックテーブルに、「８３、５５、３６、２２」と、「８３、５５、３２、２２」のいずれか一方を登録しなければならないが、非グループ化文字情報登録部１１７は、サブベクトルの出現回数の多いほうを選択して登録する。例えば、「８３、５５、３６、２２」の出現回数が５回で、「８３、５５、３２、２２」の出現回数が３回であれば、コードブックテーブルには、サブベクトルとして「８３、５５、３６、２２」を登録する。コードブックテーブルへの登録例を図１０（Ｄ）に示す。このようにして非グループ化文字情報登録部１１７は、グループに分類されなかった文字の固有ベクトルのベクトル要素をすべてインデックス番号で置き換える。そして、非グループ化文字情報登録部１１７は、インデックステーブルの該当文字の固有ベクトル記録欄には、付与したインデックス番号だけを記録する。 Further, the ungrouped character information registration unit 117 detects a sub-vector that partially matches with an error.
For example, in the example shown in FIG. 10C, the character code “0x5241”, the subvector of “83, 55, 36, 22” of the eigenvalue “10.1”, the character code “0x3698”, the eigenvalue “9.6”. The sub-vectors “83, 55, 32, and 22” in FIG.
In the ungrouped character information registration unit 117, a threshold for the distance between vectors is set. The ungrouped character information registration unit 117 assigns subvectors whose subvectors are not equal to each other but whose vector distance is equal to or smaller than a threshold to the same index number.
At this time, either “83, 55, 36, 22” or “83, 55, 32, 22” must be registered in the codebook table, but the non-grouped character information registration unit 117 Select and register the vector with the highest number of appearances. For example, if the number of appearances of “83, 55, 36, 22” is 5 and the number of appearances of “83, 55, 32, 22” is 3, the codebook table may include “83, 55, 36, 22 "are registered. An example of registration in the code book table is shown in FIG. In this way, the ungrouped character information registration unit 117 replaces all vector elements of eigenvectors of characters not classified into groups with index numbers. Then, the ungrouped character information registration unit 117 records only the assigned index number in the eigenvector recording column of the corresponding character in the index table.

このようにして、グループに分類されない、すなわち、類似する文字が存在しない文字については、インデックステーブルに登録する情報量を削減することができる。なお、図９に示すコードブックテーブルでは、４次元（４つの要素列）のサブベクトルを単位としてコードブックテーブルに登録している。しかしながら、コードブックテーブルに登録するサブベクトルの要素数は、４つに限定されるものではなく、例えば、５つ、６つなど要素数の多いサブベクトルを単位として、コードブックテーブルに登録してもよい。また、ベクトル要素数の異なるサブベクトルをコードブックテーブルに登録することもできる。 In this way, the amount of information registered in the index table can be reduced for characters that are not classified into groups, that is, for characters that do not have similar characters. In the code book table shown in FIG. 9, four-dimensional (four element strings) subvectors are registered in the code book table as a unit. However, the number of sub-vector elements registered in the code book table is not limited to four. For example, sub-vectors having a large number of elements such as five or six are registered in the code book table as a unit. Also good. In addition, subvectors having different numbers of vector elements can be registered in the codebook table.

テーブル格納部１１８は、グループ化文字情報登録部１１６と、非グループ化文字情報登録部１１７によって作成されたインデックステーブルと、コードブックテーブルとを記憶装置２３に記憶させる。 The table storage unit 118 stores the index table created by the grouped character information registration unit 116, the ungrouped character information registration unit 117, and the code book table in the storage device 23.

次に、図１１に示すフローチャートを参照しながら辞書作成部１０の処理フローを説明する。
まず、辞書作成部１０は、パターン認識用辞書を作成する文字のパターン画像データをＲＡＭ１３から取得する（ステップＳ１）。パターン画像データは、文字ごとに学習サンプルとして複数用意されている。 Next, the processing flow of the dictionary creation unit 10 will be described with reference to the flowchart shown in FIG.
First, the dictionary creation unit 10 acquires pattern image data of characters for creating a pattern recognition dictionary from the RAM 13 (step S1). A plurality of pattern image data are prepared as learning samples for each character.

次に、辞書作成部１０は、学習サンプルとして複数用意されたパターン画像データの特徴ベクトルを特徴ベクトル生成部１１１で作成し（ステップＳ２）、それらの平均ベクトルを求める（ステップＳ３）。平均ベクトルは、文字ごとに算出される。 Next, the dictionary creation unit 10 creates feature vectors of pattern image data prepared as a plurality of learning samples in the feature vector generation unit 111 (step S2), and obtains an average vector thereof (step S3). The average vector is calculated for each character.

次に、辞書作成部１０は、算出した平均ベクトルを用いて主成分分析部１１３で主成分分析を行い、固有値、固有ベクトルを求める（ステップＳ４）。固有ベクトルは、固有値ごとに求められ、複数のベクトル要素を有している。 Next, the dictionary creation unit 10 performs principal component analysis in the principal component analysis unit 113 using the calculated average vector to obtain eigenvalues and eigenvectors (step S4). The eigenvector is obtained for each eigenvalue and has a plurality of vector elements.

次に、辞書作成部１０は、グループ化部１１４でパターン認識用辞書を作成する文字を、類似する文字が存在する文字と、類似する文字が存在しない文字とに分ける（ステップＳ５）。グループ化部１１４は、類似する文字同士を同じグループに分類し、グループを識別する情報と、グループに属する文字の文字コードとを記憶装置２３に保存する。また、グループ化部１１４は、類似する文字が存在しない文字の場合に、グループに属さない文字であることを示す情報を文字の文字コードと共に記憶装置２３に保存する。 Next, the dictionary creation unit 10 divides the characters for which the grouping unit 114 creates the pattern recognition dictionary into characters that have similar characters and characters that have no similar characters (step S5). The grouping unit 114 classifies similar characters into the same group, and stores the information for identifying the group and the character codes of the characters belonging to the group in the storage device 23. Further, the grouping unit 114 stores, in the storage device 23, information indicating that the character does not belong to the group together with the character code of the character when the character does not have a similar character.

次に、辞書作成部１０は、辞書生成・登録部１１５で文字の固有値、固有ベクトルをインデックステーブルとコードブックテーブルに登録して、パターン認識用辞書を作成する（ステップＳ６）。
このとき、グループ化文字情報登録部１１６は、グループに分類された文字の固有値、固有ベクトルをすべてインデックステーブルに登録する。また、非グループ化文字情報登録部１１７は、グループに分類されない文字の固有ベクトルのベクトル要素のうち、他の文字の固有ベクトルのベクトル要素と部分一致するサブベクトルを検出する。このとき、非グループ化文字情報登録部１１７は、ベクトル間距離が誤差の範囲内のサブベクトルも部分一致するサブベクトルとして抽出する。非グループ化文字情報登録部１１７は、抽出したサブベクトルに、インデッス番号を付与する。そして、非グループ化文字情報登録部１１７は、インデックステーブルの該当文字の固有ベクトル登録欄には、インデックス番号を登録し、コードブックテーブルに、抽出したサブベクトルとインデックス番号とを登録して対応を管理する。このようにして非グループ化文字情報登録部１１７は、グループに分類されなかった文字の固有ベクトルのベクトル要素をすべてインデックス番号で置き換える。そして、非グループ化文字情報登録部１１７は、インデックステーブルの該当文字の固有ベクトル記録欄には、付与したインデックス番号だけを記録する。 Next, the dictionary creation unit 10 creates a pattern recognition dictionary by registering the eigenvalues and eigenvectors of the characters in the index table and codebook table in the dictionary generation / registration unit 115 (step S6).
At this time, the grouped character information registration unit 116 registers all the eigenvalues and eigenvectors of the characters classified into the group in the index table. Further, the ungrouped character information registration unit 117 detects a subvector that partially matches a vector element of an eigenvector of another character among vector elements of eigenvectors of characters that are not classified into groups. At this time, the ungrouped character information registration unit 117 also extracts subvectors whose vector distance is within the error range as partially matching subvectors. The ungrouped character information registration unit 117 assigns an index number to the extracted subvector. Then, the ungrouped character information registration unit 117 registers the index number in the eigenvector registration field of the corresponding character in the index table, and registers the extracted subvector and index number in the codebook table to manage the correspondence. To do. In this way, the ungrouped character information registration unit 117 replaces all vector elements of eigenvectors of characters not classified into groups with index numbers. Then, the ungrouped character information registration unit 117 records only the assigned index number in the eigenvector recording column of the corresponding character in the index table.

辞書作成部１０は、テーブル格納部１１８により、グループ化文字情報登録部１１６と、非グループ化文字情報登録部１１７で作成されたインデックステーブルと、コードブックテーブルとを記憶装置２３に記憶させる（ステップＳ７）。 The dictionary creation unit 10 causes the table storage unit 118 to store the grouped character information registration unit 116, the index table created by the ungrouped character information registration unit 117, and the code book table in the storage device 23 (step S1). S7).

このように本実施例は、類似する文字のある文字の場合には、文字認識用に多くの情報を必要とすることから、この文字の固有値、固有ベクトルをすべてインデックステーブルに登録する。
また、類似する文字が存在しない文字の場合、類似する文字がある文字に比較して、文字認識用の情報を多く必要とはしない。このため、固有ベクトルのベクトル要素のうち、ベクトル間距離に誤差を許容して他の固有ベクトルのベクトル要素列と部分一致するサブベクトルを抽出する。そして、抽出したサブベクトルに代えてサブベクトルを識別するインデックス番号をインデックステーブルに登録する。これにより、インデックステーブルに登録するデータ量を削減することができる。また、類似する文字については、２次識別方式の固有値、固有ベクトルをそのままパターン認識用辞書に登録するため、文字の認識精度を向上させることができる。 As described above, in the present embodiment, in the case of a character having a similar character, a lot of information is required for character recognition. Therefore, all the eigenvalues and eigenvectors of this character are registered in the index table.
In addition, in the case of a character having no similar character, a large amount of character recognition information is not required as compared with a character having a similar character. For this reason, out of vector elements of eigenvectors, subvectors that partially match a vector element sequence of other eigenvectors are extracted while allowing an error in the distance between vectors. Then, an index number for identifying the subvector is registered in the index table instead of the extracted subvector. Thereby, the amount of data registered in the index table can be reduced. For similar characters, the eigenvalues and eigenvectors of the secondary identification method are directly registered in the pattern recognition dictionary, so that the character recognition accuracy can be improved.

例えば、４，０００文字で各文字が６個の固有ベクトル（１つの固有ベクトルは２００バイトとする）を持つとする。
グループ化しないですべての文字の固有値、固有ベクトルをインデックステーブルに登録するとすると、固有ベクトルのデータ量は、４，０００×６×２００で４，８００，０００バイトとなる。
また、固有値は、１固有値は浮動小数点数値で８バイトとすると、４，０００×６×８で１９２，０００バイトとなる。従って、固有値と固有ベクトルとの合計のデータ量は、４，９９２，０００バイトとなる。 For example, assume that 4,000 characters and each character has six eigenvectors (one eigenvector is 200 bytes).
If eigenvalues and eigenvectors of all characters are registered in the index table without grouping, the data amount of eigenvectors is 4,000 × 6 × 200, which is 4,800,000 bytes.
The eigenvalue is 192,000 bytes of 4,000 × 6 × 8 if one eigenvalue is a floating-point value of 8 bytes. Therefore, the total data amount of the eigenvalues and eigenvectors is 4,992,000 bytes.

これに対して、本実施例では、固有ベクトルについては、４，０００×６個のベクトルを階層的クラスタリングして、１，０００個のグループを作成したとする。すると、１文字の固有ベクトルは、６個のサブベクトルで表現できるので、１文字当たり６×２バイトとなる（１つのサブベクトルは、２バイトで表現される）。従って全体で４，０００×６×２＝４８，０００バイトとなる。また、インデックステーブルのデータ量は、１，０００×８＝８，０００バイトとなる。さらに、固有値、固有ベクトルの合計は、３０４，０００バイトとなり、元のサイズの６％程度でよい。 On the other hand, in this embodiment, as for eigenvectors, it is assumed that 1,000 groups are created by hierarchically clustering 4,000 × 6 vectors. Then, since the eigenvector of one character can be expressed by 6 subvectors, it becomes 6 × 2 bytes per character (one subvector is expressed by 2 bytes). Therefore, the total is 4,000 × 6 × 2 = 48,000 bytes. The data amount of the index table is 1,000 × 8 = 8,000 bytes. Furthermore, the sum of the eigenvalues and eigenvectors is 304,000 bytes, which may be about 6% of the original size.

次に、作成したインデックステーブルと、コードブックテーブルとを使用して、パターン画像認識を行う場合について説明する。
例えば、文字パターンの画像を入力し、入力した文字パターンの文字認識を行うとする。まず、入力した文字パターンの特徴ベクトルを生成して主成分分析を行い、入力した文字パターンの固有値、固有ベクトルを作成する。
次に、標本として登録された文字の固有値、固有ベクトルをインデックステーブルとコードブックテーブルから読み出して入力した文字パターンとの文字認識を行う。このとき、インデックステーブルのインデックスフラグに「１」が記録された文字の固有ベクトルを再現する場合は、インデックステーブルの該当文字欄に記録されたインデックス番号をまず取り出す。そして、該当するインデックス番号のサブベクトルをコードブックテーブルを参照して取り出す。取り出したサブベクトルを、インデックステーブルに記録されたインデックス番号順に並べてＤ次元の固有ベクトルを再現する。そして、再現した固有ベクトルとインデックステーブルに登録された固有値とに基づいて、入力した文字パターンと、標本として登録された文字とが一致するか否かを判定する。 Next, a case where pattern image recognition is performed using the created index table and codebook table will be described.
For example, assume that an image of a character pattern is input and character recognition of the input character pattern is performed. First, a feature vector of an input character pattern is generated and principal component analysis is performed to generate eigenvalues and eigenvectors of the input character pattern.
Next, character recognition is performed on the character pattern input by reading the eigenvalue and eigenvector of the character registered as a sample from the index table and codebook table. At this time, when reproducing the eigenvector of the character having “1” recorded in the index flag of the index table, the index number recorded in the corresponding character column of the index table is first extracted. Then, the subvector of the corresponding index number is extracted with reference to the code book table. The extracted subvectors are arranged in the order of the index numbers recorded in the index table to reproduce a D-dimensional eigenvector. Then, based on the reproduced eigenvector and the eigenvalue registered in the index table, it is determined whether or not the input character pattern matches the character registered as the sample.

［変形例］
階層的クラスタリングの他の方法について説明する。
グループ化部１１４は、まず、取得した文字の平均ベクトルを、ベクトル空間に配置する。ベクトル空間は、例えば、平均ベクトルが２０個のベクトル要素を有する２０次元のベクトルからなれば、２０本の軸で表現される空間である。図１２には、簡略化のため軸１と軸２の２次元のベクトル空間を示す。軸１は、平均ベクトルの２０個ある要素のうち第１要素の成分だけが値を持つ軸である。また、軸２は、２０個の要素のうち第２要素だけが値を持つ軸である。 [Modification]
Another method of hierarchical clustering will be described.
The grouping unit 114 first arranges the acquired average vector of characters in the vector space. The vector space is, for example, a space represented by 20 axes if the average vector is a 20-dimensional vector having 20 vector elements. FIG. 12 shows a two-dimensional vector space of axis 1 and axis 2 for simplification. The axis 1 is an axis in which only the component of the first element among the 20 elements of the average vector has a value. The axis 2 is an axis in which only the second element of the 20 elements has a value.

次に、グループ化部１１４は、１つの平均ベクトルに注目し、各軸上の座標値を求める。さらにグループ化部１１４は、求めた座標値を中心に、軸上で左右に一定値だけ拡大した範囲を求める。そして、この範囲内に入っている文字を求めて、求めた文字群をこの軸上での文字集合とする。
図１２に示す例では、例えば「大」の文字に注目すると、軸１上では、「大」を中心とした大きさ７の範囲内に、「太、犬、大、木」の４つの文字が存在することが分かる。同様に、軸２上では、「大」を中心とした大きさ７の範囲内に「犬、亜、大、木、太、腕」の６つの文字が存在することが分かる。グループ化部１１４は、各軸上で求めた文字集合の論理積を演算し、すべての軸上で「大」の近傍に位置する文字を検出する。図１２に示す例では、軸１から求めた文字集合と、軸２から求めた文字集合の論理積を求めることで、「太、犬、大、木」がグループ化部１１４によって抽出される。グループ化部１１４は、抽出した文字集合を似ている文字集合として抽出する。グループ化部１１４は、以上の処理をすべての文字について行い、すべての文字で似ている文字集合を求める。そして、グループ化部１１４は、求めた似ている文字集合のうち、完全一致した似ている文字集合をグループに設定する。グループ化部１１４は、設定したグループ情報を記憶装置２３と、グループ化文字情報登録部１１６と、非グループ化文字情報登録部１１７とに出力する。 Next, the grouping unit 114 pays attention to one average vector and obtains coordinate values on each axis. Further, the grouping unit 114 obtains a range that is enlarged by a certain value on the left and right on the axis around the obtained coordinate value. Then, the characters within this range are obtained, and the obtained character group is set as a character set on this axis.
In the example shown in FIG. 12, for example, when focusing on the character “large”, on the axis 1, four characters “thick, dog, large, tree” are within a range of size 7 centered on “large”. It can be seen that exists. Similarly, on the axis 2, it can be seen that there are six characters “dog, sub, large, wood, thick, arm” within a range of size 7 centered on “large”. The grouping unit 114 calculates a logical product of the character sets obtained on each axis, and detects a character positioned in the vicinity of “large” on all axes. In the example illustrated in FIG. 12, “thick, dog, large, tree” is extracted by the grouping unit 114 by obtaining a logical product of the character set obtained from the axis 1 and the character set obtained from the axis 2. The grouping unit 114 extracts the extracted character set as a similar character set. The grouping unit 114 performs the above processing for all characters, and obtains a character set that is similar for all characters. Then, the grouping unit 114 sets a similar character set that is completely matched among the obtained similar character sets to the group. The grouping unit 114 outputs the set group information to the storage device 23, the grouped character information registration unit 116, and the non-grouped character information registration unit 117.

上述した実施例は、本発明の好適な実施の例である。但し、これに限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変形実施可能である。 The embodiment described above is a preferred embodiment of the present invention. However, the present invention is not limited to this, and various modifications can be made without departing from the scope of the present invention.

（付記）
（付記１）
入力画像の画像パターンに基づく特徴量を求めて、該特徴量を要素とする特徴ベクトルを生成する特徴ベクトル生成手段と、
各画像の特徴ベクトル同士を比較して類似度の高い画像を抽出し、前記類似度の高い画像をグループとしてまとめるグループ化手段と、
前記特徴ベクトル生成手段で生成された特徴ベクトルに基づいて、画像の画素分布の情報を含む複数の特徴データを生成する特徴データ生成手段と、
各画像を識別する識別情報と、該各画像の特徴データとを記憶手段に記憶させてパターン認識用辞書を作成する場合に、前記グループに含まれない画像の特徴データのデータ量よりも前記グループに含まれる画像の特徴データのデータ量が多くなるように前記パターン認識用辞書を作成し、該作成したパターン認識用辞書を前記記憶手段に記憶させる認識用辞書作成手段と、
を有するパターン認識辞書作成装置。
（付記２）
前記認識用辞書作成手段は、前記グループに含まれない画像の前記パターン認識用辞書を作成する際に、他の画像の複数の特徴データと共通する特徴データを抽出し、該抽出した複数の特徴データを識別する識別情報を前記抽出した複数の特徴データに代えて前記パターン認識用辞書に登録すると共に、前記抽出した複数の特徴データと、該抽出した複数の特徴データを識別する識別情報とを関連付けるサブ辞書を作成する、付記１記載のパターン認識辞書作成装置。
（付記３）
前記特徴データ生成手段は、前記特徴ベクトルに基づいて主成分分析を行い、画像の画素分布を示す固有値と、複数のベクトル要素を有する固有ベクトルを生成する手段である、付記１記載のパターン認識辞書作成装置。
（付記４）
前記認識用辞書作成手段は、前記グループに含まれない画像の前記パターン認識用辞書を作成する場合は、該画像の固有ベクトルのベクトル要素のうち、他の画像の固有ベクトルのベクトル要素と部分一致するベクトル要素列と、ベクトル要素同士の差が所定値以下にあるベクトル要素列との少なくとも一方を抽出し、該抽出したベクトル要素列を識別する識別情報を前記記憶手段に記憶させると共に、前記抽出したベクトル列と、前記識別識別とを関連付けるサブ辞書を作成する、付記３記載のパターン認識辞書作成装置。
（付記５）
コンピュータを、
入力画像の画像パターンに基づく特徴量を求めて、該特徴量を要素とする特徴ベクトルを生成する手段と、
各画像の特徴ベクトル同士を比較して類似度の高い画像を抽出し、前記類似度の高い画像をグループとしてまとめる手段と、
前記特徴ベクトル生成手段で生成された特徴ベクトルに基づいて、画像の画素分布の情報を含む複数の特徴データを生成する手段と、
各画像を識別する識別情報と、該各画像の特徴データとを記憶手段に記憶させてパターン認識用辞書を作成する場合に、前記グループに含まれない画像の特徴データのデータ量よりも前記グループに含まれる画像の特徴データのデータ量が多くなるように前記パターン認識用辞書を作成し、該作成したパターン認識用辞書を前記記憶手段に記憶させる手段として機能させるプログラム。
（付記６）
前記パターン認識用辞書を前記記憶手段に記憶させる手段は、前記グループに含まれない画像の前記パターン認識用辞書を作成する際に、他の画像の複数の特徴データと共通する特徴データを抽出し、該抽出した複数の特徴データを識別する識別情報を前記抽出した複数の特徴データに代えて前記パターン認識用辞書に登録すると共に、前記抽出した複数の特徴データと、該抽出した複数の特徴データを識別する識別情報とを関連付けるサブ辞書を作成する、付記５記載のプログラム。
（付記７）
前記特徴データを生成する手段は、前記特徴ベクトルに基づいて主成分分析を行い、画像の画素分布を示す固有値と、複数のベクトル要素を有する固有ベクトルを生成する手段である、付記５記載のプログラム。
（付記８）
前記パターン認識用辞書を前記記憶手段に記憶させる手段は、前記グループに含まれない画像の前記パターン認識用辞書を作成する場合は、該画像の固有ベクトルのベクトル要素のうち、他の画像の固有ベクトルのベクトル要素と部分一致するベクトル要素列と、ベクトル要素同士の差が所定値以下にあるベクトル要素列との少なくとも一方を抽出し、該抽出したベクトル要素列を識別する識別情報を前記記憶手段に記憶させると共に、前記抽出したベクトル列と、前記識別識別とを関連付けるサブ辞書を作成する、付記７記載のプログラム。 (Appendix)
(Appendix 1)
A feature vector generating means for obtaining a feature amount based on an image pattern of an input image and generating a feature vector having the feature amount as an element;
Grouping means for comparing feature vectors of each image to extract images having high similarity, and grouping the images having high similarity as a group;
Feature data generating means for generating a plurality of feature data including information on pixel distribution of an image based on the feature vector generated by the feature vector generating means;
In the case where the identification information for identifying each image and the feature data of each image are stored in the storage means to create a pattern recognition dictionary, the group is more than the data amount of the feature data of the image not included in the group. Recognizing dictionary creating means for creating the pattern recognizing dictionary so that the amount of image feature data included in the image data is increased, and storing the created pattern recognizing dictionary in the storage means;
A pattern recognition dictionary creation device having
(Appendix 2)
The recognition dictionary creating means extracts feature data common to a plurality of feature data of other images when creating the pattern recognition dictionary of images not included in the group, and the plurality of extracted features Registering identification information for identifying data in the pattern recognition dictionary instead of the plurality of extracted feature data, and extracting the plurality of extracted feature data and identification information for identifying the plurality of extracted feature data The pattern recognition dictionary creation device according to appendix 1, which creates a sub-dictionary to be associated.
(Appendix 3)
The pattern recognition dictionary creation according to appendix 1, wherein the feature data generation means is a means for performing principal component analysis based on the feature vector and generating an eigenvalue indicating a pixel distribution of the image and an eigenvector having a plurality of vector elements. apparatus.
(Appendix 4)
When the recognition dictionary creating means creates the pattern recognition dictionary for images not included in the group, a vector that partially matches a vector element of an eigenvector of another image among vector elements of the eigenvector of the image At least one of an element sequence and a vector element sequence in which the difference between vector elements is a predetermined value or less is extracted, and identification information for identifying the extracted vector element sequence is stored in the storage unit, and the extracted vector The pattern recognition dictionary creation device according to appendix 3, which creates a sub-dictionary that associates a column with the identification and identification.
(Appendix 5)
Computer
Means for obtaining a feature quantity based on an image pattern of an input image and generating a feature vector having the feature quantity as an element;
Means for comparing feature vectors of each image to extract images having high similarity, and grouping the images having high similarity as a group;
Means for generating a plurality of feature data including pixel distribution information of the image based on the feature vector generated by the feature vector generation means;
In the case where the identification information for identifying each image and the feature data of each image are stored in the storage means to create a pattern recognition dictionary, the group is more than the data amount of the feature data of the image not included in the group. The pattern recognition dictionary is created so that the amount of image feature data included in the image data is increased, and the program is made to function as means for storing the created pattern recognition dictionary in the storage means.
(Appendix 6)
The means for storing the pattern recognition dictionary in the storage means extracts feature data common to a plurality of feature data of other images when creating the pattern recognition dictionary for images not included in the group. The identification information for identifying the extracted feature data is registered in the pattern recognition dictionary instead of the extracted feature data, and the extracted feature data and the extracted feature data are registered. The program according to appendix 5, which creates a sub-dictionary that associates identification information for identifying
(Appendix 7)
The program according to claim 5, wherein the means for generating the feature data is a means for performing principal component analysis based on the feature vector and generating an eigenvalue indicating a pixel distribution of the image and an eigenvector having a plurality of vector elements.
(Appendix 8)
The means for storing the pattern recognition dictionary in the storage means, when creating the pattern recognition dictionary for images not included in the group, includes eigenvectors of other images among vector elements of eigenvectors of the image. At least one of a vector element sequence partially matching the vector element and a vector element sequence in which the difference between the vector elements is equal to or less than a predetermined value is extracted, and identification information for identifying the extracted vector element sequence is stored in the storage unit And creating a sub-dictionary that associates the extracted vector string with the identification and identification.

１パターン認識辞書作成装置
１０辞書作成部
２１スキャナ装置
２２操作部
２３記憶装置
２４表示装置
３０外部記憶装置
１１１特徴ベクトル生成部
１１２平均ベクトル算出部
１１３主成分分析部
１１４グループ化部
１１５辞書生成・登録部
１１６グループ化文字情報登録部
１１７非グループ化文字情報登録部
１１８テーブル格納部 DESCRIPTION OF SYMBOLS 1 Pattern recognition dictionary creation apparatus 10 Dictionary creation part 21 Scanner apparatus 22 Operation part 23 Storage device 24 Display apparatus 30 External storage device 111 Feature vector generation part 112 Average vector calculation part 113 Principal component analysis part 114 Grouping part 115 Dictionary generation / registration Part 116 Grouped character information registration part 117 Ungrouped character information registration part 118 Table storage part

Claims

A feature vector generating means for obtaining a feature amount based on an image pattern of an input image and generating a feature vector having the feature amount as an element;
Grouping means for comparing feature vectors of each image to extract images having high similarity, and grouping the images having high similarity as a group;
Feature data generating means for generating a plurality of feature data including pixel distribution information of the image based on the feature vector generated by the feature vector generating means;
When the identification information for identifying each image and the feature data of each image are stored in the storage means to create a pattern recognition dictionary, the group is more than the data amount of the feature data of the image not included in the group. Recognizing dictionary creating means for creating the pattern recognizing dictionary so that the amount of image feature data included in the image is increased, and storing the created pattern recognizing dictionary in the storage means;
A pattern recognition dictionary creation device having

The recognition dictionary creating means extracts feature data common to a plurality of feature data of other images when creating the pattern recognition dictionary of images not included in the group, and the plurality of extracted features Registering identification information for identifying data in the pattern recognition dictionary instead of the plurality of extracted feature data, and extracting the plurality of extracted feature data and identification information for identifying the plurality of extracted feature data The pattern recognition dictionary creation device according to claim 1, which creates a sub-dictionary to be associated.

The pattern recognition dictionary according to claim 1, wherein the feature data generation unit is a unit that performs principal component analysis based on the feature vector and generates an eigenvalue indicating a pixel distribution of an image and an eigenvector having a plurality of vector elements. Creation device.

When the recognition dictionary creating means creates the pattern recognition dictionary for images not included in the group, a vector that partially matches a vector element of an eigenvector of another image among vector elements of the eigenvector of the image At least one of an element sequence and a vector element sequence in which the difference between vector elements is a predetermined value or less is extracted, and identification information for identifying the extracted vector element sequence is stored in the storage unit, and the extracted vector 4. The pattern recognition dictionary creation device according to claim 3, wherein a sub-dictionary that associates a column with the identification and identification is created.

Computer
Means for obtaining a feature quantity based on an image pattern of an input image and generating a feature vector having the feature quantity as an element;
Means for comparing feature vectors of each image to extract images having high similarity, and grouping the images having high similarity as a group;
Means for generating a plurality of feature data including pixel distribution information of the image based on the feature vector generated by the feature vector generation means;
In the case where the identification information for identifying each image and the feature data of each image are stored in the storage means to create a pattern recognition dictionary, the group is more than the data amount of the feature data of the image not included in the group. The pattern recognition dictionary is created so that the amount of image feature data included in the image data is increased, and the program is made to function as means for storing the created pattern recognition dictionary in the storage means.