JP2012164008A

JP2012164008A - Handwritten word recognition device and model learning device for handwritten word recognition

Info

Publication number: JP2012164008A
Application number: JP2011021728A
Authority: JP
Inventors: Tomoyuki Hamamura; 倫行浜村; Bumpei Irie; 文平入江; Shigeki Sagayama; 茂樹嵯峨山; Junki Ono; 順貴小野; Takuya Nishimoto; 卓也西本
Original assignee: Toshiba Corp; University of Tokyo NUC
Current assignee: Toshiba Corp; University of Tokyo NUC
Priority date: 2011-02-03
Filing date: 2011-02-03
Publication date: 2012-08-30
Anticipated expiration: 2031-02-03
Also published as: JP5524102B2

Abstract

PROBLEM TO BE SOLVED: To provide a handwritten word recognition device which can be greatly improved in recognition precision of handwritten words while taking an environment into consideration even when exactly different fonts are mixed together in one class.SOLUTION: The handwritten word recognition device which recognizes handwritten words present in an image on postal matter in a postal matter processing system is greatly improved in recognition precision of the handwritten words while taking an environment of character deformations etc., due to both adjacent characters into consideration by enabling a plurality of Gaussian distributions to be allocated to one state even when exactly different fonts are mixed together in one class.

Description

本発明の実施形態は、たとえば、郵便物処理システムにおいて郵便物上の画像内に存在する手書き単語を認識する手書き単語認識装置、および、この手書き単語認識装置に用いられる手書き単語認識用モデル学習装置に関する。 Embodiments of the present invention include, for example, a handwritten word recognition device that recognizes a handwritten word existing in an image on a mail in a mail processing system, and a model learning device for handwritten word recognition used in the handwritten word recognition device About.

たとえば、音声認識では一般に隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）が用いられており、手書き文字認識においてもＨＭＭを用いた方法が提案されている。 For example, a hidden Markov model (HMM) is generally used for speech recognition, and a method using an HMM has also been proposed for handwritten character recognition.

また、音声認識では、隣接する音素（これを「環境」と呼ぶ）に依存して音素がひずみを受けるため、環境別に音素モデルを作ることが有効であることが知られている。しかし、環境の数が多い場合、音素モデル１つ当たりの学習パターンの数が減少し、過学習に陥ってしまう。そこで、環境をクラスタリングし、同一のクラスタに属する環境間ではモデルを共有することで学習パターン数を増やし、過学習を防ぐ方法が取られる。 In speech recognition, it is known that it is effective to create a phoneme model for each environment because the phoneme is distorted depending on adjacent phonemes (referred to as “environment”). However, when the number of environments is large, the number of learning patterns per phoneme model decreases, resulting in overlearning. Therefore, a method is adopted in which environments are clustered and the number of learning patterns is increased by sharing models between environments belonging to the same cluster to prevent overlearning.

上記方法を手書き文字認識に用いることで、認識性能向上が期待される。しかし、手書き文字認識では、ブロック体と筆記体のような全く異なる字体が１つのクラス内に混在しており、上記環境クラスタリング手法ではこのような場合を想定していないため、効果が得られない。 Use of the above method for handwritten character recognition is expected to improve recognition performance. However, in handwritten character recognition, completely different fonts such as block and cursive are mixed in one class, and the above-mentioned environment clustering method does not assume such a case, so an effect cannot be obtained. .

具体例をあげて説明すると、たとえば、英文字における小文字の「ｒ」の例を挙げる。小文字の「ｒ」は、ブロック体と筆記体で全く字形が異なるため、図８に示すように大きく離れた２つの分布を形成すると考えられる。そして、環境の違いは、各分布の中での小さな差に留まると考えられる。本例では「ｒ」の左側にある文字を環境とし、ａ〜ｅの５種類の環境が存在するものとし、ａ−ｒ〜ｅ−ｒと表記している。このとき、環境を無視してクラスタリングした場合は、図９のＣとＤのように字体を反映した分割が可能となるが、環境のみ分割すると、図９のＡとＢのように複数の字体にまたがった分割となり、前者に比べ推定精度が著しく低下してしまう。 For example, an example of a lowercase letter “r” in English letters is given. Since the letter “r” is completely different in the block form and the cursive form, it is considered that two distributions that are largely separated are formed as shown in FIG. And the difference in the environment is considered to be only a small difference in each distribution. In this example, it is assumed that the character on the left side of “r” is an environment, and five types of environments a to e exist, and are represented as ar to err. At this time, when clustering is performed while ignoring the environment, division that reflects the font is possible as shown in C and D of FIG. 9, but if only the environment is divided, a plurality of fonts are used as shown in A and B of FIG. Therefore, the estimation accuracy is significantly reduced compared to the former.

H. Bunke, S. Bengio, A. Vinciarelli, “Offline recognition of unconstrained handwritten texts using HMMs and statistical language models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 26, Issue 6, June 2004, p.p. 709-720H. Bunke, S. Bengio, A. Vinciarelli, “Offline recognition of unconstrained handwritten texts using HMMs and statistical language models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 26, Issue 6, June 2004, p.p. 709-720 鷹見淳一，嵯峨山茂樹，“逐次状態分割法による隠れマルコフ網の自動生成”，電子情報通信学会論文誌 D-II, Vol. J76-D-II, No. 10, pp. 2155--2164, Oct 1993.Junichi Takami, Shigeki Hiyama, “Automatic Generation of Hidden Markov Networks by Sequential State Partitioning”, IEICE Transactions D-II, Vol. J76-D-II, No. 10, pp. 2155--2164, Oct 1993.

そこで、本発明は、１つのクラス内に全く異なる字体が混在している場合でも環境を考慮して手書き単語の認識精度を著しく向上することが可能な手書き単語認識装置および手書き単語認識用モデル学習装置を提供することを目的とする。 Therefore, the present invention provides a handwritten word recognition apparatus and model learning for handwritten word recognition that can significantly improve the recognition accuracy of handwritten words in consideration of the environment even when completely different fonts are mixed in one class. An object is to provide an apparatus.

実施形態に係る手書き単語認識装置は、記録媒体上の手書き単語を含む画像を取込む画像取込手段と、この画像取込手段により取込まれた画像から単語画像を抽出する単語抽出手段と、この単語抽出手段により抽出された単語画像からその特徴を抽出する第１の特徴抽出手段と、文字ごとの文字モデルを格納するもので、前記各文字モデルはモデル母集合および環境ごとの環境別文字モデルにより構成され、前記モデル母集合および前記各環境別文字モデルはそれぞれ複数の状態により構成され、前記モデル母集合の各状態は少なくとも２個以上のＧ（自然数）個のガウス分布により構成され、前記各環境別文字モデルの各状態は少なくとも２個以上のＭ（自然数で、Ｍ≦Ｇ）個のガウス分布により構成され、かつ、前記モデル母集合の各状態を構成するＧ個のガウス分布の中からＭ個を選択した組合せのいずれかに相当しているモデル格納手段と、前記第１の特徴抽出手段により抽出された特徴と前記モデル格納手段に格納された各文字モデルとの間でマッチング処理を行ない、その結果を認識結果とするモデルマッチング手段とを具備している。 The handwritten word recognition apparatus according to the embodiment includes an image capturing unit that captures an image including a handwritten word on a recording medium, a word extracting unit that extracts a word image from the image captured by the image capturing unit, A first feature extracting means for extracting the feature from the word image extracted by the word extracting means; and storing a character model for each character, wherein each character model is a model population and an environment-specific character for each environment. The model population and each environment-specific character model are each composed of a plurality of states, and each state of the model population is composed of at least two G (natural number) Gaussian distributions, Each state of each environment-specific character model is composed of at least two or more M (natural numbers, M ≦ G) Gaussian distributions, and each state of the model population Model storage means corresponding to any of the combinations of M selected from the G Gaussian distributions constituting the feature, the features extracted by the first feature extraction means, and the model storage means And a model matching unit that performs matching processing with each character model and uses the result as a recognition result.

実施形態に係る手書き単語認識装置の構成を概略的に示すブロック図。The block diagram which shows roughly the structure of the handwritten word recognition apparatus which concerns on embodiment. 入力画像の一例を示す図。The figure which shows an example of an input image. 抽出された単語候補の一例を示す図。The figure which shows an example of the extracted word candidate. 手書き単語の認識処理時の装置構成を概略的に示すブロック図。The block diagram which shows roughly the apparatus structure at the time of the recognition process of a handwritten word. モデル学習処理時の装置構成を概略的に示すブロック図。The block diagram which shows roughly the apparatus structure at the time of a model learning process. 実施形態に係るモデルを説明する図。The figure explaining the model which concerns on embodiment. 実施形態に係るＧＭＭを用いた環境分割の一例を説明する図。The figure explaining an example of the environment division | segmentation using GMM which concerns on embodiment. 異なる字体の混在時のデータ分布を説明する図。The figure explaining the data distribution at the time of mixing of a different font. データ分布の環境分割例を説明する図。The figure explaining the environment division example of data distribution.

まず、実施形態を説明する前に、本実施形態の概要について簡単に説明する。
前述した従来の問題点を解決する方法として、各環境の分布を単一のガウス分布ではなく混合ガウス分布（ＧＭＭ）とし、各ガウス分布を共有させることが考えられる。これについて図８、図９と同様な例により図７を用いて説明する。まず、２つの字体で大きく分布が異なるため、字体で分割するのが自然である。図７では左右２つに分割される。そして、左側（ブロック体の「ｒ」）は、パターン数が多いため、更に環境の分割が可能である。図７では、ａ−ｒ、ｂ−ｒ、ｃ−ｒの３つの環境を含むグループＥ（下側）と、ｄ−ｒ、ｅ−ｒの２つの環境を含むグループＦ（上側）に分割される。 First, before describing the embodiment, an outline of the present embodiment will be briefly described.
As a method for solving the above-described conventional problems, it is conceivable that the distribution of each environment is not a single Gaussian distribution but a mixed Gaussian distribution (GMM), and each Gaussian distribution is shared. This will be described with reference to FIG. 7 using an example similar to FIG. 8 and FIG. First, since the distribution is greatly different between the two fonts, it is natural to divide the fonts. In FIG. 7, it is divided into left and right parts. Since the left side (“r” in the block body) has a large number of patterns, the environment can be further divided. In FIG. 7, it is divided into a group E (lower side) including three environments ar, br, and cr, and a group F (upper side) including two environments dr and er. The

一方、右側（筆記体の「ｒ」）は、パターン数が少ないため、これ以上分割することができず、１つのグループＧとなっている。この結果、たとえば、ａ−ｒを表わす分布はグループＥとグループＧのＧＭＭ、ｄ−ｒを表わす分布はグループＦとグループＧのＧＭＭとなる。 On the other hand, the right side (cursive “r”) has a small number of patterns and cannot be further divided into one group G. As a result, for example, the distribution representing ar is the GMM of group E and group G, and the distribution representing dr is the GMM of group F and group G.

そこで、本実施形態では、全体でＧ個のガウシアンがあり、各環境はそのうちのＭ個を用いたＧＭＭで表わす形としたモデルを考え、このモデルをＥＭアルゴリズムで最尤推定する手法を用いる。 Therefore, in the present embodiment, there are G Gaussians as a whole, and each environment considers a model represented by GMM using M of them, and uses a method of estimating the maximum likelihood with the EM algorithm.

次に、本実施形態に適用される環境クラスタリングとＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ：混合ガウス分布モデル）推定の同時最適化法について述べる。

Next, a simultaneous optimization method of environment clustering and GMM (Gaussian Mixture Model) estimation applied to this embodiment will be described.

以下、実施形態について図面を参照して説明する。
図１は、本実施形態に係る手書き単語認識装置の構成を概略的に示すものである。この手書き単語認識装置は、記録媒体上の手書き単語を含む画像を取込む画像取込手段としての画像入力部１１、この画像入力部１１により取込まれた画像から単語画像を抽出する単語抽出手段としての単語抽出部１２、郵便物処理システムにおけるビデオコーディングシステム（ＶＣＳ）１３、ビデオコーディングシステム１３から得られる正解を教示された単語画像を学習用の単語画像として蓄積する単語画像蓄積手段としての単語画像蓄積部１４、この単語画像蓄積部１４に蓄積された単語画像あるいは単語抽出部１２により抽出された単語画像からその特徴を抽出する第１、第２の特徴抽出手段としての特徴抽出部１５、文字ごとの文字モデルを格納するモデル格納手段としてのモデル格納部１６、特徴抽出部１５により抽出された特徴とモデル格納部１６に格納された各文字モデルとの間でマッチング処理を行なうモデルマッチング手段としてのモデルマッチング部１７、特徴抽出部１５により抽出された特徴がモデル格納部１６内の各文字モデルの各状態から出現する事後確率を計算する第１の確率計算手段としての第１の確率計算部１８、前記各環境が前記各組合せである事後確率を計算する第２の確率計算手段としての第２の確率計算部１９、前記各環境が前記各組合せであることを条件として、特徴抽出部１５により抽出された特徴が前記各ガウス分布から出現する事後確率を計算する第３の確率計算手段としての第３の確率計算部２０、第１、第２、第３の確率計算部１８，１９，２０により計算された各確率および特徴抽出部１５により抽出された特徴からガウス分布のパラメータを計算し、その計算結果に基づき前記モデル格納部１６に格納されているガウス分布のパラメータを更新するガウスパラメータ更新手段としてのガウスパラメータ更新部２１、第２、第３の確率計算部１９，２０により計算された各確率からモデル学習用の重みパラメータを計算し、その計算結果に基づきモデル格納部１６に格納されている重みパラメータを更新する重みパラメータ更新手段としての重みパラメータ更新部２２から構成されている。 Hereinafter, embodiments will be described with reference to the drawings.
FIG. 1 schematically shows the configuration of a handwritten word recognition apparatus according to the present embodiment. The handwritten word recognition device includes an image input unit 11 serving as an image capturing unit that captures an image including a handwritten word on a recording medium, and a word extracting unit that extracts a word image from the image captured by the image input unit 11. A word extraction unit 12, a video coding system (VCS) 13 in a mail processing system, a word as word image accumulating means for accumulating a word image instructed by the video coding system 13 as a word image for learning An image storage unit 14; a feature extraction unit 15 as first and second feature extraction means for extracting the features from the word image stored in the word image storage unit 14 or the word image extracted by the word extraction unit 12; Extracted by the model storage unit 16 and the feature extraction unit 15 as model storage means for storing a character model for each character. The feature extracted by the model matching unit 17 and the feature extracting unit 15 as model matching means for performing a matching process between the character model stored in the model storage unit 16 and the character model in the model storage unit 16 A first probability calculation unit 18 as a first probability calculation means for calculating a posterior probability of appearing from each state of the first, a second probability calculation means as a second probability calculation means for calculating a posterior probability that each of the environments is the combination. The second probability calculation unit 19 is a third probability calculation unit that calculates a posteriori probability that the feature extracted by the feature extraction unit 15 appears from each Gaussian distribution on the condition that each environment is each combination. From the probabilities calculated by the third probability calculation unit 20, the first, second, and third probability calculation units 18, 19, and 20 and the features extracted by the feature extraction unit 15. Gaussian parameter updating unit 21 serving as a Gaussian parameter updating unit for calculating a parameter of the Uus distribution and updating a parameter of the Gaussian distribution stored in the model storage unit 16 based on the calculation result, second and third probability calculations Weight parameter update unit as a weight parameter update unit that calculates a weight parameter for model learning from each probability calculated by the units 19 and 20 and updates the weight parameter stored in the model storage unit 16 based on the calculation result 22 is comprised.

以下、各部について詳細に説明する。
画像入力部１１は、たとえば、図２に示すような郵便物上に手書きされた手書き単語（本例では英文字単語）を含む画像を入力するもので、ビデオカメラなどにより構成されている。 Hereinafter, each part will be described in detail.
The image input unit 11 inputs, for example, an image including a handwritten word (English character word in this example) handwritten on a mail as shown in FIG. 2, and is configured by a video camera or the like.

単語抽出部１２は、画像入力部１１により入力された画像に対し公知の画像処理を施すことにより単語候補（単語画像）を抽出する。図３に、図２の画像に対して抽出された単語候補の例を示す。 The word extraction unit 12 extracts a word candidate (word image) by performing known image processing on the image input by the image input unit 11. FIG. 3 shows an example of word candidates extracted from the image of FIG.

ビデオコーディングシステム１３は、たとえば、図示しない郵便物区分装置にて住所情報（単語）が認識できなかった郵便物の画像を表示部に表示し、オペレータのコーディング作業により認識できなかった単語に対する正解を入力するものである。
単語画像蓄積部１４は、ビデオコーディングシステム１３により入力された正解（単語）とともに対応する単語画像を蓄積する。 For example, the video coding system 13 displays an image of a mail piece for which address information (word) could not be recognized by a mail piece sorting apparatus (not shown) on the display unit, and corrects the word that could not be recognized by the operator's coding work. Input.
The word image storage unit 14 stores the corresponding word image together with the correct answer (word) input by the video coding system 13.

特徴抽出部１５は、単語抽出部１２により抽出された単語画像あるいは単語画像蓄積部１４に蓄積された単語画像からその特徴を抽出する。この場合、１つの単語から複数個の特徴が抽出される。特徴抽出法には様々な手法が提案されているが、たとえば、以下の文献に開示されている手法を用いることができる。 The feature extraction unit 15 extracts the feature from the word image extracted by the word extraction unit 12 or the word image stored in the word image storage unit 14. In this case, a plurality of features are extracted from one word. Various methods have been proposed for the feature extraction method. For example, the methods disclosed in the following documents can be used.

Ｊ．Ａ．Ｒｏｄｒｉｇｕｅｚ，Ｆ．Ｐｅｒｒｏｎｎｉｎ，“Ｌｏｃａｌｇｒａｄｉｅｎｔｈｉｓｔｏｇｒａｍｆｅａｔｕｒｅｓｆｏｒｗｏｒｄｓｐｏｔｔｉｎｇｉｎｕｎｃｏｎｓｔｒａｉｎｅｄｈａｎｄｗｒｉｔｔｅｎｄｏｃｕｍｅｎｔｓ，”ＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｆＦｒｏｎｔｉｅｒｓｉｎＨａｎｄｗｒｉｔｉｎｇＲｅｃｏｇｎｉｔｉｏｎ（ＩＣＦＨＲ２００８），Ｊｕｌｙ２００８
モデル格納部１６は、各文字に対応する文字モデル３１，３１，…が格納されている。各文字モデル３１は、たとえば、１つのモデル母集合３２および環境ごとのＬ個（２個以上）の環境別文字モデル３３_１〜３３_Ｌにより構成されている。さらに、モデル母集合３２および各環境別文字モデル３３_１〜３３_Ｌは、それぞれＮ個（２個以上）の状態から構成されている。さらに、環境別文字モデル３３_１〜３３_Ｌの各状態は、Ｍ個（２個以上）のガウス分布（ガウシアン）から構成され、モデル母集合３２の各状態はＧ個（Ｍ≦Ｇ）のガウス分布から構成されている。さらに、環境別文字モデル３３_１〜３３_Ｌの各状態を構成するＭ個のガウス分布は、モデル母集合３２の各状態を構成するＧ個のガウス分布の中からＭ個を選択した組合せのいずれかに相当している。 J. et al. A. Rodriguez, F.M. Perronnin, “Local gradient histogram features for word spotting in unconstrained handwrought documents,” International Conference of Frontiers
The model storage unit 16 stores character models 31, 31,... Corresponding to each character. Each character model 31 includes, for example, one model population 32 and L (two or more) environment-specific character models 33 _{1 to} 33 _L for each environment. Further, the model population 32 and the environment-specific character models 33 _{1 to} 33 _L are each composed of N (two or more) states. Further, each state of the environment-specific character models 33 _{1 to} 33 _L is composed of M (two or more) Gaussian distributions (Gaussian), and each state of the model population 32 is G (M ≦ G) Gaussians. It consists of a distribution. Further, the M Gaussian distributions constituting each state of the environment-specific character models 33 _{1 to} 33 _L may be any combination of M selected from the G Gaussian distributions constituting each state of the model population 32. It corresponds to crab.

モデルマッチング部１７は、特徴抽出部１５により抽出された特徴とモデル格納部１６に格納された各文字モデル３１との間でマッチング処理を行ない、マッチングスコア最大となる結果をもって認識結果とする。 The model matching unit 17 performs a matching process between the feature extracted by the feature extraction unit 15 and each character model 31 stored in the model storage unit 16, and uses the result that maximizes the matching score as the recognition result.

第１の確率計算部１８は、特徴抽出部１５により抽出された特徴がモデル格納部１６内の各文字モデル３１の各状態から出現する事後確率を計算する。計算方法は、たとえば、文献「“ＭａｃｈｉｎｅＬｅａｒｎｉｎｇｆｏｒＡｕｄｉｏ，ＩｍａｇｅａｎｄＶｉｄｅｏＡｎａｌｙｓｉｓＴｈｅｏｒｙａｎｄＡｐｐｌｉｃａｔｉｏｎｓ”，Ｓｐｒｉｎｇｅｒ，２００８」に開示された計算式を用い計算することができる。 The first probability calculation unit 18 calculates a posterior probability that the feature extracted by the feature extraction unit 15 appears from each state of each character model 31 in the model storage unit 16. The calculation method can be calculated using, for example, the calculation formula disclosed in the document “Machine Learning for Audio, Image and Video Analysis Theory and Applications”, Springer, 2008.

第２の確率計算部１９は、各環境が各組合せ（各ＧＭＭ）である事後確率を計算する。第３の確率計算部２０は、各環境が各組合せ（各ＧＭＭ）であることを条件として、特徴抽出部１５により抽出された特徴が各ガウス分布から出現する事後確率を計算する。 The second probability calculation unit 19 calculates a posteriori probability that each environment is each combination (each GMM). The third probability calculation unit 20 calculates the posterior probability that the feature extracted by the feature extraction unit 15 appears from each Gaussian distribution on condition that each environment is each combination (each GMM).

ガウスパラメータ更新部２１は、第１、第２、第３の確率計算部１８，１９，２０により計算された各確率および特徴抽出部１５により抽出された特徴からガウス分布のパラメータを計算し、その計算結果に基づきモデル格納部１６に格納されているガウス分布のパラメータを更新する。 The Gaussian parameter updating unit 21 calculates the parameters of the Gaussian distribution from the probabilities calculated by the first, second, and third probability calculating units 18, 19, and 20 and the features extracted by the feature extracting unit 15. Based on the calculation result, the parameters of the Gaussian distribution stored in the model storage unit 16 are updated.

重みパラメータ更新部２２は、第２、第３の確率計算部１９，２０により計算された各確率からモデル学習用の重みパラメータを計算し、その計算結果に基づきモデル格納部１６に格納されている重みパラメータを更新する。 The weight parameter updating unit 22 calculates weight parameters for model learning from the probabilities calculated by the second and third probability calculation units 19 and 20, and is stored in the model storage unit 16 based on the calculation results. Update weight parameters.

次に、上記のような構成において手書き単語の認識処理について説明する。
手書き単語の認識処理時は、図１の装置構成が図４に示すような装置構成となり、ビデオコーディングシステム１３、単語画像蓄積部１４、第１の確率計算部１８、第２の確率計算部１９、第３の確率計算部２０、ガウスパラメータ更新部２１、重みパラメータ更新部２２は使用されない。 Next, handwritten word recognition processing in the above configuration will be described.
At the time of handwritten word recognition processing, the device configuration shown in FIG. 1 is as shown in FIG. 4, and the video coding system 13, the word image storage unit 14, the first probability calculation unit 18, and the second probability calculation unit 19 are used. The third probability calculation unit 20, the Gauss parameter update unit 21, and the weight parameter update unit 22 are not used.

まず、画像入力部１１は、郵便物上の手書き単語を含む画像を入力する。図２に入力された画像の例を示す。次に、単語抽出部１２は、画像入力部１１により入力された画像に対し公知の画像処理を施すことにより単語候補（単語画像）を抽出する。図３に、図２の画像に対して抽出された単語候補の例を示す。 First, the image input unit 11 inputs an image including a handwritten word on a postal matter. FIG. 2 shows an example of the input image. Next, the word extraction unit 12 extracts a word candidate (word image) by performing known image processing on the image input by the image input unit 11. FIG. 3 shows an example of word candidates extracted from the image of FIG.

次に、特徴抽出部１５は、単語抽出部１２により抽出された単語画像からその特徴を抽出する。この場合、１つの単語から複数個の特徴が抽出される。次に、モデルマッチング部１７は、特徴抽出部１５により抽出された特徴とモデル格納部１６に格納された各文字モデル３１との間でマッチング処理を行ない、マッチングスコア最大となる結果をもって認識結果とする。 Next, the feature extraction unit 15 extracts the feature from the word image extracted by the word extraction unit 12. In this case, a plurality of features are extracted from one word. Next, the model matching unit 17 performs a matching process between the feature extracted by the feature extraction unit 15 and each character model 31 stored in the model storage unit 16, and obtains a recognition result with a result that maximizes the matching score. To do.

次に、モデル学習処理について説明する。
モデル学習処理時は、図１の装置構成が図５に示すような装置構成となり、画像入力部１１、単語抽出部１２、モデルマッチング部１７は使用されない。 Next, the model learning process will be described.
During the model learning process, the apparatus configuration shown in FIG. 1 is as shown in FIG. 5, and the image input unit 11, the word extraction unit 12, and the model matching unit 17 are not used.

まず、ビデオコーディングシステム１３にて、図示しない郵便物区分装置にて認識できなかった単語画像に正解が教示され、単語画像蓄積部１４に蓄積される。特徴抽出部１５は、単語画像蓄積部１４に蓄積された単語画像からその特徴を抽出する。この場合、１つの単語から複数個の特徴が抽出される。 First, in the video coding system 13, a correct answer is taught to a word image that could not be recognized by a mail sorting apparatus (not shown) and stored in the word image storage unit 14. The feature extraction unit 15 extracts features from the word images stored in the word image storage unit 14. In this case, a plurality of features are extracted from one word.

次に、第１の確率計算部１８は、特徴抽出部１５により抽出された特徴がモデル格納部１６内の各文字モデル３１の各状態から出現する事後確率を計算する。そして、各特徴を上記事後確率が最大となる状態に属するものとする。つまり、各特徴に対し、特定の文字モデル、特定の環境、特定の状態が割り振られたことになる。 Next, the first probability calculation unit 18 calculates a posterior probability that the feature extracted by the feature extraction unit 15 appears from each state of each character model 31 in the model storage unit 16. Each feature belongs to a state where the posterior probability is maximized. That is, a specific character model, a specific environment, and a specific state are assigned to each feature.

ここで、環境ｌに所属した特徴を選び出したものが、先に説明した「環境クラスタリングとＧＭＭ推定の同時最適化法」におけるｘ_ｌの各要素であるｘ_ｌ１，…，ｘ_ｌＮｌに対応する。なお、先に説明した「環境クラスタリングとＧＭＭ推定の同時最適化法」における「データ」が本実施形態における「特徴」に相当していることに注意すること。 Here, the selected features belonging to environment l correspond to x _l1 ,..., X _lNl that are elements of x _{l in} the “simultaneous optimization method of environment clustering and GMM estimation” described above. Note that “data” in the “simultaneous optimization method of environment clustering and GMM estimation” described above corresponds to “feature” in the present embodiment.

次に、第２の確率計算部１９は、各環境が各組合せ（各ＧＭＭ）である事後確率を計算する。すなわち、先に説明した「環境クラスタリングとＧＭＭ推定の同時最適化法」における式（９３）の計算を行なう。 Next, the second probability calculation unit 19 calculates a posteriori probability that each environment is each combination (each GMM). That is, the calculation of Expression (93) in the “simultaneous optimization method of environment clustering and GMM estimation” described above is performed.

次に、第３の確率計算部２０は、各環境が各組合せ（各ＧＭＭ）であることを条件として、特徴抽出部１５により抽出された特徴が各ガウス分布から出現する事後確率を計算する。すなわち、先に説明した「環境クラスタリングとＧＭＭ推定の同時最適化法」における式（９７）の計算を行なう。 Next, the third probability calculation unit 20 calculates the posterior probability that the feature extracted by the feature extraction unit 15 appears from each Gaussian distribution on condition that each environment is each combination (each GMM). That is, the calculation of Expression (97) in the “simultaneous optimization method of environment clustering and GMM estimation” described above is performed.

次に、ガウスパラメータ更新部２１は、第１、第２、第３の確率計算部１８，１９，２０により計算された各確率および特徴抽出部１５により抽出された特徴からガウス分布のパラメータを計算し、その計算結果に基づきモデル格納部１６に格納されているガウス分布のパラメータを更新する。 Next, the Gaussian parameter updating unit 21 calculates the parameters of the Gaussian distribution from the probabilities calculated by the first, second, and third probability calculating units 18, 19, and 20 and the features extracted by the feature extracting unit 15. Then, the parameters of the Gaussian distribution stored in the model storage unit 16 are updated based on the calculation result.

すなわち、先に説明した「環境クラスタリングとＧＭＭ推定の同時最適化法」における式（７８）の計算を行ない、モデル格納部１６に格納されているガウス分布のパラメータ（平均、共分散行列）を更新する。 That is, the calculation of Equation (78) in the “simultaneous optimization method of environment clustering and GMM estimation” described above is performed, and the parameters (means and covariance matrix) of the Gaussian distribution stored in the model storage unit 16 are updated. To do.

次に、重みパラメータ更新部２２は、第２、第３の確率計算部１９，２０により計算された各確率からモデル学習用の重みパラメータを計算し、その計算結果に基づきモデル格納部１６に格納されている重みパラメータを更新する。 Next, the weight parameter updating unit 22 calculates weight parameters for model learning from the probabilities calculated by the second and third probability calculation units 19 and 20, and stores them in the model storage unit 16 based on the calculation results. Update the weight parameter.

すなわち、先に説明した「環境クラスタリングとＧＭＭ推定の同時最適化法」における式（８３）および式（８８）の計算を行ない、モデル格納部１６に格納されているモデル母集合３２に係るモデル学習用の重みパラメータを更新する。 In other words, the equations (83) and (88) in the “simultaneous optimization method of environment clustering and GMM estimation” described above are calculated, and model learning related to the model population 32 stored in the model storage unit 16 is performed. Update the weight parameter.

なお、上記実施形態では、単語画像蓄積部１４に、ビデオコーディングシステム１３にて正解が教示された、認識できなかった単語画像を学習用の単語画像として蓄積する場合について説明したが、ビデオコーディングシステム１３に限らず、他の単語画像入力装置で入力された正解が教示された単語画像を学習用の単語画像として蓄積してもよい。さらに、認識できなかった単語画像に限らず、認識できた単語画像はすなわち正解が判明したものであるので、当該単語画像も学習用の単語画像として用いることができる。 In the above embodiment, a case has been described in which the word image storage unit 14 stores a word image that has been recognized by the video coding system 13 and has not been recognized as a word image for learning. The word image in which the correct answer inputted by other word image input devices is taught is not limited to 13, and may be accumulated as a learning word image. Furthermore, not only the word image that could not be recognized, but also the recognized word image, that is, the correct answer was found, so that the word image can also be used as a learning word image.

また、上記実施形態では、１つの特徴抽出部１５で、単語画像蓄積部１４に蓄積された単語画像あるいは単語抽出部１２により抽出された単語画像からその特徴を抽出する場合について説明したが、単語画像蓄積部１４および単語抽出部１２に対しそれぞれ専用の特徴抽出部を設けてもよい。 Moreover, although the said embodiment demonstrated the case where the one feature extraction part 15 extracts the characteristic from the word image accumulate | stored in the word image storage part 14, or the word image extracted by the word extraction part 12, a word A dedicated feature extraction unit may be provided for each of the image storage unit 14 and the word extraction unit 12.

さらに、上記実施形態では、第１の確率計算部１８にて各特徴を事後確率が最大となる状態に属するものとしたが、特定の状態に所属させず事後確率を重みとみなして以後の計算を行なってもよい。 Further, in the above embodiment, each feature belongs to the state where the posterior probability is maximum in the first probability calculation unit 18; however, the posterior probability is regarded as a weight without belonging to a specific state, and the subsequent calculation is performed. May be performed.

以上説明したように上記実施形態によれば、１つのクラス内に全く異なる字体が混在している場合でも、1つの状態に対し複数個のガウス分布を割り当て可能とすることで、両隣の文字による文字変形等の環境を考慮して手書き単語の認識精度を著しく向上することが可能となる。 As described above, according to the above-described embodiment, even when completely different fonts are mixed in one class, a plurality of Gaussian distributions can be assigned to one state. The recognition accuracy of handwritten words can be remarkably improved in consideration of the environment such as character deformation.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行なうことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれると同様に、特許請求の範囲に記載された発明とその均等の範囲に含まれるものである。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and their modifications are included in the scope and gist of the invention, and are also included in the invention described in the claims and the equivalents thereof.

１１…画像入力部（画像取込手段）、１２…単語抽出部（単語抽出手段）、１３…ビデオコーディングシステム（ＶＣＳ）、１４…単語画像蓄積部（単語画像蓄積手段）、１５…特徴抽出部（特徴抽出手段）、１６…モデル格納部（モデル格納手段）、１７…モデルマッチング部（モデルマッチング手段）、１８…第１の確率計算部（第１の確率計算手段）、１９…第２の確率（第２の確率計算手段）、２０…第３の確率計算部（第３の確率計算手段）、２１…ガウスパラメータ更新部（ガウスパラメータ更新手段）、２２…重みパラメータ更新部（重みパラメータ更新手段）、３１…文字モデル、３２…モデル母集合、３３_１〜３３_Ｌ…環境別文字モデル。 DESCRIPTION OF SYMBOLS 11 ... Image input part (image taking means), 12 ... Word extraction part (word extraction means), 13 ... Video coding system (VCS), 14 ... Word image storage part (word image storage means), 15 ... Feature extraction part (Feature extraction means), 16 ... model storage section (model storage means), 17 ... model matching section (model matching means), 18 ... first probability calculation section (first probability calculation means), 19 ... second Probability (second probability calculation means), 20 ... third probability calculation section (third probability calculation means), 21 ... Gauss parameter update section (Gauss parameter update means), 22 ... weight parameter update section (weight parameter update) Means), 31 ... character model, 32 ... model population, 33 _{1 to} 33 _L ... environment-specific character model.

Claims

Image capturing means for capturing an image including a handwritten word on a recording medium;
Word extraction means for extracting a word image from the image captured by the image capture means;
First feature extraction means for extracting the feature from the word image extracted by the word extraction means;
A character model for each character is stored. Each character model is composed of a model population and an environment-specific character model for each environment, and each of the model population and each environment-specific character model is composed of a plurality of states. Each state of the model population is composed of at least two G (natural numbers) Gaussian distributions, and each state of each environment-specific character model has at least two M (natural numbers, M ≦ G). Model storage means constituted by a number of Gaussian distributions and corresponding to any of a combination of M selected from the G number of Gaussian distributions constituting each state of the model population;
A model matching unit that performs a matching process between the feature extracted by the first feature extraction unit and each character model stored in the model storage unit, and sets the result as a recognition result;
A handwritten word recognition apparatus comprising:

A handwritten word recognition model learning device used in the handwritten word recognition device according to claim 1,
Word image storage means for storing a word image for learning;
Second feature extraction means for extracting features from the word images stored by the word image storage means;
First probability calculation means for calculating a posterior probability that the feature extracted by the second feature extraction means appears from each state of each character model in the model storage means;
A second probability calculating means for calculating a posteriori probability that each environment is the combination;
Third probability calculating means for calculating a posterior probability that the feature extracted by the second feature extracting means appears from each Gaussian distribution on the condition that each environment is the respective combination;
Gaussian distribution parameters are calculated from the probabilities calculated by the first, second, and third probability calculating means and the features extracted by the second feature extracting means, and the model storage means is based on the calculation result. Gaussian parameter updating means for updating the parameters of the Gaussian distribution stored in
A weight parameter for model learning is calculated from the respective probabilities calculated by the second and third probability calculating means, and based on the calculation result, the model learning for the model population stored in the model storing means is calculated. A weight parameter updating means for updating the weight parameter;
A model learning apparatus for recognizing handwritten words, comprising:

3. The word image for learning of handwritten words according to claim 2, wherein the word image for learning stored in the word image storage means is a word image instructing a correct answer obtained from a video coding system in a mail processing system. Model learning device.

Image capturing means for capturing an image including a handwritten word on a recording medium;
Word extraction means for extracting a word image from the image captured by the image capture means;
First feature extraction means for extracting the feature from the word image extracted by the word extraction means;
A character model for each character is stored. Each character model is composed of a model population and an environment-specific character model for each environment, and each of the model population and each environment-specific character model is composed of a plurality of states. Each state of the model population is composed of at least two G (natural numbers) Gaussian distributions, and each state of each environment-specific character model has at least two M (natural numbers, M ≦ G). Model storage means constituted by a number of Gaussian distributions and corresponding to any of a combination of M selected from the G number of Gaussian distributions constituting each state of the model population;
A model matching unit that performs a matching process between the feature extracted by the first feature extraction unit and each character model stored in the model storage unit, and sets the result as a recognition result;
Word image storage means for storing a word image for learning;
Second feature extraction means for extracting features from the word images stored by the word image storage means;
First probability calculation means for calculating a posterior probability that the feature extracted by the second feature extraction means appears from each state of each character model in the model storage means;
A second probability calculating means for calculating a posteriori probability that each environment is the combination;
Third probability calculating means for calculating a posterior probability that the feature extracted by the second feature extracting means appears from each Gaussian distribution on the condition that each environment is the respective combination;
Gaussian distribution parameters are calculated from the probabilities calculated by the first, second, and third probability calculating means and the features extracted by the second feature extracting means, and the model storage means is based on the calculation result. Gaussian parameter updating means for updating the parameters of the Gaussian distribution stored in
A weight parameter for model learning is calculated from the respective probabilities calculated by the second and third probability calculating means, and based on the calculation result, the model learning for the model population stored in the model storing means is calculated. A weight parameter updating means for updating the weight parameter;
A handwritten word recognition apparatus comprising:

5. The handwritten word recognition apparatus according to claim 4, wherein the learning word image stored in the word image storage means is a word image taught with a correct answer obtained from a video coding system in a mail processing system. .