JP2011164124A

JP2011164124A - Acoustic model parameter learning method based on linear classification model and device, method and device for creating finite state converter with phoneme weighting, and program therefor

Info

Publication number: JP2011164124A
Application number: JP2010023141A
Authority: JP
Inventors: Takanobu Oba; 隆伸大庭; Takaaki Hori; 貴明堀
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-02-04
Filing date: 2010-02-04
Publication date: 2011-08-25
Anticipated expiration: 2030-02-04
Also published as: JP5385810B2

Abstract

<P>PROBLEM TO BE SOLVED: To express an acoustic model and a language model as the same model. <P>SOLUTION: The acoustic model parameter learning method includes a model parameter initializing process and a model parameter updating process. In the model parameter initializing process, a model parameter for calculating a recognition score is initialized. In the model parameter updating process, a feature quantity vector is input, and an objective function based on accumulation of an inner product value of the feature quantity vector and the model parameter is given from outside. The model parameter which maximizes the objective function is calculated by updating the initialized model parameter to output a partial model parameter composed of the predetermined number of frames corresponding to each phoneme. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は、ＨＭＭ（Hidden Markov Model）に比べて簡素な構造を持ち、Ｎ−gram言語モデルを表現可能な線形分類モデルにより音響モデルを作成する方法とその装置と、その方法で作成した音響モデルを利用し、音声信号と音素を対応付けるための音素重み付き有限状態変換器生成方法とその装置と、それらのプログラムに関する。 The present invention has a simple structure compared to HMM (Hidden Markov Model), a method and apparatus for generating an acoustic model by a linear classification model capable of expressing an N-gram language model, and an acoustic model created by the method The present invention relates to a phoneme-weighted finite state transducer generation method and apparatus for associating a speech signal with a phoneme, and a program thereof.

標準的な音声認識システムを構成する要素のうち、主要な要素は音響モデルと言語モデルである。音響モデルとは、音声信号と音素を対応付けるモデルである。一般に音響モデルにはＨＭＭやそれを基にしたモデルが広く利用されている。時間伸縮を考慮したモデルであり、音声信号のように伸び縮みが存在する物理現象をモデル化するために広く用いられて来た。 Of the elements constituting a standard speech recognition system, the main elements are an acoustic model and a language model. An acoustic model is a model that associates a speech signal with a phoneme. In general, HMMs and models based on them are widely used as acoustic models. It is a model that considers time expansion and contraction, and has been widely used to model physical phenomena with expansion and contraction such as audio signals.

言語モデルには、Ｎ−gram言語モデルが広く利用されている。このモデルは音響モデルとは異なる表現形式となっている。図９に、音響モデル（ＨＭＭ）と言語モデルによって単語列の探索空間が構成された例を示す（参考文献１）。この図に示すように、音声認識における探索空間は、言語モデルで決められた音素の並びに音響モデルが埋め込まれたネットワークで表現される。 As the language model, the N-gram language model is widely used. This model has a different form of expression from the acoustic model. FIG. 9 shows an example in which a search space for word strings is configured by an acoustic model (HMM) and a language model (reference document 1). As shown in this figure, a search space in speech recognition is represented by a network in which a phoneme determined by a language model and an acoustic model are embedded.

音響モデルと言語モデルは、それぞれ独立に学習されるのが一般的であるが、近年、全体最適化のための幾つかの試みが行われている。しかし、ＨＭＭ自体の複雑さに加え、異なる形式で表現された言語モデルのモデルパラメータも同時に調整しなければならないため、音声認識システムが複雑なものとなっている。 In general, the acoustic model and the language model are learned independently, but in recent years, several attempts have been made for global optimization. However, in addition to the complexity of the HMM itself, the model parameters of the language model expressed in different formats must be adjusted at the same time, which complicates the speech recognition system.

堀貴明、塚田元“音声情報処理の最先端３「重み付き有限状態トランスデューサによる音声認識」”情報処理学会誌「情報処理」４５巻１０号、pp.1020-1026(2004.10).Takaaki Hori, Hajime Tsukada “State-of-the-Art 3 of Speech Information Processing“ Speech Recognition by Weighted Finite State Transducer ”, Information Processing Society of Japan Information Processing, Vol. 45, No. 10, pp.1020-1026 (2004.10). 安藤彰男「リアルタイム音声認識」電子情報通信学会Akio Ando “Real-time Speech Recognition” The Institute of Electronics, Information and Communication Engineers Annette J Dobson, Adrian G Barnett, “An Introduction to Generalized Linear Modeles”,CRC Press.(和訳本、田中豊、森川敏彦、山内竹春、富田誠訳「一般化線形モデル入門」共立出版株式会社)Annette J Dobson, Adrian G Barnett, “An Introduction to Generalized Linear Models”, CRC Press. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer, “Online passive aggressive algorithms”,Journal of Machine Learning Research, Vol.7, pp.531-583, 2006.Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer, “Online passive aggressive algorithms”, Journal of Machine Learning Research, Vol.7, pp.531-583, 2006. 工藤拓、山本薫、松本祐治“Conditional Random Fields を用いた日本語形態素解析”情報処理学会誌全言語処理研究会 SIGNL-161, 2004.Taku Kudo, Satoshi Yamamoto, Yuji Matsumoto “Japanese Morphological Analysis Using Conditional Random Fields” SIGNL-161, 2004. Hal Daume Iii, Daniel Marcu, “Learning as search optimization:Approximate large margin methods for structured prediction”, International Conference on Machine Learning, pp.169-176, 2005.Hal Daume Iii, Daniel Marcu, “Learning as search optimization: Approximate large margin methods for structured prediction”, International Conference on Machine Learning, pp.169-176, 2005.

そこで、音響モデルと言語モデルが同一のモデルで表現されれば全体を最適化し易く、また、そのプロセスも単純化が図れるものと期待できる。しかし、従来の線形分類モデルは、時系列に入力される特徴量に従ったスコアを、時系列に出力するようには形成されていなかった。そのため、音響モデルを線形分類モデルで表現することを考えたときに連続音声認識に利用できない問題があった。 Therefore, if the acoustic model and the language model are expressed by the same model, the whole model can be easily optimized, and the process can be expected to be simplified. However, the conventional linear classification model is not formed so as to output a score according to the feature amount input in time series in time series. For this reason, there is a problem that cannot be used for continuous speech recognition when the acoustic model is expressed by a linear classification model.

この発明は、このような問題点に鑑みてなされたものであり、音響モデルと言語モデルを統一的な枠組みで表現可能な音響モデルの学習方法とその装置と、その方法で作成した音響モデルを利用し、音声信号と音素を対応付けるための音素重み付き有限状態変換器生成方法とその装置と、それらのプログラムを提供することを目的とする。 The present invention has been made in view of such problems, and an acoustic model learning method and apparatus capable of expressing an acoustic model and a language model in a unified framework, and an acoustic model created by the method. An object of the present invention is to provide a phoneme-weighted finite state transducer generation method and apparatus for associating a speech signal with a phoneme, and a program thereof.

この発明の音響モデルパラメータ学習方法は、モデルパラメータ初期化過程とモデルパラメータ更新過程を含む。モデルパラメータ初期化過程は、認識スコアを求めるための各音素に対応する所定フレーム数から成る部分モデルパラメータを初期化する。モデルパラメータ更新過程は、特徴量ベクトルを入力としてその特徴量ベクトルと部分モデルパラメータの内積値の累積に基づく目的関数が外部から与えられ、その目的関数を最大化するモデルパラメータを、上記初期化されたモデルパラメータを更新して求め、各音素に対応する部分モデルパラメータを出力する。 The acoustic model parameter learning method of the present invention includes a model parameter initialization process and a model parameter update process. In the model parameter initialization process, a partial model parameter consisting of a predetermined number of frames corresponding to each phoneme for obtaining a recognition score is initialized. In the model parameter update process, an objective function based on the accumulation of the inner product value of the feature vector and the partial model parameter is given from the outside with the feature vector as input, and the model parameter that maximizes the objective function is initialized as described above. The model parameter is updated to obtain a partial model parameter corresponding to each phoneme.

また、この発明の音素重み付き有限状態変換器生成方法は、初期状態設定過程と中間状態設定配列過程と最後状態設定過程とを含む。初期状態設定過程は、音素に対応する部分モデルパラメータを入力とし、当該音素と最初のフレームに対応するスコアを出力する初期状態を設定する。中間状態設定配列過程は、部分モデルパラメータを構成するモデルパラメータＷ_ｐ，ｉと入力特徴量ベクトルの内積として定義される関数をスコアとし、かつ、入力無しでスコア０を出力する状態遷移を持つ中間状態を設定して配列する。最後状態設定過程は、スコア０を出力する自己遷移状態とし、入力無しでスコア０を出力して終了状態に遷移する最後状態を設定する。 The phoneme-weighted finite state transducer generation method of the present invention includes an initial state setting process, an intermediate state setting array process, and a final state setting process. In the initial state setting process, a partial model parameter corresponding to a phoneme is input, and an initial state in which a score corresponding to the phoneme and the first frame is output is set. In the intermediate state setting array process, a function defined as an inner product of model parameters W _{p, i} constituting partial model parameters and an input feature vector is used as a score, and an intermediate having a state transition that outputs a score 0 without input. Set the state and arrange. In the final state setting process, a self-transition state in which score 0 is output is set, and a final state in which score 0 is output without input and transition to an end state is set.

この発明の音響モデルパラメータ学習方法によれば、音響モデルを言語モデルと同一の表現形式で表すことができるので、音声認識システム全体の最適化を容易にする。また、この発明の音素重み付き有限状態変換器生成方法によれば、その音響モデルパラメータ学習方法で作成した音響モデルを、重み付き有限状態変換器の形式で記述する。その音響モデルは、高速かつ高精度な音声認識を可能にする。 According to the acoustic model parameter learning method of the present invention, since the acoustic model can be expressed in the same expression format as the language model, it is easy to optimize the entire speech recognition system. Also, according to the phoneme weighted finite state transducer generating method of the present invention, the acoustic model created by the acoustic model parameter learning method is described in the form of a weighted finite state transducer. The acoustic model enables high-speed and highly accurate speech recognition.

この発明の音響モデルパラメータ学習装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model parameter learning apparatus 100 of this invention. 音響モデルパラメータ学習装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model parameter learning apparatus. モデルパラメータ更新部１４の機能構成例を示す図。The figure which shows the function structural example of the model parameter update part 14. FIG. モデルパラメータ更新部１４の動作フローを示す図。The figure which shows the operation | movement flow of the model parameter update part. この発明の音素重み付き有限状態変換器生成装置２００の機能構成例を示す図。The figure which shows the function structural example of the finite state converter production | generation apparatus 200 with a phoneme weight of this invention. 音素重み付き有限状態変換器生成装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the finite state converter production | generation apparatus 200 with a phoneme weight. 音素重み付き有限状態変換器生成装置２００で作成した状態遷移モデルの一例を示す図。The figure which shows an example of the state transition model produced with the phoneme weighted finite state converter production | generation apparatus 200. FIG. 音素重み付き有限状態変換器生成装置２００で作成した状態遷移モデルを用いた音声認識装置３００の簡単な機能構成を示す図。The figure which shows the simple functional structure of the speech recognition apparatus 300 using the state transition model produced with the phoneme weighted finite state converter production | generation apparatus 200. FIG. 従来の音響モデルを用いた音声認識の探索空間の例を概念的に示す図。The figure which shows notionally the example of the search space of the speech recognition using the conventional acoustic model.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。実施例の説明の前に、この発明の音響モデルと言語モデルの全体最適化について説明する。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated. Prior to the description of the embodiments, the overall optimization of the acoustic model and the language model of the present invention will be described.

〔音響モデルと言語モデルの全体最適化〕
音響モデルと言語モデルをそれぞれＣ，Ｌとおく。そしてこれらモデルパラメータそれぞれが支配する関数をＭ_Ｃ，Ｎ_Ｌとおく。Ｍ_Ｃは音響スコアを返す関数、Ｎ_Ｌは言語スコアを返す関数である。 [Overall optimization of acoustic model and language model]
Let C and L be an acoustic model and a language model, respectively. The functions governed by these model parameters are M _C and N _L. M _C is a function that returns the acoustic score and N _L is a function that returns a language score.

音響モデルと言語モデルの全体最適化を行う場合、事前に用意された目的関数Ｏ（Ｍ_Ｃ，Ｎ_Ｌ）を最大化するＣ，Ｌを求めることになる。例えば、勾配法などにより目的関数の傾きを考慮しながら最適なモデルパラメータを探索するものとすると、式（１）と式（２）を計算する必要がある。 When performing the overall optimization of the acoustic model and the language model, _C and _L that maximize the objective function O (M _C , N _L ) prepared in advance are obtained. For example, if the optimum model parameter is searched while considering the gradient of the objective function by the gradient method or the like, it is necessary to calculate the equations (1) and (2).

Ｍ_Ｃ，Ｎ_Ｌが全く異なる関数であるとすると、２種類のモデルを学習する必要がある。これが従来の音響モデルを統一的に扱えない原因である。 If M _C and N _L are completely different functions, it is necessary to learn two types of models. This is the reason why conventional acoustic models cannot be handled uniformly.

これに対し、もし、音響モデルと言語モデルとが線形分類モデルであるとすると、音声認識を行う過程で音響スコアと言語スコアを算出する関数は式（３）と式（４）で表せる。 On the other hand, if the acoustic model and the language model are linear classification models, the functions for calculating the acoustic score and the language score in the process of performing speech recognition can be expressed by Expressions (3) and (4).

ここでＴは転置記号、Ａ′は音響モデル学習に用いる特徴量ベクトル、Ａ″は言語モデル学習に用いる特徴量ベクトルである。特徴量ベクトルＡ′とＡ″を並べた特徴量ベクトルをＡと表記すると音声認識スコアはＷ^ＴＡである。ＷはＣとＬを並べたものに一致。 Here, T is a transposed symbol, A ′ is a feature vector used for acoustic model learning, A ″ is a feature vector used for language model learning. A feature vector obtained by arranging feature vectors A ′ and A ″ is A. When written, the voice recognition score is W ^T A. W matches C and L side by side.

よって、音響モデルと言語モデルを統一した目的関数はＯ（Ｗ^ＴＡ）と書けるので、モデルパラメータの学習装置を1個に集約することができる。 Therefore, since the objective function that unifies the acoustic model and the language model can be written as O (W ^T A), the model parameter learning device can be integrated into one.

この発明は、本来複数の時間フレームにまたがって算出される特徴量ベクトルＡを、フフレームごとに分割し、分割された特徴量ベクトルとモデルパラメータの内積値をフレームごとに累積する形にすることで、時系列に入力される特徴量ベクトルに対応したスコアを時系列に生成できるようにしたものである。 According to the present invention, a feature quantity vector A originally calculated over a plurality of time frames is divided for each frame, and the inner product value of the divided feature quantity vector and model parameter is accumulated for each frame. Thus, a score corresponding to a feature vector input in time series can be generated in time series.

図１にこの発明の音響モデルパラメータ学習装置１００の機能構成例を示す。図２にその動作フローを示す。音響モデルパラメータ学習装置１００は、モデルパラメータ初期化部１２と、モデルパラメータ更新部１４と、を具備する。その各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of an acoustic model parameter learning apparatus 100 according to the present invention. FIG. 2 shows the operation flow. The acoustic model parameter learning device 100 includes a model parameter initialization unit 12 and a model parameter update unit 14. The functions of the respective units are realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

モデルパラメータ初期化部１２は、認識スコアを求めるモデルパラメータを初期化する（ステップＳ１２）。モデルパラメータ更新部１４は、特徴量ベクトルＸ^ｔ+ｎ _ｔ+1を入力として目的関数Ｏ_Ｗを最大化するモデルパラメータを、モデルパラメータ初期化部１２で初期化されたモデルパラメータを更新して求め、各音素ｐに対応する所定フレーム数から成る部分モデルパラメータを出力する（ステップＳ１４）。Ｘ^ｔ+ｎ _ｔ+1の表記は、図中（図１）の表記が正しい。また、Ｘ^ｔ+ｎ _ｔ+1はｔ＋１からｔ＋ｎの各時間フレームで得られる特徴量ベクトルから構成される。これに対し、ある時刻ｔのフレームで得られる特徴量ベクトルはＸ_ｔと表記する。 The model parameter initialization unit 12 initializes model parameters for obtaining a recognition score (step S12). Model parameter updating unit 14, a model parameter that maximizes the objective function O _W as input feature vector X ^{_{t +} n t +} _1, obtained by updating the model parameters initialized with the model parameter initializing unit 12 A partial model parameter consisting of a predetermined number of frames corresponding to each phoneme p is output (step S14). The notation of X ^{t + n} _{t + 1} is correct as shown in FIG. X ^{t + n} _{t + 1} is composed of feature quantity vectors obtained in each time frame from t + 1 to t + n. In contrast, a feature vector obtained in a frame at a certain time _t is denoted as Xt.

特徴量ベクトルＸ^ｔ+ｎ _ｔ+1は、正解音素ｐの特徴量ベクトルＸ^ｔ+ｎ _ｔ+1を記憶した正解データベース１０から入力される。目的関数Ｏ_Ｗは、特徴ベクトルＸ^ｔ+ｎ _ｔ+1とモデルパラメータの内積値の累積に基づく関数であり外部から与えられる。この構成によって、音響モデルパラメータ学習装置１００は、言語モデルと同一の表現形式の音響モデルを生成する。 Feature vector X ^{_{t +} n t +} ₁ is input from the correct database 10 which stores the feature vector X ^{_{t +} n t +} ₁ of the correct phoneme p. The objective function O _W is a function based on the accumulation of inner product of the feature vector X ^{_{t +} n t +} ₁ and the model parameters given from the outside. With this configuration, the acoustic model parameter learning device 100 generates an acoustic model having the same expression format as the language model.

〔線形分類モデル〕
線形分類モデルとは、式（５）に示す制約を満たす要素関数Ｇ_Ｗに基づく分類モデルである。 (Linear classification model)
The linear classification model is a classification model based on element function G _W that satisfies the constraints shown in equation (5).

要素関数Ｇ_Ｗは、例えば式（６）や式（７）などが考えられる。 Element function _{G W} is, for example Formula (6) or expression (7) can be considered like.

モデルパラメータ更新部１４に与えられる目的関数Ｏ_Ｗは、要素関数Ｇ_Ｗを累積した形のものである。線形分類モデルの意味するところは、認識は式（５）の右辺に示す式で表される方法で行い、モデルの学習は目的関数Ｏ_Ｗを用いて行うことである。目的関数Ｏ_Ｗは、モデルパラメータＷに対して非線形な関数を用いることができるので、線形な内積値Ｗ^Ｔ _ｐＡを用いて学習を行うより高精度なモデルの生成が期待できる。 Objective function O _W applied to the model parameter updating unit 14 is in the form obtained by accumulating the elements function G _W. What is meant by a linear classification model, recognition is carried out in a manner represented by the formula shown on the right-hand side of equation (5), the learning of the model is be carried out using the objective function O _W. The objective function O _W, since it is possible to use a non-linear function for the model parameters W, the generation of highly accurate model can be expected from performing learning using a linear inner product value W ^{T p} _A.

この発明では、特徴量ベクトルとモデルパラメータの内積値の累積（式（８））を計算し、これを音素ｐのスコアとする。 In the present invention, the accumulation of the inner product values of the feature quantity vector and the model parameter (equation (8)) is calculated, and this is used as the score of the phoneme p.

ここでｎは、考慮すべき時間フレームを意味する。入力音素信号の時間フレーム数がｔ＋ｎを超える場合は、ｔ＋ｎ以降を無視することになる。よって、所定数ｎは十分に長く設定する。例えば、学習用データの最大値とする。式（８）では、入力音声信号の時間フレーム数がｔ＋ｎを下回る場合は、入力長以降の特徴量ベクトルＸ^ｔ＋ｎ _{ｔ＋ｍ＋１}をゼロベクトルと考えるものとする。ｔ＋ｍ（＜（ｔ＋ｎ））は入力信号の最後のフレームに対応する。 Here, n means a time frame to be considered. When the number of time frames of the input phoneme signal exceeds t + n, the period after t + n is ignored. Therefore, the predetermined number n is set sufficiently long. For example, the maximum value of the learning data is set. In Expression (8), when the number of time frames of the input audio signal is less than t + n, the feature amount vector X ^{t + n} _{t + m + 1} after the input length is considered as a zero vector. t + m (<(t + n)) corresponds to the last frame of the input signal.

学習により推定すべきモデルパラメータ全体は、式（９）で表せる。 The entire model parameter to be estimated by learning can be expressed by equation (9).

所定数ｎは例えば２０といった値に設定される。 The predetermined number n is set to a value such as 20, for example.

図３にモデルパラメータ更新部１４のより具体的な機能構成例を示し更に詳しくその動作を説明する。図４にその動作フローを示す。傾き算出手段１４０と、傾き評価手段１４２と、パラメータ更新手段１４４を備える。傾き算出手段１４０は、外部から与えられる偏微分関数に、特徴量ベクトルＸ^ｔ＋ｎ _ｔ+1と、パラメータ更新手段１４４で更新されるモデルパラメータを与えて目的関数Ｏ_Ｗの傾きを計算する（ステップＳ１４０）。 FIG. 3 shows a more specific functional configuration example of the model parameter update unit 14, and the operation will be described in more detail. FIG. 4 shows the operation flow. Inclination calculation means 140, inclination evaluation means 142, and parameter update means 144 are provided. Inclination calculation means 140, the partial differential function given externally, the feature vector X ^{t + n t +} _1, calculates a tilt of the objective function O _W giving the model parameters are updated by the parameter updating unit 144 (step S140 ).

傾き評価手段１４２は、目的関数Ｏ_Ｗの傾きが単調増加して極値になるまで、パラメータの更新をパラメータ更新手段１４４に指示すると共に、現在のパラメータで計算した目的関数Ｏ_Ｗの傾きを求め、目的関数Ｏ_Ｗが収束したと判定されるまで、その動作を繰り返す（ステップＳ１４２〜Ｓ１４５のＮｏの繰り返しループ）。パラメータ更新手段１４４は、傾き評価手段１４２からの制御信号に基づいてパラメータを更新する（ステップＳ１４４）。傾きが極値になると（収束したと判定された場合を意味する）その時のパラメータを、モデルパラメータとして出力する（ステップＳ１４６）。 Inclination evaluation unit 142, until the gradient of the objective function O _W becomes monotonous increase to extreme, instructs the updating of the parameters in the parameter updating unit 144 obtains the gradient of the objective function O _W calculated in the current parameter until it is determined that the objective function _{O W} has converged, and repeats the operation (repetition loop No in step S142~S145). The parameter update unit 144 updates the parameter based on the control signal from the inclination evaluation unit 142 (step S144). When the slope becomes an extreme value (meaning that it has been determined that it has converged), the parameter at that time is output as a model parameter (step S146).

偏微分関数には、例えば式（１０）に示す目的関数Ｏ_ＷをＷの各要素で偏微分した関数が用いられる。 The partial differential function, for example function obtained by partially differentiating the objective function O _W shown in Equation (10) in each element of W is used.

〔音素重み付き有限状態変換器生成装置〕
図５に、この発明の音素重み付き有限状態変換器生成装置２００の機能構成例を示す。その動作フローを図６に示す。音素重み付き有限状態変換器生成装置２００は、初期状態設定部２０と、中間状態設定配列部２２と、最後状態設定部２４と、制御部２６とを備える。 [Phoneme-weighted finite state transducer generator]
FIG. 5 shows a functional configuration example of the phoneme-weighted finite state transducer generation apparatus 200 of the present invention. The operation flow is shown in FIG. The phoneme weighted finite state transducer generating apparatus 200 includes an initial state setting unit 20, an intermediate state setting array unit 22, a final state setting unit 24, and a control unit 26.

重み付き有限状態変換器とは、状態遷移機械のモデルとして広く知られる有限オートマトン（Finite Automaton）を入出力系列の変換用に拡張したものである。具体例は後述する。音素重み付き有限状態変換器生成装置２００は、音響モデルパラメータ学習装置１００が出力するモデルパラメータのうち、ある音素ｐに対応する部分モデルパラメータ{Ｗ_ｐ，ｉ｜1≦i≦n}（以降、{Ｗ_ｐ，ｉ}と略記）を入力として、その音素ｐの音素重み付き有限常態変換器を生成する。 A weighted finite state converter is an extension of a finite automaton (Finite Automaton) widely known as a model of a state transition machine for conversion of input / output sequences. Specific examples will be described later. The phoneme-weighted finite state transducer generating apparatus 200 includes partial model parameters {W _{p, i} | 1 ≦ i ≦ n} corresponding to a certain phoneme p among model parameters output from the acoustic model parameter learning apparatus 100 (hereinafter, (Abbreviated as {W _{p, i} }), and generates a phoneme weighted finite normal converter of the phoneme p.

制御部２６は、入力された部分モデルパラメータ{Ｗ_ｐ，ｉ}を図示しないメモリなどに記憶する。そして、繰り返し変数ｉなどを初期化（ｉ＝１）する（ステップＳ２６０）。 The control unit 26 stores the input partial model parameters {W _{p, i} } in a memory or the like (not shown). Then, the repetition variable i and the like are initialized (i = 1) (step S260).

初期化設定部２０は、特徴量ベクトルを入力したときに、部分モデルパラメータ{Ｗ_ｐ，ｉ}の最初のモデルパラメータＷ_ｐ，1と入力特徴量ベクトルの内積として定義される関数を出力重みとし、音素ｐを出力記号とする状態遷移を設定する（ステップＳ２０）。制御部２６はｉを更新する（ステップＳ２６１）。 When the feature value vector is input, the initialization setting unit 20 uses, as an output weight, a function defined as the inner product of the first model parameter W _{p, 1} of the partial model parameter {W _{p, i} } and the input feature value vector. The state transition with the phoneme p as an output symbol is set (step S20). The control unit 26 updates i (step S261).

中間状態設定配列部２２は、ｉに対応させて、部分モデルパラメータ{Ｗ_ｐ，ｉ}を構成するモデルパラメータＷ_ｐ，ｉと入力特徴量ベクトルの内積として定義される関数を出力重みとし、かつ、出力信号を持たない（何も出力しないことを表す記号εを出力記号）とする状態遷移を設定する。前者は次状態に後者は終了状態に遷移する。この処理は、ｉが所定数ｎ＋１になるまで繰り返される（ステップＳ２２〜Ｓ２６２のＮｏの繰り返しループ）。つまり、初期状態の後に、初期状態を含めてｎ個のモデルパラメータＷ_ｐ，ｉに対応する状態が配列される。 The intermediate state setting array unit 22 uses, as an output weight, a function defined as the inner product of the model parameter W _{p, i} constituting the partial model parameter {W _{p, i} } and the input feature vector corresponding to _i , and The state transition is set to have no output signal (the symbol ε representing that nothing is output is the output symbol). The former transitions to the next state and the latter transitions to the end state. This process is repeated until i reaches a predetermined number n + 1 (No repeat loop in steps S22 to S262). That is, after the initial state, states corresponding to the n model parameters W _{p, i} including the initial state are arranged.

最後状態設定部２４は、所定数ｎ＋1個目の状態に、特徴量ベクトルを入力したときに０を重み出力とし、出力信号を持たない自己状態遷移を設定する（ステップＳ２４）。 The last state setting unit 24 sets a self-state transition having no output signal to 0 as a weighted output when a feature vector is input to the predetermined number n + 1 first state (step S24).

図７に、上記した過程を経て生成された重み付き有限状態変換器の一例を示す。図中の○は状態を表し、１が初期状態であり、二重丸で表された状態は終了を意味する。κは変換器の入力として特徴量ベクトルが与えられたときに状態を遷移させる、μは必ず遷移させる、εは何も出力しないことを意味する記号である。 FIG. 7 shows an example of a weighted finite state transducer generated through the above-described process. In the figure, ◯ represents a state, 1 is an initial state, and a state represented by a double circle means completion. κ is a symbol that means that a state transition is made when a feature vector is given as an input of the converter, μ always makes a transition, and ε means nothing is output.

この重み付き状態遷移変換器で、特徴量ベクトルの各要素と対応するモデルパラメータとの内積演算を通してフレーム毎のスコアが算出される。 In this weighted state transition converter, a score for each frame is calculated through an inner product operation between each element of the feature vector and the corresponding model parameter.

音素重み付き有限状態変換器生成装置２００が生成した重み付き有限状態変換器を用いることで音声認識装置を構成することができる。図８に、重み付き有限状態を用いた音声認識装置３００の簡単な機能構成例を示す。 The speech recognition apparatus can be configured by using the weighted finite state transducer generated by the phoneme weighted finite state transducer generating apparatus 200. FIG. 8 shows a simple functional configuration example of the speech recognition apparatus 300 using the weighted finite state.

音声認識装置３００は、ＷＦＳＴデータベース３２と、音声認識部３０を備える。ＷＦＳＴデータベース３２は、複数の音素の重み付き有限状態変換器を記憶する。音声認識部３０は、特徴量ベクトルＸ_ｔ〜Ｘ_ｔ+ｎを入力として、それを重み付き有限状態と演算してスコアを求め音声認識処理を実行する。 The voice recognition device 300 includes a WFST database 32 and a voice recognition unit 30. The WFST database 32 stores a plurality of phoneme weighted finite state transducers. Speech recognition unit 30, as input feature vector X _t ~X _t _{+ n,} executes the voice recognition processing it by calculating a weighted finite state seeking scores.

〔確認実験〕
この発明による線形分類モデルに基づく音響モデルの有用性を確認する目的で、従来のＨＭＭ音響モデルと性能比較を行った。実験は孤立音素認識で行った。孤立音素認識とは、音素の境界が与えられた下で、音素のラベルのみを判定する問題である。 [Confirmation experiment]
For the purpose of confirming the usefulness of the acoustic model based on the linear classification model according to the present invention, the performance was compared with the conventional HMM acoustic model. The experiment was performed with isolated phoneme recognition. Isolated phoneme recognition is a problem in which only phoneme labels are determined under a phoneme boundary.

学習データには、日本語話し言葉コーパス（ＣＳＪ）の学会講演１５０を用いた。この発明の音響モデルのモデルパラメータの推定のために高速なオンラインマージン最大化学習手法であるＰＡアルゴリズム（Passive Aggressive）を用いた。評価データは学会１０講演分である。特徴量ベクトルには一般的なＭＦＣＣ１２次元＋対数パワー＋Δ＋ΔΔ全３９次元を用いた。 As the learning data, a lecture 150 of the Japanese Spoken Language Corpus (CSJ) was used. A PA algorithm (Passive Aggressive), which is a high-speed online margin maximization learning technique, was used to estimate the model parameters of the acoustic model of the present invention. The evaluation data is for 10 lectures at the conference. As the feature vector, general MFCC 12 dimensions + logarithmic power + Δ + ΔΔ all 39 dimensions were used.

その結果は、従来のＨＭＭ音響モデルの正解率が６０．１％に対して、この発明による音響モデルを用いた場合の正解率が５９．９％と、ほぼ同等の正解精度を得ることができた。この結果から、この発明による音響モデルが従来のＨＭＭ音響モデルに取って代わる能力を備えていることが確認できた。 As a result, the accuracy rate of the conventional HMM acoustic model is 60.1%, and the accuracy rate when the acoustic model according to the present invention is used is 59.9%. It was. From this result, it was confirmed that the acoustic model according to the present invention has the ability to replace the conventional HMM acoustic model.

なお、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各装置の機能構成部は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしても良い。 In addition, the functional configuration unit of each device may be configured by causing a predetermined program to be executed on a computer, or at least a part of these processing contents may be realized in hardware.

Claims

A model parameter initialization unit that initializes a partial model parameter composed of a predetermined number of frames corresponding to each phoneme for obtaining a recognition score; and
An objective function based on the accumulation of the inner product value of the feature vector and model parameter is input from the phoneme feature vector as input, and the model parameter that maximizes the objective function is updated to the initialized model parameter. A model parameter update process for outputting partial model parameters corresponding to each phoneme,
Acoustic model parameter learning method including

The acoustic model parameter learning method according to claim 1,
The model parameter update process
A slope calculation step for calculating a slope of the objective function based on a feature vector and a model parameter to be updated;
A slope evaluation step for evaluating the slope of the objective function until the slope of the objective function monotonically increases to an extreme value and outputting a partial model parameter consisting of a predetermined number of frames corresponding to each phoneme;
A parameter update step for updating the model parameter based on the evaluation result of the inclination evaluation step;
An acoustic model parameter learning method comprising:

An initial state setting process in which an initial state setting unit sets a partial model parameter including a predetermined number of frames corresponding to a phoneme and sets an initial state for outputting a score corresponding to the phoneme and the first frame;
The intermediate state setting array unit has a state transition in which a function defined as the inner product of the model parameters W _{p, i} constituting the partial model parameters and the input feature quantity vector is used as a score and score 0 is output without input. An intermediate state setting sequence process for setting and arranging intermediate states;
A final state setting process in which the final state setting unit sets a self-transition state that outputs a score of 0, outputs a score of 0 without input, and transitions to an end state;
A phoneme-weighted finite state transducer generation method including:

A model parameter initialization unit that initializes partial model parameters composed of a predetermined number of frames corresponding to each phoneme for obtaining a recognition score;
A phoneme feature vector is input, an objective function based on the accumulation of the inner product value of the feature vector and the model parameter is given from the outside, and the model parameter that maximizes the objective function A model parameter update unit that calculates and outputs partial model parameters corresponding to each phoneme;
An acoustic model parameter learning device comprising:

The acoustic model parameter learning device according to claim 4,
The model parameter update unit
An inclination calculating means for calculating an inclination of the objective function based on a feature vector and a model parameter to be updated;
A slope evaluation means for evaluating the slope of the objective function until the slope of the objective function monotonically increases to an extreme value and outputting a partial model parameter consisting of a predetermined number of frames corresponding to each phoneme;
Parameter updating means for updating the model parameter based on the evaluation result of the inclination evaluation step;
An acoustic model parameter learning device comprising:

Input a partial model parameter consisting of a predetermined number of frames corresponding to phonemes,
An initial state setting unit for setting an initial state for outputting a score corresponding to the phoneme and the first frame;
A function defined as the inner product of the model parameters _{Wp, i} and the input feature vector constituting the partial model parameters is set as a score, and an intermediate state having a state transition that outputs a score of 0 without input is set and arranged An intermediate state setting array unit,
A final state setting unit that outputs a score 0 and outputs a score 0 without any input and transitions to an end state;
A phoneme-weighted finite state transducer generator comprising:

A device program for causing a computer to execute the function of each unit of the acoustic model parameter learning device according to any one of claims 4 to 7 or the function of each unit of the phoneme-weighted finite state transducer generation device.