JP2009063773A

JP2009063773A - Speech feature learning device and speech recognition device, and method, program and recording medium thereof

Info

Publication number: JP2009063773A
Application number: JP2007230795A
Authority: JP
Inventors: Yasuhiro Minami; 泰浩南
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-09-05
Filing date: 2007-09-05
Publication date: 2009-03-26
Anticipated expiration: 2027-09-05
Also published as: JP4901657B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve recognition accuracy, even when the number of distributions of mixed Gaussian distribution increases. <P>SOLUTION: A static feature amount and a dynamic feature amount are extracted from input speech (2, S202). A trajectory x<SB>t</SB>is synthesized (104, S204) by using a linear dynamic system and a dictionary data stored in a dictionary data base. The trajectory x<SB>t</SB>is assigned to a maximized function stored in a maximized function data base 112, and a likelihood function P(Y<SB>1:T</SB>) is generated (108, S206). The likelihood function P(Y<SB>1:T</SB>) is approximated and an approximation likelihood function P'(Y<SB>1:T</SB>) is generated(1164, S207). An approximation error e' is calculated by approximating an error e between the approximation likelihood function P'(Y<SB>1:T</SB>) and the likelihood function P(Y<SB>1:T</SB>). The approximation likelihood function P'(Y<SB>1:T</SB>) and the approximation error e are added, and a corrected likelihood function P''(Y<SB>1:T</SB>) is generated, and a likelihood is calculated from the corrected likelihood function P''(Y<SB>1:T</SB>). <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、例えば、隠れマルコフモデルを利用した音声特徴学習装置、音声認識装置、それらの方法、それらのプログラム、それらの記録媒体に関する。 The present invention relates to, for example, a speech feature learning apparatus, a speech recognition apparatus, a method thereof, a program thereof, and a recording medium using a hidden Markov model.

音響モデルを用いた音声認識技術は多数存在する。そして、音響モデルとは例えば、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）である。従来のＨＭＭを用いた音声認識では、尤度を計算する際に基準となるトラジェクトリ（ＨＭＭの平均値の時系列）がＨＭＭの状態遷移部分で滑らかではなくなり、認識精度の劣化を引き起こしていた。非特許文献１（従来技術１）、非特許文献２（従来技術２）、非特許文献３（従来技術３）の手法では、入力音声の静的特徴量と動的特徴量の関係を利用し、滑らかなトラジェクトリを合成し、音声認識精度の向上を図っていた。しかし、これらの手法は、ＨＭＭの各状態に１つのガウス分布しか持たないと仮定していたため、複数のガウス分布をもつ混合ガウス分布型のＨＭＭへの拡張が困難であった。そこで、混合ガウス分布型のＨＭＭを扱う非特許文献４（従来技術４）がある。非特許文献４の手法は、ＨＭＭを使ったビタービアルゴリズムにより入力音声に対するＨＭＭの尤度が最大となる状態系列と分布系列を求め、当該状態系列と分布系列に対してトラジェクトリを合成し、入力音声とトラジェクトリとの間の尤度を計算する手法である。この手法では、１つの分布系列のトラジェクトリしか求めていないため、混合ガウス分布の分布数が増加した場合に、認識精度が劣化するという問題があった。そこで、混合ガウス分布型のＨＭＭに対して、トラジェクトリの合成を実現するために、スイッチング線形動的システムを使った手法が特許文献１（従来技術５）に記載されている。以下に、特許文献１に記載されている音声特徴学習装置、音声認識装置、を簡単に説明する。 There are many speech recognition techniques using acoustic models. The acoustic model is, for example, an HMM (Hidden Markov Model). In the conventional speech recognition using the HMM, the trajectory (time series of the average value of the HMM) used as a reference when calculating the likelihood is not smooth at the state transition portion of the HMM, causing deterioration in recognition accuracy. In the methods of Non-Patent Document 1 (Prior Art 1), Non-Patent Document 2 (Prior Art 2), and Non-Patent Document 3 (Prior Art 3), the relationship between the static feature quantity and the dynamic feature quantity of the input speech is used. Then, smooth trajectories were synthesized to improve speech recognition accuracy. However, since these methods assumed that each state of the HMM has only one Gaussian distribution, it was difficult to expand to a mixed Gaussian distribution type HMM having a plurality of Gaussian distributions. Therefore, there is Non-Patent Document 4 (Prior Art 4) dealing with a mixed Gaussian distribution type HMM. The method of Non-Patent Document 4 obtains a state series and a distribution series that maximize the likelihood of HMM for input speech by a Viterbi algorithm using HMM, synthesizes a trajectory for the state series and the distribution series, and inputs This is a technique for calculating the likelihood between speech and trajectory. In this method, since only one distribution series trajectory is obtained, there is a problem that the recognition accuracy deteriorates when the number of distributions of the mixed Gaussian distribution increases. Therefore, Patent Document 1 (Prior Art 5) describes a technique using a switching linear dynamic system to realize trajectory synthesis for a mixed Gaussian distribution type HMM. The speech feature learning device and speech recognition device described in Patent Document 1 will be briefly described below.

図１に従来技術５の音声特徴学習装置５の機能構成例を示し、図２に従来技術５の音声認識装置１１の機能構成例を示し、図３に音声特徴学習装置５と音声認識装置１１の構成を統合したものを示す。まず、図１、図３を用いて、音声特徴学習装置５による学習モードについて説明する。なお、図３中の各構成部間の矢印において、破線の矢印は、学習モード時の情報の移動を示し、実践の矢印は、認識モード時の情報の移動を示す。 FIG. 1 shows a functional configuration example of the speech feature learning device 5 of the conventional technology 5, FIG. 2 shows a functional configuration example of the speech recognition device 11 of the conventional technology 5, and FIG. 3 shows the speech feature learning device 5 and the speech recognition device 11 The integrated structure is shown. First, the learning mode by the speech feature learning device 5 will be described with reference to FIGS. 1 and 3. In addition, in the arrows between the components in FIG. 3, the broken arrows indicate the movement of information in the learning mode, and the practical arrows indicate the movement of information in the recognition mode.

まず、特徴量抽出部２が学習用音声から静的特徴量および動的特徴量を抽出する。ここで、動的特徴量とは、例えば、静的特徴量（ｙ）の一次微分成分（Δ成分）と二次微分成分（ΔΔ成分）のような、静的特徴量の時間変化を表すパラメータ（Δｙ、ΔΔｙ）である。音響モデル学習部４は、静的特徴量と動的特徴量とから、音響モデルを学習する。具体的には、音響モデル学習部４は、例えば、特徴パターンとして静的特徴量の平均値および分散を求め、動的特徴量として静的特徴量の微分係数の平均値Δｕ_ＳＫおよび分散値Δσ_ＳＫ ^２と、静的特徴量の２次微分係数の平均値ΔΔｕ_ＳＫおよび分散値ΔΔσ_ＳＫ ^２とを計算する。この保存の際に、ＨＭＭの状態数と混合ガウス分布数（各状態での混合ガウス分布がいくつのガウス分布の和として表現されるのか）も決まる。ＨＭＭデータベース６には、ＨＭＭの各状態の各ガウス分布に対する平均値Ｍ_ＳＫ＝［ｕ_ＳＫ，Δｕ_ＳＫ，ΔΔｕ_ＳＫ］と分散［σ_ＳＫ ^２，Δσ_ＳＫ ^２，ΔΔσ_ＳＫ ^２］が保存される。ここで、ＳはＨＭＭの状態番号、Ｋはガウス分布の番号を表している。 First, the feature quantity extraction unit 2 extracts static feature quantities and dynamic feature quantities from the learning speech. Here, the dynamic feature amount is a parameter representing a temporal change of the static feature amount, such as a first derivative component (Δ component) and a second derivative component (ΔΔ component) of the static feature amount (y), for example. (Δy, ΔΔy). The acoustic model learning unit 4 learns an acoustic model from the static feature amount and the dynamic feature amount. Specifically, the acoustic model learning unit 4 obtains, for example, an average value and variance of static feature quantities as feature patterns, and an average value Δu _SK and a variance value Δσ of differential coefficients of static feature quantities as dynamic feature quantities. _SK ² and the average value ΔΔu _SK and variance value ΔΔσ _SK ² of the second derivative of the static feature quantity are calculated. At the time of this storage, the number of states of the HMM and the number of Gaussian distributions (how many Gaussian distributions the Gaussian distribution of each state is expressed as) are determined. The HMM database 6 stores an average value M _SK = [u _SK , Δu _SK , ΔΔu _SK ] and variance [σ _SK ² , Δσ _SK ² , ΔΔσ _SK ² ] for each Gaussian distribution in each state of the HMM. Here, S represents an HMM state number, and K represents a Gaussian distribution number.

モデル変換部８は、静的特徴量と動的特徴量との関係を使って、音響モデルをスイッチング線形動的システムに変換する。静的特徴量と動的特徴量との関係は、特徴量間関係式データベース１０に保存されている。ここで、線形動的システムとは、以下のような状態方程式、観測方程式である。 The model conversion unit 8 converts the acoustic model into a switching linear dynamic system using the relationship between the static feature quantity and the dynamic feature quantity. The relation between the static feature quantity and the dynamic feature quantity is stored in the feature quantity relational expression database 10. Here, the linear dynamic system is the following state equation and observation equation.

Ｍ_ＳＫ＝ＣＸ^ｔ＋Ｗ_ＳＫ
Ｘ^ｔ+１＝ＡＸ^ｔ+Ｎ_ｔ（１）
上述のように、ＳはＨＭＭの状態番号、Ｋはガウス分布の番号を表している。Ｘ^ｔは、トラジェクトリ時系列を５フレーム分まとめたものであり、Ｘ^ｔ＝［ｘ_ｔ＋２ｘ_ｔ＋１ｘ_ｔｘ_ｔ−１ｘ_ｔ―２］^Ｑと表すことができる。ｘ_ｔは時刻ｔの時のトラジェクトリを示し、「^Ｑ」は転置を表す。また、行列Ｃ、Ａ、Ｍ、Ｎについては、「発明を実施するための最良の形態」で詳細に説明する。式（１）からも理解できるように、１つの状態がＳ個の状態からなる混合ガウス分布において、１つずつの混合ガウス分布を状態方程式、観測方程式に変換し、Ｓ個の線形動的システムを作成する。また、スイッチング線形動的システムとは、複数の状態方程式と観測方程式とが時間とともに、線形動的システムが切り替わるモデルである。トラジェクトリ合成部１４は、スイッチング線形動的システムデータベース１２に保存されたスイッチング線形動的システムと特徴量抽出部２よりの静的特徴量と動的特徴量から、スイッチングカルマンフィルタを用いてトラジェクトリを求める。この処理では、まずスイッチングカルマンフィルタがスイッチング線形動的システムデータベース中の式（１）に示す状態方程式と観測方程式を使って、個々にカルマンフィルタを動作させる。 M _SK = CX ^t + W _SK
X ^{t + 1} = AX ^t + N _t (1)
As described above, S represents an HMM state number, and K represents a Gaussian distribution number. X ^t is a collection of trajectory time series for five frames, and can be expressed as X ^t = [x _{t + 2} x _{t + 1} x _t x _t−1 x _t−2 ] ^Q. x _t represents a trajectory at time t, and “ ^Q ” represents transposition. The matrices C, A, M, and N will be described in detail in “Best Mode for Carrying Out the Invention”. As understood from Equation (1), in a mixed Gaussian distribution in which one state is composed of S states, one mixed Gaussian distribution is converted into a state equation and an observation equation, and S linear dynamic systems are obtained. Create The switching linear dynamic system is a model in which a plurality of state equations and observation equations are switched over time. The trajectory synthesis unit 14 obtains a trajectory by using a switching Kalman filter from the switching linear dynamic system stored in the switching linear dynamic system database 12 and the static feature quantity and dynamic feature quantity from the feature quantity extraction unit 2. In this process, first, the switching Kalman filter individually operates the Kalman filter using the state equation and the observation equation shown in Expression (1) in the switching linear dynamic system database.

この様子を図４に示す。図４の例では、スイッチング線形動的システムのＫの最大値が２、すなわち２つの状態方程式と観測方程式のセットを持っている場合を示している。ｔを時刻とする。図４では、あらかじめ時刻ｔ−１までのトラジェクトリの混合ガウス分布が求まっているものとする。このときのトラジェクトリの分布は２つのガウス分布の和となっている。この２つのガウス分布の平均値と分散を（平均値、分散）＝（ｘ^１ _{ｔ−１｜ｔ−１}，Ｖ^１ _{ｔ−１｜ｔ−１}）、（ｘ^２ _{ｔ−１｜ｔ−１}，Ｖ^２ _{ｔ−１｜ｔ−１}）と表す。前向き処理では、この個々のｔ−１のガウス分布からｔのトラジェクトリの分布を計算する。これは、ｔ−１の個々のガウス分布にＫ＝１およびＫ＝２の状態方程式と観測方程式をそれぞれ用いてカルマンフィルタ１０５１、１０５２を動作させることで実現できる。このような操作を行うと、図４のように２つのトラジェクトリから４つのトラジェクトリを計算することになる。しかし、このような処理を時刻Ｔ分だけ行うと、最終的には、２^Ｔ個のトラジェクトリを求めることになる。これを防ぐためにスイッチングフィルタでは、４つのトラジェクトリを統合器１０５３、１０５４により、統合して２つのトラジェクトリに減らす。この操作を順次行っていくことにより、学習用音声に近いトラジェクトリを得る。また、最後に後ろ向きの処理を行い、トラジェクトリをなめらかにする。具体的には、ＨＭＭの各状態各分布の平均値ｕ^ｉｊ，Δｕ^ｉｊ，ΔΔｕ^ｉｊと分散Σ^ｉｊ，ΔΣ^ｉｊ，ΔΔΣ^ｉｊとから計算される学習用音声ｙ，Δｙ，ΔΔｙに対する尤度が高くなるような重みを設定し、この重みをかける。これによって、最終的に学習用音声に近いトラジェクトリを得ることができる。トラジェクトリの導入に伴って新しい分散の計算が必要になるが、分散計算部２０は、ＥＭアルゴリズムにより分散を計算し、分散データベース２２に保存する。 This is shown in FIG. The example of FIG. 4 shows a case where the maximum value of K of the switching linear dynamic system is 2, that is, has two sets of state equations and observation equations. Let t be the time. In FIG. 4, it is assumed that the mixed Gaussian distribution of the trajectory up to time t-1 is obtained in advance. The trajectory distribution at this time is the sum of two Gaussian distributions. The average value and variance of the two Gaussian distributions are (average value, variance) = (x ¹ _{t−1 | t−1} , V ¹ _{t−1 | t−1} ), (x ² _{t−1 | t−1).} , V ² _{t−1 | t−1} ). In the forward processing, the distribution of t trajectories is calculated from the individual Gaussian distribution of t-1. This can be realized by operating the Kalman filters 1051 and 1052 using the state equation and the observation equation of K = 1 and K = 2 for the individual Gaussian distribution of t−1, respectively. When such an operation is performed, four trajectories are calculated from two trajectories as shown in FIG. However, if such a process is performed only for time T, 2 ^T trajectories are finally obtained. In order to prevent this, in the switching filter, the four trajectories are integrated by the integrators 1053 and 1054 to be reduced to two trajectories. By sequentially performing these operations, a trajectory close to the learning voice is obtained. Finally, it performs backward processing to smooth the trajectory. Specifically, the likelihood for learning speech y, Δy, ΔΔy calculated from the average values u ^ij , Δu ^ij , ΔΔu ^ij and variances Σ ^ij , ΔΣ ^ij , ΔΔΣ ^ij of each state distribution of the HMM is high. The weight is set so as to be, and this weight is applied. Thereby, a trajectory close to the learning voice can be finally obtained. With the introduction of the trajectory, a new variance calculation is required. The variance calculation unit 20 calculates the variance using the EM algorithm and stores it in the variance database 22.

次に、図２、図３を用いて、従来技術５の認識モードを説明する。特徴量抽出部１００は、入力音声の静的特徴量（ｙ）および動的特徴量（Δｙ，ΔΔｙ）を抽出する。トラジェクトリ合成部１４は、スイッチング線形動的システムデータベース１０５に保存されているスイッチング線形動的システムを参照することで、辞書に保存されている音素、単語、あるいは文章の候補からトラジェクトリを生成する。このようなスイッチングカルマンフィルタの前向き処理と後ろ向き処理で、複数のトラジェクトリが合成される。必要に応じて、モデル変換部８は、ＨＭＭデータベース６の音響モデルを、スイッチング線形動的システムに変換する。既に、線形動的システムが得られている場合は、ＨＭＭデータベース６、モデル変換部８は無くても良い。 Next, the recognition mode of the prior art 5 will be described with reference to FIGS. The feature amount extraction unit 100 extracts a static feature amount (y) and a dynamic feature amount (Δy, ΔΔy) of the input speech. The trajectory synthesis unit 14 refers to the switching linear dynamic system stored in the switching linear dynamic system database 105 to generate a trajectory from phoneme, word, or sentence candidates stored in the dictionary. A plurality of trajectories are synthesized by forward processing and backward processing of the switching Kalman filter. As necessary, the model conversion unit 8 converts the acoustic model of the HMM database 6 into a switching linear dynamic system. If a linear dynamic system has already been obtained, the HMM database 6 and the model conversion unit 8 may be omitted.

尤度計算部１８は、分散データベース２２に保存された分散値を参照して、トラジェクトリができるごとにトラジェクトリと入力音声との間の尤度を求める。この尤度が最も大きな音素系列を音声認識の結果とする。このように、従来技術５は、今までのトラジェクトリを用いる手法で使っていた尤度最大化によるトラジェクトリ合成手法の定式化に替えて、混合ガウス分布への拡張を考え、スイッチング線形動的システムによる定式化を行っている。
Y.Minami,E,McDermott,A Nakamura,and S,Katagiri,”A recognition method using synthesis-scoring that incorporates direct relations between static and dynamic feature vector time series “Workshop for Consistent & Reliable Acoustic Cues for Sound Analysis 2001. Y.Minami,E,McDermott, A Nakamura,and S,Katagiri,”A recognition method with parametric trajectory synthesized using direct relations between static and dynamic feature vector time eries,”Proc.ICASSP,pp.957-960,2002. Y.Minami,E,McDermott, A Nakamura,and S,Katagiri,”A Theoretical Analysis of Speech Recognition based on Feature Trajectory Models,” in Proc ICSLP,vol.I,2004. Y.Minami,E,McDermott, A Nakamura,and S,Katagiri,”A recognition method with parametric generated from mixture distoribution HMMs,”Proc.ICASSP,pp.124-127 2003. 特開２００６−１７１０２０号 The likelihood calculating unit 18 refers to the variance value stored in the variance database 22 and obtains the likelihood between the trajectory and the input speech every time a trajectory is created. The phoneme sequence having the greatest likelihood is taken as the result of speech recognition. As described above, the conventional technique 5 considers the extension to the mixed Gaussian distribution instead of the formulation of the trajectory synthesis method by the likelihood maximization used in the conventional method using the trajectory, and is based on the switching linear dynamic system. Formulation is performed.
Y.Minami, E, McDermott, A Nakamura, and S, Katagiri, ”A recognition method using synthesis-scoring that incorporates direct relations between static and dynamic feature vector time series“ Workshop for Consistent & Reliable Acoustic Cues for Sound Analysis 2001. Y. Minami, E, McDermott, A Nakamura, and S, Katagiri, “A recognition method with parametric trajectory synthesized using direct relations between static and dynamic feature vector time eries,” Proc.ICASSP, pp.957-960, 2002. Y. Minami, E, McDermott, A Nakamura, and S, Katagiri, “A Theoretical Analysis of Speech Recognition based on Feature Trajectory Models,” in Proc ICSLP, vol.I, 2004. Y. Minami, E, McDermott, A Nakamura, and S, Katagiri, “A recognition method with parametric generated from mixture distoribution HMMs,” Proc.ICASSP, pp.124-127 2003. JP 2006-171020 A

従来技術５で示したトラジェクトリ合成部１４の処理において、図４に示すように、統合器１０５３、１０５４によりトラジェクトリが統合され、トラジェクトリが平均化されるため、認識精度が劣化するという問題があった。また、従来技術４は上述のように、１つの分布系列のトラジェクトリしか求めていないため、混合ガウス分布の分布数が増加した場合に、認識精度が劣化するという問題があった。そこで、本発明は、従来技術４の問題点を解決するためのものである。つまり、本発明は、混合ガウス分布の分布数が増加した場合でも、認識精度が劣化することがない音声特徴学習装置、音声認識装置、その方法、そのプログラム、およびその記録媒体を提供することを目的とする。 In the process of the trajectory synthesizer 14 shown in the prior art 5, as shown in FIG. 4, the trajectories are integrated by the integrators 1053 and 1054, and the trajectories are averaged, so that there is a problem that the recognition accuracy deteriorates. . Further, as described above, since the conventional technique 4 obtains only one distribution series trajectory, there is a problem that the recognition accuracy deteriorates when the number of distributions of the mixed Gaussian distribution increases. Therefore, the present invention is for solving the problems of the prior art 4. That is, the present invention provides a speech feature learning device, a speech recognition device, a method thereof, a program thereof, and a recording medium thereof in which the recognition accuracy does not deteriorate even when the number of mixed Gaussian distributions increases. Objective.

この発明の音声特徴学習装置は、特徴量抽出部と、音響モデル学習部と、ＨＭＭデータベースと、モデル変換部と、線形動的システムデータベースと、尤度最大計算部と、トラジェクトリ合成部と、尤度関数最大化部と、最大化関数データベースと、を備える。特徴量抽出部は、学習用音声から静的特徴量と動的特徴量を抽出する。音響モデル学習部は、静的特徴量と動的特徴量とから、音響モデルを学習する。ＨＭＭデータベースは、音響モデルを記憶する。モデル変換部は、静的特徴量と動的特徴量との関係を使って、音響モデルを線形動的システムに変換する。線形動的システムデータベースは、線形動的システムを記憶する。尤度最大計算部は、ＨＭＭデータベースに記憶されている音響モデルを用いて、当該音響モデルに対する学習用音声の尤度が最大になるよう音響モデルの状態系列と分布系列を求める。トラジェクトリ合成部は、線形動的システムと状態系列と分布系列とを用いてトラジェクトリを合成する。尤度関数最大化部は、静的特徴量と動的特徴量と、トラジェクトリと、状態系列と分布系列とによる分散と、からなる尤度関数を最大化する最大化関数を求める。最大化関数データベースは、最大化関数を記憶する。 The speech feature learning apparatus according to the present invention includes a feature amount extraction unit, an acoustic model learning unit, an HMM database, a model conversion unit, a linear dynamic system database, a maximum likelihood calculation unit, a trajectory synthesis unit, a likelihood A degree function maximization unit and a maximization function database. The feature amount extraction unit extracts a static feature amount and a dynamic feature amount from the learning speech. The acoustic model learning unit learns an acoustic model from the static feature amount and the dynamic feature amount. The HMM database stores acoustic models. The model conversion unit converts the acoustic model into a linear dynamic system using the relationship between the static feature quantity and the dynamic feature quantity. The linear dynamic system database stores the linear dynamic system. The maximum likelihood calculation unit uses the acoustic model stored in the HMM database to determine the state series and distribution series of the acoustic model so that the likelihood of the learning speech for the acoustic model is maximized. The trajectory synthesis unit synthesizes the trajectory using the linear dynamic system, the state series, and the distribution series. The likelihood function maximization unit obtains a maximization function that maximizes a likelihood function including a static feature value, a dynamic feature value, a trajectory, and a variance based on a state sequence and a distribution sequence. The maximization function database stores the maximization function.

また、この発明の音声認識装置は、特徴量抽出部と、ＨＭＭデータベースと、線形動的システムデータベースと、トラジェクトリ合成部と、尤度関数生成部と、尤度計算部と、を備える。また、尤度関数計算部は、尤度を補正するための補正手段を備える。特徴量抽出部は、入力音声から静的特徴量と動的特徴量を抽出する。ＨＭＭデータベースは、音響モデルを記録する。線形動的システムデータベースは、線形動的システムを記憶する。トラジェクトリ合成部は、線形動的システムデータベースに記憶されている線形動的システムと、辞書データベースに記憶されている辞書データとを用いて、トラジェクトリを合成する。尤度関数生成部は、最大化関数データベースに記憶されている最大化関数に前記トラジェクトリを代入して、前記トラジェクトリの入力に対する入力音声の確率密度である尤度関数、もしくは前記トラジェクトリの入力に対する尤度の最大値をとる尤度関数を生成する。尤度計算部は、尤度関数に静的特徴量と動的特徴量とを代入して、尤度を計算する。 The speech recognition apparatus according to the present invention includes a feature amount extraction unit, an HMM database, a linear dynamic system database, a trajectory synthesis unit, a likelihood function generation unit, and a likelihood calculation unit. In addition, the likelihood function calculation unit includes correction means for correcting the likelihood. The feature quantity extraction unit extracts a static feature quantity and a dynamic feature quantity from the input speech. The HMM database records acoustic models. The linear dynamic system database stores the linear dynamic system. The trajectory synthesis unit synthesizes a trajectory using a linear dynamic system stored in the linear dynamic system database and dictionary data stored in the dictionary database. The likelihood function generation unit assigns the trajectory to the maximization function stored in the maximization function database, and the likelihood function that is the probability density of the input speech with respect to the input of the trajectory or the likelihood with respect to the input of the trajectory Generate a likelihood function that takes the maximum value of degrees. The likelihood calculating unit calculates the likelihood by substituting the static feature amount and the dynamic feature amount into the likelihood function.

上記の構成により、混合ガウス分布の分布数が増加した場合でも、ＨＭＭの状態遷移部分でも滑らかなトラジェクトリの生成を確保できながら、認識精度を向上できる。 With the above configuration, even when the number of mixed Gaussian distributions increases, recognition accuracy can be improved while ensuring smooth trajectory generation even in the state transition portion of the HMM.

以下に、発明を実施するための最良の形態を示す。なお、同じ機能を持つ構成部や同じ処理を行う過程には同じ番号を付し、重複説明を省略する。図５に音声特徴学習装置２００の機能構成例を示し、図６に音声認識装置３００の機能構成例を示し、図７に音声特徴学習装置２００の構成と音声認識装置３００の構成を統合したものを示し、図８に音声特徴学習装置２００の主な処理の流れを示し、図９に音声認識装置３００の主な処理の流れを示す。まず、図５と図７と図８を用いて、音声特徴学習装置２００による学習モードについて説明する。 The best mode for carrying out the invention will be described below. In addition, the same number is attached | subjected to the process which performs the structure part which has the same function, and the same process, and duplication description is abbreviate | omitted. FIG. 5 shows a functional configuration example of the speech feature learning device 200, FIG. 6 shows a functional configuration example of the speech recognition device 300, and FIG. 7 shows an integrated configuration of the speech feature learning device 200 and the speech recognition device 300. FIG. 8 shows the main processing flow of the speech feature learning apparatus 200, and FIG. 9 shows the main processing flow of the speech recognition apparatus 300. First, the learning mode by the speech feature learning apparatus 200 will be described with reference to FIGS. 5, 7, and 8.

［学習モード］
まず、学習用音声が入力されると、特徴量抽出部２は、静的特徴量、動的特徴量、を抽出する（ステップＳ１０２）。ここで、動的特徴量とは、例えば、静的特徴量（ｙ）の一次微分成分（Δ成分）と二次微分成分（ΔΔ成分）のような、静的特徴量の時間変化を表すパラメータ（Δｙ、ΔΔｙ）である。 [Learning mode]
First, when a learning voice is input, the feature amount extraction unit 2 extracts a static feature amount and a dynamic feature amount (step S102). Here, the dynamic feature amount is a parameter representing a temporal change of the static feature amount, such as a first derivative component (Δ component) and a second derivative component (ΔΔ component) of the static feature amount (y), for example. (Δy, ΔΔy).

音響モデル学習部４は、静的特徴量と動的特徴量とから、音響モデルを学習する（ステップＳ１０４）。具体的には、音響モデル学習部４は、例えば、特徴パターンとして静的特徴量の平均値および分散を求め、動的特徴量として静的特徴量の微分係数の平均値Δｕ_ＳＫおよび分散値Δσ_ＳＫ ^２と、静的特徴量の２次微分係数の平均値ΔΔｕ_ＳＫおよび分散値ΔΔσ_ＳＫ ^２とを計算する。ＨＭＭデータベース６には、ＨＭＭの各状態の各ガウス分布に対する平均値Ｍ_ＳＫ＝［ｕ_ＳＫ，Δｕ_ＳＫ，ΔΔｕ_ＳＫ］と分散［σ_ＳＫ ^２，Δσ_ＳＫ ^２，ΔΔσ_ＳＫ ^２］が保存される（ステップＳ１０６）。この保存の際に、ＨＭＭの状態数と混合ガウス分布数（各状態での混合ガウス分布がいくつのガウス分布の和として表現されるのか）も決まる。ここで、ＳはＨＭＭの状態番号、Ｋはガウス分布の番号を表している。 The acoustic model learning unit 4 learns an acoustic model from the static feature amount and the dynamic feature amount (step S104). Specifically, the acoustic model learning unit 4 obtains, for example, an average value and variance of static feature quantities as feature patterns, and an average value Δu _SK and a variance value Δσ of differential coefficients of static feature quantities as dynamic feature quantities. _SK ² and the average value ΔΔu _SK and variance value ΔΔσ _SK ² of the second derivative of the static feature quantity are calculated. The HMM database 6 stores an average value M _SK = [u _SK , Δu _SK , ΔΔu _SK ] and variance [σ _SK ² , Δσ _SK ² , ΔΔσ _SK ² ] for each Gaussian distribution in each state of the HMM ( Step S106). At the time of this storage, the number of states of the HMM and the number of Gaussian distributions (how many Gaussian distributions the Gaussian distribution of each state is expressed as) are determined. Here, S represents an HMM state number, and K represents a Gaussian distribution number.

モデル変換部８は、静的特徴量と動的特徴量との関係を使って、音響モデルを線形動的システムに変換する（ステップＳ１０８）。静的特徴量と動的特徴量との関係は予め特徴量間関係式データベース１０に記憶されている。線形動的システムの一例を説明する。 The model conversion unit 8 converts the acoustic model into a linear dynamic system using the relationship between the static feature quantity and the dynamic feature quantity (step S108). The relation between the static feature quantity and the dynamic feature quantity is stored in the feature quantity relational expression database 10 in advance. An example of a linear dynamic system will be described.

音響モデルの各状態、各分布に関して、以下のような状態方程式、観測方程式に変換することが好ましい。以下で説明するトラジェクトリ合成部１０４によるトラジェクトリの合成演算量が削減されるからである。この状態方程式、観測方程式は上記式（１）と同様である。 It is preferable to convert each state and distribution of the acoustic model into the following state equation and observation equation. This is because the trajectory synthesis calculation amount by the trajectory synthesis unit 104 described below is reduced. The state equation and the observation equation are the same as the above equation (1).

Ｍ_ＳＫ＝ＣＸ^ｔ+Ｗ_ＳＫ
Ｘ^ｔ+１＝ＡＸ^ｔ+Ｎ_ｔ（１）
ただし、ｔは時間を表し、Ｘ^ｔは、トラジェクトリ時系列を５フレーム分まとめたものであり、Ｘ^ｔ＝［ｘ_ｔ＋２ｘ_ｔ＋１ｘ_ｔｘ_ｔ−１ｘ_ｔ―２］^Ｑと表すことができる。ｘ_ｔは時刻ｔの時のトラジェクトリを示し、「^Ｑ」は転置を表す。そして、行列ＡとＣに関して、ここでは、以下のような行列を用いる。

ここで、θは正の大きな値である。 M _SK = CX ^t + W _SK
X ^{t + 1} = AX ^t + N _t (1)
However, t represents time, ^{X t} is a summary of the trajectory time series 5 frames ^can be expressed as _{_{_{X t = [x t + 2}}} x t + 1 x t x t-1 x t-2] Q . x _t represents a trajectory at time t, and “ ^Q ” represents transposition. For the matrices A and C, the following matrix is used here.

Here, θ is a large positive value.

Ｗ_ＳＫは、平均が［０００］^Ｑで、
分散がΣ_ＳＫ＝ｄｉａｇ［σ_ＳＫ ^２ Δσ_ＳＫ ^２ ΔΔσ_ＳＫ ^２］であるガウス分布に従う確率変数である。ここで、ｄｉａｇは［］内を対角要素にもつ対角行列を作成する関数である。Ｍ_ＳＫには次のようにＨＭＭの平均値が代入される。
Ｍ_ＳＫ＝［ｕ_ＳＫ Δｕ_ＳＫ ΔΔｕ_ＳＫ］^Ｑ W _SK has an average of [0 0 0] ^Q ,
It is a random variable that follows a Gaussian distribution with variance Σ _SK = diag [σ _SK ² Δσ _SK ² ΔΔσ _SK ² ]. Here, diag is a function for creating a diagonal matrix having [] in diagonal elements. The average value of the HMM as follows: is substituted for M _SK.
M _SK = [u _SK Δu _SK ΔΔu _SK ] ^Q

また、行列Ａはｘ_ｔ＋２＝ｘ_ｔ＋２ｘ_ｔ＋１＝ｘ_ｔ＋１ｘ_ｔ＝ｘ_ｔｘ_ｔ−１＝ｘ_ｔ−１を示しており、この演算では、雑音による影響以外の変化が無いことを示している。行列Ａはこれに限られるものではない。行列Ｃは、一次微分と二次微分を近似的に実現する行列であれば、どんな行列でも用いることができる。行列Ｃについては、列の数は、Ｘ^ｔで定義したフレーム数であり、行数の数は、動的特徴量の微分次数＋１（この例では、２＋１）と同数であればよい。行列Ｃの第１行目は静的特徴量を求める計算、第２行目はΔ特徴量を求める計算、第３行目はΔΔ特徴量を求める計算となる。 Further, the matrix A indicates the _{_{_{_{x t + 2 = x t +}}}} 2 x t + 1 = x t + 1 x t = x t x t-1 = x t-1, in this operation, indicating that changes in non-affected by the noise is not Yes. The matrix A is not limited to this. As the matrix C, any matrix can be used as long as it approximately realizes the first and second derivatives. The matrix C, the number of columns is the number of frames defined by X ^t, the number of the number of rows (in this example, 2 + 1) differential orders +1 dynamic features as long as the same number. The first line of the matrix C is a calculation for obtaining a static feature value, the second line is a calculation for obtaining a Δ feature value, and the third line is a calculation for obtaining a ΔΔ feature value.

尤度最大計算部１０６は、ＨＭＭデータベース６に記憶されている音響モデルを用いて、当該音響モデルに対する前記学習用音声の尤度が最大になるよう音響モデルの状態系列と分布系列を求める（ステップＳ１１０）。例えば、この最大化処理には、ビタービアルゴリズムを用いれば良い。 The maximum likelihood calculation unit 106 uses the acoustic model stored in the HMM database 6 to determine the state series and distribution series of the acoustic model so that the likelihood of the learning speech for the acoustic model is maximized (step S110). For example, a Viterbi algorithm may be used for this maximization process.

トラジェクトリ合成部１０４は、線形動的システムデータベース１０２よりの線形動的システム（状態方程式と観測方程式）と、尤度最大計算部１０６よりの状態系列と分布系列を用いて、トラジェクトリｘ_ｔを合成する（ステップＳ１１２）。具体的には尤度最大計算部１０６よりの状態系列と分布系列を、状態方程式と観測方程式に代入して、代入された状態方程式と観測方程式のｘ_ｔについて解く事で、ｘ_ｔを求める。 The trajectory synthesis unit 104 synthesizes the trajectory x _t using the linear dynamic system (state equation and observation equation) from the linear dynamic system database 102 and the state sequence and distribution sequence from the maximum likelihood calculation unit 106. (Step S112). Specific to the state sequence with the distribution sequence than the maximum likelihood calculation unit 106 substitutes the state equation observation equation, by solving for x _t assignment state equations and observation equations, seek x _t.

そして、尤度関数最大化部１１０は、静的特徴量と動的特徴量と、トラジェクトリと、状態系列と分布系列とによる分散と、からなる尤度関数を最大化する最大化関数を求める（ステップＳ１１４）。以下、詳細に説明すると、トラジェクトリｘ_ｔに関して、時刻ｔでの尤度関数Ｐ（ｙ_ｔ）は以下の式（２）で表すことができる。

Then, the likelihood function maximization unit 110 obtains a maximization function that maximizes the likelihood function including the static feature quantity, the dynamic feature quantity, the trajectory, and the variance of the state series and the distribution series ( Step S114). Hereinafter, in detail, regarding the trajectory x _t , the likelihood function P (y _t ) at the time t can be expressed by the following equation (2).

この尤度関数Ｐ（ｙ_ｔ）は、トラジェクトリｘ_ｔが音響モデルの状態と分布の値がＳ_ｔＫ_ｔである線形動的モデルから生成されたときに使われる。トラジェクトリがどの線形動的モデルから生成されたかによって、尤度関数を切り替え、以上の尤度関数の時刻ｔでの積が最大になる最大化関数を学習する。ここで、この関数を学習するＥＭアルゴリズムを計算するのは大変なので、近似計算を用いる。これには、尤度最大計算部１０６よりの状態系列と分布系列を利用する。この状態系列と分布系列から全ての学習データのトラジェクトリを計算し、尤度関数Ｐ（ｙ_ｔ）を最大化するような最大化関数を求める。そして、最大化関数は、最大化関数データベース１１２に記憶される（ステップＳ１１６）。最大化関数は、ｙ_ｔ、ｘ_ｔ、Δｙ_ｔ、Δｘ_ｔ、ΔΔｙ_ｔ、ΔΔｘ_ｔが変数になっており、

が、定数になっている。 This likelihood function P (y _t ) is used when the trajectory x _t is generated from a linear dynamic model whose acoustic model state and distribution value are S _t K _t . Depending on which linear dynamic model the trajectory is generated from, the likelihood function is switched, and the maximization function that maximizes the product of the above likelihood function at time t is learned. Here, since it is difficult to calculate the EM algorithm for learning this function, approximate calculation is used. For this, a state series and a distribution series from the maximum likelihood calculation unit 106 are used. A trajectory of all learning data is calculated from the state series and the distribution series, and a maximization function that maximizes the likelihood function P (y _t ) is obtained. Then, the maximization function is stored in the maximization function database 112 (step S116). In the maximization function, y _t , x _t , Δy _t , Δx _t , ΔΔy _t , ΔΔx _t are variables,

Is a constant.

［認識モード］
次に、図６と図７と図９を用いて、音声認識装置３００による認識モードについて説明する。特徴量抽出部２は、入力音声の静的特徴量ｙ_ｔと動的特徴量（Δｙ_ｔ、ΔΔｙ_ｔ）を抽出する（ステップＳ２０２）。トラジェクトリ合成部１０４は、線形動的システムデータベース１０２よりの線形動的システムを参照することで、辞書データベース１６に保存されている音素、単語、あるいは文章の候補から可能な全てのトラジェクトリを合成する（ステップＳ２０４）。また、必要に応じて、モデル変換部８が、ＨＭＭデータベース６に記憶されている音響モデルを線形動的システムに変換して、線形動的システムデータベースに記憶させてもよい。線形動的システムデータベース１０２が既に得られている場合は、モデル変換部８はなくてもよい。 [Recognition mode]
Next, the recognition mode by the speech recognition apparatus 300 will be described using FIG. 6, FIG. 7, and FIG. Feature extraction unit 2, a static characteristic amount _{y t} and dynamic features of the input speech (Δy _{_t,} ΔΔy _t) to extract (step S202). The trajectory synthesis unit 104 synthesizes all possible trajectories from phoneme, word, or sentence candidates stored in the dictionary database 16 by referring to the linear dynamic system from the linear dynamic system database 102 ( Step S204). If necessary, the model conversion unit 8 may convert the acoustic model stored in the HMM database 6 into a linear dynamic system and store it in the linear dynamic system database. If the linear dynamic system database 102 has already been obtained, the model conversion unit 8 may be omitted.

尤度関数生成部１０８は、最大化関数データベース１１２に記憶されている最大化関数にトラジェクトリ合成部１０４よりのトラジェクトリを代入して、トラジェクトリの入力に対する入力音声の確率密度である尤度関数Ｐ_１（Ｙ_１：Ｔ）、もしくはトラジェクトリの入力に対する尤度の最大値をとる尤度関数Ｐ_２（Ｙ_１：Ｔ）を生成する（ステップＳ２０６）。Ｐ_１（Ｙ_１：Ｔ）、Ｐ_２（Ｙ_１：Ｔ）は例えば、以下のように定義できる。

The likelihood function generation unit 108 substitutes the trajectory from the trajectory synthesis unit 104 for the maximization function stored in the maximization function database 112, and the likelihood function P ₁ that is the probability density of the input speech with respect to the input of the trajectory. (Y _{1: T} ) or a likelihood function P ₂ (Y _{1: T} ) that takes the maximum likelihood value for the trajectory input is generated (step S206). P ₁ (Y _{1: T} ) and P ₂ (Y _{1: T} ) can be defined as follows, for example.

ここで、ｍａｘＡはＡの最大値を取ることを示し、Ｐ（Ａ│Ｂ）はＢである場合にＡである条件付確率であり、１：Ｔは１，．．．，Ｔを示し、Ｓ_１：ＴはＨＭＭの状態系列を示し、Ｋ_１：ＴはＨＭＭの分布系列を示し、Ｘ_１：Ｔ（Ｓ_１：Ｔ，Ｋ_１：Ｔ）はこの状態系列と分布系列から合成されるトラジェクトリ系列を示し、Ｐ（Ｓ_１：Ｔ，Ｋ_１：Ｔ）は状態の遷移確率と分布を選ぶ確率を状態と分布の系列に対して求めたものであり、ＨＭＭで利用している値と同じ値を用いる。また、「＾」が使用されている記号は、トラジェクトリに対しての値であり、「＾」が使用されていない記号は、ＨＭＭに対しての値である。また、イメージで記載された式中では「＾」は記号の真上に付され、テキストで記載された式中では「＾」は記号の右斜上に付されているが、これらの値は同値である。Ｙ_ｔとＹ_１：Ｔは以下の式により定義される。
Ｙ_ｔ＝［ｙ_ｔ、Δｙ_ｔ、ΔΔｙ_ｔ］^Ｑ
Ｙ_１：Ｔ＝Ｙ_１，Ｙ_２，．．．，Ｙ_Ｔ Here, maxA indicates the maximum value of A, P (A | B) is a conditional probability that is A when B is 1, and 1: T is 1,. . . , T, S _{1: T} represents an HMM state series, K _{1: T} represents an HMM distribution series, and X _{1: T} (S _{1: T} , K _{1: T} ) represents this state series and distribution. A trajectory sequence synthesized from the sequence is shown. P (S _{1: T} , K _{1: T} ) is a state transition probability and a probability of selecting a distribution obtained for the state and distribution sequence, and is used in the HMM. Use the same value as the current value. A symbol using “^” is a value for the trajectory, and a symbol not using “^” is a value for the HMM. In addition, in the expression described in the image, “^” is attached immediately above the symbol, and in the expression described in the text, “^” is attached immediately above the symbol, but these values are Equivalent. Y _t and Y _{1: T} are defined by the following equations.
Y _t = [y _t , Δy _t , ΔΔy _t ] ^Q
Y _{1: T} = Y ₁ , Y ₂ ,. . . , Y _T

以下の説明では、Ｐ_１（Ｙ_１：Ｔ）、Ｐ_２（Ｙ_１：Ｔ）をまとめて、Ｐ（Ｙ_１：Ｔ）という。ＨＭＭの状態数が少ない場合（例えば、状態数が１）には、尤度計算部１１４は、式（３）もしくは（３’）に示す尤度関数に、特徴量抽出部２よりの静的特徴量、動的特徴量を代入して、尤度を計算する。そして、尤度を最大にするモデルを最終的な認識結果とする。 In the following description, P ₁ (Y _{1: T} ) and P ₂ (Y _{1: T} ) are collectively referred to as P (Y _{1: T} ). When the number of states of the HMM is small (for example, the number of states is 1), the likelihood calculation unit 114 uses the likelihood function shown in Expression (3) or (3 ′) as a static function from the feature amount extraction unit 2. The likelihood is calculated by substituting the feature quantity and the dynamic feature quantity. A model that maximizes the likelihood is set as a final recognition result.

［変形例１］
次に音声認識装置３００の変形例１である音声認識装置３００−１について説明する。ＨＭＭの状態数が多い場合には、式（３）において、可能な全てのトラジェクトリを生成し、入力音声と比較しなければならないので、実現できない。 [Modification 1]
Next, the speech recognition apparatus 300-1 which is the modification 1 of the speech recognition apparatus 300 is demonstrated. When the number of states of the HMM is large, all the possible trajectories must be generated and compared with the input speech in Equation (3), which cannot be realized.

そこで、音声認識装置３００−１は尤度計算部１１４に尤度関数を近似するための近似手段１１６４を保持する。尤度関数Ｐ（Ｙ_１：Ｔ）は尤度近似手段１１６４に入力され、尤度関数Ｐ（Ｙ_１：Ｔ）は近似され、近似尤度関数Ｐ’（Ｙ_１：Ｔ）が求められる（ステップＳ２０７）。尤度関数Ｐ（Ｙ_１：Ｔ）の近似の手法について説明する。まず、あらかじめ音響モデルのビタービアルゴリズムで音響モデルの尤度が最大になる状態系列Ｓ＾_１：Ｔと分布系列Ｋ＾_１：Ｔを求める。状態系列Ｓ＾_１：Ｔと分布系列Ｋ＾_１：Ｔの求めかたは以下の式（４）で表すことができる。

Therefore, the speech recognition apparatus 300-1 holds an approximation unit 1164 for approximating the likelihood function in the likelihood calculation unit 114. Likelihood function P (Y _{1: T} ) is input to likelihood approximating means 1164, and likelihood function P (Y _{1: T} ) is approximated to obtain approximate likelihood function P ′ (Y _{1: T} ) ( Step S207). A method of approximating the likelihood function P (Y _{1: T} ) will be described. First, a state sequence S _{1: T} and a distribution sequence K _{1: T} that maximize the likelihood of the acoustic model are obtained in advance by the Viterbi algorithm of the acoustic model. The method of obtaining the state series S _{1: T} and the distribution series K _{1: T} can be expressed by the following equation (4).

式（５）のトラジェクトリｘ＾_１：Ｔは線形動的モデルを用いて、上記式（１）による状態方程式と観測方程式によるカルマンフィルタを実現することにより求めることが好ましい。計算量が削減されるからである。このトラジェクトリｘ＾_１：Ｔを使って、得られる尤度は以下の式（６）のように、近似尤度関数Ｐ’（Ｙ_１：Ｔ）を求めることができる。

The trajectory x ^ _{1: T in the} equation (5) is preferably obtained by realizing a Kalman filter based on the state equation and the observation equation according to the equation (1) using a linear dynamic model. This is because the calculation amount is reduced. Using this trajectory x ^ _{1: T} , an approximate likelihood function P ′ (Y _{1: T} ) can be obtained as shown in the following equation (6).

そして、尤度計算部１１４は、近似尤度関数Ｐ’（Ｙ_１：Ｔ）に特徴量抽出部２よりの静的特徴量、動的特徴量を代入して、尤度を計算する。そして、尤度を最大にするモデルを最終的な認識結果とする。 Then, the likelihood calculation unit 114 calculates the likelihood by substituting the static feature amount and the dynamic feature amount from the feature amount extraction unit 2 into the approximate likelihood function P ′ (Y _{1: T} ). A model that maximizes the likelihood is set as a final recognition result.

この近似尤度関数Ｐ’（Ｙ_１：Ｔ）は非特許文献４で述べているものとほぼ等しいが、上記式（５）によるトラジェクトリを求める際に、状態方程式と観測方程式によるカルマンフィルタを実現させている所が従来技術４とは異なる。 This approximate likelihood function P ′ (Y _{1: T} ) is almost the same as that described in Non-Patent Document 4, but when obtaining a trajectory according to the above equation (5), a Kalman filter based on a state equation and an observation equation is realized. Is different from the prior art 4.

［変形例２］
次に音声認識装置３００の変形例２である音声認識装置３００−２について説明する。近似尤度関数Ｐ’（Ｙ_１：Ｔ）では、以下に示すＨＭＭによる最大状態系列と分布系列以外のトラジェクトリの尤度を無視していることになり、高精度な認識をできない。そこで、尤度関数計算１１４は、更に、補正手段１１６を持たせる。補正手段１１６などの機能構成例を図１０に示す。補正手段１１６は誤差近似手段１１６８と加算手段１１７０とで構成されている。誤差近似手段１１６８は、近似尤度関数Ｐ’（Ｙ_１：Ｔ）と尤度関数Ｐ（Ｙ_１：Ｔ）との誤差ｅをＨＭＭの尤度を用いて近似することで、近似誤差ｅ’を生成する。まず、誤差ｅは以下の式（７）で表すことができる。誤差ｅはｅ＝│Ｐ（Ｙ_１：Ｔ）−Ｐ’（Ｙ_１：Ｔ）│で表すことができる。

[Modification 2]
Next, a speech recognition device 300-2 that is a second modification of the speech recognition device 300 will be described. In the approximate likelihood function P ′ (Y _{1: T} ), the likelihood of trajectories other than the maximum state sequence and distribution sequence by the HMM shown below is ignored, and high-accuracy recognition cannot be performed. Therefore, the likelihood function calculation 114 further includes a correction unit 116. An example of the functional configuration of the correction unit 116 and the like is shown in FIG. The correction unit 116 includes an error approximation unit 1168 and an addition unit 1170. The error approximation means 1168 approximates the error e between the approximate likelihood function P ′ (Y _{1: T} ) and the likelihood function P (Y _{1: T} ) using the likelihood of the HMM, so that the approximate error e ′. Is generated. First, the error e can be expressed by the following equation (7). The error e can be expressed by e = | P (Y1 _{: T} ) −P ′ (Y1 _{: T} ) |.

しかし、誤差ｅを直接求めることはできない。従って、誤差近似手段１１６６は、ＨＭＭの尤度を利用して、誤差ｅを近似して、近似誤差ｅ’を生成する。近似誤差ｅ’は以下の式（８）になる。

この式（８）の２行目の式は例えば、ＨＭＭのビタービアルゴリズムを用いて得られる。 However, the error e cannot be obtained directly. Therefore, the error approximating unit 1166 approximates the error e using the likelihood of the HMM to generate an approximate error e ′. The approximation error e ′ is expressed by the following equation (8).

The expression in the second row of the expression (8) is obtained using, for example, the HMM Viterbi algorithm.

加算手段１１７０は、尤度近似手段１１６４よりの近似尤度関数Ｐ’（Ｙ_１：Ｔ）と、近似誤差ｅ’とを加算することで、補正尤度関数Ｐ’’（Ｙ_１：Ｔ）を求める（ステップＳ２０８）。つまり、以下の式（９）が演算される。

The adding unit 1170 adds the approximate likelihood function P ′ (Y _{1: T} ) from the likelihood approximating unit 1164 and the approximate error e ′, thereby correcting the likelihood function P ″ (Y _{1: T} ). Is obtained (step S208). That is, the following formula (9) is calculated.

加算手段１１７０から補正尤度関数Ｐ’’（Ｙ_１：Ｔ）が出力される。そして、尤度計算部１１４は、特徴量抽出部２よりの静的特徴量と動的特徴量を用いて、式（９）の補正尤度関数Ｐ’’（Ｙ_１：Ｔ）の関数値である尤度を求め（ステップＳ２１０）、尤度を最大にするモデルを最終的な認識結果とする。 A correction likelihood function P ″ (Y _{1: T} ) is output from the adding means 1170. Then, the likelihood calculation unit 114 uses the static feature amount and the dynamic feature amount from the feature amount extraction unit 2 to use the function value of the corrected likelihood function P ″ (Y _{1: T} ) in Expression (9). Is obtained (step S210), and a model that maximizes the likelihood is set as a final recognition result.

このようにして、尤度関数Ｐ（Ｙ_１：Ｔ）を補正手段１１６により補正することで、補正尤度関数Ｐ’’（Ｙ_１：Ｔ）を求める。その結果、ＨＭＭによる最大状態系列と分布系列以外のトラジェクトリの尤度も考慮していることになる。従って、この実施例の音声認識装置３００は、混合ガウス分布の分布数が増加した場合であっても、従来技術４、５よりも、認識精度を向上させることが出来る。 In this way, the likelihood function P _{(Y 1: T)} is corrected by the correction means 116, correction likelihood function _{P '' (Y 1: T} ) calculated. As a result, the likelihood of trajectories other than the maximum state sequence and distribution sequence by HMM is also taken into consideration. Therefore, the speech recognition apparatus 300 of this embodiment can improve the recognition accuracy over the prior arts 4 and 5 even when the number of mixed Gaussian distributions is increased.

［実験結果］
次に、本発明の音声認識装置と、従来技術４の音声認識装置と、ＨＭＭを用いた音声認識装置と、を比較した実験結果を説明する。実験では、話者独立、タスク独立の認識を行った。サンプリングレートを１６ｋＨｚとし、フレームシフトを１０ｍｓとし、各状態のガウス分布数が２もしくは３である環境依存HMMを学習した。学習データとして、ＪＮＡＳデータの全ての男性話者を用いた。利用したデータは２００７８文章である。評価データとして、学習データと同条件で分析した男性７５人による１００都市発声を用いた。相発声数は、７１９８である。認識実験においては、ひとつの条件下では、誤差によって偶然性能が良くなるという可能性がある。この点を排除するために、作成する状態数を変化させて、複数の実験を行った。図１１に混合ガウス分布数が２の場合の単語誤認識率の結果を示し、図１２に混合ガウス分布数が３の場合の単語誤認識率の結果を示す。また、混合ガウス分布数が２の場合は、作成ＨＭＭの状態数を２５８９、１９９２、１６１１の３種類を用い、混合ガウス分布数が３の場合は、作成ＨＭＭの状態数を１９９２の１種類を用いた。図１１、図１２より、作成ＨＭＭの状態数がどの場合であっても、本発明の音声認識装置が、従来技術４の音声認識装置やＨＭＭを用いた音声認識装置よりも認識精度が向上していることが分かる。 [Experimental result]
Next, experimental results comparing the speech recognition apparatus according to the present invention, the speech recognition apparatus according to the prior art 4 and the speech recognition apparatus using the HMM will be described. In the experiment, speaker-independent and task-independent recognition were performed. An environment-dependent HMM was learned in which the sampling rate was 16 kHz, the frame shift was 10 ms, and the number of Gaussian distributions in each state was 2 or 3. All male speakers of JNAS data were used as learning data. The data used is 2,0078 sentences. As evaluation data, 100 city utterances by 75 men analyzed under the same conditions as the learning data were used. The number of vocalizations is 7198. In the recognition experiment, under one condition, there is a possibility that the performance is improved by chance due to an error. In order to eliminate this point, a plurality of experiments were performed by changing the number of states to be created. FIG. 11 shows the result of the word error recognition rate when the number of mixed Gaussian distributions is 2, and FIG. 12 shows the result of the word error recognition rate when the number of mixed Gaussian distributions is 3. When the number of mixed Gaussian distributions is 2, the number of states of the created HMM is 389, 2589, 1992, and 1611. When the number of mixed Gaussian distributions is 3, the number of states of the created HMM is 1 type of 1992. Using. 11 and 12, regardless of the number of states of the created HMM, the speech recognition device of the present invention has improved recognition accuracy over the speech recognition device of the prior art 4 and the speech recognition device using the HMM. I understand that

従来の音声特徴学習装置の機能構成例を示す図。The figure which shows the function structural example of the conventional audio | voice characteristic learning apparatus. 従来の音声認識装置の機能構成例を示す図。The figure which shows the function structural example of the conventional speech recognition apparatus. 従来の音声特徴学習装置と音声認識装置とをまとめたものを示す図。The figure which shows what put together the conventional speech feature learning apparatus and speech recognition apparatus. 従来のトラジェクトリを求める流れを示した図。The figure which showed the flow which calculates | requires the conventional trajectory. 実施例の音声特徴学習装置の機能構成例を示す図。The figure which shows the function structural example of the audio | voice feature learning apparatus of an Example. 実施例の音声認識装置の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus of an Example. 実施例の音声特徴学習装置と音声認識装置とをまとめたものを示す図。The figure which shows what put together the audio | voice feature learning apparatus and audio | voice recognition apparatus of an Example. 実施例の音声特徴学習装置の主な処理の流れを示す図。The figure which shows the flow of the main processes of the audio | voice feature learning apparatus of an Example. 実施例の音声認識装置の主な処理の流れを示す図。The figure which shows the flow of the main processes of the speech recognition apparatus of an Example. 補正手段１１６の機能構成例を示す図。The figure which shows the function structural example of the correction | amendment means. 混合ガウス分布が２の場合の実験結果を示す図。The figure which shows the experimental result in case mixing Gaussian distribution is 2. FIG. 混合ガウス分布が３の場合の実験結果を示す図。The figure which shows the experimental result in case mixing Gaussian distribution is 3. FIG.

Claims

A feature quantity extraction unit that extracts static feature quantities and dynamic feature quantities from learning speech;
An acoustic model learning unit that learns an acoustic model from the static feature amount and the dynamic feature amount;
An HMM database storing the acoustic model;
A model conversion unit that converts the acoustic model into a linear dynamic system using the relationship between the static feature quantity and the dynamic feature quantity;
A linear dynamic system database storing the linear dynamic system;
A maximum likelihood calculation unit for obtaining a state sequence and a distribution sequence of the acoustic model so as to maximize the likelihood of the learning speech for the acoustic model, using the acoustic model stored in the HMM database;
A trajectory synthesis unit that synthesizes a trajectory using the linear dynamic system, the state series, and the distribution series;
A likelihood function maximization unit that obtains a maximization function that maximizes a likelihood function consisting of the static feature quantity, the dynamic feature quantity, the trajectory, and the variance of the state series and the distribution series; ,
A maximization function database for storing the maximization function;
A speech feature learning apparatus comprising:

A feature quantity extraction unit that extracts static feature quantities and dynamic feature quantities from input speech;
An HMM database for storing acoustic models;
A linear dynamic system database storing the linear dynamic system;
A trajectory synthesis unit that synthesizes a trajectory using a linear dynamic system stored in the linear dynamic system database and dictionary data stored in the dictionary database;
A likelihood function which is a likelihood function which is a probability density of input speech with respect to the input of the trajectory or a maximum value of likelihood with respect to the input of the trajectory by substituting the trajectory into a maximization function stored in the maximization function database. A likelihood function generator for generating a degree function;
A likelihood calculating unit that calculates the likelihood by substituting the static feature amount and the dynamic feature amount into the likelihood function;
A speech recognition apparatus comprising:

The speech recognition device according to claim 2,
The likelihood calculator is
Approximating means for generating an approximate likelihood function by approximating the likelihood function using the likelihood of the acoustic model, and substituting the static feature amount and the dynamic feature amount into the approximate likelihood function A speech recognition apparatus characterized by calculating likelihood.

The speech recognition device according to claim 3,
The likelihood calculation unit further includes:
By approximating the error between the approximate likelihood function and the likelihood function using the likelihood of the acoustic model, an approximate error is generated, and by adding the approximate likelihood function and the approximate error, A correction means for generating a corrected likelihood function by correcting the likelihood function, and calculating the likelihood by substituting the static feature quantity and the dynamic feature quantity into the corrected likelihood function; A speech recognition apparatus characterized by being.

The speech recognition apparatus according to any one of claims 2 to 4,
The trajectory synthesizing unit synthesizes the trajectory using a state equation and an observation equation.

The speech recognition apparatus according to any one of claims 2 to 5,
The trajectory synthesizing unit synthesizes the trajectory using a Kalman filter.

A feature extraction process for extracting static features and dynamic features from learning speech;
An acoustic model learning process in which an acoustic model is learned from the static feature quantity and the dynamic feature quantity and stored in an HMM database;
Using the relationship between the static feature quantity and the dynamic feature quantity, the acoustic model is converted into a linear dynamic system and stored in a linear dynamic system database; and
A maximum likelihood calculation process for obtaining a state sequence and a distribution sequence of the acoustic model so as to maximize the likelihood of the learning speech for the acoustic model using the acoustic model stored in the HMM database;
A trajectory synthesis process for synthesizing a trajectory using the linear dynamic system, the state sequence, and the distribution sequence;
A maximization function for maximizing a likelihood function consisting of the static feature quantity, the dynamic feature quantity, the trajectory, and the variance of the state series and the distribution series is obtained in a maximization function database. A likelihood function maximization process to remember,
A speech feature learning method characterized by comprising:

A feature extraction process for extracting static and dynamic features from input speech;
A trajectory synthesis process for synthesizing a trajectory using a linear dynamic system stored in the linear dynamic system database and dictionary data stored in the dictionary database;
A likelihood function which is a likelihood function which is a probability density of input speech with respect to the input of the trajectory or a maximum value of likelihood with respect to the input of the trajectory by substituting the trajectory into the maximization function stored in the maximization function database. A likelihood function generation process for generating a degree function;
A speech recognition method comprising: a likelihood calculation step of calculating a likelihood by substituting the static feature amount and the dynamic feature amount into the likelihood function.

A speech feature learning program and a speech recognition program for causing a computer to execute each process of the speech feature learning device and the speech recognition device according to claim 1.

A computer-readable recording medium in which the voice feature learning program and the voice recognition program according to claim 9 are recorded.