JP5647159B2

JP5647159B2 - Prior distribution calculation device, speech recognition device, prior distribution calculation method, speech recognition method, program

Info

Publication number: JP5647159B2
Application number: JP2012041441A
Authority: JP
Inventors: ソンジュンハム; 小川　厚徳; 厚徳小川; 雅清藤本; 堀　貴明; 貴明堀; 中村　篤; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-28
Filing date: 2012-02-28
Publication date: 2014-12-24
Anticipated expiration: 2032-02-28
Also published as: JP2013178343A

Description

本発明は、特徴空間と音響モデル空間で共通に用いる事前分布を生成する事前分布計算装置、事前分布計算方法、プログラム、およびこの事前分布を用いた音声認識装置、音声認識方法、プログラムに関する。 The present invention relates to a prior distribution calculation device, a prior distribution calculation method, and a program for generating a prior distribution commonly used in a feature space and an acoustic model space, and a speech recognition device, a speech recognition method, and a program using the prior distribution.

音声認識の入力信号に影響を与えるさまざまな変動要因（例えば、話者、雑音、通信チャンネル、マイクなど）による悪影響を防ぐため、適応技術が発展してきた。特にモデルに基づく適応技術は変換行列による線形変換で音響モデルのすべてのパラメータを適応させることができるので、適応技術として多く使われている。 Adaptation techniques have been developed to prevent the adverse effects of various variables (eg, speakers, noise, communication channels, microphones, etc.) that affect the speech recognition input signal. In particular, model-based adaptation techniques are often used as adaptation techniques because all parameters of an acoustic model can be adapted by linear transformation using a transformation matrix.

モデルに基づく線形変換形式の適応技術として、Ｕｎｃｏｎｓｔｒａｉｎｅｄ＿Ｍａｘｉｍｕｍ＿Ｌｉｋｅｌｉｈｏｏｄ＿Ｌｉｎｅａｒ＿Ｒｅｇｒｅｓｓｉｏｎ（ＵＭＬＬＲ、制約無し最尤線形回帰；以下ＭＬＬＲと呼ぶ）（非特許文献１）とＣｏｎｓｔｒａｉｎｅｄ＿Ｍａｘｉｍｕｍ＿Ｌｉｋｅｌｉｈｏｏｄ＿Ｌｉｎｅａｒ＿Ｒｅｇｒｅｓｓｉｏｎ（ＣＭＬＬＲ、制約付き最尤線形回帰）（非特許文献２）が知られている。前者はモデル空間、後者は特徴空間での適応手法である。ＣＭＬＬＲは特徴空間での変換式表現できるのでｆｅａｔｕｒｅ＿ｓｐａｃｅ＿ＭＬＬＲ（ｆＭＬＬＲ、特徴空間最尤線形回帰）とも呼ばれる。特にこの手法はＳｐｅａｋｅｒ＿Ａｄａｐｔｉｖｅ＿Ｔｒａｉｎｉｎｇ（ＳＡＴ、話者適応学習）（非特許文献３）に対して効果的であり、メモリ使用量、計算量を削減できるという利点がある。 As the adaptation technique of the linear transformation format based on the model, Unconstrained_Maximum_Likelihood_Linear_Regression (UMLRR, unconstrained maximum likelihood linear regression; hereinafter referred to as MLLR) (Non-Patent Document 1) and Constrained_Maximum_LikeliLedReliable_Least )It has been known. The former is an adaptation method in a model space, and the latter is a feature space. CMLLR is also called feature_space_MLLR (fMLLR, feature space maximum likelihood linear regression) because it can express a conversion formula in the feature space. In particular, this method is effective for Speaker_Adaptive_Training (SAT, speaker adaptive learning) (Non-patent Document 3), and has an advantage that the amount of memory used and the amount of calculation can be reduced.

しかし、上述したＭＬＬＲ（最尤線形回帰）などの事前分布を用いない変換行列推定方法では、適応データ量が少ないときに信頼性のある推定ができないため、認識率の低下及び認識自体ができない場合が生じる。よってこの問題を解決するため事前分布を用いた手法が提案されている。 However, the transformation matrix estimation method that does not use the prior distribution such as MLLR (maximum likelihood linear regression) described above cannot perform reliable estimation when the amount of adaptive data is small, and thus cannot reduce the recognition rate and cannot recognize itself. Occurs. Therefore, a method using prior distribution has been proposed to solve this problem.

事前分布を用いた代表的な方法はＭａｘｉｍｕｍ＿Ａ＿Ｐｏｓｔｅｒｉｏｒｉ＿Ｌｉｎｅａｒ＿Ｒｅｇｒｅｓｓｉｏｎ（ＭＡＰＬＲ）(非特許文献４)、Ｓｔｒｕｃｔｕｒａｌ＿ＭＡＰＬＲ（ＳＭＡＰＬＲ）（非特許文献５）、ｆｅａｔｕｒｅ＿ｓｐａｃｅ＿ＭＡＰＬＲ（ｆＭＡＰＬＲ）（非特許文献６）が挙げられる。ＭＡＰＬＲとＳＭＡＰＬＲは音響モデル空間での適応手法で、ｆＭＡＰＬＲは特徴空間での適応手法である。事前分布には、各手法を用いた学習データに含まれている話者の変換行列の分布が用いられる。 Typical methods using the prior distribution include Maximum_A_Posterori_Linear_Regulation (MAPLR) (Non-Patent Document 4), Structural_MAPLR (SMAPLR) (Non-Patent Document 5), and feature_space_MAPLR (fMAPLR) (Non-Patent Document 6). MAPLR and SMAPLR are adaptive methods in the acoustic model space, and fMAPLR is an adaptive method in the feature space. For the prior distribution, the distribution of the conversion matrix of the speakers included in the learning data using each method is used.

従来のＭＬＬＲを利用したＳＡＴは必要なメモリ量と計算量が多い。この理由としてはＭＬＬＲでよく使われている木構造を元に適応データによって選択されたノードの変換行列推定が行われるためである。一般的に学習データ量はテストデータ量に比べてはるかに多く、木構造から選択されるノードが多いため推定すべき話者毎の変換行列数が増加する。さらにＭＬＬＲでは平均と分散の変換行列が異なるためＣＭＬＬＲと比べ二倍の計算量とメモリが必要になる。 A conventional SAT using MLLR requires a large amount of memory and calculation. This is because the transformation matrix of the node selected by the adaptive data is estimated based on the tree structure often used in MLLR. In general, the amount of learning data is much larger than the amount of test data, and since there are many nodes selected from the tree structure, the number of transformation matrices for each speaker to be estimated increases. Furthermore, since MLLR has different conversion matrices for average and variance, it requires twice the amount of calculation and memory compared to CMLLR.

Leggetter, C. and Woodland, P.C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2):171--185, 1995.Leggetter, C. and Woodland, P.C.Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models.Computer Speech and Language, 9 (2): 171--185, 1995. Gales, M.J.F. Maximum likelihood linear transformations for HMM-based speech recognition. Computer Speech and Language, 12:75--98, 1998.Gales, M.J.F.Maximum likelihood linear transformations for HMM-based speech recognition.Computer Speech and Language, 12: 75--98, 1998. Anastasakos, T. and McDonough, J. and Makhoul, J. Speaker adaptive training: A maximum likelihood approach to speaker normalization. Proc. of ICASSP, pages 1043--1046, 1997.Anastasakos, T. and McDonough, J. and Makhoul, J. Speaker adaptive training: A maximum likelihood approach to speaker normalization.Proc. Of ICASSP, pages 1043--1046, 1997. Siohan, O. and Chesta, C. and Lee, C.H. Joint maximum a posteriori adaptation of transformation and HMM parameters. IEEE Trans. on Speech and Audio Processing, 9(4):417--428, 2001.Siohan, O. and Chesta, C. and Lee, C.H.Joint maximum a posteriori adaptation of transformation and HMM parameters.IEEE Trans. On Speech and Audio Processing, 9 (4): 417--428, 2001. Siohan, O. and Myrvoll, T.A. and Lee, C.H. Structural maximum a posteriori linear regression for fast HMM adaptation. Computer Speech & Language, 16(1):5--24, 2002.Siohan, O. and Myrvoll, T.A. and Lee, C.H.Structural maximum a posteriori linear regression for fast HMM adaptation.Computer Speech & Language, 16 (1): 5--24, 2002. Lei, X. and Hamaker, J. and He, X. Robust feature space adaptation for telephony speech recognition. Proc. of INTERSPEECH, pages 773--776, 2006.Lei, X. and Hamaker, J. and He, X. Robust feature space adaptation for telephony speech recognition.Proc. Of INTERSPEECH, pages 773--776, 2006.

モデル空間での適応方法、特徴空間での適応方法の双方を組み合わせて話者適応を行うことで、適応データ量が多い場合には、モデルか特徴空間だけでの適応方法に比べ認識性能の向上を図ることができる。しかしながら、上記組み合わせの手法によっても事前分布を用いなければ、適応データ量が少ない場合に信頼性の高い推定を行うことが出来ない。一方、モデル空間での適応、特徴空間での適応の双方に対して、事前分布を用いることとすると、各々に対して別々に事前分布を計算することとなり、計算量が増大してしまう。 When speaker adaptation is performed by combining both the adaptation method in the model space and the adaptation method in the feature space, the recognition performance improves when the amount of adaptation data is large compared to the adaptation method in the model or feature space alone. Can be achieved. However, if the prior distribution is not used even by the above combination method, it is impossible to perform highly reliable estimation when the amount of adaptive data is small. On the other hand, if the prior distribution is used for both the adaptation in the model space and the adaptation in the feature space, the prior distribution is calculated separately for each, and the calculation amount increases.

そこで、本発明では、モデル空間での適応および特徴空間での適応の双方に共通に用いることができる事前分布を生成し、事前分布の計算量を削減することができる事前分布計算装置、音声認識装置、事前分布計算方法、音声認識方法、プログラムを提供することを目的とする。 Therefore, the present invention generates a prior distribution that can be commonly used for both adaptation in the model space and adaptation in the feature space, and reduces the amount of calculation of the prior distribution, speech recognition An object is to provide a device, a prior distribution calculation method, a speech recognition method, and a program.

本発明の事前分布計算装置は、特徴ベクトル抽出部と、第１変換行列推定部と、特徴ベクトル変換部と、ＭＬＥ音響モデル学習部と、第２変換行列推定部と、事前分布計算部とを備える。 The prior distribution calculation device of the present invention includes a feature vector extraction unit, a first transformation matrix estimation unit, a feature vector conversion unit, an MLE acoustic model learning unit, a second transformation matrix estimation unit, and a prior distribution calculation unit. Prepare.

特徴ベクトル抽出部は、複数の話者の入力音声から話者毎の特徴ベクトルを抽出する。第１変換行列推定部は、特徴ベクトルと、全話者のデータから予め学習された初期音響モデルとを用いて特徴空間最尤線形回帰により話者毎に第１の変換行列を推定する。特徴ベクトル変換部は、話者毎の第１の変換行列を用いて、対応する話者の特徴ベクトルを変換する。ＭＬＥ音響モデル学習部は、特徴ベクトル変換部により変換された特徴ベクトルを用いて、最尤法により音響モデルの学習を行う。第２変換行列推定部は、特徴ベクトル抽出部で抽出された特徴ベクトルと、ＭＬＥ音響モデル学習部により学習された音響モデルとを用いて特徴空間最尤線形回帰により話者毎に第２の変換行列を推定する。事前分布計算部は、第２の変換行列を用いて行列の多変量正規分布を計算し、当該多変量正規分布を事前分布として、事前分布のハイパーパラメータを出力する。 The feature vector extraction unit extracts feature vectors for each speaker from the input speech of a plurality of speakers. The first transformation matrix estimation unit estimates a first transformation matrix for each speaker by feature space maximum likelihood linear regression using a feature vector and an initial acoustic model previously learned from data of all speakers. The feature vector conversion unit converts the feature vector of the corresponding speaker using the first conversion matrix for each speaker. The MLE acoustic model learning unit learns the acoustic model by the maximum likelihood method using the feature vector converted by the feature vector conversion unit. The second transformation matrix estimation unit performs second transformation for each speaker by feature space maximum likelihood linear regression using the feature vector extracted by the feature vector extraction unit and the acoustic model learned by the MLE acoustic model learning unit. Estimate the matrix. The prior distribution calculation unit calculates a multivariate normal distribution of the matrix using the second transformation matrix, and outputs the hyperparameter of the prior distribution using the multivariate normal distribution as the prior distribution.

本発明の事前分布計算装置によれば、モデル空間での適応および特徴空間での適応の双方に共通に用いることができる事前分布を生成し、事前分布の計算量を削減することができる。 According to the prior distribution calculation device of the present invention, it is possible to generate a prior distribution that can be commonly used for both adaptation in the model space and adaptation in the feature space, and reduce the amount of calculation of the prior distribution.

実施例１の事前分布計算装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a prior distribution calculation apparatus according to Embodiment 1. FIG. 実施例１の事前分布計算装置の動作を示すフローチャート。3 is a flowchart illustrating the operation of the prior distribution calculation apparatus according to the first embodiment. 実施例２の音声認識装置の構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a speech recognition apparatus according to a second embodiment. 実施例２の音声認識装置の動作を示すフローチャート。9 is a flowchart illustrating the operation of the speech recognition apparatus according to the second embodiment. 変形例１の音声認識装置の構成を示すブロック図。The block diagram which shows the structure of the speech recognition apparatus of the modification 1. FIG. 変形例１の音声認識装置の動作を示すフローチャート。9 is a flowchart showing the operation of the speech recognition apparatus according to the first modification.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下、図１、図２を参照して実施例１の事前分布計算装置について詳細に説明する。図１は本実施例の事前分布計算装置１の構成を示すブロック図である。図２は本実施例の事前分布計算装置１の動作を示すフローチャートである。本実施例の事前分布計算装置１は、特徴ベクトル抽出部１０と、第１変換行列推定部２０と、特徴ベクトル変換部３０と、ＭＬＥ音響モデル学習部４０と、正規化済み音響モデル格納部５０と、第２変換行列推定部６０と、事前分布計算部７０と、初期音響モデル格納部８０とを備える。第１変換行列推定部２０は、統計量Ｇ計算手段２１と、統計量ｋ計算手段２２と、変換行列推定手段２３と、反復学習手段２４とを備える。ＭＬＥ音響モデル学習部４０は、平均更新手段４１と、分散更新手段４２とを備える。事前分布計算部７０は、パラメータＣ計算手段７１と、パラメータＶ計算手段７２とを備える。初期音響モデル格納部８０には、全話者のデータを用いて学習された音響モデルが初期音響モデルとして予め記憶されている。 Hereinafter, the prior distribution calculation apparatus according to the first embodiment will be described in detail with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing the configuration of the prior distribution calculation apparatus 1 of the present embodiment. FIG. 2 is a flowchart showing the operation of the prior distribution calculation apparatus 1 of the present embodiment. The prior distribution calculation apparatus 1 of the present embodiment includes a feature vector extraction unit 10, a first transformation matrix estimation unit 20, a feature vector conversion unit 30, an MLE acoustic model learning unit 40, and a normalized acoustic model storage unit 50. A second transformation matrix estimation unit 60, a prior distribution calculation unit 70, and an initial acoustic model storage unit 80. The first transformation matrix estimation unit 20 includes a statistic G calculation unit 21, a statistic k calculation unit 22, a transformation matrix estimation unit 23, and an iterative learning unit 24. The MLE acoustic model learning unit 40 includes an average update unit 41 and a distributed update unit 42. The prior distribution calculation unit 70 includes parameter C calculation means 71 and parameter V calculation means 72. In the initial acoustic model storage unit 80, an acoustic model learned using data of all speakers is stored in advance as an initial acoustic model.

以下、最初に処理の概要を三節に分けて説明し、各節の最後に、各節における各構成部の具体的な処理を説明する。 Hereinafter, the outline of the process will be described in three sections first, and specific processes of each component in each section will be described at the end of each section.

＜１．変換行列の推定（第１変換行列推定部２０の処理）＞
本実施例の事前分布計算装置１は、全話者のデータを用いて予め学習された初期音響モデルと、各話者の入力音声を話者毎に変換した特徴ベクトルに基づいてｆＭＬＬＲ（特徴空間最尤線形回帰）で各話者の変換行列（第１変換行列）を推定する。まず、入力音声から抽出された特徴ベクトルをｏ（ｔ）と定義する。このとき、ｏ（ｔ）は、ｔ番目のフレームのＮ次元特徴ベクトルを表す。本実施例の事前分布計算装置１は、この特徴ベクトルｏ（ｔ）を変換行列を用いて特徴ベクトルｏ（ｔ）ハットに変換する。変換された特徴ベクトルｏ（ｔ）ハットは以下のようになる。 <1. Estimation of transformation matrix (processing of first transformation matrix estimation unit 20)>
The prior distribution calculation apparatus 1 according to the present embodiment uses an initial acoustic model learned in advance using data of all speakers and a feature vector obtained by converting each speaker's input speech for each speaker. The conversion matrix (first conversion matrix) of each speaker is estimated by maximum likelihood linear regression. First, a feature vector extracted from input speech is defined as o (t). At this time, o (t) represents the N-dimensional feature vector of the t-th frame. The prior distribution calculation apparatus 1 of the present embodiment converts this feature vector o (t) into a feature vector o (t) hat using a conversion matrix. The transformed feature vector o (t) hat is as follows.

変換行列推定のためのＱ関数は以下のように定義される。

The Q function for transform matrix estimation is defined as follows.

ここで、Ｗの最適化問題（αの推定）は非特許文献２に詳述されている。変換行列Ｗのｉ行目は以下のように求まる。

Here, the optimization problem of W (estimation of α) is described in detail in Non-Patent Document 2. The i-th row of the transformation matrix W is obtained as follows.

また、変換行列推定のためのｉ次元目の統計量Ｇ^（ｉ）とｋ^（ｉ）は入力音声の拡張特徴ベクトルξ（ｔ）とｕ番目の混合ガウス分布のｉ次元目平均μ_ｉ ^（ｕ）と分散σ_ｉ ^（ｕ）を用いて以下の式のように計算される。

The i-th statistics G ⁽ⁱ⁾ and k ⁽ⁱ⁾ for transform matrix estimation are the extended feature vector ξ (t) of the input speech and the i-th average μ _i ^{(u )} And variance σ _i ^(u) .

Ｒｏｗ−ｂｙ−ｒｏｗ変換行列の推定の後、次式を用いて最尤法で反復学習を行う。

After the estimation of the Row-by-row transformation matrix, iterative learning is performed by the maximum likelihood method using the following equation.

本実施例では、上述した第１節の処理を第１変換行列推定部２０が実行する。従って、まず特徴ベクトル抽出部１０は、Ｓ人の話者（話者１、…、話者Ｓ、Ｓは２以上の整数）の入力音声からＮ次元特徴ベクトルｏ（ｔ）を抽出する（Ｓ１０）。前述したように、初期音響モデル格納部８０には、全話者のデータを用いて学習された音響モデルが初期音響モデルとして予め記憶されている。統計量Ｇ計算手段２１は、式（４）により統計量Ｇを計算する（ＳＳ２１）。統計量ｋ計算手段２２は、式（５）により統計量ｋを計算する（ＳＳ２２）次に、変換行列推定手段２３は、式（３）により、変換行列を推定する（ＳＳ２３）。これらのサブステップＳＳ２１〜ＳＳ２３の処理は、次元（行）ｉの全ての取りうる値について繰り返し実行され、変換行列Ｗが得られる。次に、反復学習手段２４は、式（６）を用いて、最尤法で変換行列の反復学習を行う（ＳＳ２４）。以上のステップＳ１０、Ｓ２０により話者１〜話者Ｓの変換行列が求められる。 In the present embodiment, the first transformation matrix estimation unit 20 executes the processing of the first section described above. Therefore, first, the feature vector extraction unit 10 extracts an N-dimensional feature vector o (t) from the input speech of S speakers (speakers 1,..., Speakers S, S is an integer of 2 or more) (S10). ). As described above, in the initial acoustic model storage unit 80, an acoustic model learned using data of all speakers is stored in advance as an initial acoustic model. The statistic G calculating means 21 calculates the statistic G by the equation (4) (SS21). The statistic k calculation means 22 calculates the statistic k using equation (5) (SS22). Next, the transformation matrix estimation means 23 estimates the transformation matrix using equation (3) (SS23). The processing of these sub-steps SS21 to SS23 is repeatedly executed for all possible values of the dimension (row) i, and the transformation matrix W is obtained. Next, the iterative learning means 24 performs iterative learning of the transformation matrix by the maximum likelihood method using Equation (6) (SS24). Through the above steps S10 and S20, a conversion matrix of speakers 1 to S is obtained.

＜２．音響モデルの学習（特徴ベクトル変換部３０、ＭＬＥ音響モデル学習部４０の処理）＞
本実施例の事前分布計算装置１は、話者１〜話者Ｓの変換行列を利用して、各話者の特徴ベクトルを変換し、変換された特徴ベクトル（学習データ）を用いて最尤法で音響モデルの学習を行う。音響モデル学習（ＳＡＴ）のためのＱ関数は以下のように定義される。

<2. Acoustic Model Learning (Processing of Feature Vector Conversion Unit 30 and MLE Acoustic Model Learning Unit 40)>
The prior distribution calculation apparatus 1 of the present embodiment uses the conversion matrix of speakers 1 to S to convert each speaker's feature vector, and uses the converted feature vector (learning data) for maximum likelihood. The acoustic model is learned by the method. The Q function for acoustic model learning (SAT) is defined as follows.

変換後の特徴ベクトルｏ^（ｓ）（ｔ）ハットは以下の式により計算される。

The transformed feature vector o ^(s) (t) is calculated by the following equation.

式（７）の処理は、従来の最尤法（ＭＬＥ、Ｍａｘｉｍｕｍ＿Ｌｉｋｅｌｉｈｏｏｄ＿Ｅｓｔｉｍａｔｉｏｎ）と比較して、特徴ベクトルのみが異なる処理となっている。すなわち、元の特徴ベクトルｏ^（ｓ）（ｔ）の代わりにｏ^（ｓ）（ｔ）ハットを使い、従来のＭＬＥと同様の学習を行う。 The processing of Expression (7) is processing in which only the feature vector is different compared to the conventional maximum likelihood method (MLE, Maximum_Likelihood_Estimation). That is, learning similar to conventional MLE is performed using o ^(s) (t) hat instead of the original feature vector o ^(s) (t).

平均と分散の更新式は以下のようになる。

The update formula for mean and variance is:

本実施例では、上述した第２節の処理のうち、式（８）にかかる処理を特徴ベクトル変換部３０が実行し、残りの処理をＭＬＥ音響モデル学習部４０が実行する。従って、特徴ベクトル変換部３０は、第１変換行列推定部２０により推定された変換行列により、式（８）を用いて、特徴ベクトル抽出部１０で生成された特徴ベクトルを変換する（Ｓ３０）。次に、ＭＬＥ（最尤度）による音響モデルの学習は各学習回数毎に平均と分散とを更新しながら尤度が収束するまで繰り返し行われる。平均更新手段４１は、式（９）を用いて、変換後の特徴ベクトルから平均を求める（ＳＳ４１）。分散更新手段４２は、式（１０）を用いて、変換後の特徴ベクトルから分散を求める（ＳＳ４２）。各話者の変換行列を用いて変換された特徴を用いて学習した、話者の変異が正規化された音響モデルは次の処理のため、正規化済み音響モデル格納部５０に格納される（Ｓ５０）。 In the present embodiment, the feature vector conversion unit 30 executes the processing according to the equation (8) among the processing of the second section described above, and the MLE acoustic model learning unit 40 executes the remaining processing. Therefore, the feature vector conversion unit 30 converts the feature vector generated by the feature vector extraction unit 10 using the equation (8) based on the conversion matrix estimated by the first conversion matrix estimation unit 20 (S30). Next, the learning of the acoustic model by MLE (maximum likelihood) is repeated until the likelihood converges while updating the average and variance for each learning count. The average updating unit 41 obtains an average from the feature vector after conversion using Equation (9) (SS41). The variance updating unit 42 obtains variance from the converted feature vector using Equation (10) (SS42). The acoustic model in which the speaker variation is normalized, which is learned using the characteristics transformed using the transformation matrix of each speaker, is stored in the normalized acoustic model storage unit 50 for the next processing ( S50).

＜３．事前分布の計算（第２変換行列推定部６０、事前分布計算部７０の処理）＞
本実施例の事前分布計算装置１は、ＭＬＥ音響モデル学習部４０で学習された音響モデルを用いて、第１節と同様に、ｆＭＬＬＲで各話者毎の変換行列（第２変換行列）を求める。本実施例の事前分布計算装置１は、求めた各話者の変換行列（第２変換行列）の事前分布（各次元毎の平均と分散）を求める。 <3. Calculation of Prior Distribution (Processing of Second Transformation Matrix Estimation Unit 60 and Prior Distribution Calculation Unit 70)>
The prior distribution calculation apparatus 1 according to the present embodiment uses the acoustic model learned by the MLE acoustic model learning unit 40 to obtain a transformation matrix (second transformation matrix) for each speaker using fMLLR, as in the first section. Ask. The prior distribution calculation apparatus 1 according to the present embodiment obtains a prior distribution (average and variance for each dimension) of the obtained transformation matrix (second transformation matrix) of each speaker.

まず、事前分布計算のため、ＭＬＥ音響モデル学習部４０で学習した音響モデルを用いて、ｆＭＬＬＲにより、第１変換行列推定部２０と同様の処理（式（３）〜（６））を再度行って、各話者の変換行列（第２変換行列）を求めておく。事前分布の計算は非特許文献４に記載の行列の多変量正規分布を用いる。この多変量正規分布は以下の式のように定義される。

First, for the prior distribution calculation, the same processing (Equations (3) to (6)) as in the first transformation matrix estimation unit 20 is performed again by fMLLR using the acoustic model learned by the MLE acoustic model learning unit 40. Thus, the conversion matrix (second conversion matrix) of each speaker is obtained. The calculation of the prior distribution uses a multivariate normal distribution of a matrix described in Non-Patent Document 4. This multivariate normal distribution is defined as:

ここで分散ハイパーパラメータは非特許文献４のように単位行列と仮定する。全体話者数をＳとし、Ｓ個の変換行列からハイパーパラメータＣ、Ｖを以下の式で求める。

Here, the distributed hyperparameter is assumed to be a unit matrix as in Non-Patent Document 4. Let S be the total number of speakers, and hyperparameters C and V are obtained from the S transformation matrices using the following equations.

本実施例では、上述した第３節の処理のうち、変換行列の推定処理については第２変換行列推定部６０が実行し、式（１２）にかかる処理を事前分布計算部７０が実行する。従って、第２変換行列推定部６０は、ＭＬＥ音響モデル学習部４０で学習した音響モデルを用いて、式（３）〜式（６）に基づいて、変換行列（第２変換行列）を推定する。この処理は、第１変換行列推定部２０の処理と同様である。次に、求められた第２変換行列は行列の多変量正規化分布に従うという仮定下で、パラメータＣ計算手段７１は、第２変換行列を用いて、式（１２）により、ハイパーパラメータＣを計算する（ＳＳ７１）。パラメータＶ計算手段７２は、第２変換行列を用いて、式（１２）により、ハイパーパラメータＶを計算する（ＳＳ７２）。 In the present embodiment, among the processes of the third section described above, the second conversion matrix estimation unit 60 executes the conversion matrix estimation process, and the prior distribution calculation unit 70 executes the process according to Expression (12). Therefore, the second transformation matrix estimation unit 60 estimates the transformation matrix (second transformation matrix) based on the equations (3) to (6) using the acoustic model learned by the MLE acoustic model learning unit 40. . This process is the same as the process of the first transformation matrix estimation unit 20. Next, under the assumption that the obtained second transformation matrix follows the multivariate normalized distribution of the matrix, the parameter C calculation means 71 calculates the hyperparameter C using the second transformation matrix according to the equation (12). (SS71). The parameter V calculation means 72 calculates the hyperparameter V using the second transformation matrix according to the equation (12) (SS72).

このように、本実施例の事前分布計算装置１によれば、モデル空間での適応および特徴空間での適応の双方に共通に用いることができる事前分布を生成し、事前分布の計算量を削減することができる。 Thus, according to the prior distribution calculation apparatus 1 of the present embodiment, a prior distribution that can be commonly used for both adaptation in the model space and adaptation in the feature space is generated, and the amount of calculation of the prior distribution is reduced. can do.

以下、図３、図４を参照して実施例２の音声認識装置について詳細に説明する。図３は本実施例の音声認識装置１００の構成を示すブロック図である。図４は本実施例の音声認識装置１００の動作を示すフローチャートである。本実施例の音声認識装置１００は、実施例１の事前分布計算装置１により予め求めた事前分布を共通に用いて特徴空間とモデル空間を同時適応することを特徴とする。本実施例の音声認識装置１００は、特徴ベクトル抽出部１１０と、特徴ベクトル格納部１１５と、特徴ベクトル変換部１２０と、初期変換行列格納部１２５と、音声認識部１３０と、認識用データ記憶部１４０と、認識結果格納部１４５と、特徴空間統計量計算部１５０と、特徴空間変換行列推定部１５５と、木構造決定部１６０と、モデル空間統計量計算部１７０と、モデル空間変換行列推定部１７５と、音響モデル更新部１８０と、事前分布記憶部１９０とを備える。認識用データ記憶部１４０は、音響モデル１４１と、言語モデル１４２と、単語辞書１４３とを備える。モデル空間統計量計算部１７０は、統計量Ｇチルダ計算手段１７１と、統計量ｋチルダ計算手段１７２と、平滑化統計量計算手段１７３とを備える。事前分布記憶部１９０には、実施例１で説明された方法で生成された事前分布のハイパーパラメータＣ、Ｖが予め記憶されている。 Hereinafter, the speech recognition apparatus according to the second embodiment will be described in detail with reference to FIGS. 3 and 4. FIG. 3 is a block diagram showing the configuration of the speech recognition apparatus 100 of this embodiment. FIG. 4 is a flowchart showing the operation of the speech recognition apparatus 100 of this embodiment. The speech recognition apparatus 100 according to the present embodiment is characterized in that the feature space and the model space are simultaneously adapted by commonly using the prior distribution obtained in advance by the prior distribution calculation apparatus 1 according to the first embodiment. The speech recognition apparatus 100 according to the present embodiment includes a feature vector extraction unit 110, a feature vector storage unit 115, a feature vector conversion unit 120, an initial transformation matrix storage unit 125, a speech recognition unit 130, and a recognition data storage unit. 140, a recognition result storage unit 145, a feature space statistics calculation unit 150, a feature space transformation matrix estimation unit 155, a tree structure determination unit 160, a model space statistics calculation unit 170, and a model space transformation matrix estimation unit 175, an acoustic model update unit 180, and a prior distribution storage unit 190. The recognition data storage unit 140 includes an acoustic model 141, a language model 142, and a word dictionary 143. The model space statistic calculator 170 includes a statistic G tilde calculator 171, a statistic k tilde calculator 172, and a smoothed statistic calculator 173. The prior distribution storage unit 190 stores in advance the hyperparameters C and V of the prior distribution generated by the method described in the first embodiment.

以下、本実施例の音声認識装置１００の処理の概要を説明し、その後に各構成部の具体的な処理内容を説明する。 Hereinafter, an outline of processing of the speech recognition apparatus 100 of the present embodiment will be described, and then specific processing contents of each component will be described.

＜４．事前分布共有による特徴空間と音響モデル空間の同時適応＞
本実施例の音声認識装置１００は、入力音声を音声認識して、当該音声認識結果を元に（教師なし適応）統計量計算を行なう。統計量計算に際して、実施例１の方法により予め求めた事前分布が反映される。本実施例の音声認識装置１００は、計算された統計量から特徴空間とモデル空間での変換行列を推定する。本実施例の音声認識装置１００は、推定された特徴空間の変換行列と、モデル空間の変換行列を用いて、Ｎ次元特徴ベクトルと音響モデルをそれぞれ更新して再認識を行う。 <4. Simultaneous adaptation of feature space and acoustic model space by sharing prior distribution>
The speech recognition apparatus 100 according to the present embodiment recognizes input speech and performs statistical calculation based on the speech recognition result (unsupervised adaptation). In calculating statistics, the prior distribution obtained in advance by the method of the first embodiment is reflected. The speech recognition apparatus 100 of the present embodiment estimates the transformation matrix in the feature space and the model space from the calculated statistics. The speech recognition apparatus 100 according to the present embodiment performs re-recognition by updating the N-dimensional feature vector and the acoustic model, respectively, using the estimated feature space transformation matrix and model space transformation matrix.

事前分布なしのＭＬ基準Ｑ関数は以下のように定義される。

The ML criterion Q function without prior distribution is defined as follows.

モデル空間での変換は平均のみを考慮する。つまり分散の適応は特徴空間で行われる。異なる空間での式（１３）を直接最適化することは難しいので、ここでは同時最適化のため特徴空間とモデル空間で順番に最適化を行う方法を利用する。 Transformation in model space only considers the mean. In other words, the distribution is adapted in the feature space. Since it is difficult to directly optimize the expression (13) in different spaces, here, a method of performing optimization in order in the feature space and the model space is used for simultaneous optimization.

まずモデル空間変換行列Ｗ_ｒ ^Ｍを単位変換行列［０_ｎ ^ＴＩ_ｎ×ｎ］として、モデル空間での事前分布はないと仮定すれば、事前分布を用いた特徴空間でのＱ関数は以下のようになる。

First, assuming that the model space transformation matrix W _r ^M is a unit transformation matrix [0 _n ^T I _{n × n} ] and there is no prior distribution in the model space, the Q function in the feature space using the prior distribution is It becomes like this.

特徴空間でのｉ行目の変換行列は以下の式で推定できる。

The transformation matrix of the i-th row in the feature space can be estimated by the following equation.

ここで、統計量は事前分布を用いて以下のように計算される。

統計量Ｇ^（ｉ）ハットと統計量ｋ^（ｉ）ハットは、それぞれ、Ｇ^（ｉ）、ｋ^（ｉ）の平滑化された統計量を意味する。Ｇ^（ｉ）、ｋ^（ｉ）は式（４）と式（５）を用いて計算したものである。 Here, the statistic is calculated using the prior distribution as follows.

Statistics ^{G (i)} hat and statistic ^{k (i)} hat, ^{respectively,} it means ^G ^{(i), k} smoothed statistic ^(i). G ⁽ⁱ⁾ and k ⁽ⁱ⁾ are calculated using the equations (4) and (5).

次に、事前分布を用いたモデル空間でのＱ関数は以下のようになる。

Next, the Q function in the model space using the prior distribution is as follows.

モデル空間での事前分布のハイパーパラメータＶチルダ、Ｃチルダは以下のように定義される。

The hyperparameter V tilde and C tilde of the prior distribution in the model space are defined as follows.

ｒ番目の再帰クラスのｉ行目の変換行列Ｗ_ｒ ^Ｍは以下の式で定義される。

The transformation matrix W _r ^{M in} the i-th row of the r-th recursive class is defined by the following equation.

また平滑化された統計量Ｇ^（ｉ）バー、ｋ^（ｉ）バーは以下の式を用いて計算される。

The smoothed statistics G ⁽ⁱ⁾ bar and k ⁽ⁱ⁾ bar are calculated using the following equations.

モデル空間での統計量Ｇチルダ、ｋチルダは以下の式を用いて計算される。

Statistics G tilde and k tilde in the model space are calculated using the following equations.

得られた変換行列を用いて、以下の式のように音響モデルの平均の更新を行う。

Using the obtained transformation matrix, the average of the acoustic model is updated as in the following equation.

認識（テスト）時は入力音声の特徴ベクトルを特徴空間で求めた変換行列を用いて変換し、モデル空間変換行列で更新された音響モデルに基づいて認識を行う。 At the time of recognition (test), the feature vector of the input speech is transformed using the transformation matrix obtained in the feature space, and recognition is performed based on the acoustic model updated by the model space transformation matrix.

本実施例では、上述した４節の処理を音声認識装置１００の各構成部が実行する。まず、特徴ベクトル抽出部１１０は、入力される音声信号からＮ次元特徴ベクトルを抽出する（Ｓ１１０）。次に、特徴ベクトル格納部１１５は、Ｎ次元特徴ベクトルを格納する（Ｓ１１５）。格納されたＮ次元特徴ベクトルは、後述するステップＳ１２０−１、Ｓ１２０−２の双方において、特徴ベクトル変換に用いられる。特徴ベクトル変換部１２０は、変換行列により特徴ベクトルを変換する（Ｓ１２０−１）。ここで、変換行列の初期値は初期変換行列格納部１２５に格納されているものとし、最初の（ステップＳ１２０−１における）特徴ベクトル変換部１２０の動作時には、初期変換行列が使用されるものとする。初期変換行列は単位変換行列（バイアスは全て０で、回転行列は単位行列）であるため、変換前の特徴ベクトルｏ（ｔ）と、変換後の特徴ベクトルｏ（ｔ）ハットは、同一となる。次に、音声認識部１３０は、認識用データ記憶部１４０に記憶された音響モデル１４１、言語モデル１４２、単語辞書１４３を用いて音声認識を行い、変換後の特徴ベクトルから音声認識結果を生成する（Ｓ１３０−１）。認識結果格納部１４５は、生成された音声認識結果を格納する（Ｓ１４５−１）。次に、特徴空間統計量計算部１５０は、式（１６）を用いて統計量Ｇハット、ｋハットを計算する（Ｓ１５０）。特徴空間変換行列推定部１５５は、式（１５）を用いて特徴空間の変換行列を推定する（Ｓ１５５）。次に、特徴ベクトル変換部１２０は、特徴空間変換行列推定部１５５が推定した特徴空間の変換行列を用いて、特徴ベクトル格納部１１５に格納済みの特徴ベクトルを変換する（Ｓ１２０−２）。音声認識部１３０は、ステップＳ１３０−１と同様に、ステップＳ１２０−２で特徴空間の変換行列を用いて変換した特徴ベクトルから音声認識結果を生成する（Ｓ１３０−２）。認識結果格納部１４５は、生成された音声認識結果を格納する（Ｓ１４５−２）。次に、木構造決定部１６０は、式（１８）を用いて音声認識結果を木構造に分類して、ハイパーパラメータＣチルダ、Ｖチルダを決定する（Ｓ１６０）。次に、統計量Ｇチルダ計算手段１７１は、式（２１）により、統計量Ｇチルダを計算する（ＳＳ１７１）。次に、統計量ｋチルダ計算手段は、式（２２）により、統計量ｋチルダを計算する（ＳＳ１７２）。平滑化統計量計算手段１７３は、統計量Ｇチルダ、統計量ｋチルダ、ハイパーパラメータＣチルダ、Ｖチルダを用いて、式（２０）により、平滑化された統計量Ｇバー、ｋバーを計算する（ＳＳ１７３）。モデル空間変換行列推定部１７５は、統計量Ｇバー、ｋバーを用いて、式（１９）により、モデル空間の変換行列を推定する（Ｓ１７５）。音響モデル更新部１８０は、推定されたモデル空間の変換行列を用いて式（２３）により、音響モデルの平均を計算し、音響モデル１４１を更新する（Ｓ１８０）。 In the present embodiment, each component of the speech recognition apparatus 100 executes the above-described processing in section 4. First, the feature vector extraction unit 110 extracts an N-dimensional feature vector from the input audio signal (S110). Next, the feature vector storage unit 115 stores an N-dimensional feature vector (S115). The stored N-dimensional feature vector is used for feature vector conversion in both steps S120-1 and S120-2 described later. The feature vector conversion unit 120 converts the feature vector using a conversion matrix (S120-1). Here, the initial value of the transformation matrix is assumed to be stored in the initial transformation matrix storage unit 125, and the initial transformation matrix is used when the feature vector transformation unit 120 is operated for the first time (in step S120-1). To do. Since the initial transformation matrix is a unit transformation matrix (the bias is all 0 and the rotation matrix is a unit matrix), the feature vector o (t) before the transformation and the feature vector o (t) after the transformation are the same. . Next, the speech recognition unit 130 performs speech recognition using the acoustic model 141, the language model 142, and the word dictionary 143 stored in the recognition data storage unit 140, and generates a speech recognition result from the converted feature vector. (S130-1). The recognition result storage unit 145 stores the generated speech recognition result (S145-1). Next, the feature space statistic calculation unit 150 calculates the statistic G hat and k hat using the equation (16) (S150). The feature space transformation matrix estimation unit 155 estimates the transformation matrix of the feature space using Equation (15) (S155). Next, the feature vector conversion unit 120 converts the feature vector stored in the feature vector storage unit 115 using the feature space conversion matrix estimated by the feature space conversion matrix estimation unit 155 (S120-2). Similar to step S130-1, the speech recognition unit 130 generates a speech recognition result from the feature vector converted using the feature space conversion matrix in step S120-2 (S130-2). The recognition result storage unit 145 stores the generated speech recognition result (S145-2). Next, the tree structure determination unit 160 classifies the speech recognition result into a tree structure using Expression (18), and determines hyperparameter C tilde and V tilde (S160). Next, the statistic G tilde calculation means 171 calculates the statistic G tilde according to the equation (21) (SS171). Next, the statistic k tilde calculation means calculates the statistic k tilde using equation (22) (SS172). The smoothed statistic calculation means 173 calculates the statistic G tilde, the statistic k tilde, the hyperparameter C tilde, and the V tilde using the equation (20) to calculate the smoothed statistic G bar and k bar. (SS173). The model space transformation matrix estimation unit 175 estimates the transformation matrix of the model space according to the equation (19) using the statistics G bar and k bar (S175). The acoustic model update unit 180 calculates the average of the acoustic model according to the equation (23) using the estimated transformation matrix of the model space, and updates the acoustic model 141 (S180).

このように、本実施例の音声認識装置１００によれば、あらかじめ定めた共通の事前分布を用いて、特徴空間とモデル空間を同時適応するため、事前分布の計算量を削減するという実施例１と共通する効果に加えて、適応データ量が少ない場合には事前分布を使用したことにより認識率が向上し、適応データ量が多い場合には特徴空間とモデル空間を組み合わせて話者適応を行なったことにより認識率が向上するため、適応データ量の多少に関わらず認識率が向上する。 As described above, according to the speech recognition apparatus 100 of the present embodiment, in order to simultaneously adapt the feature space and the model space using a predetermined common prior distribution, the calculation amount of the prior distribution is reduced. In addition to the common effects, the recognition rate is improved by using prior distribution when the amount of adaptive data is small, and speaker adaptation is performed by combining the feature space and model space when the amount of adaptive data is large. As a result, the recognition rate is improved, so that the recognition rate is improved regardless of the amount of adaptive data.

［変形例１］
以下、図５、図６を参照して、実施例２の音声認識装置１００の変形例である変形例１の音声認識装置について説明する。図５は本変形例の音声認識装置１００’の構成を示すブロック図である。図６は本変形例の音声認識装置１００’の動作を示すフローチャートである。本変形例の音声認識装置１００’は、実施例２と同様に、実施例１の方法により予め求めた事前分布を共通に用いて特徴空間とモデル空間を同時適応することを特徴とする。本変形例の音声認識装置１００’は、特徴ベクトル抽出部１１０と、特徴ベクトル格納部１１５と、特徴ベクトル変換部１２０と、初期変換行列格納部１２５と、音声認識部１３０と、認識用データ記憶部１４０と、認識結果格納部１４５と、特徴空間統計量計算部１５０と、特徴空間変換行列推定部１５５と、木構造決定部１６０’と、モデル空間統計量計算部１７０と、モデル空間変換行列推定部１７５と、音響モデル更新部１８０と、事前分布記憶部１９０とを備える。木構造決定部１６０’以外の各構成部は、実施例２の音声認識装置１００において同一の番号を付した各構成部と同一の動作をするため説明を割愛する。 [Modification 1]
Hereinafter, with reference to FIG. 5 and FIG. 6, a speech recognition apparatus according to Modification 1 which is a modification of the speech recognition apparatus 100 according to Embodiment 2 will be described. FIG. 5 is a block diagram showing a configuration of a speech recognition apparatus 100 ′ according to this modification. FIG. 6 is a flowchart showing the operation of the speech recognition apparatus 100 ′ of this modification. Similar to the second embodiment, the speech recognition apparatus 100 ′ of the present modification is characterized in that the feature space and the model space are simultaneously adapted using the prior distribution previously obtained by the method of the first embodiment in common. The speech recognition apparatus 100 ′ of the present modification includes a feature vector extraction unit 110, a feature vector storage unit 115, a feature vector conversion unit 120, an initial transformation matrix storage unit 125, a speech recognition unit 130, and a recognition data storage. Unit 140, recognition result storage unit 145, feature space statistic calculation unit 150, feature space conversion matrix estimation unit 155, tree structure determination unit 160 ′, model space statistic calculation unit 170, model space conversion matrix An estimation unit 175, an acoustic model update unit 180, and a prior distribution storage unit 190 are provided. Since each component other than the tree structure determining unit 160 ′ performs the same operation as each component having the same number in the speech recognition apparatus 100 of the second embodiment, the description thereof is omitted.

従って、ステップＳ１１０〜ステップＳ１４５−１は実施例２と同様に実行される。次に、木構造決定部１６０’は、式（１８）’を用いて最初に得た音声認識結果を木構造に分類して、ハイパーパラメータＣチルダ、Ｖチルダを決定する（Ｓ１６０’−１）。式（１８）’を以下に示す。

Accordingly, steps S110 to S145-1 are executed in the same manner as in the second embodiment. Next, the tree structure determination unit 160 ′ classifies the speech recognition result obtained first using the equation (18) ′ into a tree structure, and determines hyperparameter C tilde and V tilde (S160′-1). . Equation (18) ′ is shown below.

以下、ステップＳ１５０〜ステップＳ１４５−２が実施例２と同様に実行される。次に、木構造決定部１６０’は、式（１８）を用いて二度目に得た音声認識結果を木構造に分類してハイパーパラメータＣチルダ、Ｖチルダを決定する（Ｓ１６０’−２）。以下、実施例２と同様にステップＳ１７０、ステップＳ１７５、ステップＳ１８０が実行される。 Thereafter, step S150 to step S145-2 are executed in the same manner as in the second embodiment. Next, the tree structure determination unit 160 'classifies the speech recognition result obtained a second time using the equation (18) into a tree structure, and determines hyperparameter C tilde and V tilde (S160'-2). Thereafter, step S170, step S175, and step S180 are executed as in the second embodiment.

このように、本変形例の音声認識装置１００’によれば、特徴空間でも木構造を使って話者適応を行うことにより、適応データ量が多い場合に、実施例２よりさらに認識率が向上する。 As described above, according to the speech recognition apparatus 100 ′ of the present modified example, the speaker adaptation is performed using the tree structure even in the feature space, so that when the amount of adaptive data is large, the recognition rate is further improved compared to the second embodiment. To do.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A feature vector extraction unit that extracts feature vectors for each speaker from input speech of a plurality of speakers;
A first transformation matrix estimation unit for estimating a first transformation matrix for each speaker by feature space maximum likelihood linear regression using the feature vector and an initial acoustic model previously learned from data of all speakers;
A feature vector conversion unit that converts a feature vector of a corresponding speaker using the first conversion matrix for each speaker;
An MLE acoustic model learning unit that performs acoustic model learning by a maximum likelihood method using the feature vector converted by the feature vector conversion unit;
A second transformation matrix is estimated for each speaker by feature space maximum likelihood linear regression using the feature vector transformed by the feature vector extraction unit and the acoustic model learned by the MLE acoustic model learning unit. A transformation matrix estimation unit;
Calculating a multivariate normal distribution of a matrix using the second transformation matrix, setting the multivariate normal distribution as a prior distribution, and outputting a hyperparameter of the prior distribution;
A prior distribution calculation device comprising:

A feature vector extraction unit that extracts a feature vector from input speech;
A feature vector conversion unit that converts the feature vector using an initial conversion matrix comprising a unit conversion matrix or a feature space conversion matrix;
A data storage unit for recognition that stores an acoustic model;
A speech recognition unit that performs speech recognition using the acoustic model and the feature vector converted by the feature vector conversion unit;
A feature space statistic calculation unit that calculates a statistic used for estimating a transformation matrix of the feature space using a hyperparameter of a prior distribution;
A feature space transformation matrix estimation unit that estimates a feature space transformation matrix using the statistics calculated by the feature space statistics calculation unit;
A model space statistic calculator for calculating a statistic used for estimating a transformation matrix of a model space using the hyperparameter of the prior distribution;
A model space transformation matrix estimation unit for estimating a model space transformation matrix using the statistics calculated by the model space statistics calculation unit;
A speech recognition apparatus comprising: an acoustic model update unit that updates the acoustic model using the estimated model space transformation matrix;
Prior distribution used in common by the feature space statistic calculator and the model space statistic calculator is
Each speaker's input speech is converted into a feature vector for each speaker, and each speaker is subjected to feature space maximum likelihood linear regression using the feature vector and an initial acoustic model previously learned from the data of all speakers. The first transformation matrix is estimated, the feature vector of the corresponding speaker is transformed using the first transformation matrix for each speaker, and the feature vector transformed using the first transformation matrix is And the acoustic model is learned by the maximum likelihood method, and the feature vector before being transformed by the first transformation matrix and the learned acoustic model are used for each speaker by the feature space maximum likelihood linear regression. A speech recognition apparatus, wherein the second transformation matrix is estimated to be a multivariate normal distribution of a matrix calculated using the second transformation matrix.

A feature vector extraction step of extracting feature vectors for each speaker from the input speech of a plurality of speakers;
A first transformation matrix estimation step for estimating a first transformation matrix for each speaker by feature space maximum likelihood linear regression using the feature vector and an initial acoustic model previously learned from data of all speakers;
A feature vector conversion step of converting a feature vector of a corresponding speaker using the first conversion matrix for each speaker;
An MLE acoustic model learning step of learning an acoustic model by a maximum likelihood method using the feature vector converted by the feature vector conversion step;
A second transformation matrix is estimated for each speaker by feature space maximum likelihood linear regression using the feature vector transformed in the feature vector extraction step and the acoustic model learned in the MLE acoustic model learning step. A transformation matrix estimation step;
Calculating a multivariate normal distribution of a matrix using the second transformation matrix, setting the multivariate normal distribution as a prior distribution, and outputting a hyperparameter of the prior distribution;
A prior distribution calculation method characterized by comprising:

A feature vector extraction step for extracting a feature vector from the input speech;
A feature vector conversion step of converting the feature vector using an initial conversion matrix comprising a unit conversion matrix or a feature space conversion matrix;
A speech recognition step for performing speech recognition using an acoustic model and the feature vector converted by the feature vector conversion step;
A feature space statistic calculation step for calculating a statistic used for estimating a transformation matrix of the feature space using a hyperparameter of a prior distribution;
A feature space transformation matrix estimation step for estimating a feature space transformation matrix using the statistics calculated by the feature space statistics calculation step;
A model space statistic calculation step for calculating a statistic used for estimating a transformation matrix of the model space using the hyperparameter of the prior distribution;
A model space transformation matrix estimation step for estimating a model space transformation matrix using the statistics calculated by the model space statistics calculation step;
An acoustic model update step of updating the acoustic model using a transformation matrix of the estimated model space,
Prior distribution commonly used in the feature space statistic calculation step and the model space statistic calculation step is:
A feature vector for each speaker is extracted from input speech of a plurality of speakers, and each speaker is subjected to feature space maximum likelihood linear regression using the feature vector and an initial acoustic model previously learned from data of all speakers. The first transformation matrix is estimated, the feature vector of the corresponding speaker is transformed using the first transformation matrix for each speaker, and the feature vector transformed using the first transformation matrix is And the acoustic model is learned by the maximum likelihood method, and the feature vector before being transformed by the first transformation matrix and the learned acoustic model are used for each speaker by the feature space maximum likelihood linear regression. A speech recognition method, wherein the second transformation matrix is estimated to be a multivariate normal distribution of a matrix calculated using the second transformation matrix.

A program for causing a computer to function as the apparatus according to claim 1.