JP2009145499A

JP2009145499A - Voice parameter learning apparatus and method therefor, voice recognition apparatus and voice recognition method using them, and their program and recording medium

Info

Publication number: JP2009145499A
Application number: JP2007321201A
Authority: JP
Inventors: Marc Delcroix; マークデルクロア; Shinji Watabe; 晋治渡部; Tomohiro Nakatani; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-12-12
Filing date: 2007-12-12
Publication date: 2009-07-02
Anticipated expiration: 2027-12-12
Also published as: JP4960845B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice-parameter learning apparatus that does not depend on specific voice emphasis method. <P>SOLUTION: This voice-parameter learning apparatus includes a voice preprocessing section for adaptation, an acoustic model storage section, an adaptation parameter creating section, a voice preprocessing section for recognition, and a dispersion dynamic correcting section. The adaptation parameter creating section creates a dynamic dispersion adaptive parameter depending on a frame as a parameter for dispersion correction, and a static dispersion adaptive parameter independent of the frame. The voice preprocessing section for recognition creates a voice feature amount for each frame of an observation voice signal, and uncertainty showing variation in the voice feature amount. The dispersion dynamic correcting section receives the uncertainty of the voice feature amount, the adaptive parameter, and the acoustic model, and outputs the dispersion of the Gaussian distribution corrected with the adaptive parameter for each frame. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、雑音抑圧や残響除去等の音声前処理を行った段階で生じる音声の歪みを抑圧するための音声パラメータ学習方法とその装置、その装置と方法を用いた音声認識装置と音声認識方法と、それらのプログラムと記録媒体に関する。 The present invention relates to a speech parameter learning method and apparatus, and a speech recognition apparatus and speech recognition method using the apparatus and method for suppressing speech distortion that occurs at the stage of speech preprocessing such as noise suppression and dereverberation. And the programs and recording media.

音声認識を行う上で観測音声信号は、騒音や残響などの外的要因で大きく歪む。音声認識は、そのような大きく歪んだ音声を認識するのは不得手である。音声前処理部において、雑音抑圧や残響除去等を行い歪みを緩和することができる。しかし、音声前処理を行なっても、音声前処理部が新たにもたらす歪みや歪みの消し残り等で音声の歪みが依然として存在する。そこで、しばしば用いられるのが音声認識用音響モデルに含まれるガウス分布の分散パラメータを補正する方法である。この方法は非特許文献１に開示されている。その方法に基づく従来の音声認識装置の機能構成を図９に、その動作フローを図１０に示して簡単に説明する。 When performing speech recognition, the observed speech signal is greatly distorted by external factors such as noise and reverberation. Speech recognition is not good at recognizing such heavily distorted speech. In the speech preprocessing unit, noise can be reduced by performing noise suppression and dereverberation. However, even if audio preprocessing is performed, audio distortion still exists due to distortion newly introduced by the audio preprocessing unit and unerased distortion. Therefore, a method often used is a method of correcting the dispersion parameter of the Gaussian distribution included in the acoustic model for speech recognition. This method is disclosed in Non-Patent Document 1. The functional configuration of a conventional speech recognition apparatus based on this method is shown in FIG. 9, and its operation flow is shown in FIG.

音声認識装置２００は、音声前処理部９０と、音響モデル記憶部９２と、分散動的補正部９４と、音声認識用音響モデル記憶部９６と、認識部９７と、発音辞書モデル記憶部９８と、言語モデル記憶部９９とを備える。 The speech recognition apparatus 200 includes a speech preprocessing unit 90, an acoustic model storage unit 92, a distributed dynamic correction unit 94, a speech recognition acoustic model storage unit 96, a recognition unit 97, and a pronunciation dictionary model storage unit 98. And a language model storage unit 99.

音声前処理部９０は、観測音声信号ｏ（ｔ）を読み込み（ステップＳ９０）、例えば雑音抑圧や残響除去法などの音声強調技術で推定された音声特徴量ｘ_ｔ＾（＾は図又は式に示す表記が正しい）を各フレーム毎に出力する。ただし、上記したように音声前処理部９０では、音声歪みを完璧に消すことが出来ず、推定された音声特徴量ｘ_ｔ＾と音響モデル構築の際に用いたクリーン音声特徴には大きなミスマッチが存在する。これが認識性能を劣化させる大きな要因となる。そこで音声特徴量ｘ_ｔ＾を、クリーン音声特徴ｘ_ｔと差分ｂ_ｔの和であると仮定する（式（１））。

ただし、差分ｂ_ｔは式（２）に示す様に平均０のガウス分布に従うと仮定する。

The speech preprocessing unit 90 reads the observed speech signal o (t) (step S90) and, for example, the speech feature amount x _t ^ (^ is a figure or an expression estimated by speech enhancement technology such as noise suppression or dereverberation method. Is output for each frame. However, as described above, the speech pre-processing unit 90 cannot completely eliminate the speech distortion, and there is a large mismatch between the estimated speech feature amount x _t ^ and the clean speech feature used when constructing the acoustic model. Exists. This is a major factor that degrades recognition performance. Therefore, it is assumed that the speech feature amount x _t ^ is the sum of the clean speech feature x _t and the difference b _t (formula (1)).

However, it is assumed that the difference b _t follows an average 0 Gaussian distribution as shown in the equation (2).

ここで、Σ_ｘｔ＾は音声特徴量の分散である。つまり、音声前処理部９０は推定された音声特徴量ｘ_ｔ＾とともに、音声特徴量の分散Σ_ｘｔ＾を出力する（ステップＳ９１）。音声特徴量の分散Σ_ｘｔ＾は、ＧＭＭに基づく音声強調法ではクリーン音声の混合ガウス分布モデルの分散パラメータから導出される。 Here, Σ _xt ^ is a variance of speech feature values. That is, the speech preprocessing unit 90 outputs the variance Σ _xt ^ of the voice feature value together with the estimated voice feature value x _t ^ (step S91). In the speech enhancement method based on GMM, the speech feature amount variance Σ _xt ^ is derived from the dispersion parameters of the mixed Gaussian distribution model of clean speech.

分散動的補正部９４は、音響モデル記憶部９２に記憶されている音響モデルの分散パラメータΣ_ｎ,ｍ（ｎはＨＭＭ状態、ｍは混合成分）を読み込み（ステップＳ９２）、音声前処理部９０が出力する音声特徴量の分散Σ_ｘｔ＾を用いて補正する（ステップＳ９４）。ここで、音響モデルについて説明する。音響モデルは、通常隠れマルコフモデル（ＨＭＭ）で表現され、ＨＭＭの出力分布としては混合ガウス分布が用いられる。あるＨＭＭ状態ｎにおいて音声特徴ｘ_ｔを出力する出力確率は式（３）で表現される。

The distributed dynamic correction unit 94 reads the dispersion parameter Σ _{n, m} (n is an HMM state, m is a mixed component) of the acoustic model stored in the acoustic model storage unit 92 (step S92), and the speech preprocessing unit 90 Is corrected using the variance Σ _xt ^ of the voice feature value output by (step S94). Here, the acoustic model will be described. The acoustic model is usually expressed by a hidden Markov model (HMM), and a mixed Gaussian distribution is used as the output distribution of the HMM. Output probability for outputting the speech feature x _t In certain HMM state n is represented by the formula (3).

ここで、ｍはガウス分布の混合成分の指標であり、Ｍは状態あたりの混合数を表わす。ｐ（ｍ）は混合重み因子を表わす。μ_ｎ,ｍ及びΣ_ｎ,ｍはＨＭＭ状態ｎ、混合成分ｍでのガウス分布の平均パラメータ及び共分散行列を表わす。なお、通常の音響モデルは共分散行列を対角共分散行列として扱う場合が多い。そのため以降では、共分散行列の対角成分を特徴量次元の指標ｉを用いて、標準偏差σ_ｎ,ｍ,ｉ ^２として表わすこともある。 Here, m is an index of the mixture component of the Gaussian distribution, and M represents the number of mixtures per state. p (m) represents a mixing weight factor. μ _{n, m} and Σ _{n, m} represent the mean parameter and covariance matrix of the Gaussian distribution in the HMM state n and the mixture component m. An ordinary acoustic model often treats a covariance matrix as a diagonal covariance matrix. Therefore, hereinafter, the diagonal component of the covariance matrix may be expressed as the standard deviation σ _{n, m, i} ² using the feature quantity dimension index i.

一般には、上記音響モデルパラメータはクリーン音声を用いて学習されるため、例えば、それらのデータから得られる平均パラメータμ_ｎ,ｍと音声前処理部９０で推定された音声特徴量ｘ_ｔ＾とではミスマッチが存在する。このようなミスマッチを緩和するために分散動的補正部９４では、音響モデルの分散パラメータΣ_ｎ,ｍを音声特徴量ｘ_ｔ＾に合わせるように補正を行う。分散パラメータΣ_ｎ,ｍを音声特徴量ｘ_ｔ＾に合わせる補正を行うため、ＨＭＭ状態ｎでの音響モデルの出力確率ｐ（ｘ_ｔ｜ｎ）に対し、ｘ_ｔ及びｘ_ｔとｘ_ｔ＾の差分ｂ_ｔの同時確率を考え、ｂ_ｔに関して周辺化（積分）を行うことにより、式（４）に示すような出力確率ｐ（ｘ_ｔ｜ｎ）を理論的に導出することができる。

In general, since the acoustic model parameters are learned using clean speech, for example, the average parameter μ _{n, m} obtained from the data and the speech feature amount x _t ^ estimated by the speech preprocessing unit 90 There is a mismatch. In order to alleviate such mismatch, the distributed dynamic correction unit 94 performs correction so that the dispersion parameter Σ _{n, m} of the acoustic model matches the speech feature amount x _t ^. In order to perform the correction to match the dispersion parameter Σ _{n, m} with the speech feature amount x _t ^, for the output probability p (x _t | n) of the acoustic model in the HMM state n, x _t and x _t and x _t ^ By considering the joint probability of the difference b _t and performing marginalization (integration) on b _t , an output probability p (x _t | n) as shown in Equation (4) can be theoretically derived.

ここでは、ｐ（ｂ_ｔ｜ｎ）≒ｐ（ｂ_ｔ）と仮定している。従って、分散動的補正部９４では、各フレーム毎に動的に音声特徴量の分散Σ_ｘｔ＾を用いて音響モデルの分散パラメータΣ_ｎ,ｍを式（５）に示すように補正することにより、推定された音声特徴量ｘ_ｔ＾を出力する出力分布を得ることができる。

Here, it is assumed that p (b _t | n) ≈p (b _t ). Therefore, the variance dynamic correction unit 94 dynamically corrects the variance parameter Σ _{n, m} of the acoustic model as shown in Expression (5) by using the variance Σ _xt ^ of the voice feature amount for each frame. An output distribution that outputs the estimated speech feature amount x _t ^ can be obtained.

補正された出力分布は、音声認識用音響モデル記憶部９６に記憶される。
認識部９７では、音声前処理部９０から入力される特徴量集合Ｘ＝[ｘ_１＾,…,ｘ_ｔ＾，…]に対して音響モデルｐ（Ｘ｜ｎ）、発音辞書モデル記憶部９８に記憶された発音辞書モデルｐ（ｎ｜Ｗ）、言語モデル記憶部９９に記憶された言語モデルｐ（Ｗ）を用いて式（６）に示すように音声認識結果Ｗを出力する（ステップＳ９７）。

The corrected output distribution is stored in the acoustic model storage unit 96 for speech recognition.
In the recognition unit 97, the acoustic model p (X | n) and the pronunciation dictionary model storage unit 98 for the feature amount set X = [x ₁ ^,..., X _t ^,. Using the pronunciation dictionary model p (n | W) stored in the language model and the language model p (W) stored in the language model storage unit 99, the speech recognition result W is output as shown in equation (6) (step S97). ).

特徴量集合に対する音響モデルｐ（Ｘ｜ｎ）のスコアは、出力確率ｐ（ｘ_ｔ｜ｎ）から得られる各フレームｔ毎の音響スコアを、ＤＰマッチング（動的計画法）などを用いて蓄積することで得られる。 As the score of the acoustic model p (X | n) for the feature quantity set, the acoustic score for each frame t obtained from the output probability p (x _t | n) is accumulated using DP matching (dynamic programming) or the like. It is obtained by doing.

出力確率ｐ（ｘ_ｔ｜ｎ）から得られる各フレームｔ毎の音響スコアは、音声前処理部９０より出力される推定された音声特徴量ｘ_ｔ＾及び、分散動的補正部９４より得られる補正された分散Σ_ｎ,ｍ＋Σ_ｘｔ＾、及びその他の音響モデルパラメータを用いて式（７）に示す様に計算することができる。

The acoustic score for each frame t obtained from the output probability p (x _t | n) is obtained from the estimated speech feature amount x _t ^ output from the speech preprocessing unit 90 and the distributed dynamic correction unit 94. Using the corrected variance Σ _{n, m} + Σ _xt ^ and other acoustic model parameters, the calculation can be made as shown in Equation (7).

以上の動作によって、雑音抑圧や残響除去等の音声前処理を行なった段階に生じる音声の歪みを抑圧した音声認識が実現される。
Deng, L.,Droppo, J. and Acero, A.,”Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,”IEEE Trans.SAP,vol. 13,no.3,pp.412-421,2005. With the above operation, speech recognition is realized in which speech distortion occurring at the stage of performing speech preprocessing such as noise suppression and dereverberation is suppressed.
Deng, L., Droppo, J. and Acero, A., “Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion,” IEEE Trans.SAP, vol. 13, no. 3, pp .412-421,2005.

しかし、上記した方法では、分散動的補正部９４において用いる音声特徴量の分散Σ_ｘｔ＾を音声前処理部９０で生成する必要がある。音声前処理部９０ではクリーン音声の混合ガウス分布にもとづく音声強調手法を用いており、音声特徴量の分散Σ_ｘｔ＾は、その混合ガウス分布モデルの分散パラメータから導出している。他の多くの音声強調手法、例えばスペクトル減算法、音声分離法（BSS）、ウィナーフィルター法（wiener）では、直接音声特徴量の分散を出力することが難しく、上記した方法の適用は困難である。つまり、上記した従来方法は、特定の音声強調手法を用いなければならないという点で汎用性に欠ける。 However, in the method described above, it is necessary for the speech preprocessing unit 90 to generate the speech feature amount variance Σ _xt ^ used in the distributed dynamic correction unit 94. The speech preprocessing unit 90 uses a speech enhancement method based on a mixed Gaussian distribution of clean speech, and the variance Σ _xt ^ of the speech feature value is derived from the dispersion parameter of the mixed Gaussian distribution model. In many other speech enhancement methods such as spectral subtraction, speech separation (BSS), and Wiener filter (wiener), it is difficult to directly output the variance of speech features, and the above method is difficult to apply. . That is, the above-described conventional method lacks versatility in that a specific speech enhancement method must be used.

また、観測音声信号の音声特徴ｕ_ｔと、音声前処理部で推定された音声特徴量ｘ_ｔ＾の２乗誤差を音声特徴量の分散と近似することにより、音声強調手法によらない動的分散補正が可能ではある。しかし、本来、分散動的補正に必要な音声特徴量の分散は、クリーン音声特徴ｘ_ｔと音声前処理部で推定された音声特徴量ｘ_ｔ＾の２乗誤差であり、上記した近似では動的分散補正の精度が低下し性能が劣化してしまう。 Further, by approximating the square error of the speech feature u _t of the observed speech signal and the speech feature amount x _t ^ estimated by the speech pre-processing unit to the variance of the speech feature amount, the dynamics independent of the speech enhancement method are used. Dispersion correction is possible. However, originally, the variance of the speech feature amount necessary for the distributed dynamic correction is a square error between the clean speech feature x _t and the speech feature amount x _t ^ estimated by the speech pre-processing unit. The accuracy of the automatic dispersion correction is lowered and the performance is deteriorated.

この発明は、このような点に鑑みてなされたものであり、任意の音声特徴量の分散を用いても適切な音響モデルを得ることができる音声パラメータ学習装置とその方法、それらを用いた音声認識装置と音声認識方法、それらのプログラムと記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and a speech parameter learning apparatus and method capable of obtaining an appropriate acoustic model using any variance of speech feature values, and speech using them. An object of the present invention is to provide a recognition device, a speech recognition method, a program thereof, and a recording medium.

この発明による音声パラメータ学習装置は、適応用音声前処理部と、音響モデル記憶部と、適応パラメータ生成部と、認識用音声前処理部と、分散動的補正部とを具備する。適応用音声前処理部は、観測音声信号を入力として、当該観測音声信号のフレーム毎の音声特徴を強調した強調音声信号の音声特徴量と、上記音声特徴量のバラツキを表わす不確かさとを生成する。音響モデル記憶部は、音響モデルを記憶する。適応パラメータ生成部は、強調音声特徴量の集合と、上記不確かさの集合と、上記音響モデルと、教師信号とを入力とし、音響モデル中のガウス分布の分散補正のための適応パラメータとしてフレームに依存する動的分散適応パラメータと、フレームに依存しない静的分散適応パラメータとを生成する。認識用音声前処理部は、観測音声信号のフレーム毎の音声特徴量と、音声特徴量のバラツキを表わす不確かさを生成する。分散動的補正部は、上記音声特徴量の不確かさと、上記適応パラメータと、上記音響モデルとを入力としてフレーム毎に適応パラメータで補正された音響モデルのガウス分布の分散を出力する。 The speech parameter learning apparatus according to the present invention includes an adaptive speech preprocessing unit, an acoustic model storage unit, an adaptive parameter generation unit, a recognition speech preprocessing unit, and a distributed dynamic correction unit. The adaptation speech preprocessing unit receives the observed speech signal and generates a speech feature amount of the enhanced speech signal in which the speech feature of each frame of the observed speech signal is emphasized and an uncertainty representing the variation of the speech feature amount. . The acoustic model storage unit stores an acoustic model. The adaptive parameter generation unit receives the set of emphasized speech features, the set of uncertainties, the acoustic model, and the teacher signal as input, and adds them to the frame as adaptive parameters for dispersion correction of the Gaussian distribution in the acoustic model. Generate dependent dynamic distributed adaptation parameters and static distributed adaptive parameters independent of frames. The recognition speech pre-processing unit generates a speech feature amount for each frame of the observed speech signal and an uncertainty representing variation in the speech feature amount. The variance dynamic correction unit outputs the variance of the Gaussian distribution of the acoustic model corrected by the adaptation parameter for each frame with the uncertainty of the speech feature value, the adaptation parameter, and the acoustic model as inputs.

また、この発明による音声認識装置は、上記した音声パラメータ学習装置と、認識部を具備する。認識部は、音声パラメータ学習装置が出力する音声特徴量と、音声パラメータ学習装置において補正された音響モデルのガウス分布の分散を入力として単語列を出力する。 The speech recognition device according to the present invention includes the speech parameter learning device described above and a recognition unit. The recognizing unit outputs a word string with the speech feature amount output by the speech parameter learning device and the variance of the Gaussian distribution of the acoustic model corrected by the speech parameter learning device as inputs.

この発明の音声パラメータ学習装置は、適応パラメータ生成部が音響モデルの分散補正のためのパラメータとして、観測音声信号からフレームに依存する動的分散パラメータと、フレームに依存しない静的分散パラメータとを生成する。つまり、音声強調部に混合ガウス分布法を用いずに分散補正のためのパラメータが生成できるので、任意の音声強調手法に対応可能な汎用性の高い音声パラメータ学習装置とすることができる。また、この音声パラメータ学習装置を用いた音声認識装置は、特定の音声強調手法に依存することなく、音声の歪みを抑圧した高い認識性能を持つ音声認識を実現することができる。 In the speech parameter learning device of the present invention, the adaptive parameter generation unit generates a dynamic dispersion parameter that depends on a frame and a static dispersion parameter that does not depend on a frame from the observed speech signal as parameters for dispersion correction of the acoustic model. To do. That is, since a parameter for dispersion correction can be generated without using the mixed Gaussian distribution method in the speech enhancement unit, a highly versatile speech parameter learning apparatus that can cope with any speech enhancement method can be provided. Further, a speech recognition device using this speech parameter learning device can realize speech recognition with high recognition performance with suppressed speech distortion without depending on a specific speech enhancement method.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

〔発明の基本的な考え〕
実施例の説明の前に、この発明の音声特徴量の分散を補正する方法の基本的な考え方について説明する。この発明は、式（８）に示すように分散補正された音響モデルの分散Σ′_{ｎ，ｍ，ｔ}を、フレームｔに依存する動的成分行列Σ^Ｄと、フレームｔに依存しない静的成分行列Σ^Ｓの組み合わせとして表現する。

ここで演算子（＋）は行列Σ^Ｓ、Σ^Ｄに対しての、和・積などの演算及びそれらの演算等の組み合わせで表現される２項演算を意味する。 [Basic idea of the invention]
Prior to the description of the embodiments, the basic concept of the method for correcting the variance of the audio feature quantity of the present invention will be described. The present invention, the static component that is independent of the dispersion sigma _'n dispersion correction acoustic model as shown in Equation _{(8), m,} and _t, and a dynamic component matrix sigma ^D that depends on the frame t, in frame t It expressed as a combination of the matrix Σ ^S.

Here, the operator (+) means a binary operation expressed by a combination of operations such as sum and product and the operations on the matrices Σ ^S and Σ ^D.

音響モデルの分散Σ_ｎ，ｍを補正するのに、音声特徴量の分散を用いる。式（９）に示すように音声特徴量の分散ｅ_ｔとΣ_ｎ，ｍを引数として特徴量分散を出力する関数ｆを求めればよい。

In _order to correct the variance Σ _{n, m} of the acoustic model, the variance of the speech feature value is used. As shown in the equation (9), a function f that outputs the variance of the feature quantity with the voice feature quantity variance _et and Σ _{n, m} as arguments may be obtained.

背景技術で述べた音声特徴量の分散Σ_ｘｔ＾をそのまま音響モデルの分散に足す場合、つまり、Σ_ｎ、ｍ＋Σ_ｘｔ＾（式（５））ではΣ_ｘｔ＾を正確に推定しないと十分な性能を得ることができない。また、この方法では、音声強調手法が限定されてしまう。そこで、この発明では式（１０）に示すように、音響モデル分散を音声特徴量の各フレームｔでの不確かさに依存する動的成分行列Σ^Ｄ（ｅ_ｔ）と、音声特徴量の各フレームｔでの不確かさに依存しない静的成分行列Σ^Ｓの組み合わせとして表現する。ただし、不確かさｅ_ｔとしてスカラーやベクトル値や行列値をとってもよい。スカラー値の不確かさとしては、音声強調や音声区間検出（ＶＡＤ）などの音声前処理時に出力されるバイナリー値や信頼度、また音声認識を行うことによって算出される信頼度等が考えられる。また、スカラー値の不確かさを各特徴次元毎に算出することによりベクトル型の不確かさを計算することも可能である。また、共分散行列や自己相関行列から行列型の不確かさを計算することも可能である。

When the variance Σ _xt ^ of the speech feature described in the background art is added to the variance of the acoustic model as it is, that is, it is sufficient if Σ _xt ^ is not estimated accurately in Σ _{n, m} + Σ _xt ^ (formula (5)). Unable to get performance. In addition, this method limits the speech enhancement technique. Therefore, in the present invention, as shown in the equation (10), the dynamic component matrix Σ ^D (e _t ) whose acoustic model variance depends on the uncertainty of each speech feature amount in each frame t, and each speech feature amount frame It is expressed as a combination of static component matrices Σ ^S that do not depend on the uncertainty at t. However, it takes the scalar or vector values and matrix values as uncertainty e _t. As the uncertainty of the scalar value, the binary value and reliability output during speech preprocessing such as speech enhancement and speech interval detection (VAD), the reliability calculated by performing speech recognition, and the like can be considered. It is also possible to calculate vector type uncertainty by calculating the uncertainty of the scalar value for each feature dimension. It is also possible to calculate a matrix type uncertainty from a covariance matrix or an autocorrelation matrix.

また、あるフレームｔでの音響モデルの分散Σ′_{ｎ、ｍ、ｔ}を推定するためには、音声特徴量の不確かさｅ_ｔのみならず、フレームｔを含めた音声特徴量の不確かさｅの集合、音声特徴集合ｘ_ｔ＾、及び音響モデルΨの情報も有用である。したがって、これらを用いて、音響モデル特徴量分散を式（１１）に示すように表現する。

Further, in order to estimate the variance sigma _{'n, m, t} for the acoustic model of a certain frame t is not only uncertainty e _t of the audio feature, the uncertainty e of speech features, including frame t Information about the set, the speech feature set x _t ^, and the acoustic model Ψ is also useful. Therefore, using these, the acoustic model feature amount variance is expressed as shown in Expression (11).

音響モデル特徴量分散Σ′_{ｎ、ｍ、ｔ}は、強調音声特徴量の集合、例えばｔを含みｔ′からｔ′′までの有限区間としてｘ＾＝｛ｘ_ｔ’＾…ｘ_ｔ＾…ｘ_ｔ’’＾｝、強調音声特徴量の不確かさの集合、例えばｅ＝｛ｅ_ｔ’…ｅ_ｔ…ｅ_ｔ’’｝といったデータ集合に依存するため、学習によって精度良く求めることができる。
式（１１）の適切な関数系を示す。一般に関数系が複雑であればあるほど大量の学習データと長い学習時間が必要となるが、精度良く関数系を学習することができる。逆に、関数系をシンプルにすればするほど少量の学習データと短い学習時間で関数系を学習することができるが、その精度は一般的に複雑な関数系に比べて低くなる。従って、以降に示す関数系に関しては、学習データ量や学習時間等の応用上の条件に応じて適切に選択すればよい。以降では、学習によるパラメータの推定が前提となるため、簡単のためにｅ、ｘ等の引数は省略する。式（１１）の２項演算（＋）の単純形として、式（１２）に示す積表現と式（１３）に示す和表現とが考えられる。

The acoustic model feature quantity variance Σ ′ _{n, m, t} is a set of emphasized speech feature quantities, for example, as a finite section from t ′ to t ″ including t x = {x _{t ′} ^ ... x _t ^ ... x _Since it depends on _{t ″} ^}, a set of uncertainties of the emphasized speech feature quantity, for example, e = {e _{t ′} ... e _t ... e _{t ″} }, it can be accurately obtained by learning.
An appropriate function system of equation (11) is shown. In general, the more complicated the function system, the larger the amount of learning data and the longer the learning time are required, but the function system can be learned with high accuracy. Conversely, the simpler the function system is, the more the function system can be learned with a small amount of learning data and a short learning time, but the accuracy is generally lower than that of a complex function system. Therefore, the function system described below may be appropriately selected according to application conditions such as the learning data amount and the learning time. In the following, since parameter estimation by learning is assumed, arguments such as e and x are omitted for simplicity. As a simple form of the binary operation (+) of Expression (11), a product expression shown in Expression (12) and a sum expression shown in Expression (13) can be considered.

式（１３）の分散の和表現は、式（５）との類推から考えて、理論的・実用的に妥当な表現といえる。従って、以降では和表現で説明する。 The sum expression of the variance in equation (13) can be said to be a theoretically and practically appropriate expression in view of analogy with equation (5). Therefore, hereinafter, the description will be made in the Japanese expression.

Σ^Ｓが音響モデルの分散に依存すると仮定すると、式（１４）で表わせる音響モデルの特徴量分散Σ′_{ｎ、ｍ、ｔ}は、式（１４）で表わせる。

When sigma ^S is assumed to depend on the distribution of acoustic models, the feature variance sigma _{'n, m, t} of expressed acoustic model in equation (14) can be expressed by equation (14).

ここでΣ^Ｓ、Σ^Ｄの関数系として任意の関数、例えば行列の多項式等を与える。その最も簡単な形として式（１５）と式（１６）で表わせる。

これは特徴量が線形変換された場合の分散の変換式である。
ここでＡ，Ｂ，Ｃ，Ｄは、特徴量次元の正方行列であり、他の部分のＡ〜Ｄとは異なる変数である。行列は任意の形でよい（対称、ブロック、帯、スカラー倍の単位行列）。以降では、分散のバイアス項の影響を無視し(Ｂ＝０,Ｄ＝０)、ＡとＣの対角行列に対しての表現で説明する。ＡとＣのｉ行ｉ列の対角成分を√λ_ｉと√α_ｉと表わすと、音響モデルの特徴量分散Σ′_{ｎ、ｍ、ｔ}の対角成分は式（１７）で表わせる。つまり、音響モデルの分散をパラメトリック表現することができる。

Here, an arbitrary function such as a matrix polynomial is given as a function system of Σ ^S and Σ ^D. As its simplest form, it can be expressed by equations (15) and (16).

This is a dispersion conversion formula when the feature amount is linearly converted.
Here, A, B, C, and D are square matrices of the feature quantity dimension, and are variables different from other parts A to D. The matrix can be in any form (symmetric, block, band, scalar multiple unit matrix). Hereinafter, the influence of the bias term of the dispersion is ignored (B = 0, D = 0), and the description will be made with the expression for the diagonal matrix of A and C. When the diagonal elements of the i-th row i column of A and C represent the √Ramuda _i and √Arufa _i, diagonal elements of the feature quantity distributed Σ _{'n, m, t} for the acoustic model can be expressed by Equation (17). That is, the variance of the acoustic model can be expressed parametrically.

ここで、σ_n,m,i ^２は状態ｎ、混合成分ｍでの音響モデル中のガウス分布の共分散行列の対角（ｉ×ｉ）成分である。このとき、学習により推定すべきパラメータはαとλとなる。ここで注目したいのは、α＝０とすると従来からある静的分散補正法となる。また、α＝const,λ_ｉ＝１とすると従来の動的分散補正法となることである。つまりこの発明の方法は、従来の両手法を内包する手法であるといえる。以上説明した考えに基づくこの発明の音声パラメータ学習装置の実施例を次に説明する。 Here, σ _{n, m, i} ² is a diagonal (i × i) component of the covariance matrix of the Gaussian distribution in the acoustic model in the state n and the mixed component m. At this time, the parameters to be estimated by learning are α and λ. It should be noted here that when α = 0, the conventional static dispersion correction method is used. Further, when α = const, λ _i = 1, it is a conventional dynamic dispersion correction method. That is, it can be said that the method of the present invention is a method including both conventional methods. Next, an embodiment of the speech parameter learning apparatus of the present invention based on the above-described idea will be described.

図１にこの発明の音声パラメータ学習装置の実施例１の概略的な機能構成例を示す。音声パラメータ学習装置１００は、適応用音声前処理部２と、音響モデル記憶部４と、適応パラメータ生成部６と、認識用音声前処理部８と、分散動的補正部１０を具備する。その動作フローを図２に示す。この例の音声パラメータ学習装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a schematic functional configuration example of the speech parameter learning apparatus according to the first embodiment of the present invention. The speech parameter learning apparatus 100 includes an adaptive speech preprocessing unit 2, an acoustic model storage unit 4, an adaptive parameter generation unit 6, a recognition speech preprocessing unit 8, and a distributed dynamic correction unit 10. The operation flow is shown in FIG. The speech parameter learning apparatus 100 of this example is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

音声パラメータ学習装置１００は、上記したパラメータのαとλを推定するものである。適応用音声前処理部２と認識用音声前処理部８とに入力される観測音声信号は、例えば、サンプリング周波数＝８ｋＨｚ、量子化ビット数＝１６bitの離散値である。適応用音声前処理部２と認識用音声前処理部８は、この離散値を例えば２４０点まとめて１フレームとして処理を行なう。 The speech parameter learning device 100 estimates α and λ of the parameters described above. The observed speech signal input to the adaptive speech preprocessing unit 2 and the recognition speech preprocessing unit 8 is, for example, a discrete value with a sampling frequency = 8 kHz and a quantization bit number = 16 bits. The adaptation speech pre-processing unit 2 and the recognition speech pre-processing unit 8 process the discrete values, for example, 240 points as one frame.

適応用音声前処理部２は、観測音声信号ｏ（ｔ）のフレーム毎の音声特徴を強調した強調音声特徴量の集合｛ｘ_ｔ’＾,…, ｘ_ｔ＾,…, ｘ_ｔ’’＾｝と、強調音声特徴量のバラツキを表わす不確かさの集合｛ｅ_ｔ’,…, ｅ_ｔ,…, ｅ_ｔ’’｝とを生成する（ステップＳ２、図２）。適応パラメータ生成部６は、強調音声特徴量の集合｛ｘ_ｔ’＾,…, ｘ_ｔ＾,…, ｘ_ｔ’’＾｝と、強調音声特徴量のバラツキを表わす不確かさの集合｛ｅ_ｔ’,…, ｅ_ｔ,…, ｅ_ｔ’’｝と、音響モデル記憶部４に記憶された音響モデルと、教師信号を入力とし、音響モデル中のガウス分布の補正のための適応パラメータを生成する（ステップＳ６）。適応パラメータ生成過程は、フレームに依存しない静的分散適応パラメータλを生成する静的分散適応過程（ステップＳ６２）と、フレームに依存する動的分散適応パラメータαを生成する動的分散過程（ステップＳ６６）の２つの過程から成る。両過程の順序はどちらが先でもかまわない。 The adaptive speech preprocessing unit 2 sets a set of emphasized speech feature values { _{xt ′} ^,..., _Xt ^, ..., _{xt ''} ^ that emphasizes the speech characteristics of each frame of the observed speech signal o (t). } And a set of uncertainties {e _{t ′} ,..., E _t ,..., E _{t ″} } representing variations in the emphasized speech feature value (step S2, FIG. 2). The adaptive parameter generation unit 6 includes a set of emphasized speech feature values {x _{t ′} ^,..., X _t ^,..., X _{t ″} ^} and a set of uncertainties representing variations in the emphasized speech feature values {e _{t '} , ..., _et , ..., et _" }, and the acoustic model stored in the acoustic model storage unit 4 and the teacher signal are used as input to generate adaptive parameters for correcting the Gaussian distribution in the acoustic model. (Step S6). The adaptive parameter generation process includes a static distributed adaptation process (step S62) that generates a static distributed adaptive parameter λ that does not depend on a frame (step S62), and a dynamic distributed process that generates a dynamic distributed adaptive parameter α that depends on a frame (step S66). 2). The order of both processes may be either.

認識用音声前処理部８は、観測音声信号ｏ（ｔ）のフレーム毎の音声特徴量ｘ_ｔ＾と、その音声特徴量のバラツキを表わす不確かさｅ_ｔを生成する（ステップＳ８）。なお、この例の認識用音声前処理部８は、適応用音声前処理部２と同じ処理を行なう。分散動的補正部１０は、適応パラメータαとλと、不確かさｅ_ｔと、音響モデル記憶部４に記憶された音響モデルとを入力とし、フレーム毎に音響モデルのガウス分布の分散Σ_ｎ，ｍを、適応パラメータαとλで補正した分散Σ′_{ｎ，ｍ，ｔ}を出力する（ステップＳ１０）。 Recognition voice pre-processing unit 8, the observed speech signals audio feature amount of each frame of o _(t) x t _^, generates an uncertainty e _t representing the variation of the audio feature amount (step S8). Note that the recognition speech preprocessing unit 8 in this example performs the same processing as the adaptive speech preprocessing unit 2. Distributed dynamic correction unit 10, the adaptive parameter α and λ and uncertainty e _t and, as input an acoustic model stored in the acoustic model storage unit 4, the dispersion sigma _n of the Gaussian distribution of the acoustic model for each _{frame, The} variance Σ ′ _{n, m, t} obtained by correcting _m with the adaptive parameters α and λ is output (step S10).

適応用音声前処理部２と、適応パラメータ生成部６と、分散動的補正部１０は、適応パラメータ学習部を構成する。ここで、パラメトリック表現された音響モデルの分散パラメータの学習について説明する。
一般に、学習においては教師信号が必要となる。教師信号（以降、ラベルと称する。）としては、各フレームにおけるラベル情報が必要になる。ラベルは単語情報や音素情報、ＨＭＭ状態情報等がある。観測音声信号に予めラベルがふられている場合は、それをそのまま利用する。または、例えば、図示しない音声認識器もしくは音声区間検出器等を用いてラベルを付与すればよい。 The adaptive speech preprocessing unit 2, the adaptive parameter generation unit 6, and the distributed dynamic correction unit 10 constitute an adaptive parameter learning unit. Here, the learning of the dispersion parameter of the acoustic model expressed in the parametric manner will be described.
In general, a teacher signal is required for learning. As the teacher signal (hereinafter referred to as a label), label information in each frame is required. The label includes word information, phoneme information, HMM state information, and the like. If the observation audio signal is pre-labeled, it is used as it is. Alternatively, for example, a label may be given using a voice recognizer or a voice section detector (not shown).

学習というのは、音声データやラベルなどを利用して、音響モデルのパラメータを生成する方法であって、学習の出力は新しい音響モデルである。音声認識装置は、その音響モデルを用いて音声認識を行う。この例では、動的補正のために適応を利用する。適応も音声データやラベル等を利用してパラメータを生成するが、学習と違いその出力は適応パラメータである。適応パラメータ生成部６は、静的分散適応手段６２と、動的分散適応手段６６とから成り、強調音声特徴量の集合と、強調音声特徴量の不確かさの集合と、ラベルと、音響モデルを入力として式（１７）に示したα、λのような分散補正のための適応パラメータを算出する。 Learning is a method of generating acoustic model parameters using speech data, labels, and the like, and the output of learning is a new acoustic model. The speech recognition apparatus performs speech recognition using the acoustic model. In this example, adaptation is used for dynamic correction. In adaptation, parameters are generated using speech data, labels, and the like. Unlike learning, the output is an adaptation parameter. The adaptive parameter generation unit 6 includes a static variance adaptation unit 62 and a dynamic variance adaptation unit 66, and includes a set of emphasized speech feature amounts, a set of uncertainties of the emphasized speech feature amounts, a label, and an acoustic model. As an input, adaptive parameters for dispersion correction such as α and λ shown in Expression (17) are calculated.

学習の規範としては、例えば尤度最大化を採用する。最尤学習は、音響モデル記憶部４に記憶された音響モデルが、学習データを出力する際の尤度を最大化するようパラメータを学習する規範である。また、他の学習法として、事後確率の最大化を規範とするベイズ学習でもよい。ただし、その場合は、各パラメータに適切な共役分布や無情報事前分布を事前分布として設定する必要がある。他にも、音声認識率などの識別基準を利用した識別学習などが上げられる。このような規範を用いると、パラメータを引数とするコスト関数を導出することができる。 As a learning standard, for example, likelihood maximization is adopted. Maximum likelihood learning is a norm for learning parameters so that the acoustic model stored in the acoustic model storage unit 4 maximizes the likelihood of outputting learning data. As another learning method, Bayesian learning based on maximization of the posterior probability may be used. However, in that case, it is necessary to set an appropriate conjugate distribution or no information prior distribution for each parameter as the prior distribution. In addition, identification learning using identification criteria such as a speech recognition rate can be raised. By using such a criterion, a cost function having parameters as arguments can be derived.

上記学習規範から求められたコスト関数を最適化するパラメータを推定する。最適化手法としては、最急降下法などの数値計算、ニューラルネットワーク、マルコフチェインモンテカルロなどのサンプリング法、遺伝的アルゴリズム等が考えられる。この実施例では、期待値最大化（ＥＭ）アルゴリズムを用いた例で説明する。 A parameter for optimizing the cost function obtained from the learning criterion is estimated. As an optimization method, numerical calculation such as a steepest descent method, a neural network, a sampling method such as Markov chain Monte Carlo, a genetic algorithm, and the like can be considered. In this embodiment, an example using an expected value maximization (EM) algorithm will be described.

ＥＭアルゴリズムは、直接尤度を最大にするのではなく、式（１８）で定義される補助関数Ｑ（θ｜θ’）を最大化するパラメータを求める手法である。

The EM algorithm is a method for obtaining a parameter that maximizes the auxiliary function Q (θ | θ ′) defined by the equation (18), not directly maximizing the likelihood.

θは分散補正のためのパラメータ集合であり、具体的にはαとλである。Ｘはクリーン音声特徴量の系列、Ｔはフレーム数、θ′は各反復計算における一つ前の推定値、θは各反復計算における推定対象のパラメータである。 θ is a parameter set for dispersion correction, and specifically, α and λ. X is a sequence of clean speech feature values, T is the number of frames, θ ′ is the previous estimated value in each iteration, and θ is a parameter to be estimated in each iteration.

補助関数Ｑ（θ｜θ’）と尤度の増減関係は一致するため、式（１８）を最大化するθは局所最適解となる。ここでＢは差分特徴量の系列、ＳはＨＭＭ状態のあらゆる系列の集合、Ｃは混合成分のあらゆる系列の集合、ＮはＨＭＭ状態数を表わす。補助関数Ｑ（θ｜θ’）は、従来のstochastic matching法の補助関数と類似しているが、式（１８）の４段目の差分ベクトルｂ_ｔの出力分布の対数項、つまり動的補正項の存在がその違いとなる。 Since the auxiliary function Q (θ | θ ′) matches the increase / decrease relationship in likelihood, θ that maximizes Equation (18) is a local optimal solution. Here, B is a sequence of differential feature values, S is a set of all sequences of HMM states, C is a set of all sequences of mixed components, and N is the number of HMM states. The auxiliary function Q (θ | θ ′) is similar to the auxiliary function of the conventional stochastic matching method, but the logarithm term of the output distribution of the fourth-stage difference vector b _t in equation (18), that is, dynamic correction. The existence of a term is the difference.

期待値ステップ（Ｅ-step）においては、フォワード・バックワードアルゴリズムやビタービアルゴリズムなどの隠れ変数に対するデータ割り当て手法を用いて、各フレーム毎の状態系列、混合成分系列に割り当てられた占有事後確率値を計算し、その値を元に１次統計量などの諸々の統計量を期待値計算により求める。 In the expected value step (E-step), the occupancy posterior probability value assigned to the state series and mixed component series for each frame using a data allocation method for hidden variables such as the forward / backward algorithm and the Viterbi algorithm Based on this value, various statistics such as a primary statistic are obtained by calculating the expected value.

最大化ステップ（Ｍ-step）では、Ｅ-stepで得られた統計量を元に式（１８）を最大化する式（１９）に示すパラメータθ＾を求める。

In the maximization step (M-step), the parameter θ ^ shown in the equation (19) that maximizes the equation (18) is obtained based on the statistic obtained in the E-step.

適応パラメータαとλは、相互に依存しており、それぞれを同時に最適化することは難しい。そこで適応パラメータ生成部６は、静的分散パラメータλと、動的分散パラメータαとを分けて推定する。適応用音声前処理部２と適応パラメータ生成部６のより具体的な機能構成例を図３に示して、音声パラメータ学習装置１００を更に詳細に説明する。動作フローを図４に示す。 The adaptation parameters α and λ are mutually dependent, and it is difficult to optimize each of them simultaneously. Therefore, the adaptive parameter generation unit 6 estimates the static dispersion parameter λ and the dynamic dispersion parameter α separately. A more specific functional configuration example of the adaptive speech preprocessing unit 2 and the adaptive parameter generation unit 6 is shown in FIG. 3, and the speech parameter learning device 100 will be described in more detail. The operation flow is shown in FIG.

適応用音声前処理部２は、音声強調部２０と、特徴量算出部２１と、強調音声特徴量算出部２２と、不確かさ算出部２３とを備える。音声強調部２０は、入力される観測音声信号ｏ（ｔ）のフレーム毎の音声特徴を強調した強調音声信号ｏ＾（ｔ）を生成する（ステップＳ２ａ）。特徴量算出部２１は、観測音声信号ｏ（ｔ）のフレーム毎の特徴量ｕ_ｔを算出する（ステップＳ２ｂ）。強調音声特徴量算出部２２は、強調音声信号の音声特徴ｘ_ｔ＾を強調音声特徴量の集合｛ｘ_ｔ’＾,…, ｘ_ｔ＾,…, ｘ_ｔ’’＾｝として算出する（ステップＳ２ｃ）。不確かさ算出部２３は、フレーム毎の強調音声特徴量ｘ_ｔ＾と観測音声信号ｏ（ｔ）の特徴量ｕ_ｔを入力として、強調音声特徴量のバラツキを表わす不確かさｅ_ｔ＝（ｘ_ｔ＾−ｕ_ｔ）^２を算出し、その集合、例えば｛ｅ_ｔ’,…, ｅ_ｔ,…, ｅ_ｔ’’｝を出力する（ステップＳ２ｄ）。それぞれの集合は、適応パラメータ生成部６に入力される。 The adaptive speech preprocessing unit 2 includes a speech enhancement unit 20, a feature amount calculation unit 21, an enhanced speech feature amount calculation unit 22, and an uncertainty calculation unit 23. The speech enhancement unit 20 generates an enhanced speech signal o ^ (t) in which speech features for each frame of the input observed speech signal o (t) are enhanced (step S2a). The feature amount calculation unit 21 calculates the feature amount u _t for each frame of the observed audio signal o (t) (step S2b). The emphasized speech feature amount calculation unit 22 calculates the speech feature x _t ^ of the enhanced speech signal as a set of emphasized speech feature amounts {x _{t '} ^, ..., x _t ^, ..., x _t'' ^} (step S2c). The uncertainty calculation unit 23 receives the emphasized speech feature quantity x _t ^ for each frame and the feature quantity u _t of the observed speech signal o (t), and inputs an uncertainty e _t = (x _t ^ _-u ^t) is calculated ^2, the set, for example, _{_{{e t ', ..., e}} t, ..., e t''} outputs the (step S2d). Each set is input to the adaptive parameter generation unit 6.

適応パラメータ生成部６は、占有確率算出部６４と、クリーンスピーチ分散算出部６２ａと、スケーリング因子λ算出部６２ｂと、差分２乗値算出部６６ａと、スケーリング因子α算出部６６ｂとを備える。 The adaptive parameter generation unit 6 includes an occupation probability calculation unit 64, a clean speech variance calculation unit 62a, a scaling factor λ calculation unit 62b, a difference square value calculation unit 66a, and a scaling factor α calculation unit 66b.

占有確率算出部６４には、強調音声特徴量の集合｛ｘ_ｔ’＾,…, ｘ_ｔ＾,…, ｘ_ｔ’’＾｝と、不確かさの集合｛ｅ_ｔ’,…, ｅ_ｔ,…, ｅ_ｔ’’｝と、ラベルと、音響モデル記憶部４内の音響モデルとが入力され、ＨＭＭ状態ｎ、混合成分ｍの占有確率γ_ｔ（ｎ,ｍ）を算出する（ステップＳ６０）。この占有確率γ_ｔ（ｎ,ｍ）は、ＥＭアルゴリズムのＥ-stepにおいてフォワード・バックワードアルゴリズムやビタービアルゴリズムなどのデータ割り当て手法によって計算することが可能である。 The occupancy probability calculation unit 64, a set of enhancement audio feature _{{x t '^, ...,} x t ^, ..., x t''^} and a set of uncertainty _{_{{e t', ..., e}} t, .., E _{t ″} }, the label, and the acoustic model in the acoustic model storage unit 4 are input, and the occupancy probability γ _t (n, m) of the HMM state n and the mixed component m is calculated (step S60). . This occupation probability γ _t (n, m) can be calculated by a data allocation method such as a forward / backward algorithm or a Viterbi algorithm in the E-step of the EM algorithm.

クリーンスピーチ分散算出部６２ａは、強調音声特徴量の集合｛ｘ_ｔ’＾,…, ｘ_ｔ＾,…, ｘ_ｔ’’＾｝と、不確かさの集合｛ｅ_ｔ’,…, ｅ_ｔ,…, ｅ_ｔ’’｝と、音響モデル記憶部４内の音響モデルとを入力として、クリーンスピーチの分散の推定値Ａ｛ｘ_ｔ，ｉ,ｘ_ｔ＾，ｎ,ｍ,Ψ,α’,λ’｝を算出する。 The clean speech variance calculation unit 62a includes a set of emphasized speech feature values {x _{t ′} ^,..., X _t ^, ..., x _{t ″} ^} and a set of uncertainties {e _{t ′} ,..., _Et , .., E _{t ″} } and the acoustic model in the acoustic model storage unit 4 as inputs, and an estimated value A {x _{t, i} , x _t ^, n, m, Ψ, α ′, λ ′} is calculated.

スケーリング因子λ算出部６２ｂは、クリーンスピーチの分散の推定値Ａ｛ｘ_ｔ，ｉ,ｘ_ｔ＾,ｎ,ｍ,Ψ,α’,λ’｝と占有確率γ_ｔ（ｎ,ｍ）を入力として、α＝constのとき、各特長量次元ｉにおけるスケーリング因子λ_ｉを、ＥＭアルゴリズムのＭ-stepにおいて式（２０）に示すように更新する（ステップＳ６２）。

ここで、

クリーンスピーチ分散算出部６２ａと、スケーリング因子λ算出部６２ｂとで静的分散適応手段６２を構成する。 The scaling factor λ calculation unit 62b receives the estimated value A { _{xt, i} , _xt ^, n, m, Ψ, α ′, λ ′} of the clean speech and the occupation probability γ _t (n, m). as, alpha = time const, the scaling factor lambda _i of each feature quantity dimension i, and updates the M-step of the EM algorithm as shown in equation (20) (step S62).

here,

The clean speech variance calculation unit 62a and the scaling factor λ calculation unit 62b constitute a static variance adaptation means 62.

差分２乗値算出部６６ａは、強調音声特徴量の集合｛ｘ_ｔ’＾,…, ｘ_ｔ＾,…, ｘ_ｔ’’＾｝と、不確かさの集合｛ｅ_ｔ’,…, ｅ_ｔ,…, ｅ_ｔ’’｝と、音響モデル記憶部４内の音響モデルとを入力として、音声特徴量ｘ_ｔ＾と、クリーン音声特徴ｘ_ｔとの差分ｂ_ｔ ^２の期待値Ｅ｛ｂ^２ _ｔ，ｉ｜ｘ_ｔ＾，ｎ,ｍ,Ψ,α’,λ’｝を算出する。 The difference square value calculation unit 66a includes a set of emphasized speech feature values { _{xt '} ^, ..., _xt ^, ..., xt _" ^} and a set of uncertainties { _et' , ..., _et. ,..., E _{t ″} } and the acoustic model in the acoustic model storage unit 4 as inputs, and the expected value E {b ² of the difference b _t ² between the speech feature amount x _t ^ and the clean speech feature x _t. _{t, i} | x _t ^, n, m, Ψ, α ′, λ ′} is calculated.

スケーリング因子α算出部６６ｂは、λ＝constのとき各特長量次元ｉにおけるスケーリング因子α_ｉを、式（２３）に示すように更新する（ステップＳ６６）。式（２３）は、λ＝constのとき、式（１７）と式（２）を式（１８）に代入し、α_ｉに関して最大化することで得られる。

ここで

The scaling factor α calculation unit 66b updates the scaling factor α _i in each feature dimension i as shown in Expression (23) when λ = const (step S66). Equation (23) is obtained by substituting Equation (17) and Equation (2) into Equation (18) and maximizing α _i when λ = const.

here

式（２３）からスケーリング因子α_ｉは、差分ベクトルの２乗の期待値と不確かさｅ_ｔ，ｉとの比を、全学習データ、全ＨＭＭ状態、全混合成分に渡って期待値を取ったものであると解釈することができる。差分２乗値算出部６６ａとスケーリング因子α算出部６６ｂとで動的分散適応手段６６を構成する。 From equation (23), the scaling factor α _i takes the ratio between the expected value of the square of the difference vector and the uncertainty et _{, i over} all learning data, all HMM states, and all mixture components. Can be interpreted. The difference square value calculation unit 66a and the scaling factor α calculation unit 66b constitute a dynamic dispersion adaptation unit 66.

分散動的補正部１０は、スケーリング因子α_ｉとλ_ｉと、音響モデル記憶部４に記憶された音響モデルと、認識用音声前処理部８から入力されるフレーム毎の不確かさｅ_ｔを入力として、補正した音響モデルのガウス分布の分散Σ′_{ｎ，ｍ，ｔ}を出力する。例えばΣ′_{ｎ，ｍ，ｔ}が対角行列の場合、各対角成分は式（２６）で計算できる。

The distributed dynamic correction unit 10 receives the scaling factors α _i and λ _i , the acoustic model stored in the acoustic model storage unit 4, and the uncertainty e _t for each frame input from the recognition speech preprocessing unit 8. As a result, Gaussian distribution Σ ′ _{n, m, t} of the corrected acoustic model is output. For example, when Σ ′ _{n, m, t} is a diagonal matrix, each diagonal component can be calculated by Expression (26).

〔応用例〕
上記説明した音声モデルパラメータ学習装置１００を用いて音声認識装置１５０を構成することができる。図５に音声認識装置１５０の機能構成例を示す。動作フローを図６に示す。音声認識装置１５０は、背景技術を説明した従来の音声認識装置２００の音声前処理部９０と、音響モデル記憶部９２と、分散動的補正部９４とを、音声パラメータ学習装置１００に置き換えたものである。他の構成は、音声認識装置２００と同じである。音声パラメータ学習装置１００は、フレーム毎に上記した説明済みの動作を行い観測音声信号のフレーム毎の音声特徴量ｘ_ｔ＾と、適応パラメータで補正された音響モデルのガウス分布の分散Σ′_{ｎ，ｍ，ｔ}と、音響モデルの平均パラメータμ_ｎ，ｍとを出力する（ステップＳ１０、図６）。認識部７４は、説明済みの音声認識装置２００と同様の動作により、適応パラメータで補正された音響モデルのガウス分布の分散Σ′_{ｎ，ｍ，ｔ}を用いて単語列Ｗを出力する（ステップＳ９７）。つまり、音声認識装置１５０は、特定の音声強調手法に依存することなく、音声の歪みを抑圧した音声認識を実現することができる。また、後述するように高い認識性能を持った音声認識装置とすることができる。 [Application example]
The speech recognition device 150 can be configured using the speech model parameter learning device 100 described above. FIG. 5 shows a functional configuration example of the speech recognition apparatus 150. The operation flow is shown in FIG. The speech recognition device 150 is obtained by replacing the speech pre-processing unit 90, the acoustic model storage unit 92, and the distributed dynamic correction unit 94 of the conventional speech recognition device 200 described in the background art with a speech parameter learning device 100. It is. Other configurations are the same as those of the speech recognition apparatus 200. The speech parameter learning device 100 performs the above-described operation for each frame, and the speech feature amount x _t ^ for each frame of the observed speech signal and the variance Σ ′ _n, of the Gaussian distribution of the acoustic model corrected by the adaptive parameter _{m and t} and the average parameter μ _{n, m of} the acoustic model are output (step S10, FIG. 6). The recognizing unit 74 outputs the word string W using the variance Σ ′ _{n, m, t} of the Gaussian distribution of the acoustic model corrected with the adaptive parameter by the same operation as the speech recognition apparatus 200 already described (step S97). ). That is, the speech recognition apparatus 150 can realize speech recognition with suppressed speech distortion without depending on a specific speech enhancement method. Further, as will be described later, a speech recognition device having high recognition performance can be obtained.

なお、音声特徴量ｘ_ｔ＾と、適応パラメータで補正された音響モデルのガウス分布の分散Σ′_{ｎ，ｍ，ｔ}と、平均パラメータμ_ｎ，ｍとがフレーム毎に出力されるので、音声認識用音響モデル記憶部９６を設けなくてもよい。 Note that since the speech feature amount x _t ^, the variance Σ ′ _{n, m, t} of the Gaussian distribution of the acoustic model corrected with the adaptive parameter, and the average parameter μ _{n, m} are output for each frame, speech recognition is performed. The acoustic model storage unit 96 may not be provided.

〔シミュレーション結果〕
この発明の音声パラメータ学習装置を用いた音声認識装置の単語誤り率（WER:Word Error Rate）を評価した。音声強調手法には、近年提案されたブラインド残響除去法を用いた。音声認識タスクとして、TI-Digit連続数字認識タスクを用いた。音響モデルは単語モデルを採用し、クリーン音声を用いて１単語当たり１６状態、１状態当たり３ガウス分布の不特定話者音響モデルを構築した。サンプリング周波数は８ｋHz、音声特徴量に１２次元のＭＦＣＣと０次のケプストラム及びそれらの差分成分と加速度成分を利用することにより３９次元の特徴量ベクトルを１０ｍｓ毎に用いた。なお、音声特徴量にＣＭＮ（Cepstral Mean Normalization）をかけた。〔simulation result〕
The word error rate (WER) of the speech recognition device using the speech parameter learning device of the present invention was evaluated. The recently proposed blind dereverberation method was used for speech enhancement. TI-Digit continuous digit recognition task was used as a speech recognition task. As the acoustic model, a word model was adopted, and an unspecified speaker acoustic model having 16 states per word and 3 Gauss distribution per state was constructed using clean speech. A sampling frequency was 8 kHz, and a 39-dimensional feature vector was used every 10 ms by using a 12-dimensional MFCC, a zeroth-order cepstrum, and their differential components and acceleration components as speech feature values. Note that CMN (Cepstral Mean Normalization) was applied to the voice feature amount.

残響音声は、クリーン音声に対し部屋の伝達特性を畳み込むことによって生成した。残響時間が０．５秒の部屋で測定した伝達関数を利用した。クリーン音声はTI−Digitクリーンセットを利用した。テストデータには１０４人の男性と女性話者で話された５６１発話を利用した。発話の平均長は６秒である。 Reverberation sound was generated by convolving room transfer characteristics with clean sound. A transfer function measured in a room with a reverberation time of 0.5 seconds was used. The clean voice used the TI-Digit clean set. As test data, 561 utterances spoken by 104 male and female speakers were used. The average length of utterance is 6 seconds.

単語誤り率で評価した認識結果を図７に示す。クリーン音声、残響音声、残響除去音声、分散動的補正（適応無し）と、分散動的補正（オラクル）での単語誤り率を比較した。ここでオラクルとは、分散動的補正において必要な特徴量分散を、クリーン音声と残響除去後音声のそれぞれの特徴量から算出した理想的な値である。図７に示すように残響除去を行うことにより若干単語誤り率は改善されるが、クリーン音声の認識結果と比べて大きな開きがあることが分かる。一方、従来の分散動的補正を用いると認識性能を大きく改善することができるが、オラクルの値と比べて依然として大きな開きがある。この発明の目標は、このオラクル値に認識性能を近づけることである。 FIG. 7 shows the recognition result evaluated by the word error rate. We compared the word error rates of clean speech, reverberation speech, dereverberation speech, distributed dynamic correction (no adaptation), and distributed dynamic correction (Oracle). Here, the oracle is an ideal value calculated from the feature amounts of the clean speech and the dereverberated speech for the feature amount dispersion necessary for the distributed dynamic correction. As shown in FIG. 7, the word error rate is slightly improved by performing dereverberation, but it can be seen that there is a large gap compared to the clean speech recognition result. On the other hand, the recognition performance can be greatly improved by using the conventional distributed dynamic correction, but there is still a big difference compared with the value of Oracle. The goal of this invention is to bring the recognition performance closer to this Oracle value.

不特定話者の適応データを利用することで、話者に適応させるのではなく、音声強調されたデータに適応させることが可能になる。適応データは、テストデータと同じ話者によって話された５２０発話を利用する。発話数の影響を検討するため適応データからランダムに２〜５１２発話を抜き出し、その適応データを用いて適応を行った。図８に静的分散適応（SVA）と、動的分散適応（DVA）と、この発明の方法であるＳＤＶＡによる単語誤り率を示す。横軸は発話数、縦軸は単語誤り率（WER）である。２発話程度の少量発話で認識性能が十分収束するのが分かる。また、静的分散適用の利用によって、単語誤り率は３１％（図７）から１５.２％に改善する。動的分散適用の利用によっても１５．５％程度に改善される。この発明の動的分散適用と静的分散適用とを同時に行うＳＤＶＡによれば、更に単語誤り率を２％程度改善することができる。結果として図７に示した残響除去後音声（３１.０％）に比べて誤り率を約半分以下にすることができた。また、更なる認識率の改善を目的に、この発明の分散適応方法とＭＬＬＲ（Maximum Likelihood Linear Regression）による平均パラメータの適応の組み合わせについて検討したところ、単語誤り率５％の結果を得た。５％の単語誤り率は、クリーン音声の認識率（１.２％）に近い値である。このようにこの発明による音声パラメータ学習装置を用いることで、単語誤り率を改善することができる。 By using the adaptation data of the unspecified speaker, it is possible to adapt to the data emphasized by speech, not to the speaker. The adaptation data uses 520 utterances spoken by the same speaker as the test data. In order to examine the influence of the number of utterances, 2 to 512 utterances were randomly extracted from the adaptation data, and adaptation was performed using the adaptation data. FIG. 8 shows the word error rate by static distributed adaptation (SVA), dynamic distributed adaptation (DVA), and SDVA which is the method of the present invention. The horizontal axis is the number of utterances, and the vertical axis is the word error rate (WER). It can be seen that the recognition performance converges sufficiently with a small amount of utterances of about two utterances. Also, the use of static distributed application improves the word error rate from 31% (FIG. 7) to 15.2%. Even with the use of dynamic distributed application, it is improved to about 15.5%. According to the SDVA that simultaneously performs dynamic distribution application and static distribution application of the present invention, the word error rate can be further improved by about 2%. As a result, the error rate could be reduced to about half or less compared to the speech after dereverberation (31.0%) shown in FIG. Further, for the purpose of further improving the recognition rate, the combination of the distributed adaptation method of the present invention and the adaptation of average parameters by MLLR (Maximum Likelihood Linear Regression) was examined, and a result with a word error rate of 5% was obtained. The word error rate of 5% is close to the clean speech recognition rate (1.2%). Thus, the word error rate can be improved by using the speech parameter learning apparatus according to the present invention.

なお、以上説明した適応手法は分散パラメータに注目したものであるが、平均パラメータや状態遷移率、混合重み因子といった他のパラメータに対応する適応手法と組み合わせることもできる。 The adaptive method described above focuses on the dispersion parameter, but can be combined with an adaptive method corresponding to other parameters such as an average parameter, a state transition rate, and a mixing weight factor.

また、この発明の装置及び方法は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Moreover, the apparatus and method of this invention are not limited to the above-mentioned embodiment, It can change suitably in the range which does not deviate from the meaning of this invention. Further, the processes described in the above apparatus and method are not only executed in time series according to the order of description, but also may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. Good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

この発明の音声パラメータ学習装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech parameter learning apparatus 100 of this invention. 音声パラメータ学習装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech parameter learning apparatus 100. 適応用音声前処理部２と適応パラメータ生成部６のより具体的な機能構成例を示す図。The figure which shows the more concrete function structural example of the audio | voice pre-processing part 2 for adaptation, and the adaptive parameter generation part 6. 図３の動作フローを示す図。The figure which shows the operation | movement flow of FIG. 音声パラメータ学習装置１００を用いた音声認識装置１５０の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 150 using the speech parameter learning apparatus 100. FIG. 音声認識装置１５０の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 150. 単語誤り率で評価した認識結果を示す図。The figure which shows the recognition result evaluated by the word error rate. 静的分散適応（SVA）と、動的分散適応（DVA）と、この発明の方法であるＳＤＶＡによる単語誤り率を示す図。The figure which shows the word error rate by static dispersion | distribution adaptation (SVA), dynamic dispersion | distribution adaptation (DVA), and SDVA which is the method of this invention. 従来の音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the conventional speech recognition apparatus 200. FIG. 音声認識装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 200.

Claims

Adaptive speech preprocessing that takes an observed speech signal as input and generates a set of emphasized speech features that emphasize speech features for each frame of the observed speech signal and a set of uncertainties that represent variations in the emphasized speech features And
An acoustic model storage unit storing an acoustic model;
The set of emphasized speech features, the set of uncertainties, the acoustic model, and the teacher signal are input, and a motion dependent on the frame is used as an adaptive parameter for dispersion correction of the Gaussian distribution in the acoustic model. An adaptive parameter generation unit for generating a static distributed adaptive parameter and a static distributed adaptive parameter independent of the frame,
A speech pre-processing unit for recognition that generates the uncertainty representing the variation of the speech feature amount and the speech feature amount of each frame of the observed speech signal, using the observed speech signal as an input;
A variance dynamic correction unit that receives the uncertainty of the speech feature, the adaptation parameter, and the acoustic model, and outputs a variance of the Gaussian distribution of the acoustic model corrected by the adaptation parameter for each frame;
A speech parameter learning apparatus comprising:

The speech parameter learning device according to claim 1,
The adaptive speech preprocessing unit is
A speech enhancement unit that generates an enhanced speech signal in which speech features for each frame of the input observed speech signal are enhanced;
A feature amount calculation unit that calculates a feature amount for each frame of the observed audio signal;
An enhanced speech feature quantity calculating unit that computes an enhanced speech feature quantity for each frame of the enhanced speech signal and generates a set of enhanced speech feature quantities;
An uncertainty calculation unit that calculates an uncertainty representing the variation of the emphasized speech feature amount from the enhanced speech feature amount of the enhanced speech signal and the feature amount of the observed speech signal and generates a set of uncertainties of the enhanced speech feature amount And
The adaptive parameter generation unit
An occupancy probability calculation unit that receives the set of emphasized speech feature values, the set of uncertainties of the emphasized speech feature values, the acoustic model, and the teacher signal, and calculates the occupancy probability of the HMM state n and the mixed component m When,
A set of the emphasized speech feature amount, a set of uncertainties of the emphasized speech feature amount, a clean speech variance calculation unit that calculates the variance of clean speech using the acoustic model as an input, the variance of the clean speech, and the occupation A scaling factor λ calculator for calculating the scaling factor λ as the static variance adaptive parameter,
Using the set of emphasized speech feature values, the set of uncertainties of the emphasized speech feature values, and the acoustic model as input, the expected value of the square value of the difference between the clean speech feature and the speech feature value is calculated. A difference square value calculation unit;
A scaling factor α calculating unit that inputs the occupation probability and the square value of the difference and generates a scaling factor α as the dynamic dispersion adaptive parameter;
A speech parameter learning apparatus characterized by that.

The speech parameter learning device according to claim 1 or 2,
A recognition unit that outputs the speech feature value output by the speech parameter learning device and the variance of the Gaussian distribution of the acoustic model corrected in the speech parameter learning device, and outputs a word string;
A speech recognition apparatus comprising:

The adaptive speech preprocessing unit receives the observed speech signal as an input, and a set of emphasized speech features that emphasizes the speech features for each frame of the observed speech signal, and a set of uncertainties that represent variations in the enhanced speech feature values A speech preprocessing process for adaptation to generate
The adaptive parameter generation unit receives the set of emphasized speech features, the set of uncertainties, the acoustic model, and the teacher signal as inputs, and uses dynamic distribution depending on the frame as an adaptive parameter for dispersion correction. An adaptive parameter generation process for generating adaptive parameters and the frame-independent static distributed adaptive parameters;
A recognition speech pre-processing unit that receives the observed speech signal as an input and generates a speech feature value for each frame of the observed speech signal and a uncertainty representing a variation in the speech feature value;
A variance dynamic correction unit receives the uncertainty of the speech feature value, the adaptive parameter, and the acoustic model as input, and outputs a variance of the Gaussian distribution of the acoustic model corrected with the adaptive parameter for each frame Dynamic correction process;
A speech parameter learning method including:

The speech parameter learning method according to claim 4, wherein
The adaptive speech preprocessing process is
A speech enhancement process in which a speech enhancement unit generates an enhanced speech signal in which speech features for each frame of the input observation speech signal are enhanced;
A feature amount calculating unit for calculating a feature amount for each frame of the observed audio signal;
An enhanced speech feature quantity calculating unit that computes an enhanced speech feature quantity for each frame of the enhanced speech signal to generate a set of enhanced speech feature quantities; and
An uncertainty calculation unit calculates an uncertainty representing the variation of the emphasized speech feature amount from the enhanced speech feature amount of the enhanced speech signal and the feature amount of the observed speech signal, and obtains a set of uncertainties of the enhanced speech feature amount. Including the uncertainty calculation process to generate,
The adaptive parameter generation process is as follows:
An occupancy probability calculation unit calculates the occupancy probability of the HMM state n and the mixed component m with the set of the emphasized speech feature amount, the uncertainties of the emphasized speech feature amount, the acoustic model, and the teacher signal as inputs. Occupancy probability calculation process to
A clean speech variance calculation process in which a clean speech calculation unit calculates the variance of clean speech using the set of the emphasized speech feature amount, the set of uncertainties of the emphasized speech feature amount, and the acoustic model as an input;
A scaling factor λ calculating section for calculating a scaling factor λ from the clean speech variance and the occupation probability;
The difference square value calculation unit receives the set of the emphasized speech feature, the set of uncertainties of the enhanced speech feature, and the acoustic model as a square of the difference between the clean speech feature and the speech feature. A difference square value calculation process for calculating an expected value of the value,
A scaling factor α calculating unit including a scaling factor α calculating step of generating the dynamic dispersion adaptive parameter by inputting the occupation probability, the uncertainty, and the square value of the difference;
A speech parameter learning method characterized by the above.

The speech parameter learning method according to claim 4 or 5,
A recognition process in which the recognition unit receives the speech feature amount generated by the speech parameter learning method and the variance of the Gaussian distribution of the corrected acoustic model, and outputs a word string;
A speech recognition method comprising:

A program for causing a computer to function as the speech parameter learning device according to claim 1.

A program for causing a computer to function as the voice recognition apparatus according to claim 3.

A computer-readable recording medium on which the program according to claim 7 is recorded.