JP5749186B2

JP5749186B2 - Acoustic model adaptation device, speech recognition device, method and program thereof

Info

Publication number: JP5749186B2
Application number: JP2012022908A
Authority: JP
Inventors: 太一浅見; 哲小橋川; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-02-06
Filing date: 2012-02-06
Publication date: 2015-07-15
Anticipated expiration: 2032-02-06
Also published as: JP2013160930A

Description

この発明は、教師なし音響モデル適応を行う音響モデル適応装置とその音響モデルを用いた音声認識装置とそれらの方法と、プログラムに関する。 The present invention relates to an acoustic model adaptation device that performs unsupervised acoustic model adaptation, a speech recognition device that uses the acoustic model, a method thereof, and a program.

音声認識に使用する音響モデルを更新する際には、学習データ中の事例ができるだけ多く成り立つようにモデルのパラメータの最適化処理を行う。この処理を「音響モデルの適応」と呼び、一般に、音声ファイルと当該音声ファイルの発話内容を表す正解テキストとを学習（適応）データとして用いる。音響モデルの適応は、正解テキストを、音声ファイルに対応する読みを人間が書き起こすことにより得る教師あり適応と、音声ファイルの音声認識結果として得る教師なし適応とに大別される。教師なし適応の方が、人手による正解テキストを必要としない分、音声認識システムの開発コストを低く抑えることができる。 When updating the acoustic model used for speech recognition, model parameter optimization processing is performed so that as many examples as possible are included in the learning data. This process is called “acoustic model adaptation” and generally uses a speech file and a correct text representing the utterance content of the speech file as learning (adaptive) data. The adaptation of the acoustic model is broadly classified into supervised adaptation in which correct text is obtained by human writing up the reading corresponding to the speech file, and unsupervised adaptation obtained as a speech recognition result of the speech file. Unsupervised adaptation can reduce the development cost of a speech recognition system because it does not require manual correct text.

例えば非特許文献１に、教師なし音響モデル適応を行う際、システムに蓄積された音声を音響的な類似度に基づいていくつかのクラスに分類し、各クラスに対して教師なし音響モデル適応を行うことによってクラスごとに音響モデルを生成し、システムに音声が入力された時に適切なクラスの音響モデルを選択して音声認識に用いることで、通常の教師なし音響モデル適応よりも音声認識精度を高められる技術が開示されている。 For example, in Non-Patent Document 1, when performing unsupervised acoustic model adaptation, the speech accumulated in the system is classified into several classes based on the acoustic similarity, and unsupervised acoustic model adaptation is applied to each class. By generating an acoustic model for each class, and when speech is input to the system, an appropriate class acoustic model is selected and used for speech recognition, resulting in better speech recognition accuracy than normal unsupervised acoustic model adaptation. Enhanced techniques are disclosed.

その技術は、蓄積された音声を、はじめはランダムに分類したクラスごとに音声ＧＭＭ（Gaussian Mixture Model）を学習し、各音声をＧＭＭ尤度が最大となるクラスに割り当て直すことを繰り返して音声をクラスに分類してクラスごとに教師なし音響モデル適応を行う方法である。そして、音声認識を行う際は、各クラスの音声ＧＭＭで入力音声に対する尤度を算出し、尤度が最大となったクラスの音響モデルを選択して音声認識を行う。 The technology learns the speech GMM (Gaussian Mixture Model) for each class of the accumulated speech at random, and then reassigns each speech to the class that maximizes the GMM likelihood. This is a method of classifying into classes and applying the unsupervised acoustic model for each class. When speech recognition is performed, the likelihood for the input speech is calculated by the speech GMM of each class, and the acoustic model of the class having the maximum likelihood is selected to perform speech recognition.

佐藤ほか「２段階クラスタリングに基づく選択学習による音響モデル適応化」電子情報通信学会論文誌、D-II,Vol.J85-D-II,No2,pp.174-183,2002.Sato et al. "Acoustic model adaptation by selective learning based on two-stage clustering" IEICE Transactions, D-II, Vol. J85-D-II, No2, pp.174-183, 2002.

非特許文献１に開示された従来の方法では、音声間の類似度として音声ＧＭＭにより算出した尤度を用いている。音声ＧＭＭの学習では、音声の中で発声されている音素の種類を考慮せず、全ての音素を同一視して一つの音声ＧＭＭパラメータを決定する。そのため、音声ＧＭＭ間の距離を音声間の類似度として用いると、音素別に詳細に見れば違いのある２つの音素の類似度を高く見積もってしまい大きく異なる音素が同じクラスに分類されてしまう場合がある。このようなクラスに対して音響モデル適応を行っても、異なりの大きい音素を正しく認識できるようにパラメータを設定することが困難（一方に合わせればもう一方が認識できない）であるため、適応の効果が小さくなってしまう。 In the conventional method disclosed in Non-Patent Document 1, the likelihood calculated by the speech GMM is used as the similarity between speech. In speech GMM learning, all phonemes are identified without considering the type of phoneme uttered in the speech, and one speech GMM parameter is determined. For this reason, if the distance between speech GMMs is used as the similarity between speeches, the similarity between two phonemes that differ in terms of phonemes will be estimated high, and greatly different phonemes may be classified into the same class. is there. Even if acoustic model adaptation is applied to such a class, it is difficult to set parameters so that different large phonemes can be recognized correctly (the other cannot be recognized if it is matched), so the effect of adaptation Will become smaller.

この発明は、このような課題に鑑みてなされたものであり、音素別に類似度を評価して類似度が高い音声が同じクラスに分類されるようにして教師なし音響モデル適応を行う音響モデル適応装置と、音声認識装置とそれらの方法とプログラムを提供することを目的とする。 The present invention has been made in view of such a problem, and is an acoustic model adaptation that performs unsupervised acoustic model adaptation by evaluating similarity for each phoneme and classifying voices with high similarity into the same class. It is an object to provide a device, a speech recognition device, and a method and program thereof.

この発明の音響モデル適応装置は、音声認識部と、音素誤り傾向ベクトル生成部と、クラスタリング部と、ベース音響モデルと、ベース音響モデル適応部と、適応後音響モデル記録部と、を具備する。音声認識部は、複数の音声から成る音声群を入力として、入力音声をベース音響モデルに基づいて音声認識処理した結果の音声認識結果テキストとその音声を出力する。音素誤り傾向ベクトル生成部は、入力音声の音響特徴量抽出をフレーム毎に行い当該フレームの音響特徴量の出力確率をベース音響モデルに含まれる全音素の全状態について求め、当該出力確率の最大値を当該フレームの出力確率の総和で除して１位音素の事後確率とし、当該１位音素の事後確率の音素毎の平均値を音声単位で求めた音素事後確率を当該音声単位で並べて事後確率ベクトルとし当該事後確率ベクトルから、予め求めた上記音声群全体の音声群全体事後確率ベクトルを減算して上記音声の音素誤り傾向ベクトルとして生成する。クラスタリング部は、上記音声と音声認識結果テキストと音素誤り傾向ベクトルとの３つの組群を入力として、音素誤り傾向ベクトル間の類似度を尺度に、３つの組群を所定のクラスのクラスタに分類すると共に当該クラスタ中の音素誤り傾向ベクトルの平均ベクトルであるセントロイドを求め、クラスタとセントロイドを出力する。ベース音響モデル適応部は、クラスタとセントロイドを入力として、各クラスタに含まれる音声と音声認識結果に基づいて、ベース音響モデルをクラスタごとに適応させた適応後音響モデルを生成する。適応後音響モデル記録部は、クラスタごとに適応後音響モデルを記録する。 The acoustic model adaptation device of the present invention includes a speech recognition unit, a phoneme error tendency vector generation unit, a clustering unit, a base acoustic model, a base acoustic model adaptation unit, and a post-adaptation acoustic model recording unit. The voice recognition unit receives a voice group composed of a plurality of voices, and outputs a voice recognition result text obtained as a result of voice recognition processing of the input voice based on the base acoustic model and the voice. The phoneme error tendency vector generation unit performs acoustic feature extraction of the input speech for each frame, obtains the output probability of the acoustic feature of the frame for all states of all phonemes included in the base acoustic model, and calculates the maximum value of the output probability Is divided by the sum of the output probabilities of the corresponding frame to obtain the posterior probability of the first phoneme, and the posterior probability obtained by arranging the phoneme posterior probabilities obtained by calculating the average value of each first phoneme posterior probability in units of speech. The speech group overall posterior probability vector of the entire speech group obtained in advance is subtracted from the posterior probability vector as a vector to generate the phoneme error tendency vector of the speech. The clustering unit receives the three groups of the speech, the speech recognition result text, and the phoneme error tendency vector as input, and classifies the three groups into clusters of a predetermined class on the basis of the similarity between the phoneme error tendency vectors. In addition, a centroid that is an average vector of phoneme error tendency vectors in the cluster is obtained, and the cluster and the centroid are output. The base acoustic model adaptation unit generates a post-adaptation acoustic model in which the base acoustic model is adapted for each cluster, based on the speech and speech recognition results included in each cluster, with the cluster and centroid as inputs. The post-adaptation acoustic model recording unit records the post-adaptation acoustic model for each cluster.

また、この発明の音声認識装置は、適応後音響モデル記録部と、音素誤り傾向ベクトル生成部と、近傍セントロイド選択部と、適用音響モデルと、音声認識部と、を具備する。適応後音響モデル記録部は、上記した音響モデル適応装置によって生成された適応後音響モデルを含む複数のクラスタとそのセントロイドとを組みにして記録する。音素誤り傾向ベクトル生成部は、認識対象音声の音響特徴量抽出をフレーム毎に行い当該フレームの音響特徴量の出力確率をベース音響モデルに含まれる全音素の全状態について求め、当該出力確率の最大値を当該フレームの出力確率の総和で除して１位音素の事後確率とし、当該１位音素の事後確率の音素毎の平均値を音声単位で求めた音素事後確率を当該音声単位で並べて事後確率ベクトルとし当該事後確率ベクトルから、外部から入力される予め求めた上記音声群全体の音声群全体事後確率ベクトルを減算して認識時音素誤り傾向ベクトルとして生成する。最近傍セントロイド選択部は、認識時音素誤り傾向ベクトルと、複数のクラスタと組みのセントロイドとの類似度が最大となる適応後音響モデルを選択して適用音響モデルとして出力する。音声認識部は、認識対象音声を、適用音響モデルに基づいて音声認識処理して音声認識結果テキストを出力する。 The speech recognition apparatus of the present invention further includes an after-adaptation acoustic model recording unit, a phoneme error tendency vector generation unit, a neighborhood centroid selection unit, an applied acoustic model, and a speech recognition unit. The post-adaptation acoustic model recording unit records a plurality of clusters including the post-adaptation acoustic model generated by the above-described acoustic model adaptation device and their centroids. The phoneme error tendency vector generation unit performs acoustic feature extraction of the recognition target speech for each frame, obtains the output probability of the acoustic feature of the frame for all states of all phonemes included in the base acoustic model, and calculates the maximum of the output probability The value is divided by the sum of the output probabilities of the frame to obtain the posterior probability of the first phoneme, and the posterior probabilities obtained by calculating the average value for each phoneme of the posterior probability of the first phoneme in units of speech A probability vector is generated as a phoneme error tendency vector at the time of recognition by subtracting the entire speech group a posteriori probability vector of the entire speech group input in advance from the posterior probability vector. The nearest neighbor centroid selection unit selects the post-adaptation acoustic model that maximizes the similarity between the recognition-time phoneme error tendency vector and the centroid of a plurality of clusters and pairs, and outputs the selected acoustic model as an applied acoustic model. The speech recognition unit performs speech recognition processing on the recognition target speech based on the applied acoustic model, and outputs a speech recognition result text.

この発明の音響モデル適応装置は、音声毎に音素別の傾向を考慮した音素数次元の音素誤り傾向ベクトルを求め、その音素誤り傾向ベクトル間の類似度に基づいて各音声とその音声認識結果テキストを複数のクラスタに分類する。よって、各クラスタには音素誤り傾向が似ている音声が分類される。その音素誤り傾向が似ている音声とその音声認識結果テキストを用いて、ベース音響モデルを教師なし適応させるので音響モデルの適応効果を高めることが可能になる。 The acoustic model adaptation device according to the present invention obtains a phoneme-number-dimensional phoneme error tendency vector in consideration of a tendency for each phoneme for each voice, and each voice and its speech recognition result text based on the similarity between the phoneme error tendency vectors. Are classified into a plurality of clusters. Therefore, voices having similar phoneme error tendencies are classified into each cluster. Using the speech with similar phoneme error tendency and the speech recognition result text, the base acoustic model is adapted unsupervised, so that the adaptation effect of the acoustic model can be enhanced.

また、この発明の音声認識装置によれば、認識対象音声の認識時音素誤り傾向ベクトルを求め、その認識時音素誤り傾向ベクトルと、上記したこの発明の音響モデル適応装置で適応させた適応後音響モデルのセントロイド（音素誤り傾向ベクトル）との類似度の高い適応後音響モデルを選択した適用音響モデルを用いて音声認識処理するので、音声認識の認識精度を高めることができる。 Further, according to the speech recognition apparatus of the present invention, the phoneme error tendency vector at the time of recognition of the recognition target speech is obtained, and the post-adaptive sound adapted by the above-described acoustic model adaptation device of the present invention and the recognized phoneme error tendency vector. Since the speech recognition process is performed using the applied acoustic model in which the post-adaptation acoustic model having a high similarity with the model centroid (phoneme error tendency vector) is selected, the recognition accuracy of speech recognition can be improved.

この発明の音響モデル適応装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model adaptation apparatus 100 of this invention. 音響モデル適応装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model adaptation apparatus 100. 音素誤り傾向ベクトル生成部２０のより具体的な機能構成例を示す図。The figure which shows the more specific functional structural example of the phoneme error tendency vector generation part 20. FIG. フレーム毎の音響特徴量Ｏ_itと出力確率の関係を模式的に示す図。The figure which shows typically the relationship between the acoustic feature-value _Oit and output probability for every flame | frame. １位音素の事後確率と音素毎の事後確率との関係を模式的に示す図。The figure which shows typically the relationship between the posterior probability of a 1st phoneme, and the posterior probability for every phoneme. この発明の音声認識装置２００の機能構成例示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention. 音声認識装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 200. 評価実験結果を示す図。The figure which shows an evaluation experiment result.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響モデル適応装置１００の機能構成例を示す。その動作フローを図２に示す。音響モデル適応装置１００は、音声認識部１０と、音素誤り傾向ベクトル生成部２０と、クラスタリング部３０と、ベース音響モデル４０と、ベース音響モデル適応部５０と、適応後音響モデル記録部６０と、を具備する。音響モデル適応装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of an acoustic model adaptation apparatus 100 of the present invention. The operation flow is shown in FIG. The acoustic model adaptation apparatus 100 includes a speech recognition unit 10, a phoneme error tendency vector generation unit 20, a clustering unit 30, a base acoustic model 40, a base acoustic model adaptation unit 50, a post-adaptation acoustic model recording unit 60, It comprises. The acoustic model adaptation apparatus 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

音声認識部１０は、複数の音声から成る音声群を入力として、その音声をベース音響モデル４０に基づいて音声認識処理した結果の音声認識結果テキストとその音声を出力する（ステップＳ１０）。音声はディジタル信号で与えられ、所定の数の音声ディジタル信号を１フレーム（例えば２０ms）としたフレーム毎に、音声認識処理が施される。音声認識した結果の音声認識結果テキストは、その音声と組で出力される。なお、音声認識部１０は従来技術で構成される。 The voice recognition unit 10 receives a voice group consisting of a plurality of voices, and outputs a voice recognition result text and a voice as a result of voice recognition processing of the voice based on the base acoustic model 40 (step S10). Voice is given as a digital signal, and voice recognition processing is performed for each frame in which a predetermined number of voice digital signals are one frame (for example, 20 ms). The speech recognition result text resulting from the speech recognition is output in combination with the speech. Note that the voice recognition unit 10 is configured by a conventional technique.

音素誤り傾向ベクトル生成部２０は、上記音声ディジタル信号の音響特徴量抽出をフレーム毎に行い当該フレームの音響特徴量の出力確率をベース音響モデル４０に含まれる全音素の全状態について求め、その出力確率の最大値を当該フレームの出力確率の総和で除して１位音素の事後確率とし、その１位音素の事後確率の音素毎の平均値を音声単位で求めた音素事後確率をその音声単位で並べて事後確率ベクトルとしその事後確率ベクトルから、予め求めた上記音声群全体の音声群全体事後確率ベクトルを減算してその音声単位の音素誤り傾向ベクトルとして生成する（ステップＳ２０）。ここで、音声群を成す複数の音声のそれぞれは、例えば数分間の電話会話や１時間程度の講演会の講義を録音した音声や動画の音声等である。ここで、音声群全体の音声群全体事後確率ベクトルとは、音素誤り傾向ベクトル生成部２０が行う音響モデルを作る為に入力される音声群に対して事後確率ベクトルを生成するまでの処理を施して生成した事後確率ベクトルのことであり、予め求めておく。音声群全体事後確率ベクトルは、外部から入力してもよいし、音素誤り傾向ベクトル生成部２０に予め記録させておいてもよい。音声群全体事後確率ベクトルを含めた音素誤り傾向ベクトル生成部２０の詳しい動作説明は後述する。 The phoneme error tendency vector generation unit 20 extracts the acoustic feature amount of the speech digital signal for each frame, obtains the output probability of the acoustic feature amount of the frame for all states of all phonemes included in the base acoustic model 40, and outputs The maximum probability value is divided by the sum of the output probabilities of the frame to obtain the posterior probability of the first phoneme, and the phoneme posterior probability obtained by calculating the average value for each phoneme of the posterior probability of the first phoneme in the speech unit. Are set as posterior probability vectors, and the speech group posterior probability vector of the entire speech group obtained in advance is subtracted from the posterior probability vector to generate a phoneme error tendency vector for the speech unit (step S20). Here, each of the plurality of voices constituting the voice group is, for example, a voice recorded in a telephone conversation for several minutes or a lecture for about one hour, a voice of a moving picture, or the like. Here, the entire speech group posterior probability vector of the entire speech group is a process of generating a posterior probability vector for the input speech group in order to create an acoustic model performed by the phoneme error tendency vector generation unit 20. This is a posterior probability vector generated in advance and is obtained in advance. The entire speech group posterior probability vector may be input from the outside, or may be recorded in advance in the phoneme error tendency vector generation unit 20. A detailed operation description of the phoneme error tendency vector generation unit 20 including the entire speech group posterior probability vector will be described later.

クラスタリング部３０は、音声認識部１０が出力する音声認識結果テキストとその音声と、音素誤り傾向ベクトル生成部２０が出力するその音声の音素誤り傾向ベクトルとの３つの組群の集まりである全体集合を入力として、音素誤り傾向ベクトル間の類似度を尺度に、３つの組群を所定の数（ｋ個）のクラスのクラスタに分類すると共に、当該クラスタ中の音素誤り傾向ベクトルの平均ベクトルであるセントロイドとを求め、クラスタとセントロイドを出力する（ステップＳ３０）。１個のクラスタは、音声、音声認識結果テキスト、音素誤り傾向ベクトル、の３つの組が集合した部分集合である。分類数ｋは、所望のクラスタ数に全体集合を分割する数であり、例えばｋ＝４８とされる。分類数ｋは、外部からクラスタリング部３０に与えてもよいし、予めクラスタリング部３０に設定しておいてもよい。分類数ｋは５０位〜１００位を目安に設定する。 The clustering unit 30 is an entire set that is a set of three sets of the speech recognition result text output from the speech recognition unit 10 and the speech thereof, and the phoneme error tendency vector of the speech output from the phoneme error tendency vector generation unit 20. Is an average vector of phoneme error tendency vectors in the cluster, classifying the three groups into clusters of a predetermined number (k) classes using the similarity between phoneme error tendency vectors as a scale. The centroid is obtained, and the cluster and centroid are output (step S30). One cluster is a subset of three sets of speech, speech recognition result text, and phoneme error tendency vector. The classification number k is a number that divides the entire set into a desired number of clusters. For example, k = 48. The classification number k may be given to the clustering unit 30 from the outside, or may be set in the clustering unit 30 in advance. The classification number k is set from 50 to 100 as a guide.

ベース音響モデル適応部５０は、クラスタリング部３０が出力するクラスタとセントロイドを入力として、各クラスタに含まれる音声と音声認識結果テキストに基づいて、ベース音響モデル４０をクラスタ毎に適応させた適応後音響モデルを生成し、適応後音響モデルと当該クラスタのセントロイドのペアを出力する（ステップＳ５０）。ベース音響モデル適応部５０は、クラスタの数ｋ個にそれぞれ対応するｋ個の音響モデル適応部５０_１〜５０_ｋで構成される。 The base acoustic model adaptation unit 50 receives the clusters and centroids output from the clustering unit 30 as input, and adapts the base acoustic model 40 for each cluster based on the speech and speech recognition result text included in each cluster. An acoustic model is generated, and an after-adaptation acoustic model and a centroid pair of the cluster are output (step S50). The base acoustic model adaptation unit 50 includes k acoustic model adaptation units 50 ₁ to 50 _k that respectively correspond to several k clusters.

適応後音響モデル記録部６０は、クラスタ毎の適応後音響モデルとセントロイドのペアを記録する（ステップＳ６０）。複数の適応後音響モデル１〜ｋは、それぞれ適応後音響モデル６０_１〜６０_ｋとして別々に記録される。 The post-adaptation acoustic model recording unit 60 records a post-adaptation acoustic model and centroid pair for each cluster (step S60). The plurality of after-adaptation acoustic models _{1 to} _k are separately recorded as after-adaptation acoustic models 60 _{1 to} 60 _k .

以上のように動作することで、ベース音響モデル４０は、音素誤り傾向が似た音声の部分集合（クラスタ）毎に、その音声と音声認識結果テキストを用いた教師なし音響モデル適応処理によって適応後音響モデルに変換される。したがって、音響モデル適応装置１００は、従来の音響モデル適応装置よりも音響モデルの適応効果を高めることができる。 By operating as described above, the base acoustic model 40 is subjected to an unsupervised acoustic model adaptation process using the speech and speech recognition result text for each speech subset (cluster) having a similar phoneme error tendency. Converted to an acoustic model. Therefore, the acoustic model adaptation apparatus 100 can enhance the adaptation effect of the acoustic model as compared with the conventional acoustic model adaptation apparatus.

〔音素誤り傾向ベクトル生成部〕
音響モデル適応装置１００の各部の機能を具体的に示して更に詳しく説明する。図３に、音素誤り傾向ベクトル生成部２０の機能構成例を示す。音素誤り傾向ベクトル生成部２０は、音響特徴量抽出手段２１と、出力確率算出手段２２と、出力確率総和計算手段２３と、最大出力確率音素の事後確率計算手段２４と、音素事後確率計算手段２５と、事後確率ベクトル生成手段２６と、音素誤り傾向ベクトル生成手段２７と、を備える。 [Phoneme error tendency vector generator]
The function of each part of the acoustic model adaptation apparatus 100 will be specifically described and described in detail. FIG. 3 shows a functional configuration example of the phoneme error tendency vector generation unit 20. The phoneme error tendency vector generation unit 20 includes an acoustic feature quantity extraction unit 21, an output probability calculation unit 22, an output probability sum calculation unit 23, a maximum output probability phoneme posterior probability calculation unit 24, and a phoneme posterior probability calculation unit 25. And a posteriori probability vector generation means 26 and a phoneme error tendency vector generation means 27.

音響特徴量抽出手段２１は、複数の音声Ｕ_all（Ｕ_１,Ｕ_２,…,Ｕ_ｎ）の音声Ｕ_ｉの音響特徴量Ｏ_itをフレーム毎に抽出する。ｔは音声Ｕ_ｉ中でのフレーム番号を表す。音響特徴量Ｏ_itは、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって算出される。音響特徴量Ｏ_itは、音声認識部１０から入手してもよい。その場合、音響特徴量抽出手段２１は不要である。 The acoustic feature quantity extracting unit 21, a plurality of audio _{_{_{U all (U 1, U 2}}} , ..., U n) for extracting acoustic features O _it voice _{U i} for each frame. t represents a frame number in the voice U _i . The acoustic feature quantity O _it is calculated by, for example, Mel Frequency Cepstral Coefficients (MFCC) analysis. The acoustic feature amount _Oit may be obtained from the speech recognition unit 10. In that case, the acoustic feature quantity extraction means 21 is unnecessary.

出力確率算出手段２２は、各フレームにおいて、ベース音響モデル４０に含まれる全音素のモノフォンモデルの全状態について、当該フレームの音響特徴量Ｏ_itの出力確率を算出する。第ｔフレームの音響特徴量Ｏ_itの出力確率をＰ（Ｏ_it|ｓ_ｊ）と表記する。ここで、ｓ_ｊはベース音響モデル４０に含まれるモノフォンＨＭＭのｊ番目の状態を表す。 Output probability calculation unit 22, in each frame, for all states of all phonemes monophone models included in the base acoustic model 40, and calculates the output probability of the acoustic feature quantity O _it of the frame. The output probability of the acoustic feature quantity _Oit of the t-th frame is expressed as P ( _Oit | _sj ). Here, s _j represents the j-th state of the monophone HMM included in the base acoustic model 40.

図４に、フレーム毎の音響特徴量Ｏ_itと出力確率の関係を模式的に示す。音声Ｕ_ｉの先頭から、例えば２０msの時間幅のフレーム毎にその音響特徴量Ｏ_itが抽出される。第８番目のフレームｔ_i8の音響特徴量Ｏ_i8の出力確率を、ベース音響モデル４０に含まれる全音素のモノフォンモデルの全状態について算出する。全音素とは、例えば/a/,/i/,/u/,…/N/のローマ字の表記に相当する音素のことであり、その数は例えば２６個である。音素のモノフォンモデルの全状態とは、/a/〜/N/の第１状態〜第３状態の全ての状態のことであり、音素数を２６個、状態数を３個とすると７８個になる。出力確率算出手段２２は、この全状態についての出力確率Ｐ（Ｏ_i8|ａ_１）〜Ｐ（Ｏ_i8|N_３）を算出する。Ｐ（Ｏ_i8|ａ_１）〜Ｐ（Ｏ_i8|N_３）は、Ｐ（Ｏ_i8|ｓ_１）〜Ｐ（Ｏ_i8|ｓ_７８）である。 FIG. 4 schematically shows the relationship between the acoustic feature amount _Oit and the output probability for each frame. The acoustic feature value _Oit is extracted for each frame having a time width of 20 ms, for example, from the head of the sound U _i . The output probability of the acoustic feature value O _i8 of the eighth frame t _i8 is calculated for all states of all phoneme monophone models included in the base acoustic model 40. The total phonemes are phonemes corresponding to the Roman alphabet notation of / a /, / i /, / u /,... / N /, for example, and the number thereof is 26, for example. The total state of the phoneme monophone model means all states from the first state to the third state of / a / to / N /, and 78 if the number of phonemes is 26 and the number of states is 3. become. The output probability calculation means 22 calculates the output probabilities P (O _i8 | a ₁ ) to P (O _i8 | N ₃ ) for all the states. P (O _i8 | a ₁ ) to P (O _i8 | N ₃ ) are P (O _i8 | s ₁ ) to P (O _i8 | s ₇₈ ).

出力確率総和算出手段２３は、各フレームの出力確率の総和Ｐ（Ｏ_it）を式（１）で計算する。 The output probability sum calculating means 23 calculates the sum P (O _it ) of the output probabilities of each frame by the equation (1).

最大出力確率音素の事後確率算出手段２４は、各フレームにおいて、当該フレームで最大の出力確率となる音素を＾ｐとして、その最大出力確率（式（２））を各フレームの出力確率の総和Ｐ（Ｏ_it）で除した値を、当該フレームの１位音素の事後確率Ｐ（＾ｐ|Ｏ_it）として計算する（式（３））。図４は、音素/ｉ/の第２状態を最大出力確率とした例を示す。 The a posteriori probability calculating means 24 of the maximum output probability phoneme takes the maximum output probability (formula (2)) as the sum of the output probabilities P of each frame, where ^ p is the phoneme having the maximum output probability in that frame. The value divided by (O _it ) is calculated as the posterior probability P (^ p | O _it ) of the first phoneme of the frame (formula (3)). FIG. 4 shows an example in which the second state of phonemes / i / is set as the maximum output probability.

なお、式（３）の分子は、最大出力確率を示す状態の出力確率でもよいし、その音素の第１状態〜第３状態の出力確率の和でもよい。 The numerator of equation (3) may be an output probability in a state indicating the maximum output probability, or may be a sum of output probabilities of the first state to the third state of the phoneme.

音素事後確率計算手段２５は、各フレームの１位音素の事後確率の音素毎のフレーム平均値を、音声Ｕ_ｉ単位で求めて音素毎の事後確率とする。図５に、１位音素の事後確率と音素毎の事後確率との関係を模式的に示す。図５では、音声Ｕ_ｉ内のフレームｔ_８とｔ_２４とｔ_ｎの１位音素が/ｉ/であった場合を示している。他の音素の１位音素の事後確率の表記は省略している。 The phoneme posterior probability calculation means 25 obtains the frame average value for each phoneme of the posterior probability of the first phoneme of each frame in units of speech U _{i and} sets it as the posterior probability for each phoneme. FIG. 5 schematically shows the relationship between the posterior probability of the first phoneme and the posterior probability of each phoneme. FIG. 5 shows a case _where the first phoneme of frames t ₈ , t ₂₄ and t _n in speech U _i is / i /. The notation of the posterior probability of the first phoneme of other phonemes is omitted.

音素事後確率計算手段２５は、音素/ｉ/の事後確率が１位のフレームの事後確率を音声Ｕ_ｉ内の全フレームにわたって合計した値を、音素/ｉ/の事後確率が１位のフレーム数で除して音素毎の音素事後確率を計算する。他の各々の音素についての音素事後確率も同様に計算される。 The phoneme posterior probability calculating means 25 calculates the sum of the posterior probabilities of the frame with the first posterior probability of the phoneme / i / over all the frames in the speech U _i , and the number of frames with the first posterior probability of the phoneme / i /. Divide by to calculate the phoneme posterior probability for each phoneme. The phoneme posterior probabilities for each other phoneme are calculated similarly.

事後確率ベクトル生成手段２６は、音素毎の音素事後確率を並べて事後確率ベクトルを生成する。音素事後確率を並べる順序としては、例えば音素名を辞書式順序で昇順に並べた順序とすればよい。事後確率ベクトルは、音声Ｕ_ｉの時間長が短い場合には出現する音素が少ないのでスパースになる場合が想定される。しかし、音声Ｕ_ｉは上記したように例えば電話会話における一通話のような長い単位を１つの適応データとするので全音素の音素事後確率が並べられる場合が多い。音素数を例えば２６個とすると、事後確率ベクトルは２６次元のベクトルとなる。 The posterior probability vector generating means 26 generates a posterior probability vector by arranging phoneme posterior probabilities for each phoneme. The order in which phoneme posterior probabilities are arranged may be, for example, an order in which phoneme names are arranged in ascending order in a lexicographic order. The posterior probability vector is assumed to be sparse because there are few phonemes that appear when the time length of the speech U _i is short. However, as described above, since the voice U _i uses a long unit such as one call in a telephone conversation as one adaptation data, the phoneme posterior probabilities of all phonemes are often arranged. For example, when the number of phonemes is 26, the posterior probability vector is a 26-dimensional vector.

音素誤り傾向ベクトル生成手段２７は、事後確率ベクトル生成手段２６が出力する事後確率ベクトルから、外部から入力される予め求めた音声群全体事後確率ベクトルを減算することで各音声Ｕ_ｉの音素誤り傾向ベクトルを生成する。音声群全体事後確率ベクトルは、音響モデル適応装置１００に入力される全ての音声Ｕ_ｉについて、音素誤り傾向ベクトル生成部２０が行う事後確率ベクトルを生成するまでの処理を施して生成したものである。ベース音響モデル４０を同じものとすることで、事後確率ベクトルと音声群全体事後確率ベクトルの次元数は同じになる。 The phoneme error tendency vector generation means 27 subtracts the previously obtained speech group overall posterior probability vector input from the outside from the posterior probability vector output from the posterior probability vector generation means 26 to thereby obtain the phoneme error tendency of each voice U _i. Generate a vector. The entire speech group posterior probability vector is generated by performing processing until the posterior probability vector generated by the phoneme error tendency vector generation unit 20 is generated for all speech U _i input to the acoustic model adaptation apparatus 100. . By making the base acoustic model 40 the same, the number of dimensions of the posterior probability vector and the entire speech group posterior probability vector are the same.

音素誤り傾向ベクトル生成手段２７は、各音声Ｕ_ｉの各音素の事後確率を、音響モデルを造る為の全音声の事後確率で正規化する処理を行う。この処理によって得られた音素誤り傾向ベクトルの各要素の値は、その音素の認識誤り傾向を表す音素の誤り傾向スコアとなる。ある音声Ｕ_ｉにおいて、音素の誤り傾向スコアの値が正の場合はその音素は通常よりも正しく認識できることを表す。負の場合は、当該音素は通常よりも誤認識し易いことを表す。 The phoneme error tendency vector generation means 27 performs a process of normalizing the posterior probability of each phoneme of each speech U _i with the posterior probability of all speech for creating an acoustic model. The value of each element of the phoneme error tendency vector obtained by this processing is a phoneme error tendency score representing the recognition error tendency of the phoneme. In a certain voice U _i , when the error tendency score value of a phoneme is positive, it indicates that the phoneme can be recognized more correctly than usual. A negative value indicates that the phoneme is more likely to be erroneously recognized than usual.

〔クラスタリング部〕
クラスタリング部３０は、音声認識部１０が出力する音声認識結果テキストと音声Ｕ_ｉと、音素誤り傾向ベクトル生成部２０が出力するその音声Ｕ_ｉの音素誤り傾向ベクトルとの３つの組群を入力として、音素誤り傾向ベクトル間の類似度を尺度に、３つの組群を所定の数（ｋ個）のクラスのクラスタに分類すると共に、当該クラスタ中の音素誤り傾向ベクトルの平均ベクトルであるセントロイドとを求め、上記クラスタとセントロイドを出力する。 [Clustering section]
The clustering unit 30 receives as input three sets of speech recognition result text and speech U _i output from the speech recognition unit 10 and phoneme error tendency vectors of the speech U _i output from the phoneme error tendency vector generation unit 20. , Using the similarity between phoneme error tendency vectors as a scale, classify the three groups into clusters of a predetermined number (k) classes, and a centroid that is an average vector of phoneme error tendency vectors in the clusters; And the cluster and centroid are output.

ｋ個のクラスタの分類は、ベクトル間の類似度を尺度にして行う。ベクトル間の類似度の尺度としては、例えばコサイン類似度を用いることができる。次元数が等しい２つのベクトルＶ_１とＶ_２の間のコサイン類似度Ｓ（Ｖ_１,Ｖ_２）は、式（４）で計算される。 The k clusters are classified based on the similarity between vectors. As a measure of similarity between vectors, for example, cosine similarity can be used. The cosine similarity S (V ₁ , V ₂ ) between _two vectors V ₁ and V ₂ having the same number of dimensions is calculated by Expression (4).

Ｖ_１・Ｖ_２はベクトルＶ_１とＶ_２の内積、|Ｖ_１|と|Ｖ_２|はそれぞれベクトルＶ_１とＶ_２のノルムを表す。 V ₁ · V ₂ represents the inner product of the vectors V ₁ and V ₂ , and | V ₁ | and | V ₂ | represent the norms of the vectors V ₁ and V ₂ , respectively.

クラスタリング部３０は、各音声Ｕ_ｉの音素誤り傾向ベクトルをＶ_１,Ｖ_２として、各音声間のコサイン類似度を式（４）で計算し、そのコサイン類似度の値を用いて各々の音声Ｕ_ｉを、ｋ個のクラスタに分類する。そして、各クラスタ内の音素誤り傾向ベクトルを平均してそのクラスタのセントロイドを求める。複数の音声Ｕ_ｉを、ｋ個のクラスタに分類する処理は、例えば大規模データでも比較的に高速に処理できる周知のk-means法を用いることができる。分類処理そのものは、この発明の主要部では無いので詳しい説明は省略する。 The clustering unit 30 calculates the cosine similarity between the respective voices by using the phoneme error tendency vectors of the respective voices U _i as V ₁ and V ₂ , and uses the cosine similarity values for the respective voices. Classify U _i into k clusters. Then, the phoneme error tendency vectors in each cluster are averaged to obtain the centroid of the cluster. For the process of classifying a plurality of voices U _i into k clusters, for example, a known k-means method that can process relatively large speed even with large-scale data can be used. Since the classification process itself is not the main part of the present invention, detailed description thereof is omitted.

〔ベース音響モデル適応部〕
ベース音響モデル適応部５０は、クラスタの数ｋ個に対応する数の音響モデル適応部５０_１〜５０_ｋで構成される。音響モデル適応部５０_１〜５０_ｋは、音声、音声認識結果テキスト、音素誤り傾向ベクトル、の３つの組群の部分集合で構成されるｋ個のクラスタをそれぞれ入力として、各クラスタに含まれる音声Ｕ_ｉと音声認識結果テキストに基づいて、ベース音響モデル４０を、クラスタ毎に教師なし適応させて適応後音響モデルを生成し、適応後音響モデルと当該クラスタのセントロイドのペアを出力する。 [Base acoustic model adaptation section]
The base acoustic model adaptation unit 50 includes a number of acoustic model adaptation units 50 ₁ to 50 _k corresponding to the number k of clusters. The acoustic model adaptation units 50 ₁ to 50 _k _each receive _k clusters composed of a subset of three groups of speech, speech recognition result text, and phoneme error tendency vector, and speech included in each cluster. Based on U _i and the speech recognition result text, the base acoustic model 40 is adapted without supervision for each cluster to generate an after-adaptation acoustic model, and a pair of the after-adaptation acoustic model and the centroid of the cluster is output.

音響モデル適応アルゴリズムは従来技術であり、例えば参考文献（J.-L. Gauvain and C.-H. Lee,”Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,”IEEE trans. on Speech and Audio processing, 2(2), pp.291-298,1994.）に記載された方法で行う。 The acoustic model adaptation algorithm is a conventional technique, for example, J.-L. Gauvain and C.-H. Lee, “Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE trans. On Speech and Audio. processing, 2 (2), pp.291-298, 1994).

ベース音響モデル４０をクラスタ毎に教師なし適応させた適応後音響モデルは、セントロイドと組み（ペア）でクラスタ毎に適応後音響モデル記録部６０に記録される。その適応後音響モデルは、類似度の高い音声単位ごとに教師なし音響モデル適応を行わせたものなので、適応効果の高い音響モデルにすることができる。 The post-adaptation acoustic model in which the base acoustic model 40 is adapted without supervision for each cluster is recorded in the post-adaptation acoustic model recording unit 60 for each cluster in combination (pair) with the centroid. Since the post-adaptation acoustic model is obtained by performing unsupervised acoustic model adaptation for each speech unit having a high degree of similarity, the acoustic model having a high adaptation effect can be obtained.

次に、この発明の音響モデル適応装置１００で生成した適応後音響モデルを用いて音声認識処理を行う音声認識装置２００を説明する。 Next, a speech recognition device 200 that performs speech recognition processing using the post-adaptation acoustic model generated by the acoustic model adaptation device 100 of the present invention will be described.

〔音声認識装置〕
図６に、この発明の音声認識装置２００の機能構成例を示す。その動作フローを図７に示す。音声認識装置２００は、適応後音響モデル記録部６０と、認識時音素誤り傾向ベクトル生成部２１０と、ベース音響モデル４０と、最近傍セントロイド選択部２２０と、適用音響モデル２３０と、音声認識部２４０と、を具備する。適応後音響モデル記録部６０とベース音響モデル４０は、参照符号から明らかなように音響モデル適応装置１００と同じものである。音響モデル適応装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 [Voice recognition device]
FIG. 6 shows a functional configuration example of the speech recognition apparatus 200 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 200 includes an after-adaptation acoustic model recording unit 60, a recognition phoneme error tendency vector generation unit 210, a base acoustic model 40, a nearest centroid selection unit 220, an applied acoustic model 230, and a speech recognition unit. 240. The post-adaptation acoustic model recording unit 60 and the base acoustic model 40 are the same as the acoustic model adaptation apparatus 100 as apparent from the reference numerals. The acoustic model adaptation apparatus 100 is realized by a predetermined program being read into a computer composed of, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

適応後音響モデル記録部６０は、上記した音響モデル適応装置１００によって生成された適応後音響モデルを記録した複数のクラスタとそのセントロイドとを組みにして記録する。 The post-adaptation acoustic model recording unit 60 records a set of a plurality of clusters in which the post-adaptation acoustic model generated by the acoustic model adaptation apparatus 100 is recorded and its centroid.

認識時音素誤り傾向ベクトル生成部２１０は、認識対象音声の音響特徴量抽出をフレーム毎に行い当該フレームの音響特徴量の出力確率をベース音響モデルに含まれる全音素の全状態について求め、当該出力確率の最大値を当該フレームの出力確率の総和で除して１位音素の事後確率とし、当該１位音素の事後確率の音素毎の平均値を上記音声単位で求めた音素事後確率を当該音声単位で並べて事後確率ベクトルとし当該事後確率ベクトルから、予め求めた上記音声群全体の音声群全体事後確率ベクトルを減算して上記音声の認識時音素誤り傾向ベクトルとして生成する（ステップＳ２１０）。認識時音素誤り傾向ベクトル生成部２１０で行われる処理は、音響モデル適応装置１００の音素誤り傾向ベクトル生成部２０で行う処理と同じである。認識時音素誤り傾向ベクトル生成部２１０には、音声認識対象の認識対象音声が入力される点のみが異なる。 The recognition-time phoneme error tendency vector generation unit 210 performs acoustic feature extraction of the recognition target speech for each frame, obtains an output probability of the acoustic feature of the frame for all states of all phonemes included in the base acoustic model, and outputs the output The maximum probability value is divided by the sum of the output probabilities of the frame to obtain the posterior probability of the first phoneme, and the phoneme posterior probability obtained by calculating the average value for each phoneme of the posterior probability of the first phoneme in the above speech unit By arranging the units as a posterior probability vector, the utterance group posterior probability vector of the entire speech group obtained in advance is subtracted from the posterior probability vector to generate the phoneme error tendency vector during speech recognition (step S210). The processing performed by the recognition phoneme error tendency vector generation unit 210 is the same as the processing performed by the phoneme error tendency vector generation unit 20 of the acoustic model adaptation device 100. The only difference is that the recognition target speech to be recognized is input to the recognition-target phoneme error tendency vector generation unit 210.

最近傍セントロイド選択部２２０は、認識時音素誤り傾向ベクトルと、適応後音響モデル記録部６０に記録された複数のクラスタと組みのセントロイドとの類似度が最大となる適応後音響モデルを、適用音響モデルとして選択して出力する（ステップＳ２２０）。適応後音響モデル記録部６０に記録された複数のクラスタと組みのセントロイドは、各クラスタ内の音素誤り傾向ベクトルを平均したものである。最近傍セントロイド選択部２２０は、そのセントロイド（音素誤り傾向ベクトル）と認識時音素誤り傾向ベクトルとの類似度を、音響モデル適応装置１００のクラスタリング部３０で用いたのと同じコサイン類似度（式（４））を用いて評価し、類似度が最大となる適応後音響モデルを選択して適用音響モデル２３０として出力する。 The nearest-neighbor centroid selection unit 220 selects an after-adaptation acoustic model in which the similarity between a recognition phoneme error tendency vector and a plurality of clusters and a set of centroids recorded in the after-adaptation acoustic model recording unit 60 is maximum. It selects and outputs as an applicable acoustic model (step S220). A centroid paired with a plurality of clusters recorded in the post-adaptation acoustic model recording unit 60 is an average of phoneme error tendency vectors in each cluster. The nearest-neighbor centroid selection unit 220 uses the cosine similarity (the same as that used in the clustering unit 30 of the acoustic model adaptation device 100) by using the similarity between the centroid (phoneme error tendency vector) and the recognition phoneme error tendency vector. Evaluation is performed using Expression (4), and the post-adaptation acoustic model that maximizes the similarity is selected and output as the applied acoustic model 230.

音声認識部２４０は、認識対象音声を、最近傍セントロイド選択部２２０で選択された適用音響モデル２３０に基づいて音声認識処理して音声認識結果テキストを出力する（ステップＳ２４０）。 The speech recognition unit 240 performs speech recognition processing on the recognition target speech based on the applied acoustic model 230 selected by the nearest centroid selection unit 220, and outputs a speech recognition result text (step S240).

音声認識装置２００によれば、認識対象音声と音素誤り傾向が類似する音声に適応した音響モデルを用いて音声認識処理が行われるので、認識率を向上させることができる。また、従来の方法では、複数の音響モデルごとに音声に対する尤度を計算して適切な音響モデルを選択するのに対し、この発明の方法では、1個のベース音響モデルから求めた1個の認識時音素誤り傾向ベクトルを生成することで、各音声の音素誤り傾向を評価するので高速に適切な音響モデルを選択することができる。その結果、音声を入力してから音声認識結果が出力されるまでのタイムラグを小さくすることができる。 According to the speech recognition apparatus 200, since the speech recognition process is performed using an acoustic model adapted to speech with a similar phoneme error tendency to the recognition target speech, the recognition rate can be improved. Further, in the conventional method, the likelihood for speech is calculated for each of a plurality of acoustic models and an appropriate acoustic model is selected, whereas in the method of the present invention, one piece obtained from one base acoustic model is used. By generating the phoneme error tendency vector at the time of recognition, the phoneme error tendency of each voice is evaluated, so that an appropriate acoustic model can be selected at high speed. As a result, it is possible to reduce the time lag from when the voice is input until the voice recognition result is output.

〔評価実験〕
この発明の音響モデル適応方法と音声認識方法の効果を確認する目的で評価実験を行った。その結果について説明する。 [Evaluation experiment]
An evaluation experiment was conducted for the purpose of confirming the effects of the acoustic model adaptation method and the speech recognition method of the present invention. The result will be described.

まず実験条件について説明する。評価実験用の音声データとして２４名の話者による電話会話音声３９１通話を用いた。１名当たり１２通話から２２通話を収録した。各話者の通話の半数を集めたデータセット（１９６通話）と、残りの通話を集めたデータセット（１９６通話）を作成し、一方のデータセットを認識する際にはもう一方のデータセットを適応データとして用い、両実験の平均認識率で評価を行った。 First, experimental conditions will be described. As voice data for the evaluation experiment, a telephone conversation voice 391 call by 24 speakers was used. Twelve to twenty-two calls were recorded per person. Create a data set that collects half of each speaker's call (196 calls) and a data set that collects the remaining calls (196 calls). When recognizing one data set, the other data set It was used as adaptive data and evaluated using the average recognition rate of both experiments.

適応データのクラスタリングにはk-means法を用いクラスタ数は４８とした。音響モデルの教師なし適応にはＭＡＰ適応（参考文献：「Gauvain et al., IEEE Trans. SAP, 2(2), 291-298,1994」）を用いた。 The k-means method was used for clustering of the adaptive data, and the number of clusters was 48. MAP adaptation (reference: “Gauvain et al., IEEE Trans. SAP, 2 (2), 291-298, 1994”) was used for unsupervised adaptation of acoustic models.

以上の前提で、次の４条件の認識率を比較した。その結果を図８に示す。横軸は話者ＩＤ、縦軸は認識率である。図８中の表記、■はベース音響モデルをそのまま用いた認識率（ベースライン）、×はクラスタリングを行わないで全ての適応データを用いて作成した１つの適応モデルを用いた認識率（１クラスタ）、●はクラスタリングした音響モデルを選択して用いた認識率（この発明）、▲は適応データを話者ごとに手動分類して適応を行った音響モデルを用いた認識率（話者既知の条件）である。 Based on the above assumptions, the recognition rates of the following four conditions were compared. The result is shown in FIG. The horizontal axis is the speaker ID, and the vertical axis is the recognition rate. The notation in FIG. 8, ■ indicates a recognition rate (baseline) using the base acoustic model as it is, and × indicates a recognition rate using one adaptive model (one cluster) created using all adaptive data without performing clustering. ), ● are recognition rates using selected clustered acoustic models (this invention), ▲ are recognition rates using acoustic models that have been adapted by manually classifying adaptation data for each speaker (speaker known) Condition).

１クラスタ（×）と比較して、この発明（●）は２４名中２３名の話者で認識率の改善が得られた。残りの話者（ＩＤ＝１２）の認識率もほぼ同等であることから、この発明の音素誤り傾向に着目したクラスタリングとモデル選択の有効性が確認された。また、図８に示すようにこの発明の認識率（●）は、理想的な条件と考えられる話者既知の条件（▲）に近い認識率を示した。この発明と話者既知の条件とにおける話者２４名の平均認識率は、それぞれ84.55％と84.50％であり、この発明の方法によって話者既知の条件と同等の認識性能が得られることが確認できた。 Compared with one cluster (×), this invention (●) improved the recognition rate for 23 speakers out of 24. Since the recognition rates of the remaining speakers (ID = 12) are almost the same, the effectiveness of clustering and model selection focusing on the phoneme error tendency of the present invention was confirmed. Further, as shown in FIG. 8, the recognition rate (●) of the present invention is a recognition rate close to the speaker known condition (▲) considered to be an ideal condition. The average recognition rates of the 24 speakers in the present invention and the known speaker conditions are 84.55% and 84.50%, respectively, and it is confirmed that the recognition performance equivalent to the known speaker conditions can be obtained by the method of the present invention. did it.

このように、音声データによる音素誤り傾向の類似性に着目して適応データをクラスタリングし、分類されたクラスタ毎にベース音響モデルを教師なし適応させて適応後音響モデルを生成する音響モデル適応装置１００と、その適応後音響モデルを選択して用いる音声認識装置２００は、音声認識率を向上させることができる。 As described above, the adaptive data clustering is performed by clustering the adaptive data by paying attention to the similarity of the phoneme error tendency based on the speech data, and generating the post-adaptation acoustic model by applying the base acoustic model unsupervised for each classified cluster. The speech recognition apparatus 200 that selects and uses the post-adaptation acoustic model can improve the speech recognition rate.

上記各装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The processes described in the above apparatuses and methods are not only executed in time series according to the order of description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. .

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A voice recognition unit that outputs a voice recognition result text and a voice as a result of voice recognition processing based on a base acoustic model, with a voice group including a plurality of voices as an input;
The acoustic feature extraction of the voice is performed for each frame, and the output probability of the acoustic feature of the frame is obtained for all states of all phonemes included in the base acoustic model, and the maximum value of the output probability is calculated as the output probability of the frame. Divide by the sum to make the posterior probability of the first phoneme, arrange the phoneme posterior probabilities obtained by the above speech units for the average value for each phoneme of the posterior probability of the first phoneme as the posterior probability vector A phoneme error tendency vector generation unit that subtracts a speech group posterior probability vector of the entire speech group obtained in advance to generate a phoneme error tendency vector of the speech;
Three sets of the speech, the speech recognition result text, and the phoneme error tendency vector are input, and the three sets are clustered in a predetermined number of classes using the similarity between the phoneme error tendency vectors as a scale. A centroid that is an average vector of the phoneme error tendency vectors in the cluster and outputs the cluster and the centroid;
Based on the cluster and centroid as an input, based on the speech and speech recognition result text included in each cluster, a base acoustic model adaptation unit that generates an adaptive acoustic model in which the base acoustic model is adapted for each cluster;
A post-adaptation acoustic model recording unit that records the post-adaptation acoustic model for each cluster;
An acoustic model adaptation apparatus comprising:

The acoustic model adaptation device according to claim 1,
The phoneme error tendency vector generation unit
An acoustic feature quantity extraction means for extracting the acoustic feature quantity for each frame using the voice as input and outputting the acoustic feature quantity;
Output probability calculation means for calculating the output probability of the acoustic feature amount for each frame for all states of all monophone models of the base acoustic model;
Output probability sum calculation means for calculating the sum of output probabilities calculated for all the monophone models for each frame;
The maximum output probability of each first frame is defined as the maximum output probability of the first phoneme, and the maximum output probability of the first phoneme is calculated by dividing the maximum output probability of the first phoneme by the sum of the output probabilities of all the states. A means for calculating the posterior probability of a stochastic phoneme;
A phoneme posterior probability calculating means for calculating a phoneme posterior probability for each phoneme by dividing a value obtained by summing the posterior probability of the first phoneme over all frames by a number when the phoneme is the first phoneme;
A posterior probability vector generating means for generating the posterior probability vector by arranging the phoneme posterior probabilities in units of the speech;
A phoneme error tendency vector generation means for subtracting a speech group posterior probability vector of the whole speech group obtained in advance from the outside, and generating the phoneme error trend vector of the speech as the posterior probability vector;
An acoustic model adaptation device comprising:

A post-adaptation acoustic model recording unit that records a plurality of clusters including the post-adaptation acoustic model generated by the acoustic model adaptation device according to claim 1 and the centroid, and
The acoustic feature extraction of the recognition target speech is performed for each frame, the output probability of the acoustic feature of the frame is obtained for all the states of all phonemes included in the base acoustic model, and the maximum value of the output probability is calculated as the output probability of the frame. Divide by the sum to make the posterior probability of the first phoneme, arrange the phoneme posterior probabilities obtained by the above speech units for the average value for each phoneme of the posterior probability of the first phoneme as the posterior probability vector A recognition-time phoneme error tendency vector generation unit that generates a recognition-time phoneme error tendency vector by subtracting the entire speech group posterior probability vector of the entire speech group that is input in advance from the outside;
The nearest-neighbor centroid selection unit that selects and outputs the applied acoustic model as an applied acoustic model that maximizes the degree of similarity between the recognition phoneme error tendency vector and the centroid of the set of the plurality of clusters,
A speech recognition unit that performs speech recognition processing on the recognition target speech based on the applied acoustic model and outputs a speech recognition result text; and
A speech recognition apparatus comprising:

A speech recognition process for outputting a speech recognition result text and a speech as a result of speech recognition processing based on a base acoustic model, with a speech group consisting of a plurality of speeches as input,
The acoustic feature extraction of the voice is performed for each frame, and the output probability of the acoustic feature of the frame is obtained for all states of all phonemes included in the base acoustic model, and the maximum value of the output probability is calculated as the output probability of the frame. Divide by the sum to make the posterior probability of the first phoneme, arrange the phoneme posterior probabilities obtained by the above speech units for the average value for each phoneme of the posterior probability of the first phoneme as the posterior probability vector A phoneme error tendency vector generation process for subtracting a speech group posterior probability vector of the entire speech group obtained in advance to generate a phoneme error tendency vector of the speech;
Three sets of the speech, the speech recognition result text, and the phoneme error tendency vector are input, and the three sets are clustered in a predetermined number of classes using the similarity between the phoneme error tendency vectors as a scale. A clustering process for obtaining a centroid that is an average vector of the phoneme error tendency vectors in the cluster and outputting the cluster and the centroid;
Based on the cluster and centroid as input, based on speech and speech recognition result text included in each cluster, a base acoustic model adaptation process for generating an adapted acoustic model in which the base acoustic model is adapted for each cluster;
A post-adaptive acoustic model recording process for recording the post-adaptation acoustic model for each cluster;
An acoustic model adaptation method comprising:

The acoustic feature extraction of the recognition target speech is performed for each frame, the output probability of the acoustic feature of the frame is obtained for all the states of all phonemes included in the base acoustic model, and the maximum value of the output probability is calculated as the output probability of the frame. Divide by the sum to make the posterior probability of the first phoneme, arrange the phoneme posterior probabilities obtained by the above speech units for the average value for each phoneme of the posterior probability of the first phoneme as the posterior probability vector A recognition-time phoneme error tendency vector generation process for generating a recognition-time phoneme error tendency vector by subtracting a speech group posterior probability vector of the entire speech group obtained in advance input from the outside;
A post-adaptive acoustic model in which the similarity between the recognition phoneme error tendency vector and the centroid of a set of a plurality of clusters and a set recorded in the post-adaptation acoustic model recording unit according to claim 1 or 2 is applied. The nearest centroid selection process to select and output as a model,
A speech recognition process in which the recognition target speech is subjected to speech recognition processing based on the applied acoustic model and a speech recognition result text is output;
A speech recognition method comprising:

A program for causing a computer to function as the acoustic model adaptation device according to claim 1 or the voice recognition device according to claim 3.