JP4964194B2

JP4964194B2 - Speech recognition model creation device and method thereof, speech recognition device and method thereof, program and recording medium thereof

Info

Publication number: JP4964194B2
Application number: JP2008178572A
Authority: JP
Inventors: 晋治渡部; 貴明堀; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-07-09
Filing date: 2008-07-09
Publication date: 2012-06-27
Anticipated expiration: 2028-07-09
Also published as: JP2010019941A

Description

この発明は、音声認識モデルを効率良く学習する音声認識モデル作成装置とその方法と、その方法を用いた音声認識装置と音声認識方法と、プログラムと記録媒体に関する。 The present invention relates to a speech recognition model creation apparatus and method for efficiently learning a speech recognition model, a speech recognition apparatus and speech recognition method using the method, a program, and a recording medium.

音声認識装置は、入力される音声信号を分析して得られる音響的特徴量ベクトルの系列と、音声をモデル化した音響モデルとの間の尤度を算出し、認識すべき語彙、単語間の接続のし易さ、規則を表わす言語モデルなどの言語的制約の中において、尤度の最も高い単語列を認識結果として出力するものである。音声認識が対象とする音声は、一般に、様々な話者や言語、ノイズ等の外部環境によってその特徴を大きく変化させるものである。そのような多様な特徴を持つ音声を認識するために、複数の音声認識モデルを用いて音声認識を行う音声認識手法が広く研究されている。 The speech recognition apparatus calculates a likelihood between a sequence of acoustic feature vectors obtained by analyzing an input speech signal and an acoustic model obtained by modeling speech, and recognizes between words to be recognized and words A word string having the highest likelihood is output as a recognition result in terms of linguistic constraints such as ease of connection and a language model representing a rule. The voice targeted for voice recognition generally changes its characteristics greatly depending on the external environment such as various speakers, languages and noise. In order to recognize speech having such various features, speech recognition methods that perform speech recognition using a plurality of speech recognition models have been widely studied.

例えば非特許文献１に、音響環境として英語・ドイツ語が混在した音声認識を対象とし、両言語用の２種類の音声認識モデルを用意することで言語の切り替えに関して頑健な音声認識を実現した例が示されている。また、非特許文献２には、多人数が参加する会議における複数話者混在の音声認識を対象に、複数の音声認識モデルを用意することで話者切り替えに関して頑健な音声認識を実現した例が示されている。非特許文献２の例では、各話者の音声認識モデル毎に適応学習を行うことにより、音声認識装置の性能改善を実現している。適応学習とは、音声認識装置に記録された限られた音声認識モデルを、実際に使用される場面における話者や環境によって変わる音響的特徴に適応させることである。 For example, Non-Patent Document 1 is intended for speech recognition in which English and German are mixed as an acoustic environment, and by implementing two types of speech recognition models for both languages, robust speech recognition is realized regarding language switching. It is shown. In Non-Patent Document 2, there is an example in which robust voice recognition with respect to speaker switching is realized by preparing a plurality of voice recognition models for voice recognition of a mixture of plural speakers in a conference in which many people participate. It is shown. In the example of Non-Patent Document 2, the performance improvement of the speech recognition apparatus is realized by performing adaptive learning for each speech recognition model of each speaker. Adaptive learning refers to adapting a limited speech recognition model recorded in a speech recognition device to acoustic features that vary depending on the speaker and the environment in the scene where it is actually used.

従来の複数の音響モデルを用意した音声認識装置９００の機能構成例を図９にし、その動作を簡単に説明する。音声認識装置９００は、音声認識モデル９０、Ａ/Ｄ変換部９１、特徴量抽出部９２、音声認識部９３、適応学習部９４を備える。 An example of the functional configuration of a speech recognition apparatus 900 prepared with a plurality of conventional acoustic models will be briefly described with reference to FIG. The speech recognition apparatus 900 includes a speech recognition model 90, an A / D conversion unit 91, a feature amount extraction unit 92, a speech recognition unit 93, and an adaptive learning unit 94.

音声認識モデル９０は、例えば複数の言語や複数の話者に対応した音声認識モデルである。例えば、ある一人の話者用の第１音声認識モデル９０１は、第１音響モデルメモリ９０１ａと第１言語モデルメモリ９０１ｂと第１発話辞書モデルメモリ９０１ｃとで構成される。他の話者用の第２音声認識モデル９０２も同様に第２音響モデルメモリ９０２ａと第２言語モデルメモリ９０２ｂと第２発話辞書モデルメモリ９０２ｃとで構成される。 The speech recognition model 90 is a speech recognition model corresponding to, for example, a plurality of languages and a plurality of speakers. For example, a first speech recognition model 901 for a certain speaker is composed of a first acoustic model memory 901a, a first language model memory 901b, and a first utterance dictionary model memory 901c. Similarly, the second speech recognition model 902 for other speakers includes a second acoustic model memory 902a, a second language model memory 902b, and a second utterance dictionary model memory 902c.

Ａ/Ｄ変換部９１は、入力されるアナログ信号の音声を、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換する。特徴量抽出部９２は、離散値化された音声信号を例えば３２０個を１フレーム（２０ｍｓ）とし、フレーム毎の音声信号から特徴量ベクトルを抽出する。特徴量ベクトルは、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって抽出される。音声認識部９３は、スコア計算部９３１と単語列探索部９３２とからなる。スコア計算部９３１は、特徴量ベクトルと、音声認識モデル９０１からの言語モデルと、音響モデルとを入力として、特徴量ベクトルに対するスコアを算出する。単語列探索部９３２は、スコアが最大となる単語列を発話辞書モデルメモリ９０１ｃから探索して認識結果として出力する。適応学習部９４は、単語列探索部９３２の出力する単語列を教師信号として、第１音声認識モデル９０１と第２音声認識モデル９０２毎に適応処理を行う。
Z.Wamg, U.Topkara, T.Schultz, and A.Waibel. Towards universal speech recognition.In Proc.ICMI2002,2002. 田熊竜太，岩野公司，古井貞煕「逐次話者適応を用いた並列処理型会議音声認識システムの検討」春季音響学会講演論文集、ｐ105-106，2002. The A / D conversion unit 91 converts the sound of the input analog signal into a discrete digital signal with a sampling frequency of 16 kHz, for example. The feature amount extraction unit 92 extracts 320 feature amount vectors from the sound signal for each frame, for example, 320 discrete speech signals are regarded as one frame (20 ms). The feature vector is extracted by, for example, Mel frequency cepstrum coefficient (MFCC) analysis. The voice recognition unit 93 includes a score calculation unit 931 and a word string search unit 932. The score calculation unit 931 calculates a score for the feature vector by using the feature vector, the language model from the speech recognition model 901, and the acoustic model as inputs. The word string search unit 932 searches the utterance dictionary model memory 901c for a word string having the maximum score, and outputs it as a recognition result. The adaptive learning unit 94 performs an adaptive process for each of the first speech recognition model 901 and the second speech recognition model 902 using the word sequence output from the word sequence search unit 932 as a teacher signal.
Z.Wamg, U.Topkara, T.Schultz, and A.Waibel.Towards universal speech recognition.In Proc.ICMI2002,2002. Ryuta Takuma, Kouji Iwano, Sadahiro Furui “Study on Parallel Processing Conference Speech Recognition System Using Sequential Speaker Adaptation” Proceedings of the Spring Acoustical Society, p105-106, 2002.

従来法による複数の音声認識モデルの適応学習では、各音声認識モデルに対して独立に適応学習を行うので、適応学習データを複数のモデルに分配することになり、割り当てられるデータ量が単一モデルの適応学習と比較して少なくなる。そのため、データ量の減少により適応学習の効果が限定的となってしまう課題があった。 In adaptive learning of multiple speech recognition models by the conventional method, adaptive learning is performed independently for each speech recognition model, so adaptive learning data is distributed to multiple models, and the amount of allocated data is a single model. Compared to the adaptive learning of. For this reason, there is a problem that the effect of adaptive learning becomes limited due to a decrease in the data amount.

この発明は、このような点に鑑みてなされたものであり、複数の音声認識モデルの適応学習を効率良く行える音声認識モデル作成装置とその方法と、その方法を用いた音声認識装置と音声認識方法と、プログラムとその記録媒体を提供することを目的とする。 The present invention has been made in view of these points, and a speech recognition model creation apparatus and method capable of efficiently performing adaptive learning of a plurality of speech recognition models, a speech recognition apparatus using the method, and speech recognition. It is an object to provide a method, a program, and a recording medium thereof.

この発明の音声認識モデル作成装置は、初期値音声認識モデル記録部と、尤度計算部と、モデル更新部と、更新音声認識モデル記録部とを具備する。初期値音声認識モデル記録部は、複数の音声認識モデルのパラメータをそれぞれ表現するベクトルを連結した一つのベクトルである初期値音声認識モデルであって、上記複数の音声認識モデルは複数の音源にそれぞれ対応する音声認識モデルである、初期値音声認識モデルを記録する。尤度計算部は、上記各音声認識モデルにそれぞれ対応する複数の音声認識ネットワークをユニオン演算した状態確率遷移を基に音声信号を音声認識した結果である状態列の集合と上記音声信号の特徴量ベクトルとを入力として、フレーム毎の各状態の尤度を計算する。モデル更新部は、尤度と特徴量ベクトルとを入力として、初期値音声認識モデルを更新した更新音声認識モデルを生成する。更新音声認識モデル記録部は、更新音声認識モデルを記録する。 The speech recognition model creation device according to the present invention includes an initial value speech recognition model recording unit, a likelihood calculation unit, a model update unit, and an updated speech recognition model recording unit. The initial value speech recognition model recording unit is an initial value speech recognition model that is a vector obtained by concatenating vectors representing parameters of a plurality of speech recognition models, and the plurality of speech recognition models are respectively connected to a plurality of sound sources. An initial value speech recognition model , which is a corresponding speech recognition model, is recorded. The likelihood calculation unit, feature amounts of the plurality of sets and the audio signal of the state sequence is the result of voice recognition a voice signal voice recognition network based on union operation state transition probability respectively corresponding to each speech recognition model Using the vector as an input , the likelihood of each state for each frame is calculated. Model updating unit as inputs the likelihood and the feature amount vector, to generate the updates to update the speech recognition model initial values speech recognition models. The updated speech recognition model recording unit records the updated speech recognition model.

この発明の音声認識モデル作成装置は、複数の音声認識モデルを含む初期値音声認識モデルを１つのベクトルとして扱う。そして初期値音声認識モデルを、複数の音声認識モデルの組み合わせから成る状態確率遷移を基に音声認識された音声認識結果を用いて更新する。つまり、複数の音声認識モデルがまとめて学習できるので、少量の音声データでも十分な適応学習の効果を得ることが出来る。 The speech recognition model creation apparatus according to the present invention handles an initial value speech recognition model including a plurality of speech recognition models as one vector. Then, the initial value speech recognition model is updated using a speech recognition result that is speech-recognized based on a state probability transition composed of a combination of a plurality of speech recognition models. That is, since a plurality of speech recognition models can be learned together, a sufficient adaptive learning effect can be obtained even with a small amount of speech data.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

〔この発明の基本的な考え〕
この発明の音声認識モデルの作成方法の基本的な考えについて説明する。現在広く用いられる確率統計的音声認識方法は、確率モデルを用いて音声認識過程を音声データと単語（若しくは音素、ＨＭＭ（Hidden Markov Model））の出現確率（尤度関数）として表現し、事後確率最大化や尤度最大化等の確率統計的評価規範を用いて音声認識のためのパラメータ推定を行う方法である。この発明の音声認識モデル作成方法も、この確率統計的評価規範を用いる部分では同じである。 [Basic idea of the present invention]
The basic idea of the voice recognition model creation method of the present invention will be described. Probabilistic speech recognition methods that are widely used nowadays use a probabilistic model to express the speech recognition process as the appearance probability (likelihood function) of speech data and words (or phonemes, HMM (Hidden Markov Model)), and the posterior probability. This is a method for estimating parameters for speech recognition using probability statistical evaluation criteria such as maximization and likelihood maximization. The speech recognition model creation method of the present invention is the same in the portion using this probability statistical evaluation criterion.

この発明が従来法と異なる点は、複数の音声認識モデルを含む初期値音声認識モデルを１つのベクトルとして扱い、その初期値音声認識モデルを、複数の音声認識モデルの組み合わせから成る状態確率遷移を基に音声認識された音声認識結果を用いて適応学習させる部分である。なお、ここで音声認識結果としては、単語列や音素列、ＨＭＭ状態列などの音声シンボル列がそれにあたる。これらを総称して状態列と呼ぶ。また、状態列は単一の系列だけでなく、上位スコアｎ個の状態列の集合（ｎ−ｂｅｓｔ）やラティスなどの音声認識ネットワークのサブネットワークなどから、集合として表現される。これらを総称して状態列の集合と呼ぶ。 The present invention is different from the conventional method in that an initial value speech recognition model including a plurality of speech recognition models is treated as one vector, and the initial value speech recognition model is treated as a state probability transition composed of a combination of a plurality of speech recognition models. This is a part for adaptive learning using a speech recognition result recognized based on speech. Here, the speech recognition result corresponds to a speech symbol sequence such as a word sequence, a phoneme sequence, or an HMM state sequence. These are collectively called a state sequence. In addition, the state sequence is expressed not only as a single series, but also as a set from a set of state sequences of n high-order scores (n-best), a sub-network of a speech recognition network such as lattice. These are collectively referred to as a set of state sequences.

特徴量抽出部９２が出力する音響特徴量の特徴量ベクトルの時系列集合Ｏ＝{Ｏ_t=1，Ｏ_t=2，…}が、一つの音源Ａが出力する時系列集合の部分集合Ｏ_ｅ=1＝{Ｏ_ｅ=1，t=1，Ｏ_ｅ=1，t=2，…}と、他の音源Ｂが出力する時系列集合の部分集合Ｏ_ｅ=2＝{Ｏ_ｅ=2，t=1，Ｏ_ｅ=2，t=2，…}とに分かれていると仮定する。ここで、音源Ａに対応した音声認識モデルをｅ＝１、音源Ｂに対応した音声認識モデルをｅ＝２とする。なお、ここでは説明のし易さのため２つの音源を対象にしているが、３つ以上の音源の場合も同様である。音源Ａが出力する特徴量ベクトルの時系列集合をＯ_ｅ=1，音声認識モデルをΘ_ｅ=1，隠れ変数をＺ_ｅ=1＝{Ｚ_ｅ=1，t=1，Ｚ_ｅ=1，t=2，…}とする。ここで隠れ変数とは、どの対象に属しているか観測できない変数のことである。ＨＭＭを用いた音声認識モデルの場合は隠れ変数Ｚ_ｅ=1は各フレーム時刻におけるＨＭＭ状態のＩＤを表す。このとき完全データの尤度関数は式（１）で表現できる。 The time series set O = {O _{t = 1} , O _{t = 2} ,...] Of the feature quantity vectors of the acoustic feature quantity output by the feature quantity extraction unit 92 is a subset O of the time series set output by one sound source A. _{e = 1} = {O _{e = 1, t = 1} , O _{e = 1, t = 2} ,...} and a subset O _{e = 2} = {O _{e = 2 , T = 1} , O _{e = 2, t = 2} ,. Here, a speech recognition model corresponding to the sound source A is e = 1, and a speech recognition model corresponding to the sound source B is e = 2. Here, two sound sources are targeted for ease of explanation, but the same applies to the case of three or more sound sources. The time series set of feature vectors output by the sound source A is O _{e = 1} , the speech recognition model is Θ _{e = 1} , the hidden variables are Z _{e = 1} = {Z _{e = 1, t = 1} , Z _{e = 1, Let t = 2} ,…}. Here, a hidden variable is a variable that cannot be observed to which object it belongs. In the case of a speech recognition model using HMM, the hidden variable Z _{e = 1} represents the ID of the HMM state at each frame time. At this time, the likelihood function of complete data can be expressed by equation (1).

また音声認識モデルｅ＝２に対しても同様に完全データの尤度関数は式（２）で表現できる。

Similarly for the speech recognition model e = 2, the likelihood function of complete data can be expressed by equation (2).

このように各モデルの時系列集合の部分集合Ｏ_ｅ=1，Ｏ_ｅ=2が予め与えられれば、その尤度関数を独立に与えることができる。しかし、一般には音声認識をする音声データが、音源Ａか音源Ｂのどちらの音であるかは分からない。そこで、この発明では、各フレーム時刻ｔにおいて音源Ａ又は音源Ｂのどちらの音声が出現するかを表す隠れ変数Ｕ_ｔ＝{Ｚ_ｅ=1，t，Ｚ_ｅ=２，t，}を新たに導入する。その結果、全体の隠れ変数は式（３）に示すようにＺ_ｅ=1，Ｚ_ｅ=２，とＵ＝{Ｕ_ｔ=1，Ｕ_t=2，…}で構成される。

Thus, if the subsets O _{e = 1} and O _{e = 2} of the time series sets of each model are given in advance, the likelihood function can be given independently. However, in general, it is not known whether the sound data to be recognized is sound source A or sound source B. Therefore, in the present invention, a hidden variable U _t = {Z _{e = 1, t} , Z _{e = 2, t} ,} indicating whether the sound of the sound source A or the sound source B appears at each frame time t is newly set. Introduce. As a result, the entire hidden variable is composed of Z _{e = 1} , Z _{e = 2} , and U = {U _{t = 1} , U _{t = 2} ,.

Ｚの取り得る値としては、実際には各時刻で任意のＨＭＭ状態系列が出現するのではなく、発音規則（発音辞書モデル）や単語の接続のし易さ（言語モデル）を考慮した、スコア付きの音声認識ネットワーク上の状態系列が出現する。音声認識で一般的に用いられる音声認識ネットワーク（状態確率遷移の時系列）は、ＨＭＭ（Ｈ），辞書（Ｌ），文法（Ｇ）の３つのネットワークの合成で構成される。音声認識ネットワークＮは、それらのネットワークを合成演算した式（４）で表現される。

ここで○は合成演算を表し、＊はネットワークのループを表現する。 As a possible value of Z, a score that takes into consideration pronunciation rules (pronunciation dictionary model) and ease of connection of words (language model), rather than an arbitrary HMM state sequence actually appearing at each time A state series on the voice recognition network with a mark appears. A speech recognition network (time series of state probability transitions) generally used in speech recognition is composed of a combination of three networks: HMM (H), dictionary (L), and grammar (G). The speech recognition network N is expressed by Expression (4) obtained by performing a synthesis operation on these networks.

Here, ○ represents a composition operation, and * represents a network loop.

同一言語・２話者の対話環境を考えた場合、辞書や文法のモデルは同一で、ＨＭＭネットワークのみが異なる状態確率遷移の時系列を用意すれば良い。この場合の音声認識ネットワークＮは、文間で遷移が起こると考えて、式（５）に示すように２つの音声認識ネットワークを結合させることによって構築できる。

When considering the conversation environment of the same language and two speakers, it is only necessary to prepare a time series of state probability transitions having the same dictionary and grammar model but different only in the HMM network. The speech recognition network N in this case can be constructed by combining two speech recognition networks as shown in Expression (5), assuming that a transition occurs between sentences.

ここでＵ（＋）は２つのネットワークをユニオン演算を用いて始端と終端を一致させる、ネットワークに対する二項演算である。（＋）はユニオン演算を表す。（＋）は式中の表記が正しい。図１にユニオン演算を概念的に示す。Ｎ_１は状態数１４、アーク数２７のネットワークであり、Ｎ_２は状態数８、アーク数８のネットワークである。ネットワークＮ_１とＮ_２をユニオン演算すると、始端と終端を一致させて２つのネットワークを並列して記述することができる。式（５）の演算は、音声認識ネットワーク（Ｈ_ｅ＝１・Ｌ・Ｇ）と（Ｈ_ｅ＝２・Ｌ・Ｇ）の２つの選択肢を与えることを意味する。なお、どちらのネットワークが選択されるかについては、探索過程における事後確率値や尤度値などのスコアの高いものが選ばれる。このように、式（５）で表現される合成されたネットワーク上で探索を行うことにより、複数の音源が混在する場合における、音声認識結果ならびに後述する学習のための状態列の集合を求めることができる。 Here, U (+) is a binary operation for a network in which two networks are matched with the start and end using a union operation. (+) Represents a union operation. (+) Is correct in the expression. FIG. 1 conceptually shows the union operation. N ₁ is a network having 14 states and 27 arcs, and N ₂ is a network having 8 states and 8 arcs. When the unions of the networks N ₁ and N ₂ are performed, the two networks can be described in parallel by matching the start and end points. The calculation of equation (5) means giving two choices of speech recognition network (H _{e = 1} · L · G) and (H _{e = 2} · L · G). As to which network is selected, a network having a high score such as a posteriori probability value or likelihood value in the search process is selected. In this way, by performing a search on the synthesized network expressed by Expression (5), a speech recognition result and a set of state sequences for learning to be described later when a plurality of sound sources are mixed are obtained. Can do.

この発明では、式（５）に示すような複数の音声認識モデルの組み合わせから成る音声認識ネットワークを記録した音声認識ネットワークデータベースを備える。また、音声認識モデルの初期値として式（６）に示す初期値音声認識モデルμ^０を備える。適応学習により更新された音声認識モデルの平均ベクトルμも式（６）と同じ１つのベクトルとして扱われる。

The present invention includes a speech recognition network database that records a speech recognition network composed of a combination of a plurality of speech recognition models as shown in equation (5). In addition, an initial value speech recognition model μ ⁰ shown in Expression (6) is provided as an initial value of the speech recognition model. The average vector μ of the speech recognition model updated by adaptive learning is also handled as the same vector as that in Expression (6).

式（６）は平均ベクトルのみを示すが、分散行列Σや混合重みｗ、状態遷移確率ａについても同様にベクトルとしてまとめることにより、複数の音声認識モデルが１つのベクトルとして扱われる。その複数の音声認識モデルが１つのベクトルとして扱われた初期値音声認識モデルΘと、式（３）の隠れ変数Ｚを用いると尤度関数は式（７）で表現できる。

Equation (6) shows only the average vector, but the variance matrix Σ, the mixture weight w, and the state transition probability a are similarly collected as vectors, whereby a plurality of speech recognition models are handled as one vector. When the initial value speech recognition model Θ in which the plurality of speech recognition models are handled as one vector and the hidden variable Z in Equation (3) are used, the likelihood function can be expressed by Equation (7).

このように、この発明では合成されたネットワークＺ及びモデルパラメータΘを用いて、複数環境の音声認識を式（１）と式（２）で表現される単一の音声認識モデルと同様の尤度関数を用いて実現できる。つまり、式（７）を用いることにより、音声認識デコーダ（音声認識部）を変更することなく、複数環境の音声認識を実現できる。 As described above, in the present invention, using the synthesized network Z and the model parameter Θ, speech recognition in a plurality of environments is performed with the same likelihood as the single speech recognition model expressed by the equations (1) and (2). It can be realized using a function. That is, by using Expression (7), it is possible to realize speech recognition in a plurality of environments without changing the speech recognition decoder (speech recognition unit).

この発明の音声認識モデルの作成方法は、式（７）の尤度関数と複数の音声認識モデルが１つのベクトルとして扱われた初期値音声認識モデルΘとから、式（８）に示す音声認識に使用される更新音声認識モデルΘ￣を生成する。

The speech recognition model creation method of the present invention is based on the likelihood function shown in Equation (8) from the likelihood function of Equation (7) and the initial value speech recognition model Θ in which a plurality of speech recognition models are treated as one vector. An updated speech recognition model Θ￣ used in the above is generated.

初期値音声認識モデルΘと更新音声認識モデルΘ￣とは、関数Ｆ（・）を用いてパラメットリックに表現され、関係パラメータφによって関係付けられる。このようにこの発明の音声認識モデル作成方法は、複数の音声認識モデルを１つのベクトルとして一度に学習できるので、少量の音声データでも十分な適応学習の効果を得ることができる。 The initial value speech recognition model Θ and the updated speech recognition model Θ￣ are expressed parametrically using the function F (•) and are related by the relation parameter φ. As described above, since the speech recognition model creation method of the present invention can learn a plurality of speech recognition models as one vector at a time, a sufficient adaptive learning effect can be obtained even with a small amount of speech data.

図２にこの発明の音声認識モデル作成装置１００と、それを構成要素とする音声認識装置２００の機能構成例を示す。図３に音声認識モデル作成装置１００の動作フローを示す。図２と図３を参照して音声認識モデル作成装置１００の動作を説明する。 FIG. 2 shows a functional configuration example of the speech recognition model creation device 100 of the present invention and the speech recognition device 200 having the same as a component. FIG. 3 shows an operation flow of the speech recognition model creation apparatus 100. The operation of the speech recognition model creation device 100 will be described with reference to FIGS.

音声認識モデル作成装置１００は、初期値音声認識モデル記録部１０と、モデル更新部１２と、尤度計算部１３と、更新音声認識モデル記録部１４と、制御部１６とを備える。音声認識モデル作成装置１００と音声認識装置２００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 The speech recognition model creation apparatus 100 includes an initial value speech recognition model recording unit 10, a model update unit 12, a likelihood calculation unit 13, an updated speech recognition model recording unit 14, and a control unit 16. The speech recognition model creation device 100 and the speech recognition device 200 are realized by a predetermined program being read into a computer including, for example, a ROM, a RAM, and a CPU, and the CPU executing the program.

初期値音声認識モデル記録部１０は、複数の音声認識モデルを含む初期値音声認識モデルを記録する。尤度計算部１３は、複数の音声認識モデルの組み合わせから成る状態確率遷移を基に音声認識された状態列の集合を入力としてフレーム毎の各状態の尤度を計算する（ステップＳ１３）。 The initial value speech recognition model recording unit 10 records an initial value speech recognition model including a plurality of speech recognition models. The likelihood calculating unit 13 calculates the likelihood of each state for each frame by using as input a set of state sequences that are speech-recognized based on state probability transitions composed of a combination of a plurality of speech recognition models (step S13).

ここで、フレームと各状態とガウス分布と、状態確率遷移との関係について説明する。音声認識モデルを構成する音素モデルは、図４に示す状態によって構築される。各状態ｉは、混合正規分布Ｍ_ｉとして表現される。混合正規分布Ｍ_ｉは、例えば３つの正規分布、Ｎ（μ_ｉ１，Σ_ｉ１），Ｎ（μ_ｉ２，Σ_ｉ２），Ｎ（μ_ｉ３，Σ_ｉ３）で構成される。 Here, the relationship between the frame, each state, the Gaussian distribution, and the state probability transition will be described. The phoneme model constituting the speech recognition model is constructed according to the state shown in FIG. Each state i is expressed as a mixed normal distribution M _i . The mixed normal distribution M _i is composed of, for example, three normal distributions, N (μ _i1 , Σ _i1 ), N (μ _i2 , Σ _i2 ), and N (μ _i3 , Σ _i3 ).

音素モデルは、状態iの数個〜十数個程度の確率連鎖によって構築される。図５に３状
態で構成される音素モデルの概念図を一例として示す。図５に示す例は、left−ｔｏ−ｒ
ｉｇｈｔ型ＨＭＭと呼ばれるもので、３つの状態i_１（第１状態）、ｉ_２（第２状態）、ｉ_３（第３状態）を並べたものであり、状態の確率連鎖（状態遷移）としては、自己遷移ａ_１１、ａ_２２、ａ_３３と、次状態へのａ_１２、ａ_２３、ａ_３４からなる。図６に、状態ｉとフレームｔとの時系列の関係を示す。横軸は時間経過でありフレーム番号で表す。縦軸は、各フレームの状態ｉである。各状態ｉは図４に示したように混合正規分布からなる。●は各フレーム内で出力確率スコアが最大になる最尤状態である。最尤状態●を時系列に並べたのが最尤状態系列である。この最尤状態系列が音声認識結果として出力される。 The phoneme model is constructed by a probability chain of several to about a dozen of states i. FIG. 5 shows a conceptual diagram of a phoneme model composed of three states as an example. The example shown in FIG.
This is called an “ight type HMM”, which is an array of three states i ₁ (first state), i ₂ (second state), and i ₃ (third state), as a state probability chain (state transition). Consists of self transitions a ₁₁ , a ₂₂ , a ₃₃ and a ₁₂ , a ₂₃ , a ₃₄ to the next state. FIG. 6 shows a time-series relationship between the state i and the frame t. The horizontal axis represents the passage of time and is represented by a frame number. The vertical axis represents the state i of each frame. Each state i has a mixed normal distribution as shown in FIG. ● is the maximum likelihood state in which the output probability score is maximized within each frame. The maximum likelihood state sequence is the maximum likelihood state sequence arranged in time series. This maximum likelihood state sequence is output as a speech recognition result.

尤度計算部１３は、各状態ｉの尤度ｐ（Ｏ，Ｚ_ｔ＝ｉ｜Θ＾）を、例えばフォワード・バックワードアルゴリズムで求める。各状態ｉの尤度ｐ（Ｏ，Ｚ_ｔ＝ｉ｜Θ＾）は、フォワード係数αとバックワード係数βを用いて式（９）で計算できる。尤度ｐと特徴量ベクトルＯは、改めて尤度計算部１３で計算しなくても、事前に音声認識装置で求められたそれらの値を記録して置き、逐次読み出すようにしても良い。

The likelihood calculating unit 13 obtains the likelihood p (O, Z _t = i | Θ ^) of each state i using, for example, a forward / backward algorithm. The likelihood p (O, Z _t = i | Θ ^) of each state i can be calculated by the equation (9) using the forward coefficient α and the backward coefficient β. The likelihood p and the feature quantity vector O may be recorded by sequentially storing those values obtained in advance by the speech recognition apparatus without being calculated again by the likelihood calculating unit 13.

フォワード係数αとバックワード係数βは、最尤推定法（ＥＭアルゴリズム）における
反復計算によって式（１０）と（１１）で計算される。

The forward coefficient α and the backward coefficient β are calculated by Expressions (10) and (11) by iterative calculation in the maximum likelihood estimation method (EM algorithm).

ここで、ｋは状態ｉを構成するガウス分布の指標である。ａ_ｉｊは状態ｉがｉからｊに遷移する際の状態遷移確率、ｗ_ｊｋは状態ｊにおけるガウス分布ｋに対する混合重み因子、Ｎは平均ベクトルμ_ｊｋ、共分散行列Σ_ｊｋのガウス分布を表す。式（１０）と（１１）中の＾は、上記した各パラメータが、期待値最大化法における反復計算の前のステップで推定された値であることを示している。 Here, k is an index of the Gaussian distribution constituting the state i. a _ij represents a state transition probability when the state i transitions from i to j, w _jk represents a mixture weight factor for the Gaussian distribution k in the state j, N represents a Gaussian distribution of the mean vector μ _jk and the covariance matrix Σ _jk . ^ In Equations (10) and (11) indicates that each parameter described above is a value estimated in a step before the iterative calculation in the expected value maximization method.

モデル更新部１２は、尤度を入力として、初期値音声認識モデルを１つのベクトルとして更新した更新音声認識モデルを生成する（ステップＳ１２）。更新音声認識モデル記録部１４は、更新音声認識モデルを記録する（ステップＳ１４）。モデル更新部１２と更新音声認識モデル記録部１４は、制御部１６が動作終了を指示する信号を発するまで動作を継続する（ステップＳ１６のＮ）。 The model updating unit 12 generates an updated speech recognition model in which the initial value speech recognition model is updated as one vector using the likelihood as an input (step S12). The updated speech recognition model recording unit 14 records the updated speech recognition model (step S14). The model updating unit 12 and the updated speech recognition model recording unit 14 continue to operate until the control unit 16 issues a signal instructing the end of the operation (N in step S16).

このようにモデル更新部１２が、複数の音声認識モデルを含む初期値音声認識モデルを１つのベクトルとして扱って適応学習を行うので、少量の音声データでも十分な適応学習の効果を得ることが出来る。図７に音声認識モデル作成装置１００のモデル更新部１２の詳細な機能構成例を示して更に詳しく説明する。 As described above, the model update unit 12 performs adaptive learning by treating the initial value speech recognition model including a plurality of speech recognition models as one vector, and thus sufficient adaptive learning effects can be obtained even with a small amount of speech data. . FIG. 7 shows a detailed functional configuration example of the model update unit 12 of the speech recognition model creation apparatus 100 and will be described in more detail.

モデル更新部１２は、事後確率計算部１２１と、関係パラメータ生成部１２２と、更新モデル生成部１２３とを備える。事後確率計算部１２１は、フレーム時刻ｔにおける状態ｉの事後確率を式（１２）の計算で求める（ステップＳ１２１、図３）。事後確率は、各状態ｉの尤度（式（９））をフレーム内状態尤度の総和で正規化した値である。

The model update unit 12 includes a posterior probability calculation unit 121, a relationship parameter generation unit 122, and an update model generation unit 123. The posterior probability calculation unit 121 obtains the posterior probability of the state i at the frame time t by calculating the formula (12) (step S121, FIG. 3). The posterior probability is a value obtained by normalizing the likelihood of each state i (equation (9)) with the sum of the in-frame state likelihood.

音声認識モデルの中で最も認識性能に寄与するパラメータは、ガウス分布中の平均ベク
トルである。したがって以降の説明は、平均ベクトルについての適応学習について説明を
行う。音声認識モデルの平均ベクトルに焦点を当てた場合、補助関数Ｑは式（１３）に示
す具体系に書き直すことができる。

The parameter that contributes most to the recognition performance in the speech recognition model is an average vector in the Gaussian distribution. Therefore, in the following description, adaptive learning for the average vector will be described. When focusing on the average vector of the speech recognition model, the auxiliary function Q can be rewritten in the concrete system shown in equation (13).

ここで、ζ_{ｅ，ｋ，ｔ}は、フレーム時刻ｔにおける音源Ａに対応した音声認識モデルｅのガウス分布ｋに割り当てられた事後確率である。このガウス分布毎の事後確率値ζ_{ｅ，ｋ，ｔ}は、各状態ｉの事後確率を計算したのと同じように事後確率計算部１２１において、各ガウス分布ｋ毎に計算される。 Here, ζ _{e, k, t} is a posterior probability assigned to the Gaussian distribution k of the speech recognition model e corresponding to the sound source A at the frame time t. The a posteriori probability value ζ _{e, k, t} for each Gaussian distribution is calculated for each Gaussian distribution k in the a posteriori probability calculation unit 121 in the same manner as calculating the a posteriori probability of each state i.

式（１３）の補助関数Ｑは、式（１４）で表現できる。

The auxiliary function Q of Expression (13) can be expressed by Expression (14).

ここで´は行列の転置を表す。ζ_ｅ，ｋは式（１５）、ｍ_ｅ，ｋは式（１６）で表せる十分統計量である。

Here, 'represents transposition of the matrix. ζ _{e, k} is a sufficient statistic that can be expressed by equation (15) and _{me, k} is expressed by equation (16).

更に式（１４）の補助関数Ｑは、式（１７）で表現することができる。

Further, the auxiliary function Q in Expression (14) can be expressed by Expression (17).

ここでμは、式（１８）に示すように複数の音声認識モデルを１つのベクトルとして扱
ったものである。

Here, μ represents a plurality of speech recognition models handled as one vector as shown in Expression (18).

更に、

である。このように複数音声認識モデルの補助関数Ｑは、全音声認識モデルの平均ベクト
ルμの２次形式（式（１７）の右辺第１項）で表現することができるので、安定した解が
得られる。そして、この実施例の適応学習は、初期値音声認識モデルの平均ベクトルμ^０
と推定すべきμに対して式（２１）に示す線形変換を仮定する。 Furthermore,

It is. As described above, the auxiliary function Q of the plurality of speech recognition models can be expressed in the quadratic form of the average vector μ of all speech recognition models (the first term on the right side of the equation (17)), so that a stable solution can be obtained. . The adaptive learning of this embodiment is performed by using the average vector μ ⁰ of the initial value speech recognition model.
Assuming that μ is to be estimated, the linear transformation shown in the equation (21) is assumed.

ここでＢ＝（Ａ，ｂ），ξ＝（（μ^０）´，１）´である。行列Ａは、非対角成分において複数音声認識モデル間のパラメータの相関関係を考慮したものである。

Here, B = (A, b) and ξ = ((μ ⁰ ) ′, 1) ′. The matrix A considers the correlation of parameters between a plurality of speech recognition models in the non-diagonal component.

関係パラメータ生成部１２２は、式（１７）に、式（２１）を代入してＢについてのａ
ｒｇｍａｘを取る演算をすることにより、適応データからパラメータＡ，ｂを最尤推定法により推定する（ステップＳ１２２）。パラメータＡ，ｂは、式（４）と（１７）に示したφに相当するものである。 The relation parameter generation unit 122 substitutes the expression (21) into the expression (17), and a
By calculating rgmax, the parameters A and b are estimated from the adaptive data by the maximum likelihood estimation method (step S122). Parameters A and b correspond to φ shown in equations (4) and (17).

しかし、Ａ，ｂは巨大な行列（数１０万次元以上）であるため、適応データのみでそれらを推定するとデータ量が不足し、過学習問題が生じる。この過学習を解決するためには、行列Ａのブロック化を行い、非対角要素を０と近似する。また、ｂについてもブロック化することで、変換式（２１）は式（２２）のように書き直せる。

However, since A and b are huge matrices (several hundred thousand dimensions or more), if they are estimated using only adaptive data, the amount of data is insufficient and an overlearning problem occurs. In order to solve this overlearning, the matrix A is blocked and the non-diagonal elements are approximated to zero. Also, by converting b into blocks, the conversion equation (21) can be rewritten as the equation (22).

つまり、各平均ベクトルμ_ｅ，ｋがＡ_ｅ，ｋ，ｂ_ｅ，ｋによって変換される。また、複数の平均ベクトルでＡ，ｂを共有することにより推定すべきパラメータを更に減らすことができる。これは、平均ベクトル集合に対して事前にクラスタリングを行い、クラスター中の平均ベクトルを複数含むクラスターをデータ量に応じて求めれば良い。これによりＡ，ｂを少ないパラメータで効率よく推定することができる。 That is, each average vector μ _{e, k} is converted by A _{e, k} , b _{e, k} . Further, the parameters to be estimated can be further reduced by sharing A and b with a plurality of average vectors. This is achieved by performing clustering on the average vector set in advance and obtaining a cluster including a plurality of average vectors in the cluster according to the data amount. Thereby, A and b can be efficiently estimated with a small number of parameters.

Ａ，ｂのパラメータ削減のための平均ベクトル集合に対するクラスタリングには、音響
モデル適応の代表的手法である最尤線形回帰法などでよく用いられるガウス分布共有木を
用いれば良い。ガウス分布共有木は、単一のガウス分布をリーフ、それらの集合をノード
とする木構造を用いてガウス分布の集合を表現する手法である。このとき、どのガウス分
布を一つの集合とするかについては、ユークリッド距離などの分布間距離が用いられる。
例えば２分木の場合は、分布間距離の近い２つのガウス分布を１つのノードとして表現す
る。複数音響モデルに対するガウス分布共有木の構築については、次の２種類がある。 For the clustering of the average vector set for reducing the parameters A and b, a Gaussian distribution shared tree often used in the maximum likelihood linear regression method or the like, which is a typical method for acoustic model adaptation, may be used. The Gaussian distribution shared tree is a technique for expressing a set of Gaussian distributions using a tree structure in which a single Gaussian distribution is a leaf and the set is a node. At this time, an inter-distribution distance such as an Euclidean distance is used to determine which Gaussian distribution is one set.
For example, in the case of a binary tree, two Gaussian distributions having a close distance between distributions are expressed as one node. There are two types of Gaussian distribution tree construction for multiple acoustic models:

（１）合成前の環境依存音響モデルそれぞれに対して独立に分布間距離を用いて共有木を
構築し、それらのルートノードを小ノードとする共通の親ノードを用意することにより、
共有木を合成する。この場合、回帰行列は同一話者内で共有されるため、話者性情報を利
用した共有構造が構築される。 (1) By constructing a shared tree using the inter-distribution distance independently for each environment-dependent acoustic model before synthesis, and preparing a common parent node whose root node is a small node,
Synthesize a shared tree. In this case, since the regression matrix is shared within the same speaker, a shared structure using speaker property information is constructed.

（２）複数モデルを合成した音響モデルに対して分布間距離を用いてクラスタリングを行
い共有木を構築する。この場合、回帰行列は複数話者にまたがって分布間距離の近いガウ
ス分布に対して共有される。つまり、話者性情報は直接的には考慮されず、音韻的に近い
ガウス分布が共有されることが想定される。 (2) A shared tree is constructed by clustering an acoustic model obtained by combining a plurality of models using the distance between distributions. In this case, the regression matrix is shared for Gaussian distributions with close distances between the speakers. That is, it is assumed that speaker property information is not directly taken into account and a phonologically close Gaussian distribution is shared.

後述するシミュレーションでは、２種類の共有木を用いた手法を組み合わせて、初めに
上記した（２）を用いた適応実験を行い、そのモデルを初期モデルにして上記した（１）
を用いた適応実験を行った。 In the simulation described later, an adaptation experiment using (2) described above is first performed by combining the methods using two types of shared trees, and the above model is used as the initial model (1).
The adaptation experiment using was conducted.

更新モデル生成部１２３は、関係パラメータ生成部１２２からのパラメータＡ，ｂと、初期値音声認識モデル記録部１０に記録された初期値音声認識モデルμ^０を入力として式（２１）の計算を行って音声認識モデルを更新する（ステップＳ１２３）。 The update model generation unit 123 calculates the equation (21) using the parameters A and b from the related parameter generation unit 122 and the initial value speech recognition model μ ⁰ recorded in the initial value speech recognition model recording unit 10 as inputs. The voice recognition model is updated (step S123).

以上述べたように実施例１に示す音声認識モデル作成装置１００は、複数の音声認識モデルを含む初期値音声認識モデルを１つのベクトルとして扱い、その初期値音声認識モデルを、複数の音声認識モデルの組み合わせから成る状態確率遷移を基に音声認識された音声認識結果を用いて更新する。したがって、複数の音声認識モデルがまとめて学習できるので、少量の音声データでも十分な適応学習の効果を得ることが出来る。 As described above, the speech recognition model creating apparatus 100 according to the first embodiment treats an initial value speech recognition model including a plurality of speech recognition models as one vector, and converts the initial value speech recognition model into a plurality of speech recognition models. It is updated using the speech recognition result recognized by speech recognition based on the state probability transition composed of the combinations of the above. Therefore, since a plurality of speech recognition models can be learned together, a sufficient adaptive learning effect can be obtained even with a small amount of speech data.

〔音声認識装置〕
実施例１で説明した音声認識モデル作成装置１００は、音声認識装置に利用することが可能である。音声認識モデル作成装置１００を用いた音声認識装置２００の機能構成例を図７に示す。その動作フローを図８に示す。音声認識装置２００は、音声認識モデル作成装置１００と、音声認識ネットワークデータベース２２と、Ａ/Ｄ変換部９１と、特徴量抽出部９２と、スコア計算部９３１と、音声認識ネットワーク選択部２０１とを備える。Ａ/Ｄ変換部９１、特徴量抽出部９２、スコア計算部９３１は、従来技術で説明した音声認識装置９００と同じものである。よって、音声認識ネットワークデータベース２２と、音声認識ネットワーク選択部２０１とについて説明する。 [Voice recognition device]
The speech recognition model creation device 100 described in the first embodiment can be used for a speech recognition device. An example of the functional configuration of a speech recognition apparatus 200 using the speech recognition model creation apparatus 100 is shown in FIG. The operation flow is shown in FIG. The speech recognition device 200 includes a speech recognition model creation device 100, a speech recognition network database 22, an A / D conversion unit 91, a feature amount extraction unit 92, a score calculation unit 931, and a speech recognition network selection unit 201. Prepare. The A / D conversion unit 91, the feature amount extraction unit 92, and the score calculation unit 931 are the same as the speech recognition apparatus 900 described in the related art. Therefore, the voice recognition network database 22 and the voice recognition network selection unit 201 will be described.

音声認識ネットワークデータベース２２は、複数の音声認識モデルの組み合わせから成る状態確率遷移を記録する。式（５）と図１に示した複数の音声認識モデルを含む音声認識ネットワークを記録したものである。式（５）は、同一言語・２話者の対話環境を考えた場合の音声認識ネットワークの結合を意味する。多言語音声認識のように、単語や文法自体も異なる環境では、それぞれのネットワークを準備して式（２３）に示すようにして音声認識ネットワークデータベース２２を構築する。式（２３）は発話間遷移の場合である。単語間遷移であれば式（２４）で構築できる。

The voice recognition network database 22 records state probability transitions composed of combinations of a plurality of voice recognition models. This is a recording of a speech recognition network including the expression (5) and a plurality of speech recognition models shown in FIG. Expression (5) means the combination of voice recognition networks when considering the conversation environment of the same language and two speakers. In an environment where words and grammar itself are different as in multilingual speech recognition, each network is prepared and the speech recognition network database 22 is constructed as shown in Expression (23). Expression (23) is the case of transition between utterances. If it is a transition between words, it can be constructed by equation (24).

このように、同一言語複数話者や多言語環境における発話（単語）間遷移モデルなどの多様な音響環境モデルの構築は、ネットワーク同士の合成演算やユニオン演算等で実現でき、これらは重み付有限状態トランスデューサ（ＷＦＳＴ，これを用いた音声認識デコーダをＷＦＳＴ型デコーダという）等の既存アリゴリズムを用いて効率良く行うことができる。ＷＦＳＴ型デコーダでは、音響モデルはＨＭＭ状態のＩＤとそこに含まれる混合ガウス分布モデルのパラメータ値の情報のみを扱う。従って、複数の音響モデルの合成に関しては、各モデルのＨＭＭ状態のＩＤと相当する混合ガウス分布モデルのパラメータ値を合成音響モデルに追加して行けば良い。その際のＩＤ番号の重複に注意が必要である。また、相当するＷＦＳＴ中のＨＭＭ状態ＩＤもそれに合わせて変更する必要がある。 In this way, the construction of various acoustic environment models, such as multiple speakers in the same language and transition models between utterances (words) in a multilingual environment, can be realized by synthesizing networks, union operations, etc., which are weighted finite It can be efficiently performed using an existing algorithm such as a state transducer (WFST, a speech recognition decoder using the state transducer is called a WFST type decoder). In the WFST decoder, the acoustic model handles only the information of the HMM state ID and the parameter value of the mixed Gaussian distribution model included therein. Accordingly, regarding the synthesis of a plurality of acoustic models, the parameter values of the mixed Gaussian distribution model corresponding to the ID of the HMM state of each model may be added to the synthesized acoustic model. Attention should be paid to duplication of ID numbers. Also, the corresponding HMM state ID in the WFST needs to be changed accordingly.

音声認識ネットワーク選択部２０１は、スコア計算部９３１が音響特徴量と、音声認識モデル作成装置１００が更新した更新音声認識モデルとを用いて計算されたスコアが、最も大きくなる状態確率遷移の音声認識ネットワークから成る状態列、若しくはその集合を、音声認識ネットワークデータベース２２から選択して音声認識結果として出力する（ステップＳ２０１）。音声認識結果の状態列の集合は、音声認識モデル作成装置１００の尤度計算部１３にも入力され、適応学習の教師信号となる。 The speech recognition network selection unit 201 has a state probability transition speech recognition in which the score calculated by the score calculation unit 931 using the acoustic feature amount and the updated speech recognition model updated by the speech recognition model creation device 100 is the largest. A state sequence consisting of a network or a set thereof is selected from the speech recognition network database 22 and output as a speech recognition result (step S201). A set of state sequences of speech recognition results is also input to the likelihood calculating unit 13 of the speech recognition model creating apparatus 100, and becomes a teacher signal for adaptive learning.

音声認識ネットワーク選択部２０１は、状態列の集合と共に選択した音声認識ネットワークを構成する音声認識モデルの種別ｅも環境情報として出力するようにしても良い。例えば、音声認識ネットワークデータベース２２が、日本語ｅ＝１と英語ｅ＝２の２種類の音声認識ネットワークを記録していたとすると種別ｅも出力する。そうすることで、音声認識している環境状況も知り得る効果を奏する。 The voice recognition network selection unit 201 may output the type e of the voice recognition model constituting the voice recognition network selected together with the state string set as environment information. For example, if the speech recognition network database 22 records two types of speech recognition networks, Japanese e = 1 and English e = 2, the type e is also output. By doing so, there is an effect that it is possible to know the environmental situation of voice recognition.

〔シミュレーション結果〕
この発明の音声認識モデル作成方法の有効性を確認する目的でシミュレーションを行っ
た。シミュレーション条件は、複数の音響環境として、性別依存音響モデル２種類（男性・女性）を用意した。音声認識の条件は、サンプリング周波数を１６ｋＨｚ、量子化数１６bit、ウインドウタイプはハミング窓、フレーム長を２５ｍｓ、フレームシフトを１０ｍｓとした。言語モデルはトライグラム（新聞記事１４年分）、語彙数は７００,０００個とした。〔simulation result〕
A simulation was performed for the purpose of confirming the effectiveness of the speech recognition model creation method of the present invention. As the simulation conditions, two types of sex-dependent acoustic models (male and female) were prepared as a plurality of acoustic environments. The speech recognition conditions were a sampling frequency of 16 kHz, a quantization number of 16 bits, a window type of a Hamming window, a frame length of 25 ms, and a frame shift of 10 ms. The language model was trigram (newspaper article 14 years) and the vocabulary was 700,000.

単語正解精度を、この発明の方法と、従来法の性別非依存の単一音響モデルと複数の音
響モデルを用いた方法と比較した。その結果を表１に示す。

The accuracy of word correctness was compared with the method of the present invention and the conventional method using a single acoustic model independent of gender and a plurality of acoustic models. The results are shown in Table 1.

この発明の適応学習による単語正解率が、８５．５％と最も良い数値を示し、複数モデルを用いた従来の適応学習方法よりも認識性能を１％改善する効果が得られた。単一モデル適応と比較すると、３％も単語正解精度を高めることができた。このようにこの発明の音声認識モデル作成方法も用いた音声認識装置によれば、単語正解精度を向上させる効果が得られた。 The word correct answer rate by the adaptive learning of the present invention showed the best numerical value of 85.5%, and the effect of improving the recognition performance by 1% over the conventional adaptive learning method using a plurality of models was obtained. Compared with single model adaptation, the word accuracy was improved by 3%. As described above, according to the speech recognition apparatus that also uses the speech recognition model creation method of the present invention, an effect of improving the word correct answer accuracy is obtained.

この発明の技術思想に基づく音声認識モデル作成装置とその方法、及び音声認識装置とその方法は、上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。上記した装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。例えば、上記した実施例では、音声認識モデル作成装置１００の構成に尤度計算部１３を含む例で説明を行ったが、音声認識装置を構成する場合は、音声認識装置が持つ尤度計算部で計算した尤度、若しくはスコアを用いることで、尤度計算部１３は削除することができる。また、初期値音声認識モデル記録部１０に初期値音声認識モデルを１つのベクトルとして記録する例で説明を行ったが、初期値音声認識モデル記録部１０に複数の音声認識モデルをそれぞれ独立した形で記録して置き、関係パラメータ生成部１２２でそれぞれの音声認識モデルを１つのベクトルとして扱うようにしても良い。また、音声認識装置２００は、Ａ/Ｄ変換部９１を備える例で説明したが、音声データがディジタル化された音声データファイルである場合は、Ａ/Ｄ変換部９１は必要がない。 The speech recognition model creation device and method based on the technical idea of the present invention, and the speech recognition device and method are not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. It is. The processes described in the above-described apparatus and method are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. . For example, in the above-described embodiment, an example in which the likelihood calculation unit 13 is included in the configuration of the speech recognition model creation device 100 has been described. However, when configuring the speech recognition device, a likelihood calculation unit included in the speech recognition device. The likelihood calculation unit 13 can be deleted by using the likelihood or the score calculated in (1). Further, although an example in which the initial value speech recognition model is recorded as one vector in the initial value speech recognition model recording unit 10 has been described, a plurality of speech recognition models are individually stored in the initial value speech recognition model recording unit 10. And the related parameter generation unit 122 may treat each speech recognition model as one vector. Further, although the voice recognition apparatus 200 has been described as an example including the A / D conversion unit 91, when the voice data is a digitized voice data file, the A / D conversion unit 91 is not necessary.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ-ＲＡＭ（Random Access Memory）、ＣＤ-ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ-Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてフラッシュメモリー等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, etc., and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc. can be used as magneto-optical recording media, MO (Magneto Optical disc) can be used, and flash memory can be used as semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

ユニオン演算を概念的に説明する図。The figure explaining a union operation notionally. この発明の音声認識モデル作成装置１００と、それを用いた音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition model production apparatus 100 of this invention, and the speech recognition apparatus 200 using the same. 音声認識モデル作成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition model creation apparatus 100. 音素モデルを構成する１状態を模式的に示す図。The figure which shows typically 1 state which comprises a phoneme model. 音素モデルの一例を示す図。The figure which shows an example of a phoneme model. フレームと状態ｉとの関係を模式的に示す図。The figure which shows typically the relationship between a flame | frame and the state i. 音声認識モデル作成装置１００と音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition model creation apparatus 100 and the speech recognition apparatus 200. FIG. 音声認識装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 200. 従来の複数の音声認識モデルを備えた音声認識装置９００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 900 provided with the conventional some speech recognition model.

Claims

An initial value speech recognition model that is a vector obtained by concatenating vectors representing parameters of a plurality of speech recognition models, wherein the plurality of speech recognition models are speech recognition models respectively corresponding to a plurality of sound sources. An initial value speech recognition model recording unit that records a value speech recognition model;
As inputs the feature vectors of a plurality of sets and the audio signal of the state sequence is the result of voice recognition a voice signal voice recognition network based on union operation state transition probability respectively corresponding to each of the speech recognition model, the frame A likelihood calculator that calculates the likelihood of each state for each;
As input and the likelihood and the feature vectors, and the model update unit to generate an updated speech recognition model update the initial value speech recognition models,
An updated speech recognition model recording unit for recording the updated speech recognition model;
A speech recognition model creation device comprising:

In the speech recognition model creation device according to claim 1,
The model update unit
A posterior probability calculation unit for calculating a posterior probability value for each Gaussian distribution constituting the state, using the likelihood and the feature vector as inputs,
A relational parameter generating unit that receives the posterior probability value for each Gaussian distribution and the initial value speech recognition model as input, and generates a relational parameter that updates the initial value speech recognition model as one vector;
An update model generation unit that outputs an updated speech recognition model obtained by updating the initial value speech recognition model with the related parameters;
A speech recognition model creation device comprising:

In the initial value speech recognition model recording unit, an initial value speech recognition model that is one vector obtained by concatenating vectors representing the parameters of a plurality of speech recognition models corresponding to each sound source is stored.
A set of state sequences as a result of speech recognition of speech signals based on state probability transitions obtained by union-calculating a plurality of speech recognition networks respectively corresponding to the plurality of speech recognition models corresponding to the sound sources ; A likelihood calculation process for calculating the likelihood of each state for each frame , using the feature vector of the speech signal as an input,
Model updating unit, and a model update step of generating an updated speech recognition model on Symbol initial value speech recognition model was updated as input and the likelihood and the feature vector,
An updated speech recognition model recording process in which the updated speech recognition model recording unit records the updated speech recognition model;
Speech recognition model creation method including.

In the speech recognition model creation method according to claim 3,
The model update process
A posteriori probability calculation unit calculates a posteriori probability value for each Gaussian distribution constituting the state by using the likelihood as an input; and
A relational parameter generating unit receives a posterior probability value for each Gaussian distribution, the initial value speech recognition model, and a feature vector, and generates a relational parameter for updating the initial value speech recognition model as one vector. Generation step;
An update model generation step, wherein an update model generation unit outputs an updated speech recognition model in which the initial value speech recognition model is updated with the related parameters;
A speech recognition model creation method characterized by comprising:

The speech recognition model creation device according to claim 1 or 2,
A speech recognition network database that records state probability transitions consisting of combinations of multiple speech recognition models;
A feature amount extraction unit that extracts a feature amount vector for each frame of a discrete audio signal;
A score calculator that calculates the score using the updated speech recognition model obtained by updating the initial value speech recognition model with the speech recognition result, using the feature vector and the initial value speech recognition model as inputs;
A voice recognition network selection unit that selects the voice recognition network of the state probability transition with the highest score from the voice recognition network database and outputs the result as the voice recognition result;
A speech recognition apparatus comprising:

The speech recognition apparatus according to claim 5.
The voice recognition apparatus, wherein the voice recognition network selection unit outputs environment information from the selected voice recognition network.

A speech recognition model creation method according to claim 3 or 4,
A feature quantity extraction unit that extracts a feature quantity vector for each frame of a discrete audio signal;
A score calculation process in which a score calculation unit calculates a score corresponding to the feature vector by receiving the feature vector and the updated speech recognition model;
Selecting a speech recognition network of the state probability transition with the highest score from the speech recognition network database and outputting it as a set of state sequences; and
A speech recognition method comprising:

A method program for causing a computer to function the voice recognition model creation method according to claim 3 or 4.

A method program for causing a computer to function the speech recognition method according to claim 7.

A computer-readable recording medium on which the method program according to claim 8 or 9 is recorded.