JP2012113251A

JP2012113251A - Acoustic model creation apparatus, acoustic model creation method and program therefor

Info

Publication number: JP2012113251A
Application number: JP2010264318A
Authority: JP
Inventors: Satoru Kobashigawa; 哲小橋川; Atsunori Ogawa; 厚徳小川; Taichi Asami; 太一浅見; Yoshikazu Yamaguchi; 義和山口; Hirokazu Masataki; 浩和政瀧; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-11-26
Filing date: 2010-11-26
Publication date: 2012-06-14
Anticipated expiration: 2030-11-26
Also published as: JP5411837B2

Abstract

PROBLEM TO BE SOLVED: To create an acoustic model which can be shared by an inter-word triphone use decoder and an inter-word bi/monophone use decoder without impairing voice recognition accuracy when utilizing both the decoders.SOLUTION: All triphone phoneme model names are generated which can be generated by exchanging, with all phonemes in a phonemic system, environment independence symbols included in phoneme model names of biphone and monophone phoneme models being present inside an inter-word bi/monophone use acoustic model. Among the generated phoneme model names, a phoneme model name that is not present inside the inter-word bi/monophone use acoustic model is identified. A biphone or monophone model parameter of a generation source of the identified phoneme model name is copied as a triphone model parameter of the identified phoneme model name to generate a phoneme model, thereby constructing an inter-word tri/bi/monophone sharing acoustic model.

Description

本発明は、主に音声認識処理で用いられる音響モデルを作成する音響モデル作成装置、音響モデル作成方法、及びそのプログラムに関する。 The present invention relates to an acoustic model creation device, an acoustic model creation method, and a program for creating an acoustic model mainly used in speech recognition processing.

デコーダ（音声認識エンジンあるいは音声認識エンジンの探索処理部）による音声認識処理において、単語間に用いる音素モデルは主に音素環境を考慮するものが用いられ、ＷＦＳＴ（重み付有限状態トランスデューサ）ベース等の１パスデコーダでは、中心音素の前後の音素環境を考慮するtriphoneを用いるが、２パスデコーダでは、第１パスで前後いずれか一方の音素環境を考慮するbiphoneや音素環境を考慮しないmonophone（以下、「bi/monophone」という。）を用いる場合もあればtriphoneを用いる場合もあり、一般にbi/monophoneを用いる方が処理速度が高速になる（なお、第２パスでは共にtriphoneを用いる）。単語間に用いる音素モデルがtriphoneのみの場合には、triphoneの学習データのみで学習を行うが、bi/monophoneも用いる可能性がある場合は、bi/monophoneの学習データでも学習を行う。 In speech recognition processing by a decoder (speech recognition engine or search processing unit of speech recognition engine), a phoneme model used between words mainly uses a phoneme environment, and is based on a WFST (weighted finite state transducer) base or the like. The 1-pass decoder uses a triphone that considers the phoneme environment before and after the central phoneme, whereas the 2-pass decoder uses a biphone that considers either the phoneme environment before or after the first pass or a monophone that does not consider the phoneme environment (hereinafter referred to as the phoneme environment). "Bi / monophone") may be used or triphone may be used, and generally the processing speed is faster when using bi / monophone (note that both use triphone in the second pass). When the phoneme model used between words is only triphone, learning is performed only with triphone learning data, but when there is a possibility of using bi / monophone, learning is also performed with bi / monophone learning data.

鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹夫、「IT Text 音声認識システム」、オーム社、2001年、p.144Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, “IT Text Speech Recognition System”, Ohmsha, 2001, p.144

音声認識処理に際し、単語間にtriphoneのみを用いるデコーダ（以下、「単語間triphone使用デコーダ」という。）と、単語間にtriphone のみでなくbi/monophoneも用いるデコーダ（以下、「単語間bi/monophone使用デコーダ」という。）を利用する場合、図１０の装置構成例に示すように、各デコーダに合った音響モデルを別々に用意する必要がある。音響モデルとは、triphone、biphone、monophoneの異なる音素モデルをそれぞれ必要な個数含む、音素モデルの集合をいう。 In speech recognition processing, a decoder that uses only triphone between words (hereinafter referred to as “interphone triphone use decoder”) and a decoder that uses not only triphone but also bi / monophone between words (hereinafter “inter-word bi / monophone”). In the case of using the “decoder to be used”), it is necessary to separately prepare an acoustic model suitable for each decoder as shown in the apparatus configuration example of FIG. An acoustic model is a set of phoneme models including a necessary number of different phone models of triphone, biphone, and monophone.

具体的には、単語間triphone使用デコーダ１０に対しては、triphoneのみを学習する音響モデル（以下、「単語間triphone使用音響モデル」という。）が記憶された単語間triphone使用音響モデル記憶部１０５を用意し、単語間bi/monophone使用デコーダ２０に対しては、triphone、biphone、monophoneのそれぞれを学習する音響モデル（以下、「単語間bi/monophone使用音響モデル」という。）が記憶された単語間bi/monophone使用音響モデル記憶部１１０を用意する（なお、その他に言語モデル３０も必要であるが、本発明には無関係であるため説明は省略する）。これは、以下の理由による。 Specifically, for the inter-word triphone use decoder 10, an inter-word triphone use acoustic model storage unit 105 in which an acoustic model for learning only triphone (hereinafter referred to as “inter-word triphone use acoustic model”) is stored. The inter-word bi / monophone use decoder 20 stores an acoustic model for learning each of triphone, biphone, and monophone (hereinafter referred to as “inter-bi bi / monophone use acoustic model”). An inter-bi / monophone use acoustic model storage unit 110 is prepared (note that the language model 30 is also necessary, but the description is omitted because it is irrelevant to the present invention). This is due to the following reason.

まず、単語間でtriphoneのみを学習する単語間triphone使用音響モデルを、単語間triphone使用デコーダ１０と単語間bi/monophone使用デコーダ２０とで共用することを考える。この場合、音響モデル内に、あるtriphone（例えば、k-a+k）の音素モデルが存在しないとき、当該triphoneが学習データとして入力されると、当該triphoneに近い音素環境を持つbi/monophone（例えば、k-a+*や*-a+*）の音素モデルが当該triphoneの学習データで学習される（このような状況に陥ることを"back-off"という）。そのため、当該bi/monophoneの音素モデルを単語間bi/monophone使用デコーダ２０でそのまま使うと、音声認識精度が劣化してしまう。 First, consider that an inter-word triphone use acoustic model for learning only triphone between words is shared by the inter-word triphone use decoder 10 and the inter-word bi / monophone use decoder 20. In this case, when a triphone (for example, k-a + k) phoneme model does not exist in the acoustic model, if the triphone is input as learning data, bi / monophone (with phoneme environment close to the triphone ( For example, a phoneme model of k-a + * or * -a + *) is learned with the learning data of the triphone (falling into such a situation is called “back-off”). For this reason, if the bi / monophone phoneme model is used as it is in the inter-word bi / monophone use decoder 20, the speech recognition accuracy deteriorates.

次に、単語間でtriphone、biphone、monophoneのそれぞれを学習する単語間bi/monophone使用音響モデルを、単語間bi/monophone使用デコーダ２０と単語間triphone使用デコーダ１０とで共用することを考える。この場合にも、音響モデル内に、あるtriphone（例えば、k-a+k）の音素モデルが存在しないときには、当該triphoneが学習データとして入力されると、当該triphoneに近い音素環境を持つbi/monophone（例えば、k-a+*や*-a+*、*は環境非依存記号）の音素モデルが、back-offにより当該triphoneの学習データで学習される。そして、このbi/monophoneの音素モデルは、bi/monophoneの学習データでも学習されるため、分散が広くなり識別性能が劣化する。そのため、当該音響モデルを単語間triphone使用デコーダ１０で用いると、単語間でtriphoneのみを学習する単語間triphone使用音響モデルを用いる場合に比べ、音声認識精度が劣化してしまう。 Next, it is considered that the inter-word bi / monophone use acoustic model for learning triphone, biphone, and monophone between words is shared by the inter-word bi / monophone use decoder 20 and the inter-word triphone use decoder 10. Also in this case, when there is no phone model of a certain triphone (for example, k-a + k) in the acoustic model, when the triphone is input as learning data, bi / having a phoneme environment close to the triphone. A phone model of monophone (for example, k-a + *, * -a + *, * is an environment-independent symbol) is learned from the training data of the triphone by back-off. Since the bi / monophone phoneme model is also learned by bi / monophone learning data, the dispersion becomes wide and the discrimination performance deteriorates. Therefore, when the acoustic model is used in the inter-word triphone use decoder 10, the speech recognition accuracy is deteriorated as compared with the case where the inter-word triphone use acoustic model for learning only triphone between words is used.

本発明の目的は、単語間triphone使用デコーダと単語間bi/monophone使用デコーダを利用する場合に各デコーダでの音声認識精度を損なうことなく両方のデコーダで共用可能な音響モデルを作成するための、音響モデル作成装置、音響モデル作成方法、及びそのプログラムを提供することにある。 An object of the present invention is to create an acoustic model that can be shared by both decoders without impairing the voice recognition accuracy in each decoder when using an interword triphone use decoder and an interword bi / monophone use decoder. An acoustic model creation apparatus, an acoustic model creation method, and a program thereof are provided.

本発明の音響モデル作成装置は、単語間bi/monophone使用音響モデル記憶部と単語間tri/bi/monophone共用音響モデル生成部と単語間tri/bi/monophone使用音響モデル記憶部を備える。 The acoustic model creation apparatus of the present invention includes an inter-word bi / monophone use acoustic model storage unit, an inter-word tri / bi / monophone shared acoustic model generation unit, and an inter-word tri / bi / monophone use acoustic model storage unit.

単語間bi/monophone使用音響モデル記憶部は、単語間でtriphone、biphone、monophoneの学習データによりそれぞれ学習されたtriphone、biphone、monophoneの異なる音素モデルをそれぞれ任意の個数含む、単語間bi/monophone使用音響モデルを記憶する。 Use bi / monophone between words Acoustic model storage unit uses bi / monophone between words, including any number of different phone models of triphone, biphone, and monophone respectively learned from triphone, biphone, and monophone learning data between words Memorize the acoustic model.

論理triphone生成部は、前記単語間bi/monophone使用音響モデル内に存在するbiphone及びmonophoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名（例えばbiphoneの場合、a-k+*など、monophoneの場合、*-k+*など）の左側音素又は／及び右側音素として含まれる環境非依存記号（前記の例では、"*"）を音素体系中の全音素に入れ替えることにより生成可能なtriphoneの音素モデル名（例えば、a-k+a、a-k+i、a-k+k、a-k+yなど）を全て生成し、生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在しない音素モデル名を特定して、特定した各音素モデル名の生成元であるbiphone又はmonophoneの音素モデル名の音素モデルのモデルパラメータを、当該特定した各音素モデル名のtriphoneの音素モデルのモデルパラメータとしてそれぞれコピーすることにより、当該特定した各音素モデル名のtriphoneの音素モデルを新たに生成する。 The logical triphone generator generates a phoneme model name consisting of a triplet of a left phoneme, a central phoneme, and a right phoneme of a phonephone model of biphone and monophone existing in the inter-word bi / monophone use acoustic model (for example, in the case of biphone, a- by replacing the environment-independent symbol ("*" in the example above) included in the phoneme system with the left phoneme and / or right phoneme of monophone such as k + * Generate all triphone phoneme model names (for example, a-k + a, a-k + i, a-k + k, a-k + y, etc.) The phoneme model name that does not exist in the bi / monophone acoustic model is identified, and the model parameter of the phoneme model name of the phoneme model name of biphone or monophone that is the source of each identified phoneme model name is set to each specified phoneme It as model parameter of model name triphone phoneme model Re by copying, it generates a new phoneme models of triphone of each phoneme model name the specified.

単語間tri/bi/monophone共用音響モデル記憶部は、前記単語間bi/monophone使用音響モデルを取り込んで記憶するとともに、論理triphone生成部で新たに生成されたtriphoneの音素モデルを記憶する。 The inter-word tri / bi / monophone shared acoustic model storage unit captures and stores the inter-word bi / monophone use acoustic model and stores the triphone phoneme model newly generated by the logical triphone generation unit.

本発明の音響モデル作成装置によれば、単語間triphone使用デコーダと単語間bi/monophone使用デコーダを利用する場合に各デコーダの音声認識精度を損なうことなく両方のデコーダで共用可能な音響モデルを作成することができる。 According to the acoustic model creation device of the present invention, when an inter-word triphone use decoder and an inter-word bi / monophone use decoder are used, an acoustic model that can be shared by both decoders without compromising the speech recognition accuracy of each decoder is created. can do.

本発明の音響モデル作成装置を用いた音声認識処理の装置構成例を示す図。The figure which shows the apparatus structural example of the speech recognition process using the acoustic model creation apparatus of this invention. 実施例１、３の音響モデル作成装置の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus of Example 1,3. 実施例２、５の音響モデル作成装置の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production apparatus of Example 2, 5. FIG. 実施例４の音響モデル作成装置の機能構成例を示す図。FIG. 6 is a diagram illustrating a functional configuration example of an acoustic model creation device according to a fourth embodiment. 実施例１の音響モデル作成装置の処理フロー例を示す図。FIG. 3 is a diagram illustrating a processing flow example of the acoustic model creation device according to the first embodiment. 実施例２の音響モデル作成装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the acoustic model production apparatus of Example 2. FIG. 実施例３の音響モデル作成装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the acoustic model production apparatus of Example 3. FIG. 実施例４の音響モデル作成装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the acoustic model production apparatus of Example 4. FIG. 実施例５の音響モデル作成装置の処理フロー例を示す図。FIG. 10 is a diagram illustrating a processing flow example of the acoustic model creation device according to the fifth embodiment. 従来の音声認識処理の装置構成例を示す図。The figure which shows the apparatus structural example of the conventional speech recognition process.

図１に本発明の音響モデル作成装置を用いて音声認識処理を行うための装置構成例を示す。図１には、実施例１の音響モデル作成装置１００を用いた場合を示しているが、その他の実施例の音響モデル作成装置を用いた場合も同様の位置に配置される。 FIG. 1 shows an apparatus configuration example for performing speech recognition processing using the acoustic model creation apparatus of the present invention. Although FIG. 1 shows the case where the acoustic model creation device 100 of the first embodiment is used, the acoustic model creation devices of other embodiments are also arranged at the same position.

図２に本発明の音響モデル作成装置１００の機能構成例を、図５にその処理フロー例をそれぞれ示す。音響モデル作成装置１００は、単語間bi/monophone使用音響モデル記憶部１１０、論理triphone生成部１２０、及び単語間tri/bi/monophone共用音響モデル記憶部１３０を備える。 FIG. 2 shows a functional configuration example of the acoustic model creation apparatus 100 of the present invention, and FIG. 5 shows a processing flow example thereof. The acoustic model creation apparatus 100 includes an inter-word bi / monophone use acoustic model storage unit 110, a logical triphone generation unit 120, and an inter-word tri / bi / monophone shared acoustic model storage unit 130.

単語間bi/monophone使用音響モデル記憶部１１０は、単語間でtriphone、biphone、monophoneの学習データによりそれぞれ学習されたtriphone、biphone、monophoneの音素モデルをそれぞれ任意の個数含む、単語間bi/monophone使用音響モデルを予め記憶する。 The inter-word bi / monophone use acoustic model storage unit 110 includes an inter-word bi / monophone phoneme model including any number of triphone, biphone, and monophone phoneme models learned from the triphone, biphone, and monophone learning data. An acoustic model is stored in advance.

論理triphone生成部１２０は、まず、単語間bi/monophone使用音響モデル内に存在するbiphone及びmonophoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名（例えばbiphoneの場合、a-k+*など、monophoneの場合、*-k+*など）の左側音素又は／及び右側音素として含まれる環境非依存記号（前記の例では、"*"）を音素体系中の全音素に入れ替えることにより生成可能なtriphoneの音素モデル名（例えば、a-k+a、a-k+i、a-k+k、a-k+yなど）を全て生成する（Ｓ１）。なお、生成可能な音素の組合せのうち、言語によっては現れないものがある（例えば、日本語においてはk+kなど）ため、そのような組合せについては生成対象から除外して構わない）。続いて、Ｓ１で生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在しない音素モデル名を特定する（Ｓ２）。例えば、Ｓ１で挙げた生成例で、音響モデル内にa-k+a、a-k+iが存在する場合には、a-k+k、a-k+yを特定する。続いて、Ｓ２で特定した各音素モデル名の生成元であるbiphone又はmonophoneの音素モデル名の音素モデルのモデルパラメータ（音素モデルの各状態の持つ状態遷移確率や混合正規分布と、当該混合正規分布が持つ各正規分布と、各正規分布の平均ベクトルと共分散行列）を、当該特定した各音素モデル名のtriphoneの音素モデルのモデルパラメータとしてそれぞれコピーすることにより、当該特定した各音素モデル名のtriphoneの音素モデルを新たに生成する（Ｓ３）。例えば、a-k+kの生成元の音素モデル名がa-k+*である場合には、a-k+*の音素モデルのモデルパラメータをa-k+kの音素モデルのモデルパラメータとしてコピーすることにより、a-k+kの音素モデルを生成する。 First, the logical triphone generation unit 120 has a phoneme model name consisting of a triplet of a left phoneme, a central phoneme, and a right phoneme of a phonephone model of biphone and monophone existing in an interword bi / monophone use acoustic model (for example, in the case of biphone, Replace the environment-independent symbol ("*" in the above example) included in the phoneme system with the left phoneme and / or right phoneme of monophone such as a-k + * All the phone model names (for example, a-k + a, a-k + i, a-k + k, a-k + y, etc.) that can be generated are generated (S1). It should be noted that some phoneme combinations that can be generated do not appear in some languages (for example, k + k in Japanese), so such combinations may be excluded from generation targets). Subsequently, among the phoneme model names generated in S1, a phoneme model name that does not exist in the inter-word bi / monophone use acoustic model is specified (S2). For example, in the generation example given in S1, when a-k + a and a-k + i exist in the acoustic model, a-k + k and a-k + y are specified. Subsequently, model parameters of the phoneme model of the phoneme model name of the phone or monophone that is the generation source of each phoneme model name specified in S2 (the state transition probability and mixed normal distribution of each state of the phoneme model, and the mixed normal distribution) Each normal distribution, and the average vector and covariance matrix of each normal distribution) are copied as model parameters of the triphone phoneme model of each specified phoneme model name. A new phone model of triphone is generated (S3). For example, if the name of the phoneme model from which a-k + k is generated is a-k + *, copy the model parameters of the phone model of a-k + * as the model parameters of the phone model of a-k + k As a result, an a-k + k phoneme model is generated.

単語間tri/bi/monophone共用音響モデル記憶部１３０は、単語間bi/monophone使用音響モデル記憶部１１０から単語間bi/monophone使用音響モデルを取り込んで記憶するとともに、論理triphone生成部１２０で新たに生成されたtriphoneの音素モデルを記憶する（Ｓ４）。 The inter-word tri / bi / monophone shared acoustic model storage unit 130 takes in and stores the inter-word bi / monophone use acoustic model from the inter-word bi / monophone use acoustic model storage unit 110 and newly stores it in the logical triphone generation unit 120. The generated phone model of triphone is stored (S4).

このように音響モデル内に存在しないtriphoneの音素モデルを新たに生成し、この新たに生成したtriphoneの音素モデルが追加された単語間tri/bi/monophone共用音響モデルを構築して、これを用いることで、back-offが起きなくなるため、各デコーダでの音声認識精度を損なうことなく、単語間triphone使用デコーダと単語間bi/monophone使用デコーダとで音響モデルを共用することができる。 Thus, a triphone phoneme model that does not exist in the acoustic model is newly generated, and an inter-word tri / bi / monophone shared acoustic model to which the newly generated triphone phoneme model is added is used. Thus, since back-off does not occur, the inter-word triphone use decoder and the inter-word bi / monophone use decoder can share the acoustic model without impairing the speech recognition accuracy in each decoder.

図３に本発明の音響モデル作成装置２００の機能構成例を、図６にその処理フロー例をそれぞれ示す。音響モデル作成装置２００は、単語間bi/monophone使用音響モデル記憶部１１０、論理triphone生成部１２０、論理biphone生成部２２０、及び単語間tri/bi/monophone共用音響モデル記憶部１３０を備える。つまり、実施例１の音響モデル作成装置１００に、論理biphone生成部２２０を加えた構成である。これにより、単語間bi/monophone使用音響モデル記憶部１１０内に存在しないbiphoneの音素モデルを生成し、音響モデルに追加することができるため、biphoneについてもback-offの発生を防ぐことができる。 FIG. 3 shows a functional configuration example of the acoustic model creation apparatus 200 of the present invention, and FIG. 6 shows a processing flow example thereof. The acoustic model creation apparatus 200 includes an inter-word bi / monophone use acoustic model storage unit 110, a logical triphone generation unit 120, a logical biphone generation unit 220, and an inter-word tri / bi / monophone shared acoustic model storage unit 130. That is, the configuration is such that the logical biphone generation unit 220 is added to the acoustic model creation device 100 of the first embodiment. Thereby, since the phoneme model of the biphone which does not exist in the inter-word bi / monophone use acoustic model storage unit 110 can be generated and added to the acoustic model, it is possible to prevent the back-off of the biphone.

論理biphone生成部２２０は、まず、単語間bi/monophone使用音響モデル内に存在するmonophoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名の左側音素及び右側音素として含まれる環境非依存記号のいずれか一方を音素体系中の全音素に入れ替えることにより生成可能なbiphoneの音素モデル名を全て生成する（Ｓ１１）。例えば、*-a+*のmonophoneであれば、k-a+*、y-a+*、*-a+i、*-a+nというように生成する。なお、生成可能な音素の組合せのうち、言語によっては現れないものがある（例えば、日本語においてはk+kなど）ため、そのような組合せについては生成対象から除外して構わない）。続いて、Ｓ１１で生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在しない音素モデル名を特定する（Ｓ１２）。例えば、Ｓ１１で挙げた生成例で、音響モデル内にk-a+*、*-a+nが存在する場合には、y-a+*、*-a+iを特定する。続いて、Ｓ１２で特定した各音素モデル名の生成元であるmonophoneの音素モデル名の音素モデルのモデルパラメータを、当該特定した各音素モデル名のbiphoneの音素モデルのモデルパラメータとしてそれぞれコピーすることにより、当該特定した各音素モデル名のbiphoneの音素モデルを新たに生成する（Ｓ１３）。例えば、k-a+*の生成元の音素モデル名が*-a+*である場合には、*-a+*の音素モデルのモデルパラメータをk-a+*の音素モデルのモデルパラメータとしてコピーすることにより、k-a+*の音素モデルを生成する。 First, the logical biphone generation unit 220 includes the left phoneme and the right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model existing in the inter-word bi / monophone use acoustic model. All phoneme model names that can be generated are generated by replacing any one of the environment-independent symbols with all phonemes in the phoneme system (S11). For example, in the case of * -a + * monophone, it is generated as k-a + *, y-a + *, * -a + i, * -a + n. It should be noted that some phoneme combinations that can be generated do not appear in some languages (for example, k + k in Japanese), so such combinations may be excluded from generation targets). Subsequently, among the phoneme model names generated in S11, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model are specified (S12). For example, when k-a + * and * -a + n exist in the acoustic model in the generation example given in S11, y-a + * and * -a + i are specified. Subsequently, by copying the model parameters of the phoneme model of the phonephone model name of the monophone that is the generation source of each phoneme model name specified in S12 as the model parameters of the phoneme model of the biphone of each specified phoneme model name Then, a new phoneme model of the biphone having the specified phoneme model name is generated (S13). For example, if the phoneme model name of the source of k-a + * is * -a + *, copy the model parameter of the phone model of * -a + * as the model parameter of the phoneme model of k-a + * , K-a + * phoneme model is generated.

そして、単語間tri/bi/monophone共用音響モデル記憶部１３０に、論理biphone生成部２２０で新たに生成されたbiphoneの音素モデルを更に記憶する（Ｓ４）。 Then, the phoneme model of the biphone newly generated by the logical biphone generation unit 220 is further stored in the inter-word tri / bi / monophone shared acoustic model storage unit 130 (S4).

図２に本発明の音響モデル作成装置３００の機能構成例を、図７にその処理フロー例をそれぞれ示す。音響モデル作成装置３００は、単語間bi/monophone使用音響モデル記憶部１１０、論理triphone生成部３２０、及び単語間tri/bi/monophone共用音響モデル記憶部１３０を備える。つまり、実施例１の音響モデル作成装置１００の論理triphone生成部１２０を論理triphone生成部３２０に置き換えた構成である。 FIG. 2 shows a functional configuration example of the acoustic model creation apparatus 300 of the present invention, and FIG. 7 shows a processing flow example thereof. The acoustic model creation apparatus 300 includes an inter-word bi / monophone use acoustic model storage unit 110, a logical triphone generation unit 320, and an inter-word tri / bi / monophone shared acoustic model storage unit 130. That is, the logical triphone generation unit 120 of the acoustic model creation device 100 according to the first embodiment is replaced with the logical triphone generation unit 320.

論理triphone生成部３２０は、まず、単語間bi/monophone使用音響モデル内に存在するbiphone及びmonophoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名（例えばbiphoneの場合、a-k+*など、monophoneの場合、*-k+*など）の左側音素又は／及び右側音素として含まれる環境非依存記号を音素体系中の全音素に入れ替えることにより生成可能なtriphoneの音素モデル名（例えば、a-k+a、a-k+i、a-k+k、a-k+yなど）を全て生成する（Ｓ２１）。なお、生成可能な音素の組合せのうち、言語によっては現れないものがある（例えば、日本語においてはk+kなど）ため、そのような組合せについては生成対象から除外して構わない）。続いて、Ｓ２１で生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在しない音素モデル名を特定する（Ｓ２２）。例えば、Ｓ２１で挙げた生成例では、音響モデル内にa-k+a、a-k+iが存在する場合には、a-k+k、a-k+yを特定する。続いて、Ｓ２２で特定したtriphoneの音素モデル名の中で生成元のbiphone又はmonophoneの音素モデル名が共通しているもの、すなわちback-off先が共通しているものについて、当該生成元のbiphone又はmonophoneの音素モデル名の環境非依存記号部分を音素の集合として表現することにより、生成元のbiphone又はmonophoneの音素モデル名が共通する複数のtriphoneの音素モデル名を一体的に表す音素モデル集合名を生成する（Ｓ２３）。例えば、a-k+k、a-k+yの生成元の音素モデル名がa-k+*である場合には、a-k+{*,k,y}という音素モデル集合名を生成する。続いて、Ｓ２３で生成した音素モデル集合名を構成する各音素モデル名の音素モデル共通のモデルパラメータとして、当該生成元のbiphone又はmonophoneの音素モデル名の音素モデルのモデルパラメータをコピーすることにより、当該音素モデル集合名の単位でtriphoneの音素モデルを新たに生成する（Ｓ２４）。例えば、a-k+{*,k,y}という音素モデル集合名を構成する各音素モデル名、すなわちa-k+k、a-k+yの音素モデル共通のモデルパラメータとして、生成元のa-k+*の音素モデルのモデルパラメータをコピーし、音素モデル集合名、すなわちa-k+{*,k,y}という集合の単位で音素モデルを生成する。なお、Ｓ２３で生成する音素モデル集合名の中に環境非依存記号"*"を残したのは、Ｓ２４において、音素モデル集合名からback-offして該当する生成元の音素モデル名（前記の例ではa-k+*）を特定し、当該特定した生成元の音素モデルのモデルパラメータをコピーできるようにするためである。 First, the logical triphone generator 320 generates a phoneme model name consisting of a triplet of a left phoneme, a central phoneme, and a right phoneme of a phonephone model of biphone and monophone existing in an interword bi / monophone use acoustic model (for example, in the case of biphone). phone name of triphone that can be generated by replacing environment-independent symbols contained in left phoneme and / or right phoneme of monophone such as a-k + *, etc. with all phonemes in phoneme system (For example, a-k + a, a-k + i, a-k + k, a-k + y, etc.) are all generated (S21). It should be noted that some phoneme combinations that can be generated do not appear in some languages (for example, k + k in Japanese), so such combinations may be excluded from generation targets). Subsequently, among the phoneme model names generated in S21, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model are specified (S22). For example, in the generation example given in S21, when a-k + a and a-k + i exist in the acoustic model, a-k + k and a-k + y are specified. Subsequently, among the triphone phoneme model names specified in S22, those having the same phoneme model name of the source biphone or monophone, that is, those having the same back-off destination, the source biphone Alternatively, by expressing the environment-independent symbol part of the phone model name of monophone as a set of phonemes, the phoneme model set that integrally represents the phone model names of multiple triphones that share the phone model name of the source biphone or monophone A name is generated (S23). For example, when the phoneme model name of the generation source of a-k + k and a-k + y is a-k + *, a phoneme model set name of a-k + {*, k, y} is generated. Subsequently, as a model parameter common to the phoneme model of each phoneme model name constituting the phoneme model set name generated in S23, by copying the model parameter of the phoneme model name of the phonephone model name of the source biphone or monophone, A triphone phoneme model is newly generated in units of the phoneme model set name (S24). For example, as a model parameter common to each phoneme model name constituting the phoneme model set name a-k + {*, k, y}, ie, a-k + k, a-k + y, a The model parameters of the phone model of -k + * are copied, and a phoneme model is generated in units of phoneme model set names, that is, a set of a-k + {*, k, y}. The reason why the environment-independent symbol “*” is left in the phoneme model set name generated in S23 is that the phoneme model name of the corresponding generation source is back-off from the phoneme model set name in S24 (described above) In the example, a-k + *) is specified so that the model parameter of the specified phoneme model can be copied.

単語間tri/bi/monophone共用音響モデル記憶部１３０は、単語間bi/monophone使用音響モデル記憶部１１０から単語間bi/monophone使用音響モデルを取り込んで記憶するとともに、論理triphone生成部３２０で新たに生成されたtriphoneの音素モデルを記憶する（Ｓ２５）。なお、論理triphone生成部３２０で新たに生成されたtriphoneの音素モデルの音素モデル集合名には、前記のとおりback-offして該当する生成元の音素モデル名（前記の例ではa-k+*）が含まれているが、この音素モデル名の音素モデルは、単語間bi/monophone使用音響モデル記憶部１１０から取り込んだ単語間bi/monophone使用音響モデルに元々含まれている。そのため、ここで記憶する際には、重複しないように例えば、音素モデル集合名の環境非依存記号"*"を別の記号（例えばany等）に変換（前記の例ではa-k+{any,k,y}）した上で記憶する必要がある。また、サイズ増加を抑えるため、ここで用いた別の記号（上記の例ではany）は、最終的に削除してもよい。 The inter-word tri / bi / monophone shared acoustic model storage unit 130 takes in and stores the inter-word bi / monophone use acoustic model from the inter-word bi / monophone use acoustic model storage unit 110, and newly creates a logical triphone generation unit 320. The generated phone model of triphone is stored (S25). Note that the phoneme model set name of the triphone phoneme model newly generated by the logical triphone generation unit 320 is back-off as described above and the corresponding phoneme model name (a-k + * in the above example). The phoneme model having the phoneme model name is originally included in the inter-word bi / monophone use acoustic model imported from the inter-word bi / monophone use acoustic model storage unit 110. Therefore, when storing here, for example, the environment-independent symbol “*” of the phoneme model set name is converted to another symbol (for example, any) so as not to overlap (in the above example, a-k + {any, k, y}) and need to remember. In order to suppress an increase in size, another symbol used here (any in the above example) may be finally deleted.

以上説明した音響モデル作成装置３００によれば、back-offの発生を防ぐことができることに加え、新たに生成したtriphoneの複数の音素モデル名の音素モデルについて、音素モデル集合名単位でモデルパラメータを共有するため、新たな音素モデル名の追加による音響モデルのサイズ増加を抑えることができる。また、triphoneの場合、音素モデル名ごとにモデルパラメータを分離すると、各音素モデルの学習データ量が少なくなり、音声認識精度の劣化を招く可能性があるが、本発明によれば、そのような問題の発生も回避することができる。 According to the acoustic model creation apparatus 300 described above, in addition to preventing the occurrence of back-off, model parameters can be set for each phoneme model set name for a phoneme model having a plurality of phoneme model names of a newly generated triphone. Since this is shared, an increase in the size of the acoustic model due to the addition of a new phoneme model name can be suppressed. Also, in the case of triphone, if model parameters are separated for each phoneme model name, the learning data amount of each phoneme model may be reduced, leading to deterioration of speech recognition accuracy. According to the present invention, Problems can also be avoided.

図４に本発明の音響モデル作成装置４００の機能構成例を、図８にその処理フロー例をそれぞれ示す。音響モデル作成装置４００は、単語間bi/monophone使用音響モデル記憶部１１０、論理triphone生成部４２０、及び単語間tri/bi/monophone共用音響モデル記憶部１３０を備える。音響モデル作成装置４００は、音響モデル作成装置３００の論理triphone生成部３２０を論理triphone生成部４２０に置き換えた構成である。具体的には、論理triphone生成部３２０では、triphoneからbiphoneへback-offする場合とtriphoneからmonophoneにback-offする場合とで処理を共通化していたのに対し、論理triphone生成部４２０では、それらの処理を個別に行う。 FIG. 4 shows a functional configuration example of the acoustic model creation apparatus 400 of the present invention, and FIG. 8 shows a processing flow example thereof. The acoustic model creation apparatus 400 includes an inter-word bi / monophone use acoustic model storage unit 110, a logical triphone generation unit 420, and an inter-word tri / bi / monophone shared acoustic model storage unit 130. The acoustic model creation device 400 has a configuration in which the logical triphone generation unit 320 of the acoustic model creation device 300 is replaced with a logical triphone generation unit 420. Specifically, in the logical triphone generation unit 320, the processing is shared between the case of back-off from triphone to biphone and the case of back-off from triphone to monophone, whereas in the logical triphone generation unit 420, Those processes are performed individually.

論理triphone生成部４２０は、論理bi/triphone生成手段４２１と論理mono/triphone生成手段４２２とを備える。 The logical triphone generation unit 420 includes logical bi / triphone generation means 421 and logical mono / triphone generation means 422.

論理bi/triphone生成手段４２１は、まず、単語間bi/monophone使用音響モデル内に存在するbiphoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名（例えば、a-k+*など）の左側音素又は右側音素として含まれる環境非依存記号を音素体系中の全音素に入れ替えることにより生成可能なtriphoneの音素モデル名（例えば、a-k+a、a-k+i、a-k+k、a-k+yなど）を全て生成する（Ｓ３１）。なお、生成可能な音素の組合せのうち、言語によっては現れないものがある（例えば、日本語においてはk+kなど）ため、そのような組合せについては生成対象から除外して構わない）。続いて、Ｓ３１で生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在しない音素モデル名を特定する（Ｓ３２）。例えば、Ｓ３１で挙げた生成例では、音響モデル内にa-k+a、a-k+iが存在する場合には、a-k+k、a-k+yを特定する。続いて、Ｓ３２で特定したtriphoneの音素モデル名の中で生成元のbiphoneの音素モデル名が共通しているもの、すなわちback-off先が共通しているものについて、当該生成元のbiphoneの音素モデル名の環境非依存記号部分を音素の集合として表現することにより、生成元のbiphoneの音素モデル名が共通する複数のtriphoneの音素モデル名を一体的に表す音素モデル集合名を生成する（Ｓ３３）。例えば、a-k+k、a-k+yの生成元の音素モデル名がa-k+*である場合には、a-k+{*,k,y}という音素モデル集合名を生成する。続いて、Ｓ３３で生成した音素モデル集合名を構成する各音素モデル名の音素モデル共通のモデルパラメータとして、当該生成元のbiphoneの音素モデル名の音素モデルのモデルパラメータをコピーすることにより、当該音素モデル集合名の単位でtriphoneの音素モデルを新たに生成する（Ｓ３４）。例えば、a-k+{*,k,y}という音素モデル集合名を構成する各音素モデル名、すなわちa-k+k、a-k+yの音素モデル共通のモデルパラメータとして、生成元のa-k+*の音素モデルのモデルパラメータをコピーし、音素モデル集合名、すなわちa-k+{*,k,y}という集合単位で音素モデルを生成する。なお、Ｓ３３で生成する音素モデル集合名の中に環境非依存記号"*"を残したのは、Ｓ３４において、音素モデル集合名からback-offして該当する生成元の音素モデル名（前記の例ではa-k+*）を特定し、当該特定した生成元の音素モデルのモデルパラメータをコピーできるようにするためである。 First, the logical bi / triphone generation means 421 is a phoneme model name (for example, a-k +) consisting of a triplet of a left phoneme, a central phoneme, and a right phoneme of a phoneme model of a biphone existing in an interword bi / monophone use acoustic model. *, Etc.) triphone phoneme model names that can be generated by replacing environment-independent symbols included in the left phoneme or right phoneme with all phonemes in the phoneme system (eg, a-k + a, a-k + i, a-k + k, a-k + y, etc.) are generated (S31). It should be noted that some phoneme combinations that can be generated do not appear in some languages (for example, k + k in Japanese), so such combinations may be excluded from generation targets). Subsequently, among the phoneme model names generated in S31, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model are specified (S32). For example, in the generation example given in S31, when a-k + a and a-k + i exist in the acoustic model, a-k + k and a-k + y are specified. Subsequently, the phoneme model name of the source biphone in the triphone phoneme model name specified in S32, that is, the phoneme model name of the source biphone is the same for the phonephone model name of the source biphone. By expressing the environment-independent symbol part of the model name as a phoneme set, a phoneme model set name that integrally represents the phoneme model names of a plurality of triphones that share the phoneme model name of the source biphone is generated (S33). ). For example, when the phoneme model name of the generation source of a-k + k and a-k + y is a-k + *, a phoneme model set name of a-k + {*, k, y} is generated. Subsequently, by copying the phoneme model model parameter of the phoneme model name of the source biphone as a model parameter common to the phoneme model of each phoneme model name constituting the phoneme model set name generated in S33, the phoneme model name is copied. A new phone model of triphone is generated in units of model set names (S34). For example, as a model parameter common to each phoneme model name constituting the phoneme model set name a-k + {*, k, y}, ie, a-k + k, a-k + y, a The model parameter of the phone model of -k + * is copied, and a phoneme model is generated in units of phoneme model set names, that is, a-k + {*, k, y}. The reason why the environment-independent symbol “*” is left in the phoneme model set name generated in S33 is that the phoneme model name of the corresponding generation source is back-off from the phoneme model set name in S34 (described above) In the example, a-k + *) is specified so that the model parameter of the specified phoneme model can be copied.

論理mono/triphone生成手段４２２は、まず、単語間bi/monophone使用音響モデル内に存在するmonophoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名（例えば、*-k+*など）の左側音素及び右側音素として含まれる環境非依存記号を音素体系中の全音素に入れ替えることにより生成可能なtriphoneの音素モデル名（例えば、a-k+a、a-k+i、a-k+k、a-k+yなど）を全て生成する（Ｓ３５）。なお、生成可能な音素の組合せのうち、言語によっては現れないものがある（例えば、日本語においてはk+kなど）ため、そのような組合せについては生成対象から除外して構わない）。続いて、Ｓ３５で生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在せず、かつ、前記論理bi/triphone生成部で特定されていないtriphoneの音素モデル名を特定する（Ｓ３６）。例えば、Ｓ３５で挙げた生成例では、音響モデル内にa-k+a、a-k+iが存在する場合には、a-k+k、a-k+yを特定する。続いて、Ｓ３６で特定したtriphoneの音素モデル名の中で生成元のmonophoneの音素モデル名が共通しているものについて、当該生成元のmonophoneの音素モデル名の環境非依存記号部分につき、左側音素を或る音素に固定し右側音素を音素の集合として表現して、生成元のmonophoneの音素モデル名が共通する複数のtriphoneの音素モデル名を一体的に表す音素モデル集合名を生成した場合と、右側音素を或る音素に固定し左側音素を音素の集合として表現して、生成元のmonophoneの音素モデル名が共通する複数のtriphoneの音素モデル名を一体的に表す音素モデル集合名を生成した場合とで、生成される音素モデル集合名の個数が少ない方を選択する（Ｓ３７）。例えば、a-k+k、a-k+y、i-k+k、i-k+y、・・・等のtriphoneの音素モデル名の生成元の音素モデル名が*-k+*である場合に、音素モデル集合名として、左側音素を固定し右側音素を音素の集合として表現したa-k+{*,k,y}、i-k+{*,k,y}、・・・を生成した場合と、右側音素を固定し左側音素を音素の集合として表現した{*,a,i}-k+k、{*,a,i}-k+y、・・・を生成した場合とで、生成される音素モデル集合名の個数が少ない方を選択する。音素モデル集合名の個数は、言語によって現れない音素の組合せが存在するため、固定するのが左側音素であるか右側音素であるかにより変わる。続いて、Ｓ３７で選択した音素モデル集合名を構成する各音素モデル名のtriphoneの音素モデル共通のモデルパラメータとして、当該生成元のmonophoneの音素モデル名の音素モデルのモデルパラメータをコピーすることにより、当該音素モデル集合名の単位でtriphoneの音素モデルを新たに生成する（Ｓ３８）。例えば、a-k+{*,k,y}という音素モデル集合名を構成する各音素モデル名、すなわちa-k+k、a-k+yの音素モデル共通のモデルパラメータとして、生成元の*-k+*の音素モデルのモデルパラメータをコピーし、音素モデル集合名、すなわちa-k+{*,k,y}という集合単位で音素モデルを生成する。なお、Ｓ３７で生成する音素モデル集合名の中に環境非依存記号"*"を残したのは、Ｓ３８において、音素モデル集合名からback-offして該当する生成元の音素モデル名（前記の例では*-k+*）を特定し、当該特定した生成元の音素モデルのモデルパラメータをコピーできるようにするためである。 First, the logical mono / triphone generation means 422 is a phoneme model name (for example, * -k +) consisting of a triplet of left phoneme, central phoneme, and right phoneme of a monophone phoneme model existing in an interword bi / monophone use acoustic model. *, Etc.) can be generated by replacing environment-independent symbols contained in the left and right phonemes with all phonemes in the phoneme system (eg, a-k + a, a-k + i, a-k + k, a-k + y, etc.) are generated (S35). It should be noted that some phoneme combinations that can be generated do not appear in some languages (for example, k + k in Japanese), so such combinations may be excluded from generation targets). Subsequently, among the phoneme model names generated in S35, triphone phoneme model names that do not exist in the inter-word bi / monophone use acoustic model and are not specified by the logical bi / triphone generation unit are specified ( S36). For example, in the generation example given in S35, when a-k + a and a-k + i exist in the acoustic model, a-k + k and a-k + y are specified. Subsequently, for the phoneme model name of the triphone identified in S36 that has the same phoneme model name of the source monophone, the left phoneme is assigned to the environment-independent symbol part of the phoneme model name of the source monophone. Is fixed to a certain phoneme, the right phoneme is expressed as a set of phonemes, and a phoneme model set name that integrally represents phoneme model names of multiple triphones that share the same phoneme model name of the source monophone is generated. The right phoneme is fixed to a certain phoneme, and the left phoneme is expressed as a set of phonemes to generate a phoneme model set name that integrally represents the phoneme model names of multiple triphones with the same monophone phoneme model name. If the number of phoneme model set names to be generated is smaller, the one with the smaller number is selected (S37). For example, the phoneme model name from which a triphone phoneme model name such as a-k + k, a-k + y, i-k + k, i-k + y, ... is generated is * -k + * In this case, a-k + {*, k, y}, i-k + {*, k, y}, ... is generated, with the left phoneme fixed and the right phoneme represented as a phoneme set as the phoneme model set name And the case of generating {*, a, i} -k + k, {*, a, i} -k + y, ... with the right phoneme fixed and the left phoneme represented as a set of phonemes Thus, the one with the smaller number of phoneme model set names to be generated is selected. The number of phoneme model set names varies depending on whether the left phoneme or right phoneme is fixed because there are phoneme combinations that do not appear depending on the language. Subsequently, by copying the model parameter of the phoneme model name of the phonephone model name of the original monophone as a model parameter common to the phoneme model name of the triphone of each phoneme model name constituting the phoneme model set name selected in S37, A triphone phoneme model is newly generated in units of the phoneme model set name (S38). For example, as a model parameter common to each phoneme model name constituting the phoneme model set name a-k + {*, k, y}, that is, a-k + k, a-k + y, The model parameter of the phone model of -k + * is copied, and a phoneme model is generated in units of phoneme model set names, that is, a-k + {*, k, y}. The reason why the environment-independent symbol “*” is left in the phoneme model set name generated in S37 is that the phoneme model name of the corresponding generation source is back-off from the phoneme model set name in S38 (described above) In the example, * -k + *) is specified so that the model parameter of the specified phoneme model can be copied.

そして、単語間tri/bi/monophone共用音響モデル記憶部１３０は、単語間bi/monophone使用音響モデル記憶部１１０から単語間bi/monophone使用音響モデルを取り込んで記憶するとともに、論理bi/triphone生成手段４２１及び論理mono/triphone生成手段４２２で新たに生成されたtriphoneの音素モデルを記憶する（Ｓ２５）。なお、論理bi/triphone生成手段４２１及び論理mono/triphone生成手段４２２で新たに生成されたtriphoneの音素モデルの音素モデル集合名（前記の例ではa-k+{*,k,y}）には、back-offして該当する生成元の音素モデル名（前記の例ではa-k+*）が含まれているが、これらの音素モデル名の音素モデルは、単語間bi/monophone使用音響モデル記憶部１１０から取り込んだ単語間bi/monophone使用音響モデルに元々含まれている。そのため、ここで記憶する際には、重複しないように例えば、音素モデル集合名の環境非依存記号"*"を別の記号（例えばany等）に変換（前記の例ではa-k+{any,k,y}）した上で記憶する必要がある。また、サイズ増加を抑えるため、ここで用いた別の記号（上記の例ではany）は、最終的に削除してもよい。 The inter-word tri / bi / monophone shared acoustic model storage unit 130 takes in and stores the inter-word bi / monophone use acoustic model from the inter-word bi / monophone use acoustic model storage unit 110, and logical bi / triphone generation means. The phone model of the triphone newly generated by the 421 and the logical mono / triphone generating means 422 is stored (S25). Note that the phoneme model set name (a-k + {*, k, y} in the above example) of the triphone phoneme model newly generated by the logical bi / triphone generation unit 421 and the logical mono / triphone generation unit 422 , Back-off and the corresponding phoneme model name (a-k + * in the above example) is included. The phoneme models with these phoneme model names are stored in the inter-word bi / monophone acoustic model memory. Originally included in the inter-word bi / monophone use acoustic model imported from the unit 110. Therefore, when storing here, for example, the environment-independent symbol “*” of the phoneme model set name is converted to another symbol (for example, any) so as not to overlap (in the above example, a-k + {any, k, y}) and need to remember. In order to suppress an increase in size, another symbol used here (any in the above example) may be finally deleted.

以上説明した音響モデル作成装置４００によれば、back-offの発生を防ぐことができることに加え、新たに生成したtriphoneの複数の音素モデル名の音素モデルについて、音素モデル集合名単位でモデルパラメータを共有するため、新たな音素モデル名の追加による音響モデルのサイズ増加を抑えることができる。特に、back-offして該当する生成元の音素モデル名がmonophoneである場合には、左側音素を固定した場合と右側音素を固定した場合とで新たに生成される音素モデル名の個数が少ない方を選択するため、音響モデルのサイズの増加を実施例３より更に抑えることができる。また、triphoneの場合、音素モデル名ごとにモデルパラメータを分離すると、各音素モデルの学習データ量を減らし、音声認識精度の劣化を招く可能性があるが、本発明によれば、そのような問題の発生も回避することができる。 According to the acoustic model creation apparatus 400 described above, in addition to preventing the occurrence of back-off, model parameters can be set for each phoneme model set name for a phoneme model with a plurality of phoneme model names of a newly generated triphone. Since this is shared, an increase in the size of the acoustic model due to the addition of a new phoneme model name can be suppressed. In particular, when the phoneme model name of the corresponding generation source is monophone after back-off, the number of phoneme model names newly generated is small when the left phoneme is fixed and when the right phoneme is fixed. Therefore, an increase in the size of the acoustic model can be further suppressed as compared with the third embodiment. Further, in the case of triphone, separating model parameters for each phoneme model name may reduce the amount of learning data for each phoneme model and cause deterioration in speech recognition accuracy. Can also be avoided.

図３に本発明の音響モデル作成装置５００の機能構成例を、図９にその処理フロー例をそれぞれ示す。音響モデル作成装置５００は、単語間bi/monophone使用音響モデル記憶部１１０、論理triphone生成部３２０、論理biphone生成部５２０、及び単語間tri/bi/monophone共用音響モデル記憶部１３０を備える。つまり、実施例３の音響モデル作成装置３００に、論理biphone生成部５２０を加えた構成である。これにより、単語間bi/monophone使用音響モデル記憶部１１０内に存在しないbiphoneの音素モデルを生成し、音響モデルに追加することができるため、biphoneについてもback-offの発生を防ぐことができる。なお、ここでは論理biphone生成部５２０を、音響モデル作成装置３００に加えた構成について説明するが、実施例４の音響モデル作成装置４００にも同様に適用することができる。 FIG. 3 shows a functional configuration example of the acoustic model creation apparatus 500 of the present invention, and FIG. 9 shows a processing flow example thereof. The acoustic model creation apparatus 500 includes an inter-word bi / monophone use acoustic model storage unit 110, a logical triphone generation unit 320, a logical biphone generation unit 520, and an inter-word tri / bi / monophone shared acoustic model storage unit 130. That is, this is a configuration in which the logical biphone generation unit 520 is added to the acoustic model creation device 300 of the third embodiment. Thereby, since the phoneme model of the biphone which does not exist in the inter-word bi / monophone use acoustic model storage unit 110 can be generated and added to the acoustic model, it is possible to prevent the back-off of the biphone. In addition, although the structure which added the logic biphone production | generation part 520 to the acoustic model creation apparatus 300 is demonstrated here, it can apply similarly to the acoustic model creation apparatus 400 of Example 4. FIG.

論理biphone生成部５２０は、まず、単語間bi/monophone使用音響モデル内に存在するmonophoneの音素モデルの、左側音素、中心音素、右側音素の三つ組からなる音素モデル名の左側音素及び右側音素として含まれる環境非依存記号のいずれか一方を音素体系中の全音素に入れ替えることにより生成可能なbiphoneの音素モデル名を全て生成する（Ｓ４１）。例えば、*-a+*のmonophoneであれば、k-a+*、y-a+*、*-a+i、*-a+nというように生成する。なお、生成可能な音素の組合せのうち、言語によっては現れないものがある（例えば、日本語においてはk+kなど）ため、そのような組合せについては生成対象から除外して構わない）。続いて、Ｓ４１で生成した音素モデル名のうち、単語間bi/monophone使用音響モデル内に存在しない音素モデル名を特定する（Ｓ４２）。例えば、Ｓ４１で挙げた生成例で、音響モデル内にk-a+*、*-a+nが存在する場合には、y-a+*、*-a+iを特定する。続いて、Ｓ４２で特定したbiphoneの音素モデル名の中で生成元のmonophoneの音素モデル名が共通しているものについて、当該生成元のmonophoneの音素モデル名の前記いずれか一方の環境非依存記号部分を音素の集合として表現することにより、生成元のmonophoneの音素モデル名が共通する複数のbiphoneの音素モデル名を一体的に表す音素モデル集合名を生成する（Ｓ４３）。例えば、a-k+*、i-k+*の生成元の音素モデル名が*-k+*である場合には、{*,a,i}-k+*という音素モデル集合名を生成する。続いて、Ｓ４３で生成した音素モデル集合名を構成する各音素モデル名の音素モデル共通のモデルパラメータとして、当該生成元のmonophoneの音素モデル名の音素モデルのモデルパラメータをコピーすることにより、当該音素モデル集合名の単位でbiphoneの音素モデルを新たに生成する（Ｓ４４）。例えば、{*,a,i}-k+*という音素モデル集合名を構成する各音素モデル名、すなわちa-k+*、i-k+*の音素モデル共通のモデルパラメータとして、生成元の*-k+*の音素モデルのモデルパラメータをコピーし、音素モデル集合名、すなわち{*,a,i}-k+*という集合単位で音素モデルを生成する。なお、Ｓ４３で生成する音素モデル集合名の中に環境非依存記号"*"を残したのは、Ｓ４４において、音素モデル集合名からback-offして該当する生成元の音素モデル名（前記の例では*-k+*）を特定し、当該特定した生成元の音素モデルのモデルパラメータをコピーできるようにするためである。 First, the logical biphone generation unit 520 includes the left phoneme and the right phoneme of the phoneme model name including the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model in the inter-word bi / monophone use acoustic model. All the phoneme model names that can be generated are generated by replacing any one of the environment-independent symbols with all phonemes in the phoneme system (S41). For example, in the case of * -a + * monophone, it is generated as k-a + *, y-a + *, * -a + i, * -a + n. It should be noted that some phoneme combinations that can be generated do not appear in some languages (for example, k + k in Japanese), so such combinations may be excluded from generation targets). Subsequently, among the phoneme model names generated in S41, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model are specified (S42). For example, in the generation example given in S41, when k-a + * and * -a + n exist in the acoustic model, y-a + * and * -a + i are specified. Subsequently, for the phoneme model name of the biphone identified in S42, the phoneme model name of the source monophone is the same, and either one of the environment-independent symbols of the phoneme model name of the source monophone By expressing the part as a set of phonemes, a phoneme model set name that integrally represents the phoneme model names of a plurality of biphones with the same phoneme model name of the source monophone is generated (S43). For example, if the phoneme model name from which a-k + * and i-k + * are generated is * -k + *, a phoneme model set name {*, a, i} -k + * is generated. Subsequently, as a model parameter common to the phoneme model of each phoneme model name constituting the phoneme model set name generated in S43, the phoneme model model parameter of the phoneme model name of the source monophone is copied to thereby obtain the phoneme model name. A new phoneme model of biphone is generated in units of model set names (S44). For example, as a model parameter common to each phoneme model name constituting the phoneme model set name {*, a, i} -k + *, that is, a-k + *, i-k + *, * -k + of the generation source The model parameters of the phoneme model of * are copied, and a phoneme model is generated with a phoneme model set name, that is, a set unit of {*, a, i} -k + *. The reason why the environment-independent symbol “*” is left in the phoneme model set name generated in S43 is that the phoneme model name of the corresponding generation source is back-off from the phoneme model set name in S44 (described above) In the example, * -k + *) is specified so that the model parameter of the specified phoneme model can be copied.

そして、単語間tri/bi/monophone共用音響モデル記憶部１３０は、更に、論理biphone生成部５２０で新たに生成されたbiphoneの音素モデルを記憶する（Ｓ２５）。 Then, the inter-word tri / bi / monophone shared acoustic model storage unit 130 further stores the phoneme model of the biphone newly generated by the logical biphone generation unit 520 (S25).

遷移確率を用いないデコーダで遷移確率を用いる音響モデルを用いた場合、音響モデル学習時と音声認識時との間で不整合が生じ、認識精度が劣化する。そこで、実施例１〜５の音響モデル作成装置１００〜５００の単語間tri/bi/monophone共用音響モデル記憶部１３０にtriphoneの音素モデルを記憶するのに先立ち、triphoneの音素モデルが遷移確率を用いる音素モデルである場合に、遷移確率を用いない音素モデル又は所定の遷移確率の値（例えば０．５）を持つ音素モデルに変換する遷移確率変換部６１０を設けて変換処理を行うことが考えられる（Ｓ５１）。これにより、構築した音響モデルが遷移確率を用いるものである場合に、遷移確率を使うものと使わないものの両方のデコーダで音声認識精度を保ったまま使えるようになる。 When an acoustic model that uses transition probabilities is used in a decoder that does not use transition probabilities, mismatch occurs between acoustic model learning and speech recognition, and recognition accuracy deteriorates. Therefore, prior to storing the triphone phoneme model in the inter-word tri / bi / monophone shared acoustic model storage unit 130 of the acoustic model creation apparatuses 100 to 500 of the first to fifth embodiments, the triphone phoneme model uses the transition probability. In the case of a phoneme model, it is conceivable to perform a conversion process by providing a transition probability conversion unit 610 that converts a phoneme model that does not use a transition probability or a phoneme model that has a predetermined transition probability value (for example, 0.5). (S51). As a result, when the constructed acoustic model uses transition probabilities, both decoders using and not using transition probabilities can be used while maintaining speech recognition accuracy.

新たに生成された未学習のtriphoneやbiphoneは、モデルパラメータが必ずしも適切ではないが、追加学習を行うことで高い精度を実現することができる。 The newly generated unlearned triphone and biphone are not necessarily suitable for the model parameters, but high accuracy can be realized by performing additional learning.

そこで、実施例１〜６の音響モデル作成装置１００〜５００の単語間tri/bi/monophone共用音響モデル記憶部１３０に音素モデルを記憶するのに先立ち、新たに生成された音素モデルについて追加学習を行う追加学習部７１０を設けて追加学習処理を行うことが考えられる（Ｓ６１）。 Therefore, prior to storing the phoneme model in the inter-word tri / bi / monophone shared acoustic model storage unit 130 of the acoustic model creation apparatuses 100 to 500 of the first to sixth embodiments, additional learning is performed on the newly generated phoneme model. It is conceivable to perform an additional learning process by providing an additional learning unit 710 to perform (S61).

なお、本発明の実施により、音素の共有関係が発生し、また変更されることで、学習データと音素モデルとの対応付けをするアラインメントが変わるため、更に認識精度を高めるには、新たに生成された音素モデルだけでなく、既存の音素モデルについても追加学習することが望ましい。また、遷移確率変換部６１０を設ける場合、追加学習時のみ遷移確率を無視するようにしてもよい。 It should be noted that, by implementing the present invention, the phoneme sharing relationship is generated and changed, so that the alignment for associating the learning data with the phoneme model changes. It is desirable to additionally learn not only the phoneme model that has been set, but also the existing phoneme model. When the transition probability conversion unit 610 is provided, the transition probability may be ignored only during additional learning.

本発明の音響モデル作成装置をコンピュータによって実現する場合、装置及びその各部が有す機能の処理内容はプログラムによって記述される。そのプログラムは、例えば、ハードディスク装置に格納されており、実行時には必要なプログラムやデータがＲＡＭ(Random Access Memory)に読み込まれる。その読み込まれたプログラムがＣＰＵにより実行されることにより、コンピュータ上で各処理内容が実現される。なお、処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 When the acoustic model creation apparatus of the present invention is realized by a computer, the processing contents of the functions of the apparatus and each part thereof are described by a program. The program is stored in, for example, a hard disk device, and necessary programs and data are read into a RAM (Random Access Memory) at the time of execution. The read program is executed by the CPU, whereby each processing content is realized on the computer. Note that at least a part of the processing content may be realized by hardware.

なお、本発明におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 Note that the program in the present invention includes information provided for processing by an electronic computer that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

Claims

Inter-word bi / monophone use acoustic model that stores an inter-word bi / monophone use acoustic model, including any number of different phone models of triphone, biphone, monophone respectively learned by triphone, biphone, monophone learning data between words A model storage unit;
Environment independent of phoneme model name consisting of triplet of left phoneme, central phoneme, right phoneme of phoneme model of biphone and monophone existing in acoustic model using inter / biphone / monophone, included as left phoneme and / or right phoneme All the phoneme model names that can be generated by replacing symbols with all phonemes in the phoneme system are generated, and among the generated phoneme model names, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model are identified By copying the model parameters of the phone model of the phone model name of biphone or monophone that is the generation source of each specified phone model name as the model parameters of the phone model of the tri phone of each specified phone model name , A logical triphone generation unit for newly generating a triphone phoneme model of each identified phoneme model name,
The inter-bi bi / monophone use acoustic model is captured and stored, and the inter-tri / bi / monophone acoustic model storage unit stores the triphone phoneme model newly generated by the logical triphone generation unit;
An acoustic model creation device comprising:

The acoustic model creation device according to claim 1,
Furthermore, the environment-independent symbols included as the left phoneme and right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model present in the inter-word bi / monophone use acoustic model All phoneme model names that can be generated by replacing either one with all phonemes in the phoneme system are generated, and among the generated phoneme model names, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model By copying the model parameters of the phone model of the phone model name of the monophone that is the source of each specified phone model name as the model parameters of the phone model of the phone model of each specified phone model name A logical biphone generation unit for newly generating a phoneme model of the biphone of each identified phoneme model name,
The inter-word tri / bi / monophone shared acoustic model storage unit further stores a phoneme model of a biphone newly generated by a logical biphone generation unit.

Inter-word bi / monophone use acoustic model that stores an inter-word bi / monophone use acoustic model, including any number of different phone models of triphone, biphone, monophone respectively learned by triphone, biphone, monophone learning data between words A model storage unit;
Environment independent of phoneme model name consisting of triplet of left phoneme, central phoneme, right phoneme of phoneme model of biphone and monophone existing in acoustic model using inter / biphone / monophone, included as left phoneme and / or right phoneme All the phoneme model names that can be generated by replacing symbols with all phonemes in the phoneme system are generated, and among the generated phoneme model names, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model are identified If the phone model name of the source biphone or monophone is the same among the specified phone names of the triphone, the environment-independent symbol part of the phone model name of the source biphone or monophone is used as the phoneme. The phoneme model set name that integrally represents the phone model names of multiple triphones that share the same phoneme model name of the source biphone or monophone. The phoneme model is generated by copying the phoneme model model parameter of the phoneme model name of the source biphone or monophone as a model parameter common to the phoneme model of each phoneme model name constituting the phoneme model set name. A logical triphone generator that newly generates a phone model of triphone in units of set names;
The inter-bi bi / monophone use acoustic model is captured and stored, and the inter-tri / bi / monophone acoustic model storage unit stores the triphone phoneme model newly generated by the logical triphone generation unit;
An acoustic model creation device comprising:

The acoustic model creation device according to claim 3,
The logical triphone generator is
The phoneme system includes environment-independent symbols included as the left phoneme or the right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the biphone phoneme model present in the inter-word bi / monophone use acoustic model All phoneme model names that can be generated by replacing all phonemes inside are generated, and among the generated phoneme model names, phoneme model names that do not exist in the inter-word bi / monophone acoustic model are specified and specified By expressing the environment-independent symbol part of the phoneme model name of the source biphone as a set of phonemes for the phoneme model name of the source biphone in the triphone phoneme model name Generate phoneme model set names that collectively represent the phoneme model names of multiple triphones that have the same phoneme model name of the source biphone, and generate each phoneme model set name As a model parameter common to the phoneme model of the elementary model name, by copying the model parameter of the phoneme model of the phonemic model name of the source biphone, a new phoneme model of triphone is generated in units of the phoneme model set name Logical bi / triphone generation means;
The phoneme system includes environment-independent symbols included as the left phoneme and right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model existing in the inter-word bi / monophone use acoustic model All phoneme model names of triphones that can be generated are generated by replacing all phonemes in the phoneme, and among the generated phoneme model names, the phonetic model name does not exist in the inter-word bi / monophone use acoustic model, and the logical bi / triphone A triphone phoneme model name that is not specified in the generation unit is specified, and among the specified triphone phoneme model names that have the same phoneme model name of the source monophone, the phoneme of the source monophone For the environment-independent symbol part of the model name, the left phoneme is fixed to a certain phoneme, the right phoneme is expressed as a set of phonemes, and multiple triphones with the same phoneme model name of the source monophone When a phoneme model set name that integrally represents the phoneme model name is generated, and when the right phoneme is fixed to a certain phoneme and the left phoneme is expressed as a set of phonemes, the phoneme model names of the source monophone are common Select the phoneme model set name with the smallest number of phoneme model set names when the phoneme model set name that integrally represents the phone model name of the triphone is generated, and each phoneme model that constitutes the selected phoneme model set name A new phone phone model is generated in units of the phoneme model set name by copying the model parameters of the phone model of the phone model of the original monophone as a model parameter common to the phone model of the name triphone. Logical mono / triphone generation means;
An acoustic model creation device comprising:

In the acoustic model creation device according to claim 3 or 4,
Furthermore, the environment-independent symbols included as the left phoneme and right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model present in the inter-word bi / monophone use acoustic model All phoneme model names that can be generated by replacing either one with all phonemes in the phoneme system are generated, and among the generated phoneme model names, phoneme model names that do not exist in the inter-word bi / monophone use acoustic model The phoneme model name of the source monophone in the specified phoneme model name of the specified biphone is the same as the environment-independent symbol of either one of the phoneme model name of the source monophone By expressing the part as a set of phonemes, generate a phoneme model set name that integrally represents the phoneme model names of multiple biphones that have the same phoneme model name of the source monophone By copying the phoneme model model parameter of the phone model name of the original monophone as a model parameter common to the phoneme model of each phoneme model name constituting the generated phoneme model set name, It has a logical biphone generation unit that newly generates phoneme models of biphone in units,
The inter-word tri / bi / monophone shared acoustic model storage unit further stores a phoneme model of a biphone newly generated by a logical biphone generation unit.

In the acoustic model creation device according to any one of claims 1 to 5,
Before storing the triphone phoneme model in the inter-word tri / bi / monophone use acoustic model storage unit, if the triphone phoneme model is a phoneme model using a transition probability, a phoneme model not using the transition probability or a predetermined An acoustic model creation apparatus, further comprising a transition probability conversion unit that converts a phoneme model having a transition probability value.

The acoustic model creation device according to any one of claims 1 to 6,
An acoustic model creation device further comprising an additional learning unit that performs additional learning on a newly generated phoneme model prior to storing the phoneme model in the inter-word tri / bi / monophone acoustic model storage unit.

Inter-word bi / monophone use acoustic model that stores an inter-word bi / monophone use acoustic model, including any number of different phone models of triphone, biphone, monophone respectively learned by triphone, biphone, monophone learning data between words Using the model storage unit,
Environment independent of phoneme model name consisting of triplet of left phoneme, central phoneme, right phoneme of phoneme model of biphone and monophone existing in acoustic model using inter / biphone / monophone, included as left phoneme and / or right phoneme A triphone all-phoneme model name generation step for generating all phone names of triphones that can be generated by replacing symbols with all phonemes in the phoneme system;
Among the generated phoneme model names, a triphone phoneme model name specifying step for specifying a phoneme model name that does not exist in the inter-word bi / monophone use acoustic model,
The phoneme model model parameter of the phonephone model name of biphone or monophone that is the source of each identified phoneme model name is copied as the model parameter of the triphone phoneme model of each specified phoneme model name A triphone phoneme model generation step for newly generating a triphone phoneme model of each phoneme model name,
The inter-word tri / bi / monophone shared acoustic model storage unit captures and stores the inter-word bi / monophone use acoustic model and stores the triphone phoneme model newly generated by the logical triphone generation unit. / bi / monophone shared acoustic model storage step,
Acoustic model creation method to execute.

The acoustic model creation method according to claim 8,
Furthermore, the environment-independent symbols included as the left phoneme and right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model present in the inter-word bi / monophone use acoustic model A biphone full phoneme model name generation step for generating all phoneme model names of biphone that can be generated by replacing either one with all phonemes in the phoneme system;
Among the generated phoneme model names, a biphone phoneme model name identifying step for identifying a phoneme model name that does not exist in the inter-word bi / monophone use acoustic model,
Each identified phoneme model name is generated by copying the phoneme model model parameters of the phonephone model name of the monophone that is the source of each identified phoneme model name as model parameters of the phonephone model of the identified phoneme model name. A biphone phoneme model generation step for newly generating a phoneme model of the phoneme model name biphone;
Run
The inter-word tri / bi / monophone shared acoustic model storing step further stores a phonemic model of biphone newly generated in the biphone phoneme model generating step.

Inter-word bi / monophone use acoustic model that stores an inter-word bi / monophone use acoustic model, including any number of different phone models of triphone, biphone, monophone respectively learned by triphone, biphone, monophone learning data between words Using the model storage unit,
Environment independent of phoneme model name consisting of triplet of left phoneme, central phoneme, right phoneme of phoneme model of biphone and monophone existing in acoustic model using inter / biphone / monophone, included as left phoneme and / or right phoneme A triphone all-phoneme model name generation step for generating all phone names of triphones that can be generated by replacing symbols with all phonemes in the phoneme system;
Among the generated phoneme model names, a triphone phoneme model name specifying step for specifying a phoneme model name that does not exist in the inter-word bi / monophone use acoustic model,
Among the phoneme model names of the specified triphone that have the same phoneme model name of the source biphone or monophone, use the environment-independent symbol part of the phoneme model name of the source biphone or monophone as the set of phonemes A triphone phoneme model set name generation step for generating a phoneme model set name that integrally represents the phoneme model names of a plurality of triphones that share the phoneme model name of the source biphone or monophone by expressing,
As a model parameter common to the phoneme model of each phoneme model name constituting the generated phoneme model set name, copy the phoneme model model parameter of the phonephone model name of the source biphone or monophone, and thereby name the phoneme model set A triphone phoneme model generation step for newly generating a triphone phoneme model in units of
The inter-word tri / bi / monophone shared acoustic model storage unit captures and stores the inter-word bi / monophone use acoustic model, and stores the triphone phoneme model newly generated in the triphone phoneme model generation step. Tri / bi / monophone shared acoustic model storage step,
An acoustic model creation method comprising:

The acoustic model creation method according to claim 10,
The logical triphone generation step includes:
The phoneme system includes environment-independent symbols included as the left phoneme or the right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the biphone phoneme model present in the inter-word bi / monophone use acoustic model A first triphone all-phoneme model name generation substep for generating all triphone phoneme model names that can be generated by replacing all the phonemes in
A first triphone phoneme model name identification substep for identifying a phoneme model name that does not exist in the inter-word bi / monophone use acoustic model among the phoneme model names generated in the first triphone all-phoneme model name generation substep;
For the phonephone model name of the triphone identified in the first triphone phoneme model name identification step and the phoneme model name of the source biphone is the same, the environment-independent symbol part of the phoneme model name of the source biphone is A first triphone phoneme model set name generation sub-step for generating a phoneme model set name that integrally represents a plurality of triphone phoneme model names having the same phoneme model name of the source biphone by expressing as a set of phonemes;
The model parameter of the phoneme model of the phoneme model name of the source biphone is copied as a model parameter common to the phoneme model of each phoneme model name constituting the phoneme model set name generated in the first triphone phoneme model set name generation substep A first triphone phoneme model generation sub-step for newly generating a triphone phoneme model in units of the phoneme model set name;
Environment-independent symbols included as the left phoneme and right phoneme of the phoneme model name composed of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model existing in the inter-word bi / monophone use acoustic model, A second triphone all-phoneme model name generation substep for generating all triphone phoneme model names that can be generated by replacing all phonemes in the system;
Of the phoneme model names generated in the second triphone all-phoneme model name generation sub-step, the phoneme model of the triphone that does not exist in the inter-word bi / monophone use acoustic model and is not specified by the logical bi / triphone generation unit A second triphone phoneme model name identification substep for identifying a name;
For the phone name of the triphone phoneme model specified in the second triphone phoneme model name specifying sub-step, the phoneme model name of the source monophone is the same, and the environment-independent symbol part of the phone phone model name of the source phone The left phoneme is fixed to a certain phoneme, the right phoneme is expressed as a set of phonemes, and the phoneme model set name that integrally represents the phoneme model names of multiple triphones that share the same phoneme model name of the source monophone. A phoneme model in which the right phoneme is fixed to a certain phoneme and the left phoneme is expressed as a set of phonemes, and the phoneme model names of multiple triphones with the same name are generated A second triphone phoneme model set name generation sub-step for selecting a set with a smaller number of generated phoneme model set names;
As a model parameter common to the phone model of the triphone phoneme model name constituting the phoneme model set name selected in the second triphone phoneme model set name generation sub-step, the model parameter of the phone model of the phone model name of the source monophone is used. A second triphone phoneme model generation step of newly generating a triphone phoneme model in units of the phoneme model set name by copying;
The acoustic model creation method characterized by performing.

The acoustic model creation method according to claim 10 or 11,
Furthermore, the environment-independent symbols included as the left phoneme and right phoneme of the phoneme model name consisting of the left phoneme, the central phoneme, and the right phoneme of the monophone phoneme model present in the inter-word bi / monophone use acoustic model A biphone full phoneme model name generation step for generating all phoneme model names of biphone that can be generated by replacing either one with all phonemes in the phoneme system;
Among the phoneme model names generated in the biphone all phoneme model name generation step, the biphone phoneme model name specifying step for specifying a phoneme model name that does not exist in the inter-word bi / monophone use acoustic model,
Among the phoneme model names specified in the biphone phoneme model name specifying step that have the same phoneme model name of the source monophone, the environment of either one of the phoneme model names of the source monophone Biphone phoneme model set name generation step for generating a phoneme model set name that integrally represents the phoneme model names of multiple biphones that share the phoneme model name of the source monophone by expressing the dependency symbol part as a set of phonemes When,
By copying the phoneme model model parameter of the phone model name of the original monophone as a model parameter common to the phoneme model name of each phoneme model name constituting the phoneme model set name generated in the biphone phoneme model name generation step , A biphone phoneme model generation step for newly generating a phoneme model of biphone in the unit of the phoneme model set name,
Run
The inter-word tri / bi / monophone shared acoustic model storing step further stores a phonemic model of biphone newly generated in the biphone phoneme model generating step.

The acoustic model creation method according to any one of claims 8 to 12,
The inter-word tri / bi / monophone acoustic model storage step is a phoneme model in which the triphone phoneme model uses the transition probability prior to storing the triphone phoneme model in the inter-word tri / bi / monophone acoustic model storage unit. In some cases, the acoustic model creation method further includes a transition probability conversion step of converting into a phoneme model that does not use a transition probability or a phoneme model having a predetermined transition probability value.

The acoustic model creation method according to any one of claims 8 to 13,
Additional learning to perform additional learning on the newly generated phoneme model prior to storing the phoneme model in the inter-word tri / bi / monophone shared acoustic model storage unit in the inter-word tri / bi / monophone shared acoustic model storage step A method for creating an acoustic model, further comprising executing a step.

A program for causing a computer to execute the acoustic model creation method according to any one of claims 8 to 14.