JP6903613B2

JP6903613B2 - Speech recognition device, speech recognition method and program

Info

Publication number: JP6903613B2
Application number: JP2018168708A
Authority: JP
Inventors: 寧丁
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2018-09-10
Filing date: 2018-09-10
Publication date: 2021-07-14
Anticipated expiration: 2038-09-10
Also published as: JP2020042130A

Description

本発明の実施形態は音声認識装置、音声認識方法及びプログラムに関する。 Embodiments of the present invention relate to voice recognition devices, voice recognition methods and programs.

音響モデルと言語モデルとを用いて音声データを認識し、音声データに含まれる発話のテキストを出力する音声認識技術が従来から知られている。音響モデルは予め大量（例えば数百時間以上）のデータを用いて学習される。しかし、どのような条件で用いても高い認識率（例えば８５％以上）が得られるような音響モデルを学習することは困難である。例えば、クリーンな環境で収録された音声データを用いて学習された音響モデルが用いられた場合、残響が大きい会議室での認識率が劣化してしまう。認識率の劣化を防ぐ有効な方法の一つとして、音響モデルの適応がある。 A speech recognition technique that recognizes speech data using an acoustic model and a language model and outputs the utterance text included in the speech data has been conventionally known. The acoustic model is learned in advance using a large amount of data (for example, several hundred hours or more). However, it is difficult to learn an acoustic model that can obtain a high recognition rate (for example, 85% or more) under any conditions. For example, when an acoustic model learned using voice data recorded in a clean environment is used, the recognition rate in a conference room with a large reverberation deteriorates. One of the effective methods to prevent the deterioration of the recognition rate is the adaptation of the acoustic model.

特許第５８５２５５０号公報Japanese Patent No. 5852550

しかしながら、従来の技術では、音響モデルの適応を行った場合、悪影響も生じていた。例えば、同じ内容の発話が繰り返された場合、音響モデルの適応によって、この発話を認識しやすくなるが、他の発話を認識しにくくなる。また例えば、音声データには音声及び非音声の両方が含まれているが、非音声の部分が多い場合、音響モデルの適応によって、非音声の認識結果が出やすくなり、音声の認識結果が出にくくなる。本発明が解決しようとする課題は、音響モデルの適応による悪影響を抑制できる音声認識装置、音声認識方法及びプログラムを提供することである。 However, in the conventional technique, when the acoustic model is adapted, an adverse effect also occurs. For example, when utterances of the same content are repeated, the adaptation of the acoustic model makes it easier to recognize this utterance, but makes it difficult to recognize other utterances. Also, for example, voice data includes both voice and non-voice, but when there are many non-voice parts, the non-voice recognition result can be easily obtained by adapting the acoustic model, and the voice recognition result can be obtained. It becomes difficult. An object to be solved by the present invention is to provide a voice recognition device, a voice recognition method, and a program capable of suppressing adverse effects due to adaptation of an acoustic model.

実施形態の音声認識装置は、生成部と決定部と選択部と適応部とを備える。生成部は、言語モデルと第１音響モデルとを用いて音声データを認識し、前記音声データに含まれる発話を識別するラベルを生成する。決定部は、前記ラベルを用いて、同じ発話を含む音声データの個数を特定し、前記音声データに付与する重みを前記個数に応じて決定する。選択部は、前記重みに基づいて前記音声データを選択する。適応部は、前記選択部により選択された音声データを用いて、前記第１音響モデルを適応させることにより、第２音響モデルを生成する。 The voice recognition device of the embodiment includes a generation unit, a determination unit, a selection unit, and an adaptation unit. The generation unit recognizes the voice data using the language model and the first acoustic model, and generates a label that identifies the utterance included in the voice data. The determination unit specifies the number of voice data including the same utterance using the label, and determines the weight to be given to the voice data according to the number. The selection unit selects the voice data based on the weight. The adapting unit generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.

第１実施形態の音声認識装置の機能構成の例を示すブロック図。The block diagram which shows the example of the functional structure of the voice recognition apparatus of 1st Embodiment. 第１実施形態のラベル情報の例を示す図。The figure which shows the example of the label information of 1st Embodiment. 第１実施形態の音声認識装置の動作方法の例を示すフローチャート。The flowchart which shows the example of the operation method of the voice recognition apparatus of 1st Embodiment. 第２実施形態の音声認識装置の機能構成の例を示すブロック図。The block diagram which shows the example of the functional structure of the voice recognition apparatus of 2nd Embodiment. 第２実施形態の音声データの例を示す図。The figure which shows the example of the voice data of 2nd Embodiment. 第３実施形態の音声認識装置の機能構成の例を示すブロック図。The block diagram which shows the example of the functional structure of the voice recognition apparatus of 3rd Embodiment. 第４実施形態の音声認識装置の機能構成の例を示すブロック図。The block diagram which shows the example of the functional structure of the voice recognition apparatus of 4th Embodiment. 第５実施形態の音声認識装置の機能構成の例を示すブロック図。The block diagram which shows the example of the functional structure of the voice recognition apparatus of 5th Embodiment. 第１乃至第５実施形態の音声認識装置のハードウェア構成の例を示す図。The figure which shows the example of the hardware composition of the voice recognition apparatus of 1st to 5th Embodiment.

以下に添付図面を参照して、音声認識装置、音声認識方法及びプログラムの実施形態を詳細に説明する。 The voice recognition device, the voice recognition method, and the embodiment of the program will be described in detail with reference to the accompanying drawings.

はじめに、音響モデルの適応について説明する。音響モデルの適応は、学習された音響モデルに基づき、適応データを用いて当該音響モデルを再学習することにより行われる。以下、はじめに学習された音響モデルをベース音響モデル（第１音響モデル）と呼び、適応させた音響モデルを適応音響モデル（第２音響モデル）と呼ぶ。 First, the adaptation of the acoustic model will be described. The adaptation of the acoustic model is performed by re-learning the acoustic model using the adaptation data based on the learned acoustic model. Hereinafter, the acoustic model learned first is referred to as a base acoustic model (first acoustic model), and the adapted acoustic model is referred to as an adaptive acoustic model (second acoustic model).

音響モデルを適応させる方法は、おおむね二種類ある（教師あり適応及び教師なし適応）。教師あり適応では、音声データ、及び、音声データの正解ラベルの両方を含む適応データが使用される。教師なし適応では、音声データのみを含む適応データが使用される（正解ラベルがない。）。 There are roughly two ways to adapt the acoustic model (supervised adaptation and unsupervised adaptation). In supervised adaptation, adaptive data including both audio data and the correct label of the audio data is used. In unsupervised adaptation, adaptive data containing only audio data is used (there is no correct label).

教師あり適応は正解ラベルがあるため適応に対して良いが、書き起こしなどによって正解ラベルを作成する必要があるので、コストが高い。 Supervised adaptation is good for adaptation because it has a correct label, but it is expensive because it is necessary to create a correct label by transcribing.

一方、教師なし適応は正解ラベルの作成が要らないためコストが低い。教師なし適応では、音声データを認識し、音声認識結果をラベルとして用いる。音声認識結果の誤りは適応に悪影響を及ぼす可能性があるため、基本的には、音声認識精度は高いほどよい。従来の教師なし適応方法では、言語モデルとベース音響モデルとを用いて音声を認識し、ラベル、信頼度及び音響尤度を出力する。従来の教師なし適応方法では、信頼度がより高く、かつ、音響尤度がより小さい音声データを選択して、音響モデルの適応を行う。 On the other hand, unsupervised adaptation is less costly because it does not require the creation of correct labels. In unsupervised adaptation, speech data is recognized and the speech recognition result is used as a label. Basically, the higher the speech recognition accuracy, the better, because an error in the speech recognition result may adversely affect adaptation. In the conventional unsupervised adaptation method, speech is recognized using a language model and a bass acoustic model, and labels, reliability, and acoustic likelihood are output. In the conventional unsupervised adaptation method, the acoustic model is adapted by selecting speech data having higher reliability and lower acoustic likelihood.

（第１実施形態）
はじめに、第１実施形態の音声認識装置１０の機能構成の例について説明する。 (First Embodiment)
First, an example of the functional configuration of the voice recognition device 10 of the first embodiment will be described.

［機能構成の例］
図１は第１実施形態の音声認識装置１０の機能構成の例を示す図である。第１実施形態の音声認識装置１０は、生成部１、決定部２、選択部３及び適応部４を備える。音声認識装置１０の一部又は全ての機能は、ソフトウェア（プログラム）で実現されても良いし、ハードウェアで実現されても良い。 [Example of functional configuration]
FIG. 1 is a diagram showing an example of a functional configuration of the voice recognition device 10 of the first embodiment. The voice recognition device 10 of the first embodiment includes a generation unit 1, a determination unit 2, a selection unit 3, and an adaptation unit 4. Some or all the functions of the voice recognition device 10 may be realized by software (program) or by hardware.

また、第１実施形態の音声認識装置１０は、言語モデル１０１、ベース音響モデル１０２及び適応音響モデル１０３を記憶する。言語モデル１０１は、音声の言語的な特徴をモデル化したデータである。ベース音響モデル１０２及び適応音響モデル１０３は、音声の音響的な特徴をモデル化したデータである。ベース音響モデル１０２は、はじめに学習されたデータである。適応音響モデル１０３は、適応データを用いてベース音響モデル１０２を再学習することにより得られたデータである。なお、言語モデル１０１、ベース音響モデル１０２及び適応音響モデル１０３を記憶する記憶部は、外部の装置に備えられていてもよい。 Further, the voice recognition device 10 of the first embodiment stores the language model 101, the base acoustic model 102, and the adaptive acoustic model 103. The language model 101 is data that models the linguistic features of speech. The base acoustic model 102 and the adaptive acoustic model 103 are data modeling the acoustic characteristics of speech. The bass acoustic model 102 is the data initially learned. The adaptive acoustic model 103 is data obtained by re-learning the base acoustic model 102 using the adaptive data. A storage unit for storing the language model 101, the base acoustic model 102, and the adaptive acoustic model 103 may be provided in an external device.

生成部１は、言語モデル１０１とベース音響モデル１０２とを用いて音声データを認識し、ラベルを生成する。音声データは、例えば発話毎に区切られたデータである。ラベルは、音声データの音声認識結果から変換されたデータである。ラベルは、音声データに含まれる発話を識別する情報である。 The generation unit 1 recognizes the voice data using the language model 101 and the base acoustic model 102, and generates a label. The voice data is, for example, data separated for each utterance. The label is data converted from the voice recognition result of the voice data. The label is information that identifies the utterance contained in the voice data.

決定部２は、ラベルを用いて、同じ発話を含む音声データの個数を特定し、音声データに付与する重みを、当該個数に応じて決定する。 The determination unit 2 specifies the number of voice data including the same utterance using the label, and determines the weight to be given to the voice data according to the number.

ラベル及び当該ラベルの個数は、例えば図２に示すラベル情報として、音声認識装置１０に記憶される。 The label and the number of the labels are stored in the voice recognition device 10 as label information shown in FIG. 2, for example.

図２は第１実施形態のラベル情報の例を示す図である。第１実施形態のラベル情報は、音声データ、音声認識結果、ラベル、カウント数及び重みを含む。 FIG. 2 is a diagram showing an example of label information of the first embodiment. The label information of the first embodiment includes voice data, a voice recognition result, a label, a count number, and a weight.

音声認識結果は、音声データの認識結果である。図２の例では、ラベルは、音声認識結果をひらがなに変換したデータである。なお、ラベルは、ひらがなに限らずローマ字等でもよい。 The voice recognition result is a recognition result of voice data. In the example of FIG. 2, the label is data obtained by converting the voice recognition result into hiragana. The label is not limited to hiragana, but may be Roman characters or the like.

カウント数は、ラベルの個数を示す。例えば、発話−１、発話−３及び発話−５のラベルは同じである。発話−１のラベル生成時には、当該ラベルのカウント数は１となる。発話−３のラベル生成時には、当該ラベルのカウント数は２となる。発話−５のラベル生成時には、当該ラベルのカウント数は３となる。 The count number indicates the number of labels. For example, the labels of utterance-1, utterance-3, and utterance-5 are the same. When the label of utterance-1 is generated, the count number of the label is 1. When the label of utterance-3 is generated, the count number of the label is 2. When the label of utterance-5 is generated, the count number of the label is 3.

重みは、ラベルの重みを示す。図２の例では、ラベルのカウント数が大きいほど、当該ラベルの重みは小さくなる。 The weight indicates the weight of the label. In the example of FIG. 2, the larger the number of labels counted, the smaller the weight of the label.

生成部１は、例えば下記式（１）により、ラベルの重みを決定する。 The generation unit 1 determines the weight of the label by, for example, the following equation (1).

μ＝ｅ^１−ｘ・・・（１） μ = e ^1-x ... (1)

ここで、μは重みであり、ｘはカウント数である。図２の例では、式（１）により重みが決定されている。例えば発話−１、発話−３及び発話−５のラベルの重みは、それぞれ１．００、０．３７、０．１４である。発話−２、発話−４及び発話−６のラベルの重みは、１．００である。 Here, μ is a weight and x is a count number. In the example of FIG. 2, the weight is determined by the equation (1). For example, the weights of the labels of utterance -1, utterance -3, and utterance -5 are 1.00, 0.37, and 0.14, respectively. The weight of the labels of utterance-2, utterance-4 and utterance-6 is 1.00.

なお、重みを決定する式は、上述の式（１）に限られず、他の減少関数でも良い。 The equation for determining the weight is not limited to the above equation (1), and other decreasing functions may be used.

図１に戻り、選択部３は、生成部１により生成されたラベル情報に含まれる重みに基づいて、適応データとして使用する音声データ（発話）を選択する。適応データの中で同じ内容の発話が複数存在する場合、適応によって、同じ内容の発話の事後確率が高くなって、当該発話の認識がしやすくなる。一方、この場合、他の発話の事後確率が低くなるため、他の発話を認識しにくくなる。 Returning to FIG. 1, the selection unit 3 selects voice data (utterance) to be used as adaptive data based on the weight included in the label information generated by the generation unit 1. When there are multiple utterances with the same content in the adaptation data, the adaptation increases the posterior probability of the utterance with the same content, making it easier to recognize the utterance. On the other hand, in this case, since the posterior probability of other utterances is low, it becomes difficult to recognize other utterances.

したがって、選択部３は、各発話の重みと重み閾値とを比較し、重み閾値より大きい発話を適応データとして選択する。これにより、適応データを使用して生成された適応音響モデル１０３を使用して、音声認識をする場合の悪影響を抑制することができる。 Therefore, the selection unit 3 compares the weight of each utterance with the weight threshold value, and selects an utterance larger than the weight threshold value as adaptive data. As a result, the adverse effect of voice recognition can be suppressed by using the adaptive acoustic model 103 generated using the adaptive data.

重み閾値は、例えば下記式（２）により決定される。 The weight threshold is determined by, for example, the following equation (2).

θ＝ｅ^１−αｎ・・・（２） θ = e ^1-αn ... (2)

ここで、θは重み閾値であり、αは発話係数であり、ｎは全発話数である。つまり、同じ内容の発話については、カウント数ｘが全発話数ｎのα倍より小さい場合（ｘ＜αｎ）、適応データとして選択される。 Here, θ is a weight threshold, α is an utterance coefficient, and n is the total number of utterances. That is, for utterances having the same content, when the count number x is smaller than α times the total number of utterances n (x <αn), it is selected as adaptive data.

発話係数αは、例えば０．２である。図２の例では、全発話数ｎは６であるため、重み閾値θは０．８２になる。発話−１、発話−２、発話−３及び発話−５の重みは、重み閾値θより大きいため、選択部３により適応データとして選択される。一方、発話−４及び発話−６の重みは、重み閾値θより小さいため、選択部３により適応データとして選択されない。 The utterance coefficient α is, for example, 0.2. In the example of FIG. 2, since the total number of utterances n is 6, the weight threshold θ is 0.82. Since the weights of utterance-1, utterance-2, utterance-3, and utterance-5 are larger than the weight threshold value θ, they are selected as adaptive data by the selection unit 3. On the other hand, since the weights of utterance -4 and utterance -6 are smaller than the weight threshold value θ, they are not selected as adaptive data by the selection unit 3.

なお、第１実施形態の説明では、発話係数が０．２の場合について説明したが、必要に応じて、発話係数を１以下の他の数値を設定しても良い。また、全発話数ｎの比率αｎではなく、絶対発話数（全発話数ｎ）に基づいて、重み閾値θを決定してもよい。この場合、上述の式（２）のαｎをｎに変更すればよい。 In the description of the first embodiment, the case where the utterance coefficient is 0.2 has been described, but if necessary, another numerical value of 1 or less may be set as the utterance coefficient. Further, the weight threshold θ may be determined based on the absolute number of utterances (total number of utterances n) instead of the ratio αn of the total number of utterances n. In this case, αn in the above equation (2) may be changed to n.

適応部４は、選択部３により選択された適応データを用いて、ベース音響モデル１０２を適応させることにより、適応音響モデル１０３を生成する。具体的には、ベース音響モデル１０２の適応は、ベース音響モデル１０２のパラメータを、適応データを用いて最適化することにより行われる。ベース音響モデル１０２を適応させる方法は、例えばＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）、ＣＮＮ（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）、及び、ＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）などを使用する方法がある。適応音響モデル１０３は、音声認識装置１０の外部の記憶部に記憶されるようにしてもよい。 The adaptive unit 4 generates the adaptive acoustic model 103 by adapting the base acoustic model 102 using the adaptive data selected by the selection unit 3. Specifically, the adaptation of the base acoustic model 102 is performed by optimizing the parameters of the base acoustic model 102 using the adaptation data. As a method of adapting the bass acoustic model 102, for example, there is a method using DNN (Deep Neural Network), CNN (Convolutional Neural Network), RNN (Recurrent Neural Network) and the like. The adaptive acoustic model 103 may be stored in an external storage unit of the voice recognition device 10.

［動作方法の例］
図３は第１実施形態の音声認識装置１０の動作方法の例を示すフローチャートである。はじめに、生成部１が、言語モデル１０１とベース音響モデル１０２とを用いて音声データを認識する（ステップＳ１）。次に、生成部１が、ステップＳ１の処理により認識された音声データに含まれる発話を識別するラベルを生成する（ステップＳ２）。 [Example of operation method]
FIG. 3 is a flowchart showing an example of an operation method of the voice recognition device 10 of the first embodiment. First, the generation unit 1 recognizes voice data using the language model 101 and the base acoustic model 102 (step S1). Next, the generation unit 1 generates a label that identifies the utterance included in the voice data recognized by the process of step S1 (step S2).

次に、決定部２が、ラベルを用いて、同じ発話を含む音声データの個数を特定し、当該音声データに付与する重みを当該個数に応じて決定する（ステップＳ３）。次に、選択部３が、適応データとして使用する音声データを、重みに基づいて選択する（ステップＳ４）。次に、適応部４が、選択部３により選択された音声データ（適応データ）を用いて、ベース音響モデル１０２を適応させることにより、適応音響モデル１０３を生成する（ステップＳ５）。 Next, the determination unit 2 specifies the number of voice data including the same utterance using the label, and determines the weight to be given to the voice data according to the number (step S3). Next, the selection unit 3 selects voice data to be used as adaptive data based on the weight (step S4). Next, the adaptation unit 4 generates the adaptive acoustic model 103 by adapting the base acoustic model 102 using the voice data (adaptive data) selected by the selection unit 3 (step S5).

以上説明したように、第１実施形態の音声認識装置１０では、生成部１が、言語モデル１０１とベース音響モデル１０２（第１音響モデル）とを用いて音声データを認識し、当該音声データに含まれる発話を識別するラベルを生成する。決定部２が、ラベルを用いて、同じ発話を含む音声データの個数を特定し、当該音声データに付与する重みを当該個数に応じて決定する。選択部３が、重みに基づいて音声データを選択する。そして、適応部４が、選択部３により選択された音声データ（適応データ）を用いて、ベース音響モデル１０２（第１音響モデル）を適応させることにより、適応音響モデル１０３（第２音響モデル）を生成する。 As described above, in the voice recognition device 10 of the first embodiment, the generation unit 1 recognizes the voice data by using the language model 101 and the base sound model 102 (first sound model), and uses the voice data as the voice data. Generates a label that identifies the included speech. The determination unit 2 specifies the number of voice data including the same utterance using the label, and determines the weight to be given to the voice data according to the number. The selection unit 3 selects voice data based on the weight. Then, the adaptive unit 4 adapts the base acoustic model 102 (first acoustic model) using the voice data (adaptive data) selected by the selection unit 3, thereby causing the adaptive acoustic model 103 (second acoustic model). To generate.

これにより第１実施形態の音声認識装置１０によれば、音響モデルを適応させた場合に生じる音声認識に与える悪影響を抑制することができる。 As a result, according to the voice recognition device 10 of the first embodiment, it is possible to suppress the adverse effect on voice recognition that occurs when the acoustic model is adapted.

（第２実施形態）
次に第２実施形態について説明する。第２実施形態の説明では、第１実施形態と同様の説明については省略する。 (Second Embodiment)
Next, the second embodiment will be described. In the description of the second embodiment, the same description as that of the first embodiment will be omitted.

適応データに含まれる非音声の部分が多いほど、当該適応データを用いた適応によって、非音声の確率が高くなる（音声の確率が低くなる）ため、音声の認識結果が非音声になることが多くなる。一方、適応データに含まれる非音声の部分が少ないほど、当該適応データを用いた適応によって、非音声の確率が低くなる（音声の確率が高くなる）ため、非音声の認識結果が音声になることが多くなる。 The more non-speech parts included in the adaptive data, the higher the probability of non-speech (lower the probability of speech) due to the adaptation using the adaptive data, so that the speech recognition result may become non-speech. More. On the other hand, the smaller the non-speech part included in the adaptive data, the lower the probability of non-speech (higher the probability of speech) due to the adaptation using the adaptive data, so that the non-speech recognition result becomes speech. There are many things.

教師あり学習の場合、音声データから手動で発話ごとに切り出すため、非音声の部分のデータ量を制御できる。一方、教師なしの学習の場合、基本的にＶＡＤ（ｖｏｉｃｅａｃｔｉｖｉｔｙｄｅｔｅｃｔｉｏｎ）等の音声区間検出処理により、自動的に発話を切り出すため、非音声の部分のデータ量の制御が困難である。 In the case of supervised learning, the amount of data in the non-voice part can be controlled because the voice data is manually cut out for each utterance. On the other hand, in the case of unsupervised learning, it is difficult to control the amount of data in the non-voice portion because the utterance is automatically cut out by the voice section detection process such as VAD (voice activity detection).

第２実施形態では、音声データに含まれる音声（または非音声）の部分が多い場合でも、適応の悪影響を抑制できる構成について説明する。 In the second embodiment, a configuration capable of suppressing the adverse effect of adaptation even when there are many voice (or non-voice) parts included in the voice data will be described.

［機能構成の例］
図４は第２実施形態の音声認識装置１０−２の機能構成の例を示すブロック図である。第２実施形態の音声認識装置１０−２は、生成部１、選択部３−２、適応部４及び計算部５を備える。生成部１及び適応部４の説明は、第１実施形態と同様なので省略する。 [Example of functional configuration]
FIG. 4 is a block diagram showing an example of the functional configuration of the voice recognition device 10-2 of the second embodiment. The voice recognition device 10-2 of the second embodiment includes a generation unit 1, a selection unit 3-2, an adaptation unit 4, and a calculation unit 5. The description of the generation unit 1 and the adaptation unit 4 will be omitted because they are the same as those in the first embodiment.

計算部５は、生成部１により生成されたラベルを用いて、音声データに含まれる音声フレームと、当該音声データに含まれる非音声フレームとの比率を計算する。 The calculation unit 5 calculates the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data by using the label generated by the generation unit 1.

図５は第２実施形態の音声データの例を示す図である。図５の例では、音声データに含まれるフレームの数が２０である場合を示す。１、２、１８、１９及び２０番目のフレームは、非音声フレームの一例である。なお、ｓｉｌは、ｓｉｌｅｎｃｅの略である。３〜１７番目のフレームは、音声フレームである。図５の音声データに含まれる発話は、「おはようございます」であり、当該発話のラベルも「おはようございます」である。 FIG. 5 is a diagram showing an example of voice data of the second embodiment. In the example of FIG. 5, the case where the number of frames included in the voice data is 20 is shown. The 1, 2, 18, 19 and 20th frames are examples of non-audio frames. Sil is an abbreviation for silence. The 3rd to 17th frames are audio frames. The utterance included in the voice data of FIG. 5 is "Good morning", and the label of the utterance is also "Good morning".

計算部５は、フレームごとの音素を表すために、生成したラベルを用いてアライメントを行う。発音の長さによって二つ以上のフレームに、一つの音素が対応することもある。図５の例では、例えば、フレーム４及び５に対応する音素は同じになる。 The calculation unit 5 performs alignment using the generated label in order to represent the phoneme for each frame. Depending on the length of the pronunciation, one phoneme may correspond to two or more frames. In the example of FIG. 5, for example, the phonemes corresponding to frames 4 and 5 are the same.

計算部５は、音声フレームと非音声フレームとの比率を計算する。図５の例では、音声フレームの比率は１５／２０＝０．７５である。非音声フレームの比率は５／２０＝０．２５である。 The calculation unit 5 calculates the ratio of the audio frame to the non-audio frame. In the example of FIG. 5, the ratio of audio frames is 15/20 = 0.75. The ratio of non-audio frames is 5/20 = 0.25.

図４に戻り、選択部３−２は、音声フレームの比率が所定の選択範囲以内である音声データを、適応データとして選択する。所定の選択範囲は、例えば０．３〜０．９である。図５の例では、音声フレームの比率は０．７５であるので、当該音声フレームは選択部３−２により適応データとして選択される。 Returning to FIG. 4, the selection unit 3-2 selects audio data in which the ratio of audio frames is within a predetermined selection range as adaptive data. The predetermined selection range is, for example, 0.3 to 0.9. In the example of FIG. 5, since the ratio of audio frames is 0.75, the audio frames are selected as adaptive data by the selection unit 3-2.

所定の選択範囲は、適応の目的に応じて設定すれば良い。音声認識装置１０−２から、できるだけ音声の認識結果を出力したい場合、所定の選択範囲として、値がより高い区間の範囲を使用する（例えば、０．４〜１．０）。一方、音声データに背景雑音が入っているため、音声認識装置１０−２から、できるだけ背景雑音の認識結果を出力したくない場合、所定の選択範囲として、値がより低い区間の範囲を使用する（例えば、０．０〜０．５）。 The predetermined selection range may be set according to the purpose of adaptation. When it is desired to output the voice recognition result as much as possible from the voice recognition device 10-2, a range of a section having a higher value is used as a predetermined selection range (for example, 0.4 to 1.0). On the other hand, since the voice data contains background noise, if the voice recognition device 10-2 does not want to output the background noise recognition result as much as possible, a range of a section having a lower value is used as a predetermined selection range. (For example, 0.0 to 0.5).

以上説明したように、第２実施形態の音声認識装置１０−２によれば、例えば非音声フレームの比率が高い（例えば０．７以上）音声データが含まれている場合でも、選択部３−２により、当該音声データが選択されない。これにより、適応音声モデル１０３を使用した音声認識結果への悪影響を抑制できる。 As described above, according to the voice recognition device 10-2 of the second embodiment, even when voice data having a high ratio of non-voice frames (for example, 0.7 or more) is included, the selection unit 3- Due to 2, the voice data is not selected. As a result, it is possible to suppress an adverse effect on the speech recognition result using the adaptive speech model 103.

（第３実施形態）
次に第３実施形態について説明する。第３実施形態の説明では、第１及び第２実施形態と同様の説明については省略する。第３実施形態では、第１及び第２実施形態を組み合わせる場合の動作について説明する。 (Third Embodiment)
Next, the third embodiment will be described. In the description of the third embodiment, the same description as in the first and second embodiments will be omitted. In the third embodiment, the operation when the first and second embodiments are combined will be described.

［機能構成の例］
図６は第３実施形態の音声認識装置１０−３の機能構成の例を示すブロック図である。第３実施形態の音声認識装置１０−３は、生成部１、決定部２、選択部３−３、適応部４及び計算部５を備える。生成部１、決定部２及び適応部４の説明は、第１実施形態と同様なので省略する。計算部５の説明は、第２実施形態と同様なので省略する。 [Example of functional configuration]
FIG. 6 is a block diagram showing an example of the functional configuration of the voice recognition device 10-3 according to the third embodiment. The voice recognition device 10-3 of the third embodiment includes a generation unit 1, a determination unit 2, a selection unit 3-3, an adaptation unit 4, and a calculation unit 5. The description of the generation unit 1, the determination unit 2, and the adaptation unit 4 will be omitted because they are the same as those in the first embodiment. Since the description of the calculation unit 5 is the same as that of the second embodiment, the description thereof will be omitted.

第１実施形態の選択部３による適応データの選択方法を選択方法Ａとし、第２実施形態の選択部３−２による適応データの選択方法を選択方法Ｂとする。選択方法Ａ及びＢは独立である。そのため、選択方法Ａ及びＢの組み合わせによって、適応データとして使用する音声データを選択することが可能である。 The method of selecting adaptive data by the selection unit 3 of the first embodiment is referred to as the selection method A, and the method of selecting the adaptive data by the selection unit 3-2 of the second embodiment is referred to as the selection method B. Selection methods A and B are independent. Therefore, it is possible to select the voice data to be used as the adaptive data by the combination of the selection methods A and B.

選択部３−３は、決定部２により決定された重みと、計算部５により計算された音声フレームの比率とに基づいて、適応データとして使用する音声データを選択する。具体的には、選択部３−３は、例えば選択方法Ａにより適応データ候補を選択し、次に、適応データ候補から選択方法Ｂにより適応データを選択する。また例えば、選択部３−３は、選択方法Ｂにより適応データ候補を選択し、次に、適応データ候補から選択方法Ａにより適応データを選択する。 The selection unit 3-3 selects audio data to be used as adaptive data based on the weight determined by the determination unit 2 and the ratio of the audio frames calculated by the calculation unit 5. Specifically, the selection unit 3-3 selects, for example, the adaptation data candidate by the selection method A, and then selects the adaptation data from the adaptation data candidates by the selection method B. Further, for example, the selection unit 3-3 selects the adaptation data candidate by the selection method B, and then selects the adaptation data from the adaptation data candidates by the selection method A.

これにより第３実施形態の音声認識装置１０−３によれば、第１及び第２実施形態の効果を得ることができる。 As a result, according to the voice recognition device 10-3 of the third embodiment, the effects of the first and second embodiments can be obtained.

（第４実施形態）
次に第４実施形態について説明する。第４実施形態の説明では、第１実施形態と同様の説明については省略する。第４実施形態では、適応音響モデル１０３を使用して、音声認識をする構成について説明する。 (Fourth Embodiment)
Next, the fourth embodiment will be described. In the description of the fourth embodiment, the same description as that of the first embodiment will be omitted. In the fourth embodiment, a configuration for performing voice recognition will be described using the adaptive acoustic model 103.

［機能構成の例］
図７は第４実施形態の音声認識装置１０−４の機能構成の例を示す図である。第４実施形態の音声認識装置１０−４は、生成部１、決定部２、選択部３、適応部４及び認識部６を備える。生成部１、決定部２、選択部３及び適応部４の説明は、第１実施形態と同様なので省略する。 [Example of functional configuration]
FIG. 7 is a diagram showing an example of the functional configuration of the voice recognition device 10-4 according to the fourth embodiment. The voice recognition device 10-4 of the fourth embodiment includes a generation unit 1, a determination unit 2, a selection unit 3, an adaptation unit 4, and a recognition unit 6. The description of the generation unit 1, the determination unit 2, the selection unit 3, and the adaptation unit 4 will be omitted because they are the same as those in the first embodiment.

認識部６は、言語モデル１０１及び適応音響モデル１０３を用いて、音声データの音声認識を行う。例えば、適応データが取得された環境と類似する環境で取得された音声データの音声認識をする場合、適応音響モデル１０３のパラメータは、ベース音響モデル１０２のパラメータより好ましい。また例えば、適応データに含まれる発話の話者と類似する話者（または同じ話者）の音声データの音声認識をする場合、適応音響モデル１０３のパラメータは、ベース音響モデル１０２のパラメータより好ましい。そのため、適応音響モデル１０３を用いて音声認識を行う場合、より高い音声認識精度が得られる。 The recognition unit 6 uses the language model 101 and the adaptive acoustic model 103 to perform voice recognition of voice data. For example, when performing voice recognition of voice data acquired in an environment similar to the environment in which the adaptive data is acquired, the parameters of the adaptive acoustic model 103 are preferable to the parameters of the base acoustic model 102. Further, for example, in the case of voice recognition of voice data of a speaker (or the same speaker) similar to the speaker of the utterance included in the adaptive data, the parameters of the adaptive acoustic model 103 are preferable to the parameters of the base acoustic model 102. Therefore, when voice recognition is performed using the adaptive acoustic model 103, higher voice recognition accuracy can be obtained.

（第５実施形態）
次に第５実施形態について説明する。第５実施形態の説明では、第１実施形態と同様の説明については省略する。第１実施形態では、言語モデル１０１及びベース音響モデル１０２の２種類のモデルを用いて適応を行っていた。第５実施形態では、言語モデル１０１及びベース音響モデル１０２を区別せずに、Ｅｎｄ−ｔｏ−Ｅｎｄの音声認識方法により、適応を行う場合の構成について説明する。 (Fifth Embodiment)
Next, the fifth embodiment will be described. In the description of the fifth embodiment, the same description as that of the first embodiment will be omitted. In the first embodiment, adaptation is performed using two types of models, a language model 101 and a bass acoustic model 102. In the fifth embodiment, the configuration in the case of performing adaptation by the end-to-end voice recognition method without distinguishing between the language model 101 and the bass acoustic model 102 will be described.

［機能構成の例］
図８は第５実施形態の音声認識装置１０−５の機能構成の例を示す図である。第５実施形態の音声認識装置１０−５は、生成部１−２、決定部２、選択部３及び適応部４−２を備える。決定部２及び選択部３の説明は、第１実施形態と同様なので省略する。 [Example of functional configuration]
FIG. 8 is a diagram showing an example of the functional configuration of the voice recognition device 10-5 according to the fifth embodiment. The voice recognition device 10-5 of the fifth embodiment includes a generation unit 1-2, a determination unit 2, a selection unit 3, and an adaptation unit 4-2. The description of the determination unit 2 and the selection unit 3 will be omitted because they are the same as those in the first embodiment.

第５実施形態の音声認識装置１０−５は、音声認識ベースモデル１０４及び音声認識適応モデル１０５を記憶する。音声認識ベースモデル１０４は、音声の言語的な特徴、及び、音声の音響的な特徴の両方を区別せずにモデル化したデータである。 The voice recognition device 10-5 of the fifth embodiment stores the voice recognition base model 104 and the voice recognition adaptation model 105. The speech recognition-based model 104 is data modeled without distinguishing between the linguistic features of speech and the acoustic features of speech.

生成部１−２は、音声認識ベースモデル１０４を用いて音声データを認識し、ラベルを生成する。第５実施形態では、音声認識ベースモデル１０４が、言語モデル１０１及びベース音響モデル１０２の役割を果たす。ラベルの生成方法の説明は、第１実施形態と同じなので省略する。 The generation unit 1-2 recognizes the voice data using the voice recognition base model 104 and generates a label. In the fifth embodiment, the speech recognition base model 104 plays the role of the language model 101 and the base acoustic model 102. The description of the label generation method is the same as that of the first embodiment, and thus the description thereof will be omitted.

適応部４−２は、選択部３により選択された適応データを用いて、音声認識ベースモデル１０４を適応させることにより、音声認識適応モデル１０５を生成する。具体的には、音声認識ベースモデル１０４の適応は、音声認識ベースモデル１０４のパラメータを、適応データを用いて最適化することにより行われる。音声認識ベースモデル１０４を適応させる方法は、例えばＤＮＮ、ＣＮＮ及びＲＮＮ（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ）などを使用する方法がある。音声認識適応モデル１０５は、音声認識装置１０の外部の記憶部に記憶されるようにしてもよい。 The adaptation unit 4-2 generates the voice recognition adaptation model 105 by adapting the voice recognition base model 104 using the adaptation data selected by the selection unit 3. Specifically, the adaptation of the speech recognition base model 104 is performed by optimizing the parameters of the speech recognition base model 104 using the adaptation data. As a method of adapting the speech recognition base model 104, for example, there is a method of using DNN, CNN, RNN (Recurrent Neural Network) and the like. The voice recognition adaptation model 105 may be stored in an external storage unit of the voice recognition device 10.

最後に、第１乃至第５実施形態の音声認識装置１０（１０−２，１０−３，１０−４，１０−５）のハードウェア構成の例について説明する。 Finally, an example of the hardware configuration of the voice recognition device 10 (10-2, 10-3, 10-4, 10-5) of the first to fifth embodiments will be described.

［ハードウェア構成の例］
図９は第１乃至第５実施形態の音声認識装置１０（１０−２，１０−３，１０−４，１０−５）のハードウェア構成の例を示す図である。以下では、第１実施形態の音声認識装置１０の場合を例にして説明する。なお、第２乃至第５実施形態の音声認識装置１０−２（１０−３，１０−４，１０−５））のハードウェア構成も、第１実施形態の音声認識装置１０のハードウェア構成と同様である。 [Example of hardware configuration]
FIG. 9 is a diagram showing an example of the hardware configuration of the voice recognition device 10 (10-2, 10-3, 10-4, 10-5) of the first to fifth embodiments. Hereinafter, the case of the voice recognition device 10 of the first embodiment will be described as an example. The hardware configuration of the voice recognition device 10-2 (10-3, 10-4, 10-5) of the second to fifth embodiments is also the same as the hardware configuration of the voice recognition device 10 of the first embodiment. The same is true.

第１実施形態の音声認識装置１０は、制御装置３０１、主記憶装置３０２、補助記憶装置３０３、表示装置３０４、入力装置３０５及び通信装置３０６を備える。制御装置３０１、主記憶装置３０２、補助記憶装置３０３、表示装置３０４、入力装置３０５及び通信装置３０６は、バス３１０を介して接続されている。 The voice recognition device 10 of the first embodiment includes a control device 301, a main storage device 302, an auxiliary storage device 303, a display device 304, an input device 305, and a communication device 306. The control device 301, the main storage device 302, the auxiliary storage device 303, the display device 304, the input device 305, and the communication device 306 are connected via the bus 310.

制御装置３０１は、補助記憶装置３０３から主記憶装置３０２に読み出されたプログラムを実行する。主記憶装置３０２は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及び、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等のメモリである。補助記憶装置３０３は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、及び、メモリカード等である。 The control device 301 executes the program read from the auxiliary storage device 303 to the main storage device 302. The main storage device 302 is a memory such as a ROM (Read Only Memory) and a RAM (Random Access Memory). The auxiliary storage device 303 is an HDD (Hard Disk Drive), a memory card, or the like.

表示装置３０４は表示情報を表示する。表示装置３０４は、例えば液晶ディスプレイ等である。入力装置３０５は、音声認識装置１０を操作するためのインタフェースである。入力装置３０５は、例えばキーボードやマウス等である。音声認識装置１０がスマートフォン及びタブレット型端末等のスマートデバイスの場合、表示装置３０４及び入力装置３０５は、例えばタッチパネルである。通信装置３０６は、他の装置と通信するためのインタフェースである。 The display device 304 displays the display information. The display device 304 is, for example, a liquid crystal display or the like. The input device 305 is an interface for operating the voice recognition device 10. The input device 305 is, for example, a keyboard, a mouse, or the like. When the voice recognition device 10 is a smart device such as a smartphone or a tablet terminal, the display device 304 and the input device 305 are, for example, a touch panel. The communication device 306 is an interface for communicating with another device.

第１実施形態の音声認識装置１０で実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、メモリカード、ＣＤ−Ｒ及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）等のコンピュータで読み取り可能な記憶媒体に記録されてコンピュータ・プログラム・プロダクトとして提供される。 The program executed by the voice recognition device 10 of the first embodiment is a file in an installable format or an executable format, which is read by a computer such as a CD-ROM, a memory card, a CD-R, and a DVD (Digital Versaille Disc). It is recorded on a possible storage medium and provided as a computer program product.

また第１実施形態の音声認識装置１０で実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また第１実施形態の音声認識装置１０で実行されるプログラムをダウンロードさせずにインターネット等のネットワーク経由で提供するように構成してもよい。 Further, the program executed by the voice recognition device 10 of the first embodiment may be stored on a computer connected to a network such as the Internet and provided by downloading via the network. Further, the program executed by the voice recognition device 10 of the first embodiment may be configured to be provided via a network such as the Internet without being downloaded.

また第１実施形態の音声認識装置１０のプログラムを、ＲＯＭ等に予め組み込んで提供するように構成してもよい。 Further, the program of the voice recognition device 10 of the first embodiment may be configured to be provided by incorporating it into a ROM or the like in advance.

第１実施形態の音声認識装置１０で実行されるプログラムは、上述の機能ブロックのうち、プログラムによっても実現可能な機能ブロックを含むモジュール構成となっている。当該各機能ブロックは、実際のハードウェアとしては、制御装置３０１が記憶媒体からプログラムを読み出して実行することにより、上記各機能ブロックが主記憶装置３０２上にロードされる。すなわち上記各機能ブロックは主記憶装置３０２上に生成される。 The program executed by the voice recognition device 10 of the first embodiment has a module configuration including functional blocks that can be realized by the program among the above-mentioned functional blocks. As the actual hardware, each functional block is loaded on the main storage device 302 by the control device 301 reading a program from the storage medium and executing the program. That is, each of the above functional blocks is generated on the main storage device 302.

なお上述した各機能ブロックの一部又は全部をソフトウェアにより実現せずに、ＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等のハードウェアにより実現してもよい。 It should be noted that a part or all of the above-mentioned functional blocks may not be realized by software, but may be realized by hardware such as an IC (Integrated Circuit).

また複数のプロセッサを用いて各機能を実現する場合、各プロセッサは、各機能のうち１つを実現してもよいし、各機能のうち２以上を実現してもよい。 Further, when each function is realized by using a plurality of processors, each processor may realize one of each function, or may realize two or more of each function.

また第１実施形態の音声認識装置１０の動作形態は任意でよい。第１実施形態の音声認識装置１０を、例えばネットワーク上のクラウドシステムとして動作させてもよい。 Further, the operation mode of the voice recognition device 10 of the first embodiment may be arbitrary. The voice recognition device 10 of the first embodiment may be operated as, for example, a cloud system on a network.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although some embodiments of the present invention have been described, these embodiments are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.

１生成部
２決定部
３選択部
４適応部
５計算部
６認識部
１０１言語モデル
１０２ベース音響モデル
１０３適応音響モデル
１０４音声認識ベースモデル
１０５音声認識適応モデル
３０１制御装置
３０２主記憶装置
３０３補助記憶装置
３０４表示装置
３０５入力装置
３０６通信装置
３１０バス 1 Generation unit 2 Decision unit 3 Selection unit 4 Adaptation unit 5 Calculation unit 6 Recognition unit 101 Language model 102 Base acoustic model 103 Adaptation acoustic model 104 Speech recognition base model 105 Speech recognition adaptive model 301 Control device 302 Main storage device 303 Auxiliary storage device 304 Display device 305 Input device 306 Communication device 310 Bus

Claims

A generation unit that recognizes voice data using a language model and a first acoustic model and generates a label that identifies an utterance included in the voice data.
Using the label, a determination unit that specifies the number of voice data including the same utterance and determines the weight to be given to the voice data according to the number.
A selection unit that selects the voice data based on the weight, and
An adaptation unit that generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.
A voice recognition device equipped with.

A generation unit that recognizes voice data using a language model and a first acoustic model and generates a label that identifies an utterance included in the voice data.
Using the label, a calculation unit that calculates the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data.
A selection unit that selects audio data for which the ratio of audio frames is within a predetermined selection range, and
An adaptation unit that generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.
A voice recognition device equipped with.

A generation unit that recognizes voice data using a language model and a first acoustic model and generates a label that identifies an utterance included in the voice data.
Using the label, a determination unit that specifies the number of voice data including the same utterance and determines the weight to be given to the voice data according to the number.
Using the label, a calculation unit that calculates the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data.
A selection unit that selects the audio data based on the weight and the ratio of the audio frames.
An adaptation unit that generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.
A voice recognition device equipped with.

The determination unit determines the weight to be smaller as the number of the determination units increases.
The voice recognition device according to claim 1.

The selection unit determines whether or not the weight is larger than the threshold value, and selects voice data to which a weight larger than the threshold value is given.
The voice recognition device according to claim 1.

A recognition unit that performs voice recognition of the voice data using the language model and the second acoustic model.
The voice recognition device according to claim 1.

The language model and the first acoustic model are represented by a speech recognition-based model modeled without distinguishing between the linguistic features of speech and the acoustic features of speech.
The adapting unit adapts the voice recognition base model using the voice data selected by the selection unit.
The voice recognition device according to claim 1.

A step of recognizing voice data using a language model and a first acoustic model and generating a label for identifying an utterance included in the voice data.
A step of specifying the number of voice data containing the same utterance using the label and determining the weight to be given to the voice data according to the number.
The step of selecting the voice data based on the weight, and
A step of generating a second acoustic model by adapting the first acoustic model using the voice data selected by the selected step, and a step of generating the second acoustic model.
Speech recognition methods including.

A step of recognizing voice data using a language model and a first acoustic model and generating a label for identifying an utterance included in the voice data.
Using the label, a step of calculating the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data, and
A selection unit that selects audio data for which the ratio of audio frames is within a predetermined selection range, and
A step of generating a second acoustic model by adapting the first acoustic model using the voice data selected by the selected step, and a step of generating the second acoustic model.
Speech recognition methods including.

A step of recognizing voice data using a language model and a first acoustic model and generating a label for identifying an utterance included in the voice data.
A step of specifying the number of voice data containing the same utterance using the label and determining the weight to be given to the voice data according to the number.
Using the label, a step of calculating the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data, and
A step of selecting the audio data based on the weight and the ratio of the audio frames,
A step of generating a second acoustic model by adapting the first acoustic model using the voice data selected by the selected step, and a step of generating the second acoustic model.
Speech recognition methods including.

Computer,
A generation unit that recognizes voice data using a language model and a first acoustic model and generates a label that identifies an utterance included in the voice data.
Using the label, a determination unit that specifies the number of voice data including the same utterance and determines the weight to be given to the voice data according to the number.
A selection unit that selects the voice data based on the weight, and
An adaptation unit that generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.
A program to function as.

Computer,
A generation unit that recognizes voice data using a language model and a first acoustic model and generates a label that identifies an utterance included in the voice data.
Using the label, a calculation unit that calculates the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data.
A selection unit that selects audio data for which the ratio of audio frames is within a predetermined selection range, and
An adaptation unit that generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.
A program to function as.

Computer,
A generation unit that recognizes voice data using a language model and a first acoustic model and generates a label that identifies an utterance included in the voice data.
Using the label, a determination unit that specifies the number of voice data including the same utterance and determines the weight to be given to the voice data according to the number.
Using the label, a calculation unit that calculates the ratio of the audio frame included in the audio data to the non-audio frame included in the audio data.
A selection unit that selects the audio data based on the weight and the ratio of the audio frames.
An adaptation unit that generates a second acoustic model by adapting the first acoustic model using the voice data selected by the selection unit.
A program to function as.