JP2013083798A

JP2013083798A - Sound model adaptation device, sound model adaptation method, and program

Info

Publication number: JP2013083798A
Application number: JP2011223745A
Authority: JP
Inventors: Taichi Asami; 太一浅見; Satoru Kobashigawa; 哲小橋川; Yoshikazu Yamaguchi; 義和山口; Hirokazu Masataki; 浩和政瀧; Satoshi Takahashi; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-10-11
Filing date: 2011-10-11
Publication date: 2013-05-09
Anticipated expiration: 2031-10-11
Also published as: JP5651567B2

Abstract

PROBLEM TO BE SOLVED: To provide a sound model adaptation device capable of contributing well to a parameter modification and performing adaptation to a sound model using data that prevents deterioration of effectiveness of adaptation.SOLUTION: In a sound model adaptation device 10, a voice recognition part 100 outputs a voice recognition result text and reliability from an input voice by means of a pre-adaptation sound model. A voice recognition result registration part 200 stores a voice recognition result consisting of a speaker ID, a voice, the voice recognition result text and the reliability. A struggling-speaker detection part 300 extracts a speaker ID of a struggling speaker with whom voice recognition accuracy is lower than with another speaker. An adapting-data selection part 400 reads a voice recognition result of a speaker ID being the speaker ID of the struggling speaker, the result of which the reliability is equal to or higher than a predetermined reliability threshold and outputs adapting-data. A sound model adapting-part 500 uses a preset adapting-parameter to output a post-adaptation sound model.

Description

本発明は、音声認識に用いる音響モデルの適応を行う音響モデル適応装置、音響モデル適応方法、およびプログラムに関する。 The present invention relates to an acoustic model adaptation apparatus, an acoustic model adaptation method, and a program for performing adaptation of an acoustic model used for speech recognition.

一般に、不特定多数の話者の音声を音声認識する場合には、不特定話者音響モデルが用いられる。不特定話者音響モデルは、多くの話者において（理想的にはすべての話者において）音声と音素との対応が正しく取れるようにパラメータが設定された音響モデルである。不特定話者音響モデルのパラメータは、数百人以上の多数の話者の音声と、その発声内容を記述したテキストのペアから機械学習アルゴリズムによって決定される。 In general, an unspecified speaker acoustic model is used to recognize speech of an unspecified number of speakers. The unspecified speaker acoustic model is an acoustic model in which parameters are set so that correspondence between speech and phonemes can be correctly obtained in many speakers (ideally all speakers). The parameters of the unspecified speaker acoustic model are determined by a machine learning algorithm from a pair of voices of hundreds or more of speakers and text describing the utterance contents.

しかしながら、多数の話者の音声を学習に使ったとしても、音声認識システムを利用しうるすべての話者をカバーすることは不可能である。実際の音声認識システムでは、不特定話者音響モデルを用いても音声認識精度が他の話者よりも大きく低くなる話者（以下、苦手話者という。）が存在し、音声認識システムの利便性を低下させる問題が生じる。 However, even if the speech of many speakers is used for learning, it is impossible to cover all speakers who can use the speech recognition system. In an actual speech recognition system, there are speakers (hereinafter referred to as poor speakers) whose speech recognition accuracy is significantly lower than other speakers even when an unspecified speaker acoustic model is used. There arises a problem of lowering the performance.

特許文献１には、この問題に対処するために、音声認識システム運用中に入力された音声のうち、音声認識の信頼度がある閾値を超えている場合、当該音声とその音声認識結果テキストのペアに対して音響モデル適応アルゴリズムを適用することにより、音響モデルのパラメータを更新する方法が記載されている（以下、音声認識結果テキストを用いた音響モデル適応を「教師なし適応」という。）。信頼度が高い音声を用いることで、音声認識結果テキストに含まれる認識誤りにより適応効果が薄れることを防ぐことができる。例えば、音声認識システム運用中に入力された音声を蓄積し、一定量蓄積されたタイミングで特許文献１の方法を適用することにより、実際に入力される音声にマッチするように音響モデルのパラメータを更新することができる。 In order to cope with this problem, Japanese Patent Laid-Open No. 2004-151867 discloses that when the speech recognition reliability exceeds a certain threshold among the speeches input during operation of the speech recognition system, the speech and the speech recognition result text of the speech A method of updating parameters of an acoustic model by applying an acoustic model adaptation algorithm to a pair is described (hereinafter, acoustic model adaptation using speech recognition result text is referred to as “unsupervised adaptation”). By using speech with high reliability, it is possible to prevent the adaptation effect from fading due to a recognition error included in the speech recognition result text. For example, by storing the voice input during the operation of the voice recognition system and applying the method of Patent Document 1 at a timing when a certain amount is stored, the parameters of the acoustic model are set so as to match the actually input voice. Can be updated.

特開２０１１−７５６２２号公報JP 2011-75622 A

しかしながら、特許文献１に記載された方法では、音響モデルの教師なし適応に用いる音声を選択する際に、適応に用いるべきではないデータが多く選択されてしまう。適応に用いるべきではないデータとは、適応前の音響モデルのパラメータとマッチしている音声や、認識誤りを含む音声認識結果テキストなどである。高い信頼度を示す音声の多くは、適応前の音響モデルのパラメータとマッチしている。このような音声は音響モデル適応によるパラメータ修正への寄与が小さい。一方、信頼度の高い音声認識結果テキストにも少数ながら認識誤りが存在する。認識誤りを含む音声認識結果テキストを音響モデル適応に用いると音響モデル適応の効果を低下させる。このようなデータが適応に用いられることにより、音響モデルの教師なし適応の効果が抑制されてしまうという問題があった。 However, in the method described in Patent Document 1, when selecting a voice to be used for unsupervised adaptation of an acoustic model, a large amount of data that should not be used for adaptation is selected. Data that should not be used for adaptation is speech that matches the parameters of the acoustic model before adaptation, speech recognition result text including recognition errors, and the like. Many of the voices showing high reliability match the parameters of the acoustic model before adaptation. Such speech has a small contribution to parameter correction by adaptation of the acoustic model. On the other hand, there are a small number of recognition errors in the speech recognition result text with high reliability. When speech recognition result text including recognition errors is used for acoustic model adaptation, the effect of acoustic model adaptation is reduced. When such data is used for adaptation, there is a problem that the effect of unsupervised adaptation of the acoustic model is suppressed.

本発明はこのような点に鑑みてなされたものであり、パラメータ修正への寄与が大きく、かつ適応の効果を低下させにくいデータを使って、音響モデルの教師なし適応を行うことができる音響モデル適応装置を提供することを目的とする。 The present invention has been made in view of the above points, and an acoustic model that can perform unsupervised adaptation of an acoustic model using data that greatly contributes to parameter correction and does not easily reduce the effect of adaptation. An object is to provide an adaptive device.

上記の課題を解決するために、本発明の音響モデル適応装置は、音響モデル記憶部と音声認識結果記憶部と音声認識部と音声認識結果登録部と苦手話者検出部と適応用データ選択部と音響モデル適応部を備える。音響モデル記憶部には、適応前音響モデルが記憶される。音声認識結果記憶部には、音声認識結果が記憶される。音声認識部は、入力された音声から、適応前音響モデルを用いて、少なくとも音声認識結果テキストと信頼度を出力する。音声認識結果登録部は、少なくとも話者ＩＤと音声と音声認識結果テキストと信頼度からなる音声認識結果を、音声認識結果記憶部に記憶する。苦手話者検出部は、音声認識結果記憶部から、音声認識結果をすべて読み込み、予め設定された検出条件に基づいて、他の話者よりも音声認識精度が低い苦手話者の話者ＩＤを抽出する。適応用データ選択部は、音声認識結果記憶部から、話者ＩＤが苦手話者の話者ＩＤであり、かつ、信頼度が予め設定された信頼度閾値以上である音声認識結果を読み込み、少なくとも音声と音声認識結果テキストからなる適応用データを抽出する。音響モデル適応部は、適応前音響モデルと適応用データから、予め設定された適応パラメータを用いて、適応後音響モデルを出力する。 In order to solve the above problems, an acoustic model adaptation device according to the present invention includes an acoustic model storage unit, a speech recognition result storage unit, a speech recognition unit, a speech recognition result registration unit, a poor speaker detection unit, and an adaptation data selection unit. And an acoustic model adaptation unit. The acoustic model storage unit stores a pre-adaptation acoustic model. The voice recognition result is stored in the voice recognition result storage unit. The speech recognition unit outputs at least speech recognition result text and reliability from the input speech using the pre-adaptation acoustic model. The speech recognition result registration unit stores in the speech recognition result storage unit a speech recognition result including at least a speaker ID, speech, speech recognition result text, and reliability. The poor speaker detection unit reads all the speech recognition results from the speech recognition result storage unit, and determines the speaker ID of the poor speaker whose speech recognition accuracy is lower than other speakers based on preset detection conditions. Extract. The adaptation data selection unit reads from the speech recognition result storage unit a speech recognition result whose speaker ID is a speaker ID of a poor speaker and whose reliability is equal to or higher than a predetermined reliability threshold, Data for adaptation consisting of speech and speech recognition result text is extracted. The acoustic model adaptation unit outputs a post-adaptation acoustic model from the pre-adaptation acoustic model and the adaptation data, using preset adaptation parameters.

本発明によれば、蓄積された音声を用いた不特定話者音響モデルの教師なし適応において、苦手話者の音声のうち信頼度が高い音声を適応用データとして用いることによって、音響モデルの適応による認識精度向上効果を高めることができる。 According to the present invention, in unsupervised adaptation of an unspecified speaker acoustic model using accumulated speech, adaptation of the acoustic model is performed by using speech with high reliability among speeches of poor speakers as adaptation data. The recognition accuracy improvement effect by can be enhanced.

また、不特定話者音響モデルの認識精度向上は、話者間の音声認識精度のばらつきを小さくするため、より多くの利用者にとって利便性の高い音声認識システムを実現することができる。 Further, the improvement of the recognition accuracy of the unspecified speaker acoustic model reduces the variation in the speech recognition accuracy among speakers, so that it is possible to realize a speech recognition system that is more convenient for more users.

実施例１の音響モデル適応装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of an acoustic model adaptation device according to Embodiment 1. FIG. 実施例１の音響モデル適応装置の動作を示すフローチャート。5 is a flowchart illustrating the operation of the acoustic model adaptation device according to the first embodiment. 実施例１の変型例の音響モデル適応装置の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of an acoustic model adaptation device according to a modification of the first embodiment. 実施例１の変型例の音響モデル適応装置の動作を示すフローチャート。5 is a flowchart showing the operation of the acoustic model adaptation apparatus of the modified example of the first embodiment. 実施例２の音響モデル適応装置の構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of an acoustic model adaptation device according to a second embodiment. 実施例２の音響モデル適応装置の動作を示すフローチャート。10 is a flowchart showing the operation of the acoustic model adaptation apparatus of the second embodiment. 実施例２の変型例の音響モデル適応装置の構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of an acoustic model adaptation device according to a modification of the second embodiment. 実施例２の変型例の音響モデル適応装置の動作を示すフローチャート。10 is a flowchart showing the operation of the acoustic model adaptation apparatus according to the modified example of the second embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

まず、本発明の概要について説明する。実施例１では、音声認識システムに蓄積された音声を分析し苦手話者を検出する。検出した苦手話者の音声のうち信頼度が高い音声を適応用データとして用いて不特定話者音響モデルの教師なし適応を行う。苦手話者の音声は適応前音響モデルのパラメータとマッチしていないため、音響モデル適応によるパラメータ修正への寄与を大きくすることができる。さらに、苦手話者の音声のうち信頼度が高い音声を選別することで、誤りが比較的少ない音声認識結果テキストを適応用データとして用いて、音響モデル適応の効果が抑制されにくくすることができる。 First, an outline of the present invention will be described. In the first embodiment, speech accumulated in the speech recognition system is analyzed to detect poor speakers. Unsupervised adaptation of the speaker-independent acoustic model is performed by using, as adaptation data, speech with high reliability among the speeches of the weak speakers detected. Since the voice of a poor speaker does not match the parameters of the pre-adaptation acoustic model, it is possible to increase the contribution to parameter correction by the acoustic model adaptation. Furthermore, by selecting speech with high reliability from the speech of poor speakers, it is possible to make it difficult for the effect of acoustic model adaptation to be suppressed by using speech recognition result text with relatively few errors as adaptation data. .

実施例１で最終的に出力される適応後音響モデルを使うと、検出した苦手話者以外の話者の音声において認識精度が大きく低下している可能性がある。そのため、実施例２では、複数の適応パラメータを用いて音響モデル適応を行い、複数の適応後音響モデル候補を生成する。生成した各音響モデル候補を用いて、蓄積された音声を再び音声認識して適応後の信頼度を算出し、適応前音響モデルからの信頼度が最も向上する音響モデル候補を適応後音響モデルとして採用する。生成された音響モデル候補のいずれを用いても信頼度がある閾値以上低下する場合には適応後音響モデルを棄却する。この処理により、蓄積された音声全体で認識精度が向上する（つまり、苦手話者以外の話者においても大きな認識精度低下がない）適応後音響モデルを出力することができる。 When the post-adaptation acoustic model that is finally output in the first embodiment is used, there is a possibility that the recognition accuracy is greatly lowered in the speech of a speaker other than the poor speaker who has been detected. Therefore, in Example 2, acoustic model adaptation is performed using a plurality of adaptation parameters, and a plurality of post-adaptation acoustic model candidates are generated. Using each generated acoustic model candidate, the accumulated speech is recognized again, the reliability after adaptation is calculated, and the acoustic model candidate with the highest reliability from the pre-adaptation acoustic model is determined as the after-adaptation acoustic model. adopt. If any of the generated acoustic model candidates is used and the reliability falls below a certain threshold value, the post-adaptation acoustic model is rejected. With this processing, it is possible to output a post-adaptation acoustic model in which the recognition accuracy is improved over the entire accumulated speech (that is, there is no significant reduction in recognition accuracy even in speakers other than poor speakers).

図１、図２を参照して、本発明の実施例１に係る音響モデル適応装置１０の動作を詳細に説明する。図１は本発明の実施例１に係る音響モデル適応装置１０の構成を示すブロック図である。図２は本発明の実施例１に係る音響モデル適応装置１０の動作を示すフローチャートである。 With reference to FIGS. 1 and 2, the operation of the acoustic model adaptation apparatus 10 according to the first embodiment of the present invention will be described in detail. FIG. 1 is a block diagram showing a configuration of an acoustic model adaptation apparatus 10 according to the first embodiment of the present invention. FIG. 2 is a flowchart showing the operation of the acoustic model adaptation apparatus 10 according to the first embodiment of the present invention.

以下、実際に行われる手続きの順に説明してゆく。本実施例の音響モデル適応装置１０は、音声認識部１００、音声認識結果登録部２００、苦手話者検出部３００、適応用データ選択部４００、音響モデル適応部５００、音響モデル記憶部８００、音声認識結果記憶部９００を備える。 In the following, description will be made in the order of procedures actually performed. The acoustic model adaptation apparatus 10 of the present embodiment includes a speech recognition unit 100, a speech recognition result registration unit 200, a poor speaker detection unit 300, an adaptation data selection unit 400, an acoustic model adaptation unit 500, an acoustic model storage unit 800, a speech A recognition result storage unit 900 is provided.

音響モデル記憶部８００には、適応前音響モデルが記憶されている。 The acoustic model storage unit 800 stores a pre-adaptation acoustic model.

音声認識部１００は、音声と音響モデル記憶部８００に記憶された適応前音響モデルが入力され、入力された音声に対して音声認識を行う（Ｓ１００）。音声認識と同時に信頼度を計算し、得られた音声認識結果テキストと信頼度をセットにして出力する。入力される音声は、音声文書（電話会議や講義などのような、複数の発話の連なり）としてもよいし、発話（無音区間に挟まれた、一呼吸に発生された音声区間）としてもよい。出力する信頼度は、入力された音声が音声文書である場合には音声文書認識信頼度となり、入力された音声が発話である場合には発話認識信頼度となる。音声文書認識信頼度は、例えば、「浅見太一，小橋川哲，山口義和，政瀧浩和，高橋敏，“単語の文脈一貫性と音響尤度を用いた音声ドキュメント認識信頼度の推定”，信学技報，SP，110(43)，pp.43-48，2010.」などに記載の方法で計算される。発話認識信頼度は、例えば、「特開２００５−１４８３４２号公報」などに記載の方法で計算される。なお、以上で示した信頼度の計算方法は一例であり、この他にも様々な信頼度の計算方法を利用することが可能である。 The speech recognition unit 100 receives speech and the pre-adaptation acoustic model stored in the acoustic model storage unit 800, and performs speech recognition on the input speech (S100). The reliability is calculated simultaneously with the speech recognition, and the obtained speech recognition result text and the reliability are set and output. The input voice may be a voice document (a series of a plurality of utterances such as a conference call or a lecture) or an utterance (a voice section generated by a breath sandwiched between silence sections). . The output reliability is the speech document recognition reliability when the input speech is a speech document, and the speech recognition reliability when the input speech is an utterance. Speech document recognition reliability is, for example, “Taichi Asami, Satoshi Kobashikawa, Yoshikazu Yamaguchi, Hirokazu Masami, Satoshi Takahashi,“ Estimation of speech document recognition reliability using word context consistency and acoustic likelihood ”, Technical report, SP, 110 (43), pp.43-48, 2010. " The utterance recognition reliability is calculated by a method described in, for example, “JP 2005-148342 A”. The reliability calculation method described above is merely an example, and various other reliability calculation methods can be used.

音声認識結果登録部２００は、話者ＩＤと音声と音声認識部１００が出力する音声認識結果テキストと信頼度が入力され、入力された話者ＩＤと音声と音声認識結果テキストと信頼度をセットにした音声認識結果を音声認識結果記憶部９００へ記憶する（Ｓ２００）。入力される話者ＩＤは、例えば、音声認識システムによる利用者認証時に得られる利用者ＩＤの値としてもよいし、「特開２０００−１４８１８７」に記載の既存の話者識別技術を適用して得られる値としてもよい。 The voice recognition result registration unit 200 receives the speaker ID, the voice, the voice recognition result text output by the voice recognition unit 100, and the reliability, and sets the input speaker ID, the voice, the voice recognition result text, and the reliability. The voice recognition result thus set is stored in the voice recognition result storage unit 900 (S200). The input speaker ID may be, for example, the value of the user ID obtained at the time of user authentication by the voice recognition system, or by applying the existing speaker identification technology described in “JP 2000-148187”. It is good also as a value obtained.

苦手話者検出部３００は、予め設定された検出条件θが入力され、０個以上の話者ＩＤが含まれる苦手話者の話者ＩＤを出力する。苦手話者検出部３００が処理を実行するタイミングは、例えば、システム運用者による実行指示を受けたときとしてもよいし、予め設定された周期（毎月１日、毎週日曜日、など）としてもよいし、予め設定されたデータ量が登録されるたび（新規音声が１０００個登録されるたび、など）としてもよい。 The weak speaker detection unit 300 receives a preset detection condition θ and outputs a speaker ID of a poor speaker including zero or more speaker IDs. The timing at which the weak speaker detection unit 300 executes the process may be, for example, when an execution instruction is received from the system operator, or may be a preset cycle (1st of every month, every Sunday, etc.). Alternatively, it may be performed each time a preset data amount is registered (for example, every 1000 new voices are registered).

苦手話者の検出は、音声認識結果記憶部９００に記憶されているすべての話者ＩＤについて、以下の（１）〜（４）の処理を行うことにより行われる。以下では、処理対象の話者ＩＤがＸであるものとして記述する。 Detection of poor speakers is performed by performing the following processes (1) to (4) for all speaker IDs stored in the speech recognition result storage unit 900. In the following description, it is assumed that the speaker ID to be processed is X.

（１）音声認識結果記憶部９００から、話者ＩＤ＝Ｘとなっているすべての音声認識結果に含まれる信頼度を取得する（Ｓ３０１）。以下、取得した信頼度の集合をＣｏｎｆＬｉｓｔＩＤ＿Ｘという。
（２）音声認識結果記憶部９００から、話者ＩＤ≠Ｘとなっているすべての音声認識結果に含まれる信頼度を取得する（Ｓ３０２）。以下、取得した信頼度の集合をＣｏｎｆＬｉｓｔＩＤ＿ｎｏｔＸという。
（３）ＣｏｎｆＬｉｓｔＩＤ＿ＸとＣｏｎｆＬｉｓｔＩＤ＿ｎｏｔＸから、検出条件θに基づいて話者ＩＤ＝Ｘを苦手話者の話者ＩＤとして検出するか否かを判断する。検出するか否かの判断方法は、以下のパターンＡ，Ｂのいずれを用いてもよい。
（パターンＡ）ＣｏｎｆＬｉｓｔＩＤ＿ｎｏｔＸの平均値ｍ＿ｎｏｔＸからＣｏｎｆＬｉｓｔＩＤ＿Ｘの平均値ｍ＿Ｘを減算した値が閾値θ以上であれば検出する。この場合、検出条件θは信頼度の平均の差の閾値である。θの決定方法は、例えば、音声認識結果記憶部９００に記憶されているすべての信頼度の標準偏差σを算出し、θ＝σとする方法がある。
（パターンＢ）ＣｏｎｆＬｉｓｔＩＤ＿ｎｏｔＸの平均値ｍ＿ｎｏｔＸとＣｏｎｆＬｉｓｔＩＤ＿Ｘの平均値ｍ＿Ｘに差があるかないかを検定し、有意水準θ％で「ｍ＿Ｘの方がｍ＿ｎｏｔＸより小さい」仮説が支持されれば検出する。検定方法にはｔ検定を用いる。この場合、検出条件θは検定の有意水準である。例えば、θ＝５％と設定すればよい。
（４）検出すると判断された場合、話者ＩＤ＝Ｘを苦手話者の話者ＩＤとして出力する（Ｓ３０３）。 (1) The reliability included in all speech recognition results for which the speaker ID = X is acquired from the speech recognition result storage unit 900 (S301). Hereinafter, the acquired set of reliability is referred to as ConfListID_X.
(2) The reliability included in all speech recognition results for which speaker ID ≠ X is acquired from the speech recognition result storage unit 900 (S302). Hereinafter, the acquired set of reliability is referred to as ConfListID_notX.
(3) From ConfListID_X and ConfListID_notX, it is determined whether or not speaker ID = X is detected as the speaker ID of the poor speaker based on the detection condition θ. Any of the following patterns A and B may be used as a method for determining whether or not to detect.
(Pattern A) If the value obtained by subtracting the average value m_X of ConfListID_X from the average value m_notX of ConfListID_notX is equal to or greater than the threshold θ, it is detected. In this case, the detection condition θ is a threshold value of the average difference in reliability. As a determination method of θ, for example, there is a method of calculating standard deviation σ of all the reliability stored in the speech recognition result storage unit 900 and setting θ = σ.
(Pattern B) Test whether there is a difference between the average value m_notX of ConfListID_notX and the average value m_X of ConfListID_X, and detect if the hypothesis that “m_X is smaller than m_notX” is supported at the significance level θ%. The t-test is used as the test method. In this case, the detection condition θ is the significance level of the test. For example, θ = 5% may be set.
(4) If it is determined to be detected, speaker ID = X is output as the speaker ID of the poor speaker (S303).

苦手話者検出部３００の出力する苦手話者の話者ＩＤは、「他の話者よりも信頼度が低い方に偏っている話者（すなわち、認識精度が低くなっている話者）」を表す。（３）において、すべての話者ＩＤが検出されず、苦手話者の話者ＩＤがひとつも出力されなかった場合には、以降の処理は実行しない。 The speaker ID of the poor speaker output by the weak speaker detection unit 300 is “a speaker biased toward a lower reliability than other speakers (ie, a speaker with low recognition accuracy)”. Represents. In (3), when all the speaker IDs are not detected and no speaker ID of a weak speaker is output, the subsequent processing is not executed.

適応用データ選択部４００は、予め設定された信頼度閾値δと苦手話者検出部３００が出力する苦手話者の話者ＩＤが入力され、音響モデルの適応に用いる適応用データを出力する（Ｓ４００）。音声認識結果記憶部９００から、話者ＩＤが入力された苦手話者の話者ＩＤであり、かつ、信頼度が信頼度閾値δ以上の音声認識結果に含まれる音声と音声認識結果テキストをペアにして取得する。取得した音声と音声認識結果テキストのペアの集合を適応用データとして出力する。δの決定方法は、例えば、音声認識結果記憶部９００に記憶されているすべての信頼度の平均値μおよび標準偏差σを算出し、δ＝μ−σとするなどの方法がある。 The adaptation data selection unit 400 receives the preset reliability threshold δ and the speaker ID of the poor speaker output from the poor speaker detection unit 300, and outputs adaptation data used for adaptation of the acoustic model ( S400). The speech recognition result text is paired with the speech included in the speech recognition result that is the speaker ID of the poor speaker to whom the speaker ID is input from the speech recognition result storage unit 900 and the reliability is the reliability threshold δ or more. And get it. A set of acquired speech and speech recognition result text pairs is output as adaptation data. As a method of determining δ, for example, there is a method of calculating an average value μ and standard deviation σ of all the reliability stored in the speech recognition result storage unit 900 and setting δ = μ−σ.

適応用データ選択部４００の出力する適応用データは、適応前音響モデルのパラメータとマッチしていない話者に限定されているため、適応時のパラメータ修正への寄与が大きい。また、認識誤りが比較的少ない音声認識結果テキストが選択されているため、適応による効果が抑制されにくい。 Since the adaptation data output from the adaptation data selection unit 400 is limited to speakers who do not match the parameters of the pre-adaptation acoustic model, the contribution to parameter correction during adaptation is large. In addition, since the speech recognition result text with relatively few recognition errors is selected, it is difficult to suppress the effect of adaptation.

音響モデル適応部５００は、予め設定された適応パラメータτと適応用データ選択部４００が出力する適応用データと音響モデル記憶部８００に記憶された適応前音響モデルが入力され、適応後音響モデルを出力する（Ｓ５００）。適応後音響モデルは、適応前音響モデルと適応用データに対して、入力された適応パラメータτを用いて、音響モデル適応アルゴリズムを適用することで生成される。音響モデル適応アルゴリズムは、例えば、「J.-L.Gauvain and C.-H.Lee, “Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains”,IEEE trans. on Speech and Audio processing,2(2),pp.291-298,1994.」（参考文献１）などに記載された音響モデル適応アルゴリズムを利用することができる。適応パラメータτの意味は利用する音響モデル適応アルゴリズムによって異なるが、参考文献１に記載の音響モデル適応アルゴリズムを用いる場合には、適応前音響モデルに対する適応用データの重みを表す正の数値である。この場合、例えば、τ＝５０と設定すればよい。 The acoustic model adaptation unit 500 receives a preset adaptation parameter τ, the adaptation data output from the adaptation data selection unit 400, and the pre-adaptation acoustic model stored in the acoustic model storage unit 800. Output (S500). The post-adaptation acoustic model is generated by applying an acoustic model adaptation algorithm to the pre-adaptation acoustic model and the adaptation data using the input adaptation parameter τ. The acoustic model adaptation algorithm is, for example, “J.-L.Gauvain and C.-H.Lee,“ Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains ”, IEEE trans. On Speech and Audio processing, 2 (2 , pp.291-298, 1994 (Reference 1), etc., can be used. Although the meaning of the adaptation parameter τ varies depending on the acoustic model adaptation algorithm to be used, when the acoustic model adaptation algorithm described in Reference Document 1 is used, it is a positive numerical value representing the weight of adaptation data for the pre-adaptation acoustic model. In this case, for example, τ = 50 may be set.

音響モデル適応部５００の出力する適応後音響モデルは、適応前音響モデルがマッチしていなかった苦手話者にもマッチするようにパラメータが修正された音響モデルとなっている。
［変型例］
図３、図４を参照して、本発明の実施例１の変型例に係る音響モデル適応装置１０’の動作を詳細に説明する。図３は本発明の実施例１の変型例に係る音響モデル適応装置１０’の構成を示すブロック図である。図４は本発明の実施例１の変型例に係る音響モデル適応装置１０’の動作を示すフローチャートである。 The post-adaptation acoustic model output by the acoustic model adaptation unit 500 is an acoustic model whose parameters have been modified so that it matches even a weak speaker who did not match the pre-adaptation acoustic model.
[Modification example]
With reference to FIG. 3 and FIG. 4, the operation of the acoustic model adaptation apparatus 10 ′ according to the modification of the first embodiment of the present invention will be described in detail. FIG. 3 is a block diagram showing a configuration of an acoustic model adaptation device 10 ′ according to a modification of the first embodiment of the present invention. FIG. 4 is a flowchart showing the operation of the acoustic model adaptation apparatus 10 ′ according to the modification of the first embodiment of the present invention.

本変型例の音響モデル適応装置１０’は、音声認識部１００、音声認識結果登録部２００、苦手話者検出部３００、適応用データ選択部４００、音響モデル適応部５１０、音響モデル記憶部８００、音声認識結果記憶部９００を備える。 The acoustic model adaptation apparatus 10 ′ of this modification includes a speech recognition unit 100, a speech recognition result registration unit 200, a poor speaker detection unit 300, an adaptation data selection unit 400, an acoustic model adaptation unit 510, an acoustic model storage unit 800, A speech recognition result storage unit 900 is provided.

音響モデル適応部５１０は、実施例１の音響モデル適応部５００と同様の処理を行った後に（Ｓ５１１）、出力した適応後音響モデルを音響モデル記憶部８００に記憶する（Ｓ５１２）。続いて、音声認識結果記憶部９００に記憶されているすべての音声認識結果に含まれる音声を再び音声認識して（Ｓ５１３）、音声認識結果テキストと信頼度を更新する（Ｓ５１４）。その後、苦手話者検出部３００以降の処理を繰り返し実行する。繰り返し処理は、例えば、事前に指定された回数（通常は２〜３回）適応後音響モデルの生成が行われるか（Ｓ９９２）、苦手話者検出部３００が苦手話者の話者ＩＤを一つも出力しないか（Ｓ９９１）、のいずれかが満たされたときに処理を停止する。 The acoustic model adaptation unit 510 performs the same process as the acoustic model adaptation unit 500 of the first embodiment (S511), and then stores the output post-adaptation acoustic model in the acoustic model storage unit 800 (S512). Subsequently, the speech included in all speech recognition results stored in the speech recognition result storage unit 900 is recognized again (S513), and the speech recognition result text and the reliability are updated (S514). Thereafter, the processes after the weak speaker detection unit 300 are repeatedly executed. In the repetitive processing, for example, whether an acoustic model after adaptation is generated a predetermined number of times (usually 2 to 3 times) (S992), or the poor speaker detection unit 300 determines the speaker ID of the poor speaker. If any one of them is satisfied (S991), the process is stopped.

次に、図５、図６を参照して、本発明の実施例２に係る音響モデル適応装置２０の動作を詳細に説明する。図５は本発明の実施例２に係る音響モデル適応装置２０の構成を示すブロック図である。図６は本発明の実施例２に係る音響モデル適応装置２０の動作を示すフローチャートである。 Next, the operation of the acoustic model adaptation apparatus 20 according to the second embodiment of the present invention will be described in detail with reference to FIGS. FIG. 5 is a block diagram showing the configuration of the acoustic model adaptation apparatus 20 according to the second embodiment of the present invention. FIG. 6 is a flowchart showing the operation of the acoustic model adaptation apparatus 20 according to the second embodiment of the present invention.

以下、実際に行われる手続きの順に説明してゆく。本実施例の音響モデル適応装置２０は、音声認識部１００、音声認識結果登録部２００、苦手話者検出部３００、適応用データ選択部４００、音響モデル適応部５５０、音響モデル選択部６００、音響モデル記憶部８００、音声認識結果記憶部９００を備える。 In the following, description will be made in the order of procedures actually performed. The acoustic model adaptation device 20 of the present embodiment includes a speech recognition unit 100, a speech recognition result registration unit 200, a poor speaker detection unit 300, an adaptation data selection unit 400, an acoustic model adaptation unit 550, an acoustic model selection unit 600, an acoustic A model storage unit 800 and a speech recognition result storage unit 900 are provided.

音響モデル適応部５５０は、予め設定された適応パラメータのリストと音響モデル記憶部８００に記憶された適応前音響モデルと適応用データ選択部４００が出力する適応用データが入力され、複数の適応後音響モデル候補を出力する（Ｓ５５０）。複数の適応後音響モデル候補は、入力された適応パラメータのリストに含まれる各適応パラメータを用いて、音響モデル適応アルゴリズムを適用することで、各適応パラメータに対応して生成される。音響モデル適応アルゴリズムは、実施例１の音響モデル適応部５００と同じものを利用することができる。適応パラメータのリストは、例えば、参考文献１に記載の音響モデル適応アルゴリズムを用いる場合には、１０から１００までの値を１０刻みで設定した１０個の値などと設定することができる。値の範囲を広く設定するほど、また、細かい刻み幅に設定するほど、音響モデル選択部の処理で、より精度の高いモデルを選べるようになるが、計算時間が多くかかるようになる。 The acoustic model adaptation unit 550 is supplied with a list of preset adaptation parameters, the pre-adaptation acoustic model stored in the acoustic model storage unit 800, and the adaptation data output from the adaptation data selection unit 400. The acoustic model candidate is output (S550). A plurality of post-adaptation acoustic model candidates are generated corresponding to each adaptation parameter by applying an acoustic model adaptation algorithm using each adaptation parameter included in the input list of adaptation parameters. As the acoustic model adaptation algorithm, the same algorithm as that of the acoustic model adaptation unit 500 of the first embodiment can be used. For example, when the acoustic model adaptation algorithm described in Reference Document 1 is used, the list of adaptation parameters can be set to 10 values set from 10 to 100 in 10 increments. The wider the range of values and the smaller the step size, the more accurate the model can be selected by the processing of the acoustic model selection unit, but the calculation time increases.

音響モデル選択部６００は、予め設定された信頼度低下幅閾値εと音響モデル適応部５５０が出力する複数の適応後音響モデル候補が入力され、適応後音響モデルを出力する。まず、音声認識結果記憶部９００に記憶されているすべての信頼度の平均値である適応前信頼度平均値ＢｅｆｏｒｅＡｖｅＣｏｎｆを計算する（Ｓ６０１）。次に、入力された複数の適応後音響モデル候補のそれぞれを用いて、音声認識結果記憶部９００に記憶されているすべての音声認識結果に含まれる音声を音声認識し、同時に信頼度を計算し、適応後音響モデル候補毎に信頼度の平均値を求める（Ｓ６０２）。また、適応後音響モデル候補毎の信頼度の平均値の中から、最大値である適応後信頼度平均値ＡｆｔｅｒＡｖｅＣｏｎｆ、および適応後信頼度平均値ＡｆｔｅｒＡｖｅＣｏｎｆに対応する適応後音響モデル候補ＭａｘＡｃｏｕを選択する。続いて、適応前信頼度平均値ＢｅｆｏｒｅＡｖｅＣｏｎｆから適応後信頼度平均値ＡｆｔｅｒＡｖｅＣｏｎｆを減算することで、信頼度低下幅を求める（Ｓ６０３）。信頼度低下幅が、入力された信頼度低下幅閾値ε未満であれば、ＭａｘＡｃｏｕを適応後音響モデルとして出力し、信頼度低下幅が、信頼度低下幅閾値ε以上の場合には、何も出力せずに終了する（Ｓ６０４）。信頼度低下幅閾値εは、０以上の数値であり、例えば、事前に音声認識結果記憶部９００に記憶されているすべての信頼度の標準偏差σを求め、ε＝σなどと設定することができる。また、消極的に、信頼度が向上した場合のみ出力するようにε＝０と設定してもよい。 The acoustic model selection unit 600 receives a preset reliability decrease threshold ε and a plurality of post-adaptation acoustic model candidates output from the acoustic model adaptation unit 550, and outputs a post-adaptation acoustic model. First, a pre-adaptation reliability average value BeforeAveConf, which is an average value of all the reliability levels stored in the speech recognition result storage unit 900, is calculated (S601). Next, using each of the plurality of input post-adaptation acoustic model candidates, the speech included in all speech recognition results stored in the speech recognition result storage unit 900 is speech recognized, and the reliability is calculated at the same time. Then, an average value of reliability is obtained for each candidate acoustic model after adaptation (S602). Further, from the average value of reliability for each post-adaptation acoustic model candidate, the post-adaptation reliability average value AfterAveConf, which is the maximum value, and the post-adaptation reliability average value AfterAveConf are selected. . Subsequently, the reliability decrease range is obtained by subtracting the post-adaptation reliability average value AfterAveConf from the pre-adaptation reliability average value BeforeAveConf (S603). If the reliability decrease width is less than the input reliability decrease width threshold value ε, MaxAcou is output as an after-adaptation acoustic model. If the reliability decrease width is equal to or greater than the reliability decrease width threshold value ε, nothing is performed. The process ends without outputting (S604). The reliability lowering threshold ε is a numerical value equal to or greater than 0. For example, the standard deviation σ of all the reliability stored in the speech recognition result storage unit 900 in advance is obtained, and ε = σ can be set. it can. Alternatively, ε = 0 may be set so as to output only when the reliability is passively improved.

音響モデル選択部６００の出力する適応後音響モデルは、音声認識結果記憶部９００に記憶されている音声全体で信頼度を大きく低下させない（または、必ず向上させる）適応後音響モデルとなっている。
［変型例］
図７、図８を参照して、本発明の実施例２の変型例に係る音響モデル適応装置２０’の動作を詳細に説明する。図７は本発明の実施例２の変型例に係る音響モデル適応装置２０’の構成を示すブロック図である。図８は本発明の実施例２の変型例に係る音響モデル適応装置２０’の動作を示すフローチャートである。 The post-adaptation acoustic model output from the acoustic model selection unit 600 is a post-adaptation acoustic model that does not significantly reduce (or necessarily improve) the reliability of the entire speech stored in the speech recognition result storage unit 900.
[Modification example]
With reference to FIG. 7, FIG. 8, operation | movement of acoustic model adaptation apparatus 20 'which concerns on the modification of Example 2 of this invention is demonstrated in detail. FIG. 7 is a block diagram showing a configuration of an acoustic model adaptation apparatus 20 ′ according to a modification of the second embodiment of the present invention. FIG. 8 is a flowchart showing the operation of the acoustic model adaptation apparatus 20 ′ according to the modification of the second embodiment of the present invention.

本変型例の音響モデル適応装置２０’は、音声認識部１００、音声認識結果登録部２００、苦手話者検出部３００、適応用データ選択部４００、音響モデル適応部５５０、音響モデル選択部６１０、音響モデル記憶部８００、音声認識結果記憶部９００を備える。 The acoustic model adaptation apparatus 20 ′ of this modification includes a speech recognition unit 100, a speech recognition result registration unit 200, a poor speaker detection unit 300, an adaptation data selection unit 400, an acoustic model adaptation unit 550, an acoustic model selection unit 610, An acoustic model storage unit 800 and a speech recognition result storage unit 900 are provided.

音響モデル選択部６１０は、実施例２の音響モデル選択部６００と同様の処理を行った後に（Ｓ６１１〜Ｓ６１４）、出力した適応後音響モデルを音響モデル記憶部８００に記憶する（Ｓ６１５）。続いて、音声認識結果記憶部９００に記憶されているすべての音声認識結果に含まれる音声を再び音声認識して（Ｓ６１６）、音声認識結果テキストと信頼度を更新する（Ｓ６１７）。その後、苦手話者検出部３００以降の処理を繰り返し実行する。繰り返し処理は、例えば、事前に指定された回数（通常は２〜３回）適応後音響モデルの生成が行われるか（Ｓ９９２）、苦手話者検出部３００が苦手話者の話者ＩＤを一つも出力しないか（Ｓ９９１）、音響モデル選択部６１０が適応後音響モデルを出力しないか（Ｓ９９３）、のいずれかが満たされたときに処理を停止する。 The acoustic model selection unit 610 performs the same processing as the acoustic model selection unit 600 of the second embodiment (S611 to S614), and then stores the output post-adaptation acoustic model in the acoustic model storage unit 800 (S615). Subsequently, the speech included in all speech recognition results stored in the speech recognition result storage unit 900 is recognized again (S616), and the speech recognition result text and the reliability are updated (S617). Thereafter, the processes after the weak speaker detection unit 300 are repeatedly executed. In the repetitive processing, for example, whether an acoustic model after adaptation is generated a predetermined number of times (usually 2 to 3 times) (S992), or the poor speaker detection unit 300 determines the speaker ID of the poor speaker. The process is stopped when either one is not output (S991) or the acoustic model selection unit 610 does not output the post-adaptation acoustic model (S993).

＜プログラム、記録媒体＞
上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 <Program, recording medium>
The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

本発明は、音声認識に用いる不特定話者音響モデルの教師なし適応を行うために利用することができる。 The present invention can be used for unsupervised adaptation of an unspecified speaker acoustic model used for speech recognition.

１０、１０’、２０、２０’ 音響モデル適応装置
１００音声認識部
２００音声認識結果登録部
３００苦手話者検出部
４００適応用データ選択部
５００、５１０、５５０音響モデル適応部
６００、６１０音響モデル選択部
８００音響モデル記憶部
９００音声認識結果記憶部 10, 10 ′, 20, 20 ′ Acoustic model adaptation apparatus 100 Speech recognition unit 200 Speech recognition result registration unit 300 Poor speaker detection unit 400 Adaptive data selection unit 500, 510, 550 Acoustic model adaptation unit 600, 610 Acoustic model selection Unit 800 acoustic model storage unit 900 speech recognition result storage unit

Claims

The acoustic model storage unit stores a pre-adaptation acoustic model,
A speech recognition unit that outputs at least a speech recognition result text and reliability from the input speech using the pre-adaptation acoustic model;
A speech recognition result registration unit, wherein the speech recognition result registration unit stores a speech recognition result including at least a speaker ID, the speech, the speech recognition result text, and the reliability in a speech recognition result storage unit;
The weak speaker detection unit reads all the speech recognition results from the speech recognition result storage unit, and is a speaker of a poor speaker whose speech recognition accuracy is lower than other speakers based on preset detection conditions A weak speaker detection step of extracting an ID;
The adaptive data selection unit reads from the speech recognition result storage unit a speech recognition result whose speaker ID is the speaker ID of the poor speaker and whose reliability is equal to or higher than a predetermined reliability threshold. An adaptation data selection step for extracting adaptation data comprising at least the speech and the speech recognition result text;
An acoustic model adaptation unit that outputs a post-adaptation acoustic model from the pre-adaptation acoustic model and the adaptation data using a preset adaptation parameter;
An acoustic model adaptation method characterized by comprising:

The acoustic model adaptation method according to claim 1,
The detection condition includes an average value of reliability included in a speech recognition result whose speaker ID is the speaker ID, and an average value of reliability included in a speech recognition result whose speaker ID is other than the speaker ID. If the subtracted value is equal to or greater than a preset threshold value, the speaker ID is used as the speaker ID of the poor speaker.

The acoustic model adaptation method according to claim 1,
The detection condition is a speech recognition result in which an average value of reliability included in a speech recognition result whose speaker ID is the speaker ID is a significance level set in advance and the speaker ID is other than the speaker ID. If the test supports that the reliability is smaller than the average value of reliability, the speaker ID is used as the speaker ID of the weak speaker.

The acoustic model adaptation method according to any one of claims 1 to 3,
An acoustic model selection step,
In the acoustic model adaptation step, a plurality of adaptation parameters are preset, and a plurality of post-adaptation acoustic model candidates are output for each adaptation parameter from the pre-adaptation acoustic model and the adaptation data,
In the acoustic model selection step, the acoustic model selection unit reads all of the speech recognition results from the speech recognition result storage unit, and uses all the reliability included in the speech recognition results, so that the reliability before adaptation average value And using all the speech included in the speech recognition result and the post-adaptation acoustic model candidate, obtain an after-adaptation reliability average value, and subtract the pre-adaptation reliability average value from the after-adaptation reliability average value If the reliability decrease range is less than a predetermined reliability decrease range threshold, the post-adaptation acoustic model candidate corresponding to the post-adaptation reliability average value is output as the post-adaptation acoustic model. An acoustic model adaptation method characterized by:

The acoustic model adaptation method according to any one of claims 1 to 3,
The acoustic model adaptation step stores the output post-adaptation acoustic model in the acoustic model storage unit, reads all the speech recognition results from the speech recognition result storage unit, and for all the speech included in the speech recognition results , Using the post-adaptation acoustic model, outputting speech recognition result text and reliability, storing the speech recognition result text and the reliability in the speech recognition result storage unit,
The acoustic model adaptation method, wherein the weak speaker detection step, the adaptation data selection step, and the acoustic model adaptation step are repeatedly executed until a predetermined condition is satisfied.

The acoustic model adaptation method according to claim 4,
The acoustic model selection step stores the output after-adaptation acoustic model in the acoustic model storage unit, reads all the speech recognition results from the speech recognition result storage unit, and for all the speech included in the speech recognition results , Using the post-adaptation acoustic model, outputting speech recognition result text and reliability, storing the speech recognition result text and the reliability in the speech recognition result storage unit,
The acoustic model adaptation method, wherein the weak speaker detection step, the adaptation data selection step, the acoustic model adaptation step, and the acoustic model selection step are repeatedly executed until a predetermined condition is satisfied.

An acoustic model storage unit for storing a pre-adaptation acoustic model;
A voice recognition result storage unit for storing a voice recognition result;
A speech recognition unit that outputs at least speech recognition result text and reliability from the input speech using the pre-adaptation acoustic model;
A speech recognition result registration unit that stores a speech recognition result including at least a speaker ID, the speech, the speech recognition result text, and the reliability in the speech recognition result storage unit;
Read all the speech recognition results from the speech recognition result storage unit, and extract a speaker ID of a speaker who is poor in speech recognition accuracy than other speakers based on preset detection conditions A detection unit;
From the speech recognition result storage unit, a speech recognition result in which the speaker ID is the speaker ID of the poor speaker and the reliability is equal to or higher than a reliability threshold value set in advance is read, and at least the speech and the speech An adaptive data selector for extracting adaptive data consisting of recognition result text;
From the pre-adaptation acoustic model and the adaptation data, an acoustic model adaptation unit that outputs a post-adaptation acoustic model using a preset adaptation parameter;
An acoustic model adaptation device comprising:

The program for functioning a computer as an acoustic model adaptation apparatus of Claim 7.