JP4779239B2

JP4779239B2 - Acoustic model learning apparatus, acoustic model learning method, and program thereof

Info

Publication number: JP4779239B2
Application number: JP2001179125A
Authority: JP
Inventors: 優高野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2001-06-13
Filing date: 2001-06-13
Publication date: 2011-09-28
Anticipated expiration: 2021-06-13
Also published as: JP2002372987A

Description

【０００１】
【発明の属する技術分野】
本発明は、音響モデル学習装置、音響モデル学習方法、およびそのプログラムに関し、特に、音声サンプルの特性に応じて音声サンプルに重み付けを行い、信頼性の高い音響モデルを作成する音響モデル学習装置、音響モデル学習方法、およびそのプログラムに関する。
【０００２】
【従来の技術】
音響モデル学習装置は、実際の音声を用いて、音声認識に使用される音響モデルを学習することが多い。一般に、学習される音響モデルとして、ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（隠れマルコフモデル、以下、ＨＭＭとする）が用いられる。また、ＨＭＭにおける状態を表す確率分布としては、連続混合分布が用いられる場合が多い。また、多くの場合、ＨＭＭの学習には、フォワード・バックワード法が用いられる。上記のようなＨＭＭによる音響モデルのパラメータの推定について記載されている文献としては、ＬａｗｒａｎｃｅＬａｂｉｎｅｒ，Ｂｉｉｎｇ−ＨｗａｎｇＪｕａｎｇ「ＦｕｎｄａｍｅｎｔａｌｓｏｆＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ１９９３ｐ．３３３〜ｐ．３８９」（以下、従来例１）があった。
【０００３】
従来例１では、ＨＭＭに用いられる連続混合確率分布を構成する複数の確率分布それぞれに、連続混合確率分布における混合比を示す混合重みを付加していた。
【０００４】
以下、フォワード・バックワード法を用いたＨＭＭにおけるパラメータの計算方法について説明する。
【０００５】
時刻（フレーム）ｔごとの特徴量をＯ_t（ｔは１以上Ｔ以下の整数）とすると、フォワード・バックワード法におけるフォワード確率αは、以下に示す（式１．１）および（式１．２）により示される。
【０００６】
【数１】

【０００７】
なお、フォワード確率α（ｔ，ｉ）は、特徴量Ｏ_tを観測し、状態Ｓ_iにある確率を示す。同様に、フォワード確率α（１，ｉ）は、特徴量Ｏ₁を観測し、状態Ｓ_iにある確率、フォワード確率α（ｔ＋１，ｊ）は、特徴量Ｏ_t+1を観測し、状態Ｓ_jにある確率を示す。
【０００８】
また、状態遷移確率ａ_ijは、状態Ｓ_iから状態Ｓ_jへ遷移する確率を表す。観測確率ｂ（ｉ，Ｏ₁）は、状態Ｓ_iに遷移する際に、フレームｔにおける特徴量Ｏ_tが観測される確率を示す。
【０００９】
また、フォワード・バックワード法におけるバックワード確率βは、以下に示す（式２．１）および（式２．２）により示される。
【００１０】
【数２】

【００１１】
なお、バックワード確率β（ｔ，ｉ）は、フレームｔにおいて状態Ｓ_iにあり、以後フレーム（ｔ＋１）において特徴量Ｏ_t+1を観測する確率を示す。フレームＴは、最終状態におけるフレームを表す。
【００１２】
また、フォワード・バックワード法における対応確率γは、フォワード確率αとバックワード確率βとに基づいて、計算される。対応確率γは、以下に示す（式３．１）により示される。
【００１３】
【数３】

【００１４】
なお、対応確率γ（ｔ，ｊ，ｋ）は、フレームｔに状態Ｓ_jに遷移した際、状態Ｓ_jにおけるｋ番目の混合分布要素において特徴量Ｏ_tを観測する確率である。また、Ｎ（Ｏ_t，μ_jk，Ｕ_jk）は、状態Ｓ_jのｋ番目の混合分布要素で、モデル化される特徴量がＯ_t、平均ベクトルがμ_jk、共分散行列がＵ_jkの確率分布である。また、ｃ_jkは、Ｎ（Ｏ_t，μ_jk，Ｕ_jk）に対する混合重み係数である。
【００１５】
また、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素のパラメータである混合重みｃ_jk、平均ベクトルμ（ｔ，ｊ，ｋ）、および共分散行列Ｕ（ｊ，ｋ）の各平均は、以下に示す（式４．１）、（式４．２）、および（式４．３）により計算される。
【００１６】
【数４】

【００１７】
なお、混合重みｃ_jkは、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素に対する混合重みである。また、平均ベクトルμ（ｔ，ｊ，ｋ）は、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素の平均ベクトルである。また、共分散行列Ｕ（ｊ，ｋ）は、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素の共分散行列である。また、Ｖ_kは、文字列Ｖにおける所定の文字を示す。また、（Ｏ_t−μ_jk）’は、ベクトル（Ｏ_t−μ_jk）の対置ベクトルを表す。
【００１８】
また、特開平５−２３２９８９号公報が開示するところの音響モデルの話者適応化法（以下、従来例２）では、ＨＭＭに用いられる連続混合確率分布を構成する複数の確率分布それぞれの混合比を決める重み係数だけを再推定していた。
【００１９】
また、特開平１０−１１０８６号公報が開示するところの隠れマルコフモデルの計算方式（以下、従来例３）には、フォワードバックワード法を用いたＨＭＭの計算方式が記載されていた。
【００２０】
【発明が解決しようとする課題】
一般に、信頼性の高い確率モデルの学習には、大量の音声データが必要となる。特に、不特定話者用の音響モデルには、話者の個人差による音声の変動を吸収する必要がある。従って、不特定話者用の音響モデルには、話者の発声による音声データが多数必要となる。しかしながら、大量の音声サンプルを収集する際には、話者の誤発声あるいは低品質の音声が混入する可能性がある。
【００２１】
さらに、確率モデル（音響モデル）の推定を行う場合に、以下に示すような問題が生じてしまう。通常、音声データを収集する際、話者の自然な発声による音声データを得る必要がある。従って、音声データとして収集される話者の発声内容は、実際に存在する単語が用いられる。また、実際に存在する単語を構成する音素（文字）の分布には必然的に偏りが生じる。例えば、日本語の場合は、母音、特に「あ」の出現頻度が非常に高い。確率モデルを推定する場合、確率分布を推定するサンプル数によって確率分布の信頼性に格差が生じてしまう。従って、単語を構成する音素を音響モデルを構築する音声データとして用いる場合、音素の出現頻度の偏りを修正する必要がある。
【００２２】
本発明は、上記問題点に鑑みてなされたものであり、従来例１、従来例２、および従来例２と従来例３とを組み合わせたものにおいてＨＭＭの各混合分布要素に付加されている重みに加え、収集した音声サンプルの特性に応じて設定された重み係数を、音声サンプルの各フレームにさらに付加することによって、特定の音声サンプルあるいは音声サンプルの特定部分を音響モデルの学習の際に増幅あるいは除去し、音声サンプルを構成する音素の出現頻度の偏りを修正し、信頼性の高い音響モデルを提供する音響モデル学習装置を提供することを目的とする。
【００２３】
【課題を解決するための手段】
かかる目的を達成するために、本発明は、以下の特徴を有する。本発明にかかる音響モデル学習装置は、
入力される学習用音声からフレームごとに特徴量を抽出する音声分析手段と、
所定の音声からフレームごとに抽出された特徴量を示す確率分布を用いて、前記所定の音声のフレームごとに分割された前記所定の音声の断片を状態として表現し、該状態を構成単位とする入力音響モデルと、前記学習用音声の内容を示す文字列情報である正解列と、に基づいて、前記入力音響モデルにおける前記状態に前記正解列を割り当てた状態列の情報である学習用辞書を生成する辞書生成手段と、
該辞書生成手段により生成された学習用辞書を参照し、前記学習用音声の特徴量と前記入力音響モデルにおける状態との対応確率を前記学習用音声のフレームごとに算出する対応確率算出手段と、
前記学習用辞書を参照し、前記入力音響モデルにより表現される前記状態あるいは複数の前記状態からなる状態列を、前記学習用音声のフレームごとに最尤に割り当て、第１の最尤状態列を生成する第１の最尤状態列生成手段と、
任意の文字を表す辞書を参照し、前記入力音響モデルにより表現される前記状態あるいは複数の前記状態からなる状態列を、前記学習用音声のフレームごとに最尤に割り当て、第２の最尤状態列を生成する第２の最尤状態列生成手段と、
前記第１の最尤状態列と前記第２の最尤状態列とを比較し、該比較結果に基づいて、前記対応確率に重み付けする際に付加する係数である重み係数を、前記学習用音声のフレームごとに算出する重み計算手段と、
前記対応確率算出手段により算出された対応確率と、前記重み計算手段により算出された重み係数と、前記音声分析手段により算出された特徴量と、に基づいて統計量を算出し、該算出した統計量に基づいて、前記入力音響モデルのパラメータを再推定し、出力音響モデルを作成する再評価手段と、
を有する
ことを特徴とする。
【００２４】
本発明にかかる音響モデル学習方法は、
入力される学習用音声からフレームごとに特徴量を抽出する音声分析工程と、
所定の音声からフレームごとに抽出された特徴量を示す確率分布を用いて、前記所定の音声のフレームごとに分割された前記所定の音声の断片を状態として表現し、該状態を構成単位とする入力音響モデルと、前記学習用音声の内容を示す文字列情報である正解列と、に基づいて、前記入力音響モデルにおける前記状態に前記正解列を割り当てた状態列の情報である学習用辞書を生成する辞書生成工程と、
該辞書生成工程により生成された学習用辞書を参照し、前記学習用音声の特徴量と前記入力音響モデルにおける状態との対応確率を前記学習用音声のフレームごとに算出する対応確率算出工程と、
前記学習用辞書を参照し、前記入力音響モデルにより表現される前記状態あるいは複数の前記状態からなる状態列を、前記学習用音声のフレームごとに最尤に割り当て、第１の最尤状態列を生成する第１の最尤状態列生成工程と、
任意の文字を表す辞書を参照し、前記入力音響モデルにより表現される前記状態あるいは複数の前記状態からなる状態列を、前記学習用音声のフレームごとに最尤に割り当て、第２の最尤状態列を生成する第２の最尤状態列生成工程と、
前記第１の最尤状態列と前記第２の最尤状態列とを比較し、該比較結果に基づいて、前記対応確率に重み付けする際に付加する係数である重み係数を、前記学習用音声のフレームごとに算出する重み計算工程と、
前記対応確率算出工程により算出された対応確率と、前記重み計算工程により算出された重み係数と、前記音声分析工程により算出された特徴量と、に基づいて統計量を算出し、該算出した統計量に基づいて、前記入力音響モデルのパラメータを再推定し、出力音響モデルを作成する再評価工程と、
を有する
ことを特徴とする。
【００２５】
本発明にかかるプログラムは、
入力される学習用音声からフレームごとに特徴量を抽出する音声分析処理と、
所定の音声からフレームごとに抽出された特徴量を示す確率分布を用いて、前記所定の音声のフレームごとに分割された前記所定の音声の断片を状態として表現し、該状態を構成単位とする入力音響モデルと、前記学習用音声の内容を示す文字列情報である正解列と、に基づいて、前記入力音響モデルにおける前記状態に前記正解列を割り当てた状態列の情報である学習用辞書を生成する辞書生成処理と、
該辞書生成処理により生成された学習用辞書を参照し、前記学習用音声の特徴量と前記入力音響モデルにおける状態との対応確率を前記学習用音声のフレームごとに算出する対応確率算出処理と、
前記学習用辞書を参照し、前記入力音響モデルにより表現される前記状態あるいは複数の前記状態からなる状態列を、前記学習用音声のフレームごとに最尤に割り当て、第１の最尤状態列を生成する第１の最尤状態列生成処理と、
任意の文字を表す辞書を参照し、前記入力音響モデルにより表現される前記状態あるいは複数の前記状態からなる状態列を、前記学習用音声のフレームごとに最尤に割り当て、第２の最尤状態列を生成する第２の最尤状態列生成処理と、
前記第１の最尤状態列と前記第２の最尤状態列とを比較し、該比較結果に基づいて、前記対応確率に重み付けする際に付加する係数である重み係数を、前記学習用音声のフレームごとに算出する重み計算処理と、
前記対応確率算出処理により算出された対応確率と、前記重み計算処理により算出された重み係数と、前記音声分析処理により算出された特徴量と、に基づいて統計量を算出し、該算出した統計量に基づいて、前記入力音響モデルのパラメータを再推定し、出力音響モデルを作成する再評価処理と、
を、コンピュータに実行させることを特徴とする。
【００４１】
【発明の実施の形態】
（第１の実施形態）
図１は、本発明の第１の実施形態における音響モデル学習装置の構成を示す図である。以下、図１を用いて、本実施形態における音響モデル学習装置の構成について説明する。なお、本実施形態では、音響モデルとして連続混合確率分布によるＨＭＭを用いる。上記の音響モデルでは、所定の音声からフレームごとに抽出された特徴量を示す確率分布を用いることによって、上記のフレームごとに分割された音声の断片が状態として表現され、その状態が構成単位となる。
【００４２】
音響モデル学習装置は、音声分析部１０１と、辞書部１０２と、フォワード・バックワード計算部１０３と、再評価部１０４と、ビタビ計算部１０５と、重み計算部１０６と、を有する。以下、図１を用いて音響モデル学習装置の各部位について説明する。
【００４３】
音声分析部１０１には、音響モデルの学習に用いられる音声情報である学習用音声が入力される。なお、上記の学習用音声は、ビタビ計算部１０５にも入力される。
【００４４】
音声分析部１０１は、入力された学習用音声を所定周期ごとに区切り、その区間を「フレーム」として、フレームごとに学習用音声の周波数分析を行う。上記の分析の結果抽出されたフレームごとの学習用音声の（音響的）特徴量は、フォワード・バックワード計算部１０３および再評価部１０４に入力される。なお、特徴量としては、音声のパワーを用いてもよいし、パワーの変化量、ケプストラム、あるいはケプストラム変化量等を用いてもよい。
【００４５】
辞書部１０２には、音響モデルおよび正解列が入力される。上記の正解列は、所定の入力手段（図示せず）により入力される文字列の情報としてもよい。所定の入力手段は、音声分析部１０１およびビタビ計算部１０５に入力された学習用音声の内容を示す文字情報を正解列として辞書部１０２に入力する。
【００４６】
また、辞書部１０２は、入力された音響モデル（以下、入力音響モデル）と入力された正解列とに基づいて、サブワードモデルによる学習用辞書を作成し、格納する。なお、サブワードモデルによる学習用辞書とは、入力された正解列（例えば、実際に存在する単語等）を、音素あるいは音節単位（サブワード単位）等に分割した状態列の情報である。また、辞書部１０２は、学習用辞書とは別に、任意の文字列の情報である「任意の文字列を表す辞書」を予め格納している。
【００４７】
フォワード・バックワード計算部１０３は、辞書部１０２に格納されている学習用辞書を参照し、音声分析部１０１により抽出された学習用音声の特徴量と、入力された入力音響モデルと、に基づいて、フォワード・バックワード法によるフォワード確率とバックワード確率とを算出する。さらに、フォワード・バックワード計算部１０３は、算出したフォワード確率とバックワード確率とに基づいて、学習用音声の特徴量と入力音響モデルの状態との間の対応確率を算出する。フォワード・バックワード計算部１０３は、算出した対応確率を再評価部１０４へ出力する。
【００４８】
フォワード・バックワード計算部１０３は、入力された学習用音声から変換されたフレームｔごとの特徴量をＯ_t（ｔは１以上Ｔ以下の整数）として、フォワード確率αを、以下に示す（式１．１）および（式１．２）に基づいて算出する。また、フォワード・バックワード計算部１０３は、バックワード確率βを、（式２．１）および（式２．２）により示されている式に基づいて算出する。
【００４９】
また、フォワード・バックワード計算部１０３は、算出したフォワード確率αとバックワード確率βとを用いて、対応確率γを、（式３．１）により示される式に基づいて算出する。
【００５０】
ビタビ計算部１０５には、音声分析部１０１と同様の学習用音声が入力される。また、ビタビ計算部１０５には、辞書部１０２を介して入力音響モデルが入力される。
【００５１】
ビタビ計算部１０５は、入力された学習用音声を所定時間（フレーム）ごとに分割する。次に、ビタビ計算部１０５は、所定の文字情報を参照して、上記の各フレームに入力音響モデルに基づく状態あるいは複数の状態からなる状態列を最尤に割り当て、ビタビマッチング（ＶｉｔｅｒｂｉＭａｔｃｈｉｎｇ）を行い、所定の最尤状態列を作成する。
【００５２】
重み計算部１０６は、ビタビ計算部１０５により複数種類の所定の文字情報を参照して作成された複数種類の最尤状態列に基づいて重み係数Ｒ_tを算出する。
【００５３】
再評価部１０４は、重み計算部１０６により算出された重み係数Ｒ_tと、フォワード・バックワード計算部１０３により算出された対応確率と、音声分析部１０１により抽出された特徴量と、フォワード・バックワード計算部１０３を介して入力された入力音響モデルと、に基づいて、音響モデルの各状態の統計量（混合重み、平均ベクトル、および共分散行列の各平均）を計算する。再評価部１０４は、抽出された統計量に基づいて、入力音響モデルの各パラメータ（混合重み、平均ベクトル、および共分散行列の各平均）を再評価する。再評価部１０４は、入力音響モデルの各パラメータの再評価に基づいて、音響モデルを作成する。再評価部１０４は、作成した音響モデルを、出力音響モデルとして出力する。
【００５４】
再評価部１０４は、対応確率γに重み係数Ｒ_tを積算して重み付けを行う。再評価部１０４は、重み付けされた対応確率γ・Ｒ_tを用いて、混合重みｃ_jk、平均ベクトルμ（ｔ，ｊ，ｋ）、および共分散行列Ｕ（ｊ，ｋ）の各平均を統計量として算出する。上記の統計量は、以下に示す（式５．１）、（式５．２）、および（式５．３）により与えられる。
【００５５】
【数５】

【００５６】
なお、混合重みｃ_jkは、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素に対する混合重みである。また、平均ベクトルμ（ｔ，ｊ，ｋ）は、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素の平均ベクトルである。また、共分散行列Ｕ（ｊ，ｋ）は、ＨＭＭにおける状態Ｓ_jのｋ番目の混合分布要素の共分散行列である。また、Ｖ_kは、文字列Ｖにおける所定の文字を示す。また、（Ｏ_t−μ_jk）’は、ベクトル（Ｏ_t−μ_jk）の対置ベクトルを表す。
【００５７】
図２は、本発明の第１の実施形態における入力音響モデルが表現可能な音素セットを示す図である。また、図３は、本発明の第１の実施形態における音響モデル学習装置が作成する学習用辞書を示す図である。また、図４は、本発明の第１の実施形態における重み係数Ｒ_tを示す図である。また、図９は、本発明の第１の実施形態における音響モデル学習装置の動作の流れを示すフローチャートである。以下、図１〜４を用い、図９に沿って本実施形態における音響モデル学習装置の動作について説明する。
【００５８】
本実施形態では、学習用音声の一例として、所定の話者による「加藤今太郎（かとうこんたろう）」の発声を用いる。また、本実施形態では、入力音響モデル（初期モデル）として、上記の所定の話者による「かとうこんたろう」の発声を、「さとうこんたろう」と認識する音響モデルが与えられたとする。
【００５９】
なお、ＨＭＭでは、１状態に対応する音声の長さは可変であり、ビタビマッチング等を用いることにより、ＨＭＭにおける最尤な状態系列が得られる。しかしながら、本実施形態では、簡単のために、入力音声は１４フレームの音声であり、１フレームにつき１状態が割り当てられているものとする。
【００６０】
まず、所定の制御手段（図示せず）は、学習用音声が音声分析部１０１に入力されたか否かを判断する（ステップＳ９０１）。学習用音声が音声分析部１０１に入力されていないと判断された場合（ステップＳ９０１／Ｎｏ）、ステップＳ９０１の工程が繰り返される。
【００６１】
学習用音声が音声分析部１０１に入力されたと判断された場合（ステップＳ９０１／Ｙｅｓ）、音声分析部１０１は、フレームごとに学習用音声の周波数を分析し、その分析した学習用音声の周波数に基づいて学習用音声の特徴量を抽出する（ステップＳ９０２）。抽出した学習用音声の特徴量は、フォワード・バックワード計算部１０３および再評価部１０４へ出力される。
【００６２】
次に、所定の制御手段は、正解列および入力音響モデルが辞書部１０２に入力されたか否かを判断する（ステップＳ９０３）。正解列および入力音響モデルが入力されていないと判断された場合（ステップＳ９０３／Ｎｏ）、ステップＳ９０３の工程が繰り返される。
【００６３】
正解列および入力音響モデルが辞書部１０２に入力されたと判断された場合（ステップＳ９０３／Ｙｅｓ）、辞書部１０２は、入力された正解列と入力音響モデルとに基づいて学習用辞書を作成し、作成した学習用辞書を格納する（ステップＳ９０４）。
【００６４】
ここで、図２および図３を用いて、辞書部１０２が学習用辞書を作成する工程について説明する。図２には、本実施形態における入力音響モデルが表現できる音素の列（音素セット）が示されている。上記の音素セットは、入力音響モデルに含まれている。辞書部１０２は、上記の音素セットを用いて、学習用音声「かとうこんたろう」を「ｋ−ａ−ｔ−ｏ−ｕ−ｋ−ｏ−ｎｇ−ｔ−ａ−ｒ−ｏ−ｕ」と音素単位に分割する。分割した音素を、状態Ｓ_i（ｉは１以上１３以下の整数）にそれぞれ割り当て、図３に示されるような状態列、すなわち学習用音声に対応する学習用辞書を作成する。辞書部１０２は、作成した学習用辞書を格納する。
【００６５】
辞書部１０２による学習用辞書作成後、フォワード・バックワード計算部１０３は、辞書部１０２により作成された学習用辞書を参照し、音声分析部１０１により抽出された特徴量に基づいて、フォワード確率およびバックワード確率を算出する（ステップＳ９０５）。
【００６６】
次に、フォワード・バックワード計算部１０３は、算出したフォワード確率とバックワード確率とに基づいて対応確率を算出する（ステップＳ９０６）。
【００６７】
所定の制御手段は、音声分析部１０１に入力された学習用音声と同様の学習用音声がビタビ計算部１０５に入力されたか否かを判断する。また、所定の制御手段は、入力音響モデルが辞書部１０２を介してビタビ計算部１０５に入力されたか否かを判断する（ステップＳ９０７）。学習用音声および入力音響モデルがビタビ計算部１０５に入力されていないと判断された場合（ステップＳ９０７／Ｎｏ）、ステップＳ９０７の工程が繰り返される。
【００６８】
学習用音声および入力音響モデルがビタビ計算部１０５に入力されたと判断された場合（ステップＳ９０７／Ｙｅｓ）、ビタビ計算部１０５は、入力された学習用音声および入力音響モデルを用い、辞書部１０２により作成された学習用辞書を参照して、ビタビマッチングにより最尤状態列を生成する（ステップＳ９０８）。なお、学習用辞書を参照して生成された上記の最尤状態列を第１の最尤状態列とする。
【００６９】
さらに、ビタビ計算部１０５は、入力された学習用音声および入力音響モデルを用い、辞書部１０２に格納されている任意の文字列を表す辞書を参照して、ビタビマッチングにより最尤状態列を生成する（ステップＳ９０９）。なお、任意の文字を表す辞書を参照して生成された上記の最尤状態列を第２の最尤状態列とする。
【００７０】
次に、重み計算部１０６は、ビタビ計算部１０５により生成された第１の最尤状態列の各状態と第２の最尤状態列の各状態を比較し、以下に示す（式６．１）および（式６．２）により与えられる重み係数Ｒ_tを算出する（ステップＳ９１０）。なお、重み係数Ｒ_tは、学習用音声の各フレームにそれぞれ対応するように算出される。
【００７１】
【数６】

【００７２】
話者による誤発声あるいは品質の低い音声を学習用音声として用いた場合、入力された正解列と入力音響モデルにより認識される学習用音声との間で差異が発生する可能性、つまり、入力された学習用音声による所定の言語単位（例えば、音素単位、音節単位等）の音声サンプルが音響モデルにより誤認識される可能性が高い。上記の誤認識された音声サンプルが出力音響モデルに大きく反映しないようすることによって、信頼性の高い出力音響モデルを得ることが可能となる。
【００７３】
重み計算部１０６は、第１の最尤状態列における各状態と、第２の最尤状態列における各状態と、をフレームごとに比較し、上記の（式６．１）および（式６．２）に基づいて重み係数Ｒ_tを算出する。
【００７４】
（式６．１）は、所定のフレームにおいて、第１の最尤状態列と第２の最尤状態列との間に差異が発生した場合の重み係数Ｒ_tを与える式であり、上記の場合、重み係数Ｒ_tは「０」として算出される。
【００７５】
（式６．２）は、全てのフレームにおいて、第１の最尤状態列と第２の最尤状態列とが一致した場合の重み係数Ｒ_tを与える式であり、上記の場合、重み係数Ｒ_tは「１」として算出される。
【００７６】
話者の誤発声等により学習用音声の品質が低下した場合、その品質低下が生じた部分に対応するフレームに割り当てられている第１の最尤状態列の状態と、第２の最尤状態列の状態との間に差異が発生する。従って、信頼性の高い出力音響モデルを得るためには、上記の差異が生じた部分が出力音響モデルに反映されないようにする必要がある。
【００７７】
本実施形態では、学習用音声における高品質部分（所定のフレームにおいて第１の最尤状態列の状態と第２の最尤状態列とが一致した状態）の重み係数Ｒ_tを「１」とし、低品質部分の重み係数Ｒ_tを高品質部分の重み係数Ｒ_tよりも低い値である「０」とすることによって、学習用音声の低品質部分、すなわち学習用音声が入力音響モデルにより誤認識されている部分が出力音響モデルに反映されないようにしている。
【００７８】
本実施形態における入力音響モデルでは、学習用音声「かとうこんたろう」は、「さとうこんたろう」と認識される。上記のような場合、「か」の部分が実際にどのような発声であったか不明であるが、「か」の部分における音素「ｋ」が入力音響モデルにより誤認識されている。音素「ｋ」のモデルが正しく認識される出力音響モデルを作成するためには、「か」の部分の音素「ｋ」が出力音響モデルに反映しないように設定される必要がある。
【００７９】
図１４は、（式６．１）および（式６．２）により図３の学習用辞書に与えられる重み係数Ｒ_tを示す図である。Ｒ_t（ｔ＝１〜１３）は、それぞれＳ_i（ｉ＝１〜１３）における重み係数である。図１４に示されているように、「か」の部分の音素「ｋ」（＝Ｓ₁）における重み係数Ｒ₁を「０」とし、他の音素（Ｓ₂〜Ｓ₁₃）における重み係数Ｒ₂〜Ｒ₁₃を「１」とすることによって、「か」の部分の音素「ｋ」が出力音響モデルに反映しないようにすることが可能となり、信頼性の高い音響モデルを作成することが可能となる。
【００８０】
なお、本実施形態では、重み係数Ｒ₁を「０」とすることにより、「か」の部分の音素「ｋ」が出力音響モデルに反映しないようにしたが、重み係数Ｒ₁を「０以上１未満の任意の値」に設定することによって、「か」の部分の音素「ｋ」が出力音響モデルに与える影響を調整することが可能となる。
【００８１】
以下、再び図９のフローチャートに沿って音響モデル学習装置の動作について説明を進める。再評価部１０４は、重み計算部１０６により算出された重み係数Ｒ_tと、音声分析部１０１により抽出された特徴量と、フォワード・バックワード計算部１０３により算出された対応確率と、に基づいて、音響モデルの各統計量（混合重み、平均ベクトル、および共分散行列の各平均）を算出する（ステップＳ９１１）。
【００８２】
再評価部１０４は、音響モデルの各統計量算出後、算出した統計量に基づいて、フォワード・バックワード計算部１０３を介して入力された入力音響モデルの各パラメータ（混合重み分布、平均ベクトル、および共分散行列の各平均）を再評価し、出力音響モデルを作成する（ステップＳ９１２）。作成された出力音響モデルは、再評価部１０４から出力される（ステップＳ９１３）。出力音響モデル出力後、音響モデル学習装置は、動作を終了する。
【００８３】
（第２の実施形態）
以下、特記しない限り、本発明の第２の実施形態における音響モデル学習装置の構成および動作は、本発明の第１の実施形態における音響モデル学習装置の構成および動作と同様であるとする。
【００８４】
一般に、騒音環境が学習用音声の品質を低下させる場合、学習用音声の誤認識は、単一の音素にとどまらず、その音素の周辺音素にも影響を与える。第１の実施形態では、重み係数Ｒ_tを音素ごとに設定していたが、環境騒音などの理由により複数の音素にわたって誤認識される場合、音節単位で重み付けを行うことによって、より信頼性の高い出力音響モデルを作成することが可能となる。
【００８５】
図５は、本発明の第２の実施形態における重み係数Ｒ_tを示す図である。第１の実施形態と同様に重み係数Ｒ_t（ｔ＝１〜１３）は、それぞれ図３における状態Ｓ_i（ｉ＝１〜１３）に対応する。
【００８６】
第１の実施形態では、「か」の音素「ｋ」（＝Ｓ₁）の重み係数Ｒ₁を「０」に設定していた。本実施形態では、学習用音声「かとうこんたろう」における音節「か（ｋ−ａ）」において、品質が低下し、第１の最尤状態列と第２の最尤状態列との間に差異が生じている。上記のように音節単位で学習用音声の品質低下が生じている場合、「か」の音素「ｋ」（＝Ｓ₁）の重み係数Ｒ₁と、音素「ａ」（＝Ｓ₂）の重み係数Ｒ₂と、をそれぞれ「０」に設定することによって、音素「ｋ」（＝Ｓ₁）の重み係数Ｒ₁のみを「０」とする場合と比較して、より信頼性の高い出力音響モデルを作成することが可能となる。
【００８７】
なお、本実施形態では、重み係数Ｒ₁およびＲ₂を「０」とすることにより、「か」の部分の音素「ｋ」および音素「ａ」が出力音響モデルに反映しないようにしたが、重み係数Ｒ₁およびＲ₂を「０以上１未満の任意の値」に設定することによって、「か」の部分の音素「ｋ」および音素「ａ」が出力音響モデルに与える影響を調整することが可能となる。
【００８８】
（第３の実施形態）
以下、特記しない限り、本発明の第３の実施形態における音響モデル学習装置の構成および動作は、本発明の第１の実施形態における音響モデル学習装置の構成および動作と同様であるとする。
【００８９】
第２の実施形態では、騒音環境による学習用音声の誤認識は、単一の音素にとどまらず、その音素の周辺音素にも影響を与える場合について説明した。第２の実施形態では、重み係数Ｒ_tを音節ごとに設定していたが、環境騒音などの理由により誤認識される音素の範囲が音節単位よりもさらに広い範囲にわたって存在する場合、重み付けする音素の範囲を音節単位よりもさらに拡大し、単語単位とすることによって、より信頼性の高い出力音響モデルを作成することが可能となる。
【００９０】
図６は、本発明の第３の実施形態における重み係数Ｒ_tを示す図である。第１の実施形態と同様に重み係数Ｒ_t（ｔ＝１〜１３）は、それぞれ図３における状態Ｓ_i（ｉ＝１〜１３）に対応する。
【００９１】
第１の実施形態では、「か」の音素「ｋ」（＝Ｓ₁）の重み係数Ｒ₁を「０」に設定していた。また、第２の実施形態では、「か」の音素「ｋ」（＝Ｓ₁）の重み係数Ｒ₁と、音素「ａ」（＝Ｓ₂）の重み係数Ｒ₂と、をそれぞれ「０」に設定していた。本実施形態では、学習用音声「かとうこんたろう」における単語「かとう（ｋ−ａ−ｔ−ｏ−ｕ）」において、品質が低下し、第１の最尤状態列と第２の最尤状態列との間に差異が生じている。上記のように単語単位で学習用音声の品質低下が生じている場合、単語「かとう（ｋ−ａ−ｔ−ｏ−ｕ）」における音素「ｋ」（＝Ｓ₁）、音素「ａ」（＝Ｓ₂）、音素「ｔ」（＝Ｓ₃）、音素「ｏ」（＝Ｓ₄）、および音素「ｕ」（＝Ｓ₅）それぞれに対応する重み係数Ｒ₁〜Ｒ₅を「０」とすることによって、音素単位あるいは音節単位で重み係数Ｒ_tを「０」とする場合と比較して、より信頼性の高い出力音響モデルを作成することが可能となる。
【００９２】
なお、本実施形態では、重み係数Ｒ₁〜Ｒ₅を「０」とすることにより、「かとう」の部分の音素「ｋ−ａ−ｔ−ｏ−ｕ」が出力音響モデルに反映しないようにしたが、重み係数Ｒ₁〜Ｒ₅を０以上１未満の任意の値に設定することによって、「かとう」の部分の音素「ｋ−ａ−ｔ−ｏ−ｕ」が出力音響モデルに与える影響を調整することが可能となる。
【００９３】
（第４の実施形態）
以下、特記しない限り、本発明の第４の実施形態における音響モデル学習装置の構成および動作は、本発明の第１の実施形態における音響モデル学習装置の構成および動作と同様であるとする。
【００９４】
上記の第１から第３の実施形態では、第１の最尤状態列と第２の最尤状態列との間で差異が生じた部分（学習用音声の品質が低下した部分）の重み係数Ｒ_tを「０」に設定し、出力音響モデルに反映されないようにしていた。本実施形態における音響モデル学習装置は、学習用音声における誤発声あるいは品質の低い音声が生じた部分を発声の一変化として積極的に取り入れ、学習用音声の高品質部分よりも高い重み係数Ｒ_tを設定することによって、低品質の学習用音声のサンプル数を増加させ、低品質の学習用音声に対する認識性能を向上させる。
【００９５】
図７は、本発明の第４の実施形態における重み係数Ｒ_tを示す図である。図７に示される重み係数Ｒ_tは、以下に示す（式７．１）および（式７．２）により与えられる。
【００９６】
【数７】

【００９７】
本実施形態では、第１の実施形態と同様に、所定の話者により入力された「かとうこんたろう」という学習用音声を、「さとうこんたろう」と認識する音響モデルが入力される。第１の実施形態では、「か」の音素「ｋ」（＝Ｓ₁）に対応する重み係数Ｒ₁を「０」に設定し、出力音響モデルに反映しないようにすることによって、信頼性の高い出力音響モデルを作成していた。本実施形態では、第１の最尤状態列と第２の最尤状態列との間で差異が発生した「か」の音素「ｋ」（＝Ｓ₁）に、第１の最尤状態列と第２の最尤状態列との間で一致した他の音素に設定された「重み係数Ｒ_t＝１（ｔ＝２〜１３）」よりも高い「重み係数Ｒ₁＝１０」を設定する。
【００９８】
上記のように、「重み係数Ｒ₁＝１０」と設定することによって、十分に学習されていない稀な特徴と考えられる「か」の音素「ｋ」（＝Ｓ₁）を、他の音素よりも出力音響モデルに大きく反映させることが可能となる。
【００９９】
なお、本実施形態では、重み係数Ｒ_tによる重み付けを音素単位で行ったが、第２の実施形態のように音節単位で行ってもよいし、第３の実施形態のように単語単位で行ってもよい。
【０１００】
また、本実施形態では、正解列と入力音響モデルにより認識された学習用音声との間で差異が生じた音素に対応する重み係数Ｒ_tを「１０」としたが、正解列と学習用音声との間で一致した音素と比較して大きな数値であれば、差異が生じた音素に対応する重み係数Ｒ_tは、他の値であってもよい。
【０１０１】
（第５の実施形態）
以下、特記しない限り、本発明の第５の実施形態における音響モデル学習装置の構成および動作は、本発明の第１の実施形態における音響モデル学習装置の構成および動作と同様であるとする。
【０１０２】
統計モデルの信頼性は、統計モデルのパラメータ学習に用いられた音声サンプル（音素、音節、あるいは単語）の量により大きく影響される。従って、各音響モデルにおける信頼性を均一化するためには、入力される各音声サンプルの量に著しい偏りが生じないようにする必要がある。
【０１０３】
本実施形態では、第１の最尤状態列における各状態ごとの重み係数Ｒ_tの和を一定にし、入力される所定の言語単位（音素、音節、あるいは単語等）の各音声サンプルにおけるサンプル数を均一化する。
【０１０４】
図１０は、本発明の第５の実施形態における音響モデル学習装置の動作の流れを示すフローチャートである。以下、図１を用い、図１０に沿って、本実施形態における音響モデル学習装置の動作について説明する。
【０１０５】
本実施形態では、第１の実施形態と同様に、学習用音声の一例として、所定の話者による「加藤今太郎（かとうこんたろう）」の発声を用いる。
【０１０６】
まず、所定の制御手段（図示せず）は、学習用音声が音声分析部１０１に入力されたか否かを判断する（ステップＳ１００１）。学習用音声が音声分析部１０１に入力されていないと判断された場合（ステップＳ１００１／Ｎｏ）、ステップＳ１００１の工程が繰り返される。
【０１０７】
学習用音声が音声分析部１０１に入力されたと判断された場合（ステップＳ１００１／Ｙｅｓ）、音声分析部１０１は、フレームごとに学習用音声の周波数を分析し、その分析した学習用音声の周波数に基づいて学習用音声の特徴量を抽出する（ステップＳ１００２）。抽出した学習用音声の特徴量は、フォワード・バックワード計算部１０３および再評価部１０４へ出力される。
【０１０８】
次に、所定の制御手段は、正解列および入力音響モデルが辞書部１０２に入力されたか否かを判断する（ステップＳ１００３）。正解列および入力音響モデルが入力されていないと判断された場合（ステップＳ１００３／Ｎｏ）、ステップＳ１００３の工程が繰り返される。
【０１０９】
正解列および入力音響モデルが辞書部１０２に入力されたと判断された場合（ステップＳ１００３／Ｙｅｓ）、辞書部１０２は、入力された正解列と入力音響モデルとに基づいて学習用辞書を作成し、作成した学習用辞書を格納する（ステップＳ１００４）。
【０１１０】
辞書部１０２による学習用辞書作成後、フォワード・バックワード計算部１０３は、辞書部１０２により作成された学習用辞書を参照し、音声分析部１０１により抽出された特徴量に基づいて、フォワード確率およびバックワード確率を算出する（ステップＳ１００５）。
【０１１１】
次に、フォワード・バックワード計算部１０３は、算出したフォワード確率とバックワード確率とに基づいて対応確率を算出する（ステップＳ１００６）。
【０１１２】
所定の制御手段は、音声分析部１０１に入力された学習用音声と同様の学習用音声がビタビ計算部１０５に入力されたか否かを判断する。また、所定の制御手段は、入力音響モデルが辞書部１０２を介してビタビ計算部１０５に入力されたか否かを判断する（ステップＳ１００７）。学習用音声および入力音響モデルがビタビ計算部１０５に入力されていないと判断された場合（ステップＳ１００７／Ｎｏ）、ステップＳ１００７の工程が繰り返される。
【０１１３】
学習用音声および入力音響モデルがビタビ計算部１０５に入力されたと判断された場合（ステップＳ１００７／Ｙｅｓ）、ビタビ計算部１０５は、入力された学習用音声および入力音響モデルを用い、辞書部１０２により作成された学習用辞書を参照して、ビタビマッチングにより最尤状態列を生成する（ステップＳ１００８）。なお、学習用辞書を参照して生成された上記の最尤状態列を第１の最尤状態列とする。
【０１１４】
次に、重み計算部１０６は、ビタビ計算部１０５により生成された第１の最尤状態列の各状態を参照し、以下に示す（式８．１）、（式９．１）、（式９．２）、および（式９．３）に基づいて、重み係数Ｒ_tを算出する（ステップＳ１００９）。
【０１１５】
【数８】

【０１１６】
【数９】

【０１１７】
本実施形態では、上記の（式８．１）で与えられる条件により、学習用音声を構成する同一の音声サンプル（音素、音節、あるいは単語単位）が割り当てられている状態ごとに重み係数Ｒ_tの和をとり、重み係数Ｒ_tの和が等しくなるように、重み係数Ｒ_tを算出することによって、各音声サンプルがそれぞれ出力音響モデルに与える影響が均一になる。
【０１１８】
本実施形態では、本発明の第１の実施形態と同様に図３に示される学習用辞書が生成されるとする。図８は、本発明の第５の実施形態における重み係数Ｒ_tを示す図である。図８に示される重み係数Ｒ_tは、上記の（式９．１）、（式９．２）および（式９．３）に基づいて設定されている。なお、図８における重み係数Ｒ_t（ｔ＝１〜１３）は、図３に示されている状態Ｓ_i（ｉ＝１〜１３）にそれぞれ対応している。
【０１１９】
本実施形態では、割り当てられたフレームの値が小さなものから順に、学習用音声を構成する音素を観測した場合、初めて観測された種類の音素に対応する重み係数Ｒ_tを「１」とし、以前観測された種類の音素に対応する重み係数Ｒ_tを「０」としている。
【０１２０】
以下、図３および図８を用いて説明すると、例えば、Ｓ₆の音素「ｋ」は、すでにＳ₁において観測されているので重み係数Ｒ₆は「０」に設定されている。一方、Ｓ₁₁の音素「ｒ」は、Ｓ₁〜Ｓ₁₀において観測されていないので重み係数Ｒ₁₁は「１」に設定されている。
【０１２１】
上記のように重み係数Ｒ_tが算出されることによって、同一種類の音素に付加されている重み係数Ｒ_tの和は、それぞれ「１」となり、各音素が音声サンプルとして収集される回数が均等となる。
【０１２２】
以下、再び図１０のフローチャートに沿って音響モデル学習装置の動作について説明を進める。再評価部１０４は、重み計算部１０６により算出された重み係数Ｒ_tと、音声分析部１０１により抽出された特徴量と、フォワード・バックワード計算部１０３により算出された対応確率と、に基づいて、音響モデルの各統計量（混合重み、平均ベクトル、および共分散行列の各平均）を算出する（ステップＳ１０１０）。
【０１２３】
再評価部１０４は、音響モデルの各統計量算出後、算出した統計量に基づいて、フォワード・バックワード計算部１０３を介して入力された入力音響モデルの各パラメータ（混合重み分布、平均ベクトル、および共分散行列の各平均）を再評価し、出力音響モデルを作成する（ステップＳ１０１１）。作成された出力音響モデルは、再評価部１０４から出力される（ステップＳ１０１２）。出力音響モデル出力後、音響モデル学習装置は、動作を終了する。
【０１２４】
本実施形態では、以上説明したように、同一の音声サンプル（音素、音節、あるいは単語）が割り当てられた状態ごとの重み係数Ｒ_tの和を一定とすることによって、各音声サンプル（音素、音節、あるいは単語単位）のサンプル量および出力音響モデルに与える影響を均一化し、信頼性の高い出力音響モデルを作成することを可能としている。
【０１２５】
また、音響モデル学習装置は、入力される学習用音声からフレームごとに特徴量を抽出する音声分析処理と、所定の音声からフレームごとに抽出された特徴量を示す確率分布を用いて、所定の音声におけるフレームごとの特徴量を状態として表現し、状態を構成単位とする入力音響モデルと、学習用音声の内容を示す文字列情報である正解列と、に基づいて、入力音響モデルにおける状態に正解列を割り当てた状態列の情報である学習用辞書を生成する辞書生成処理と、辞書生成処理により生成された学習用辞書を参照し、学習用音声の特徴量と入力音響モデルにおける状態との対応確率を学習用音声のフレームごとに算出する対応確率算出処理と、所定の文字列を用いて、入力音響モデルにより表現される状態あるいは複数の状態からなる状態列を、学習用音声のフレームごとに最尤に割り当て、所定の最尤状態列を生成する最尤状態列生成処理と、最尤状態列生成処理により生成された所定の最尤状態列に基づいて、対応確率に重み付けする際に付加する係数である重み係数を、学習用音声のフレームごとに算出する重み計算処理と、対応確率算出処理により算出された対応確率と、重み計算処理により算出された重み係数と、音声分析処理により算出された特徴量と、に基づいて統計量を算出し、算出した統計量に基づいて、入力音響モデルのパラメータを再推定し、出力音響モデルを作成する再評価処理と、を行う。上記の処理は、音響モデル学習装置が有するコンピュータプログラムにより実行されるが、上記のプログラムは、光ディスクあるいは磁気ディスク等の記録媒体に記録され、上記の記録媒体からロードされるようにしてもよい。
【０１２６】
なお、上記の実施形態は本発明の好適な実施の一例であり、本発明の実施形態は、これに限定されるものではなく、本発明の要旨を逸脱しない範囲において種々変形して実施することが可能となる。
【０１２７】
【発明の効果】
以上説明したように、本発明は、学習用音声のフレームごとに重み係数を算出し、上記の重み係数による重み付けを出力音響モデルに反映させることによって、観測された音声サンプルのうち音響モデルの作成に有用なものだけを抽出し、信頼性の高い音響モデルを作成することが可能となる。
【０１２８】
また、本発明は、品質の高い所定の言語単位（音素、音節、あるいは単語等）の音声サンプルの重み付け係数を「１」とし、品質の低い音声サンプルの重み付け係数を「０」とすることによって、品質の低い音声サンプルが出力音響モデルに反映しないようにすることが可能となる。
【０１２９】
また、本発明は、品質の高い所定の言語単位の音声サンプルの重み付け係数を「１」とし、品質の低い音声サンプルの重み付け係数を「１より大きな任意の値」とすることによって、品質の低い音声サンプルに対する音声認識の精度が高い出力音響モデルを作成することが可能となる。
【０１３０】
また、本発明は、同一の音声サンプル（音素、音節、あるいは単語）が割り当てられた状態ごとの重み係数の和を一定とすることによって、各音声サンプル（音素、音節、あるいは単語単位）のサンプル量および出力音響モデルに与える影響を均一化し、信頼性の高い出力音響モデルを作成することが可能となる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態における音響モデル学習装置の構成を示す図である。
【図２】本発明の第１の実施形態における入力音響モデルが表現可能な音素セットを示す図である。
【図３】本発明の第１の実施形態における音響モデル学習装置が作成する学習用辞書を示す図である。
【図４】本発明の第１の実施形態における重み係数Ｒ_tを示す図である。
【図５】本発明の第２の実施形態における重み係数Ｒ_tを示す図である。
【図６】本発明の第３の実施形態における重み係数Ｒ_tを示す図である。
【図７】本発明の第４の実施形態における重み係数Ｒ_tを示す図である。
【図８】本発明の第５の実施形態における重み係数Ｒ_tを示す図である。
【図９】本発明の第１の実施形態における音響モデル学習装置の動作の流れを示すフローチャートである。
【図１０】本発明の第５の実施形態における音響モデル学習装置の動作の流れを示すフローチャートである。
【符号の説明】
１０１音声分析部
１０２辞書部
１０３フォワード・バックワード計算部
１０４再評価部
１０５ビタビ計算部
１０６重み計算部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an acoustic model learning apparatus, an acoustic model learning method, and a program therefor, and more particularly, an acoustic model learning apparatus and an acoustic model that create a highly reliable acoustic model by weighting an audio sample according to the characteristics of the audio sample. The present invention relates to a model learning method and its program.
[0002]
[Prior art]
An acoustic model learning device often learns an acoustic model used for speech recognition using actual speech. In general, a Hidden Markov Model (Hidden Markov Model, hereinafter referred to as HMM) is used as an acoustic model to be learned. In addition, as a probability distribution representing a state in the HMM, a continuous mixture distribution is often used. In many cases, the forward / backward method is used for learning the HMM. The literature describing the estimation of the parameters of the acoustic model by the HMM as described above is Lawrence Laber, Biing-Hwang Jung “Fundamentals of Spech Recognition 1993 p.333-p.389” (hereinafter, Conventional Example 1). It was.
[0003]
In Conventional Example 1, the mixing weight indicating the mixing ratio in the continuous mixing probability distribution is added to each of the plurality of probability distributions constituting the continuous mixing probability distribution used in the HMM.
[0004]
Hereinafter, a parameter calculation method in the HMM using the forward / backward method will be described.
[0005]
The feature value for each time (frame) t is O _t Assuming that t is an integer greater than or equal to 1 and less than or equal to T, the forward probability α in the forward / backward method is expressed by the following (formula 1.1) and (formula 1.2).
[0006]
[Expression 1]

[0007]
Note that the forward probability α (t, i) is the feature amount O _t , State S _i Indicates the probability. Similarly, the forward probability α (1, i) is the feature quantity O ₁ , State S _i , The forward probability α (t + 1, j) is the feature quantity O _{t + 1} , State S _j Indicates the probability.
[0008]
In addition, the state transition probability a _ij Is in state S _i From state S _j Represents the probability of transition to Observation probability b (i, O ₁ ) Is in state S _i At the time of transition to the feature O in the frame t _t Indicates the probability that is observed.
[0009]
Further, the backward probability β in the forward-backward method is expressed by the following (formula 2.1) and (formula 2.2).
[0010]
[Expression 2]

[0011]
Note that the backward probability β (t, i) is the state S in the frame t. _i And feature amount O in frame (t + 1) _{t + 1} The probability of observing Frame T represents a frame in the final state.
[0012]
Further, the correspondence probability γ in the forward / backward method is calculated based on the forward probability α and the backward probability β. The correspondence probability γ is expressed by the following (formula 3.1).
[0013]
[Equation 3]

[0014]
Note that the corresponding probability γ (t, j, k) is in the state S in the frame t. _j When transitioning to state S _j Feature O in the kth mixture distribution element _t Is the probability of observing. N (O _t , Μ _jk , U _jk ) Is in state S _j The feature quantity to be modeled is kth mixture distribution element of O _t Mean vector is μ _jk , The covariance matrix is U _jk Probability distribution. C _jk N (O _t , Μ _jk , U _jk ).
[0015]
In addition, the state S in the HMM _j K is the parameter of the kth mixture distribution element _jk , Mean vector μ (t, j, k), and covariance matrix U (j, k) are given by the following (formula 4.1), (formula 4.2), and (formula 4.3). ).
[0016]
[Expression 4]

[0017]
Note that the mixing weight c _jk Is the state S in the HMM _j Of the kth mixture distribution element. The average vector μ (t, j, k) is the state S in the HMM. _j Is the average vector of the k th mixed distribution element. Further, the covariance matrix U (j, k) is the state S in the HMM. _j Is the covariance matrix of the k th mixed distribution element. Also, V _k Indicates a predetermined character in the character string V. Also, (O _t −μ _jk ) 'Is a vector (O _t −μ _jk ).
[0018]
Further, in the speaker adaptation method for acoustic models disclosed in Japanese Patent Laid-Open No. 5-232929 (hereinafter referred to as Conventional Example 2), the mixing ratio of each of a plurality of probability distributions constituting a continuous mixture probability distribution used for HMM. Only the weighting factor that determines
[0019]
Moreover, the calculation method of the HMM using the forward backward method is described in the calculation method of the hidden Markov model (hereinafter, Conventional Example 3) disclosed in JP-A-10-11086.
[0020]
[Problems to be solved by the invention]
In general, a large amount of speech data is required for learning a highly reliable probability model. In particular, an acoustic model for an unspecified speaker needs to absorb fluctuations in speech due to individual differences among speakers. Therefore, the acoustic model for unspecified speakers requires a lot of voice data based on the utterances of the speakers. However, when collecting a large number of voice samples, there is a possibility that a speaker's false voice or low-quality voice is mixed.
[0021]
Furthermore, when estimating a probability model (acoustic model), the following problems occur. Usually, when voice data is collected, it is necessary to obtain voice data based on a speaker's natural utterance. Therefore, words that actually exist are used as the utterance content of the speaker collected as voice data. In addition, the distribution of phonemes (characters) constituting the actually existing word inevitably has a bias. For example, in the case of Japanese, the frequency of appearance of vowels, particularly “a” is very high. When the probability model is estimated, the reliability of the probability distribution varies depending on the number of samples for which the probability distribution is estimated. Therefore, when using phonemes constituting a word as speech data for constructing an acoustic model, it is necessary to correct a deviation in the appearance frequency of phonemes.
[0022]
The present invention has been made in view of the above problems, and the weights added to each mixed distribution element of the HMM in Conventional Example 1, Conventional Example 2, and Conventional Example 2 and Conventional Example 3 are combined. In addition, a specific weight sample or a specific part of the sound sample is amplified during training of the acoustic model by adding a weighting factor set according to the characteristics of the collected sound sample to each frame of the sound sample. Alternatively, an object of the present invention is to provide an acoustic model learning device that removes and corrects a deviation in appearance frequency of phonemes constituting a speech sample and provides a highly reliable acoustic model.
[0023]
[Means for Solving the Problems]
In order to achieve this object, the present invention has the following features. The acoustic model learning device according to the present invention is
Speech analysis means for extracting feature values for each frame from input learning speech;
Using the probability distribution indicating the feature amount extracted for each frame from the predetermined sound, the predetermined sound fragment divided for each frame of the predetermined sound is expressed as a state, and the state is used as a structural unit. Based on an input acoustic model and a correct answer sequence that is character string information indicating the content of the learning speech, a learning dictionary that is information on a state sequence in which the correct answer sequence is assigned to the state in the input acoustic model. Dictionary generating means for generating;
A correspondence probability calculating means for referring to the learning dictionary generated by the dictionary generating means and calculating a correspondence probability between the feature amount of the learning speech and the state in the input acoustic model for each frame of the learning speech;
With reference to the learning dictionary, the state represented by the input acoustic model or a state sequence composed of a plurality of the states is assigned with maximum likelihood for each frame of the learning speech, and a first maximum likelihood state sequence is First maximum likelihood state sequence generation means for generating;
A dictionary representing an arbitrary character is referenced, and the state represented by the input acoustic model or a state sequence composed of a plurality of states is assigned to each frame of the learning speech with maximum likelihood, and a second maximum likelihood state Second maximum likelihood state sequence generation means for generating a sequence;
A weighting coefficient, which is a coefficient added when weighting the corresponding probability based on the comparison result between the first maximum likelihood state string and the second maximum likelihood state string, is used as the learning speech. Weight calculation means for calculating each frame of
A statistic is calculated based on the correspondence probability calculated by the correspondence probability calculation means, the weight coefficient calculated by the weight calculation means, and the feature quantity calculated by the speech analysis means, and the calculated statistics Re-estimating means for re-estimating the parameters of the input acoustic model based on the quantity and creating an output acoustic model;
Have
It is characterized by that.
[0024]
An acoustic model learning method according to the present invention includes:
A voice analysis step of extracting feature values for each frame from the input learning voice;
Using the probability distribution indicating the feature amount extracted for each frame from the predetermined sound, the predetermined sound fragment divided for each frame of the predetermined sound is expressed as a state, and the state is used as a structural unit. Based on an input acoustic model and a correct answer sequence that is character string information indicating the content of the learning speech, a learning dictionary that is information on a state sequence in which the correct answer sequence is assigned to the state in the input acoustic model. A dictionary generation step to generate;
A correspondence probability calculation step of referring to the learning dictionary generated by the dictionary generation step, and calculating a correspondence probability between the feature amount of the learning speech and the state in the input acoustic model for each frame of the learning speech;
With reference to the learning dictionary, the state represented by the input acoustic model or a state sequence composed of a plurality of the states is assigned with maximum likelihood for each frame of the learning speech, and a first maximum likelihood state sequence is A first maximum likelihood state sequence generation step to generate;
A dictionary representing an arbitrary character is referenced, and the state represented by the input acoustic model or a state sequence composed of a plurality of states is assigned to each frame of the learning speech with maximum likelihood, and a second maximum likelihood state A second maximum likelihood state sequence generation step for generating a sequence;
A weighting coefficient, which is a coefficient added when weighting the corresponding probability based on the comparison result between the first maximum likelihood state string and the second maximum likelihood state string, is used as the learning speech. A weight calculation step to calculate for each frame,
A statistic is calculated based on the correspondence probability calculated by the correspondence probability calculation step, the weight coefficient calculated by the weight calculation step, and the feature amount calculated by the speech analysis step, and the calculated statistic Re-estimating the parameters of the input acoustic model based on the quantity and creating an output acoustic model; and
Have
It is characterized by that.
[0025]
The program according to the present invention is:
A voice analysis process that extracts features for each frame from the input learning voice;
Using the probability distribution indicating the feature amount extracted for each frame from the predetermined sound, the predetermined sound fragment divided for each frame of the predetermined sound is expressed as a state, and the state is used as a structural unit. Based on an input acoustic model and a correct answer sequence that is character string information indicating the content of the learning speech, a learning dictionary that is information on a state sequence in which the correct answer sequence is assigned to the state in the input acoustic model. A dictionary generation process to generate;
A correspondence probability calculation process for referring to the learning dictionary generated by the dictionary generation process and calculating a correspondence probability between the feature amount of the learning voice and the state in the input acoustic model for each frame of the learning voice;
With reference to the learning dictionary, the state represented by the input acoustic model or a state sequence composed of a plurality of the states is assigned with maximum likelihood for each frame of the learning speech, and a first maximum likelihood state sequence is A first maximum likelihood state sequence generation process to generate;
A dictionary representing an arbitrary character is referenced, and the state represented by the input acoustic model or a state sequence composed of a plurality of states is assigned to each frame of the learning speech with maximum likelihood, and a second maximum likelihood state A second maximum likelihood state sequence generation process for generating a sequence;
A weighting coefficient, which is a coefficient added when weighting the corresponding probability based on the comparison result between the first maximum likelihood state string and the second maximum likelihood state string, is used as the learning speech. Weight calculation processing for each frame,
A statistic is calculated based on the correspondence probability calculated by the correspondence probability calculation process, the weight coefficient calculated by the weight calculation process, and the feature amount calculated by the voice analysis process, and the calculated statistic A re-evaluation process for re-estimating the parameters of the input acoustic model based on the quantity and creating an output acoustic model;
Is executed by a computer.
[0041]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
FIG. 1 is a diagram illustrating a configuration of an acoustic model learning apparatus according to the first embodiment of the present invention. Hereinafter, the configuration of the acoustic model learning apparatus according to the present embodiment will be described with reference to FIG. In this embodiment, an HMM based on a continuous mixing probability distribution is used as the acoustic model. In the above acoustic model, by using a probability distribution indicating a feature amount extracted for each frame from a predetermined sound, the sound fragment divided for each frame is expressed as a state, and the state is expressed as a structural unit. Become.
[0042]
The acoustic model learning apparatus includes a speech analysis unit 101, a dictionary unit 102, a forward / backward calculation unit 103, a reevaluation unit 104, a Viterbi calculation unit 105, and a weight calculation unit 106. Hereinafter, each part of the acoustic model learning apparatus will be described with reference to FIG.
[0043]
A voice for learning, which is voice information used for learning an acoustic model, is input to the voice analysis unit 101. Note that the learning speech is also input to the Viterbi calculator 105.
[0044]
The voice analysis unit 101 divides the input learning voice at predetermined intervals, and performs the frequency analysis of the learning voice for each frame with the section as a “frame”. The (acoustic) feature amount of the learning speech for each frame extracted as a result of the above analysis is input to the forward / backward calculation unit 103 and the reevaluation unit 104. As the feature amount, the power of voice may be used, or a power change amount, a cepstrum, a cepstrum change amount, or the like may be used.
[0045]
The dictionary unit 102 receives an acoustic model and a correct answer sequence. The correct answer sequence may be information on a character string input by a predetermined input means (not shown). The predetermined input means inputs character information indicating the contents of the learning speech input to the speech analysis unit 101 and the Viterbi calculation unit 105 to the dictionary unit 102 as a correct sequence.
[0046]
The dictionary unit 102 creates and stores a learning dictionary based on the subword model based on the input acoustic model (hereinafter referred to as input acoustic model) and the input correct sequence. The learning dictionary based on the subword model is information on a state string obtained by dividing an inputted correct answer string (for example, an actually existing word) into phonemes or syllable units (subword units). In addition to the learning dictionary, the dictionary unit 102 stores in advance a “dictionary representing an arbitrary character string” that is information on an arbitrary character string.
[0047]
The forward / backward calculation unit 103 refers to the learning dictionary stored in the dictionary unit 102 and is based on the feature amount of the learning speech extracted by the speech analysis unit 101 and the input input acoustic model. Then, the forward probability and the backward probability by the forward / backward method are calculated. Further, the forward / backward calculation unit 103 calculates a correspondence probability between the feature amount of the learning speech and the state of the input acoustic model based on the calculated forward probability and backward probability. The forward / backward calculation unit 103 outputs the calculated correspondence probability to the reevaluation unit 104.
[0048]
The forward / backward calculation unit 103 calculates the feature amount for each frame t converted from the input learning speech as O _t The forward probability α is calculated based on (Equation 1.1) and (Equation 1.2) shown below as (t is an integer of 1 or more and T or less). Further, the forward / backward calculation unit 103 calculates the backward probability β based on the expressions shown by (Expression 2.1) and (Expression 2.2).
[0049]
Further, the forward / backward calculation unit 103 uses the calculated forward probability α and backward probability β to calculate the correspondence probability γ based on the equation represented by (Expression 3.1).
[0050]
The Viterbi calculator 105 receives the same learning speech as the voice analyzer 101. In addition, the input acoustic model is input to the Viterbi calculation unit 105 via the dictionary unit 102.
[0051]
The Viterbi calculator 105 divides the input learning voice every predetermined time (frame). Next, the Viterbi calculation unit 105 refers to predetermined character information, assigns a state sequence based on the input acoustic model or a state sequence composed of a plurality of states to each frame, and performs Viterbi matching (Viterbi Matching). To create a predetermined maximum likelihood state sequence.
[0052]
The weight calculation unit 106 uses a weight coefficient R based on a plurality of types of maximum likelihood state sequences created by the Viterbi calculation unit 105 with reference to a plurality of types of predetermined character information. _t Is calculated.
[0053]
The re-evaluation unit 104 calculates the weight coefficient R calculated by the weight calculation unit 106. _t And the correspondence probability calculated by the forward / backward calculation unit 103, the feature amount extracted by the speech analysis unit 101, and the input acoustic model input via the forward / backward calculation unit 103. Calculate the statistics of each state of the acoustic model (mixed weights, average vectors, and averages of the covariance matrix). The reevaluation unit 104 reevaluates each parameter (mixing weight, average vector, and each average of the covariance matrix) of the input acoustic model based on the extracted statistics. The reevaluation unit 104 creates an acoustic model based on the reevaluation of each parameter of the input acoustic model. The reevaluation unit 104 outputs the created acoustic model as an output acoustic model.
[0054]
The reevaluation unit 104 adds the weighting factor R to the correspondence probability γ. _t Are weighted. The re-evaluation unit 104 calculates the weighted correspondence probability γ · R. _t And use the mixing weight c _jk , Averages of the average vector μ (t, j, k) and the covariance matrix U (j, k) are calculated as statistics. The above statistics are given by (Equation 5.1), (Equation 5.2), and (Equation 5.3) shown below.
[0055]
[Equation 5]

[0056]
Note that the mixing weight c _jk Is the state S in the HMM _j Of the kth mixture distribution element. The average vector μ (t, j, k) is the state S in the HMM. _j Is the average vector of the k th mixed distribution element. Further, the covariance matrix U (j, k) is the state S in the HMM. _j Is the covariance matrix of the k th mixed distribution element. Also, V _k Indicates a predetermined character in the character string V. Also, (O _t −μ _jk ) 'Is a vector (O _t −μ _jk ).
[0057]
FIG. 2 is a diagram illustrating phoneme sets that can be represented by the input acoustic model according to the first embodiment of the present invention. FIG. 3 is a diagram showing a learning dictionary created by the acoustic model learning device according to the first embodiment of the present invention. FIG. 4 shows the weighting factor R in the first embodiment of the present invention. _t FIG. FIG. 9 is a flowchart showing an operation flow of the acoustic model learning apparatus according to the first embodiment of the present invention. Hereinafter, the operation of the acoustic model learning apparatus according to the present embodiment will be described with reference to FIGS.
[0058]
In the present embodiment, as an example of the learning voice, the utterance of “Kato Kontaro” by a predetermined speaker is used. In the present embodiment, it is assumed that an acoustic model that recognizes the utterance of “Kato Kontaro” by the predetermined speaker as “Sato Kontaro” is given as an input acoustic model (initial model).
[0059]
In the HMM, the voice length corresponding to one state is variable, and the most likely state sequence in the HMM is obtained by using Viterbi matching or the like. However, in this embodiment, for the sake of simplicity, it is assumed that the input sound is 14-frame sound, and one state is assigned to each frame.
[0060]
First, a predetermined control means (not shown) determines whether or not learning speech has been input to the speech analysis unit 101 (step S901). If it is determined that the learning voice is not input to the voice analysis unit 101 (step S901 / No), the process of step S901 is repeated.
[0061]
When it is determined that the learning voice is input to the voice analysis unit 101 (step S901 / Yes), the voice analysis unit 101 analyzes the frequency of the learning voice for each frame, and uses the analyzed frequency of the learning voice. Based on this, the feature amount of the learning speech is extracted (step S902). The extracted feature amount of the learning speech is output to the forward / backward calculation unit 103 and the reevaluation unit 104.
[0062]
Next, the predetermined control means determines whether or not the correct answer sequence and the input acoustic model are input to the dictionary unit 102 (step S903). When it is determined that the correct answer sequence and the input acoustic model are not input (step S903 / No), the process of step S903 is repeated.
[0063]
When it is determined that the correct answer sequence and the input acoustic model are input to the dictionary unit 102 (step S903 / Yes), the dictionary unit 102 creates a learning dictionary based on the input correct answer sequence and the input acoustic model, The created learning dictionary is stored (step S904).
[0064]
Here, the process in which the dictionary unit 102 creates a learning dictionary will be described with reference to FIGS. 2 and 3. FIG. 2 shows a phoneme string (phoneme set) that can represent the input acoustic model in the present embodiment. The above phoneme set is included in the input acoustic model. The dictionary unit 102 uses the above phoneme set to change the learning speech “Kato Kontaro” to “kat-ou-k-o-ng-t-a-r-u”. Divide into phonemes. The divided phonemes are changed to state S. _i (I is an integer from 1 to 13), and a learning sequence corresponding to a state sequence as shown in FIG. 3, that is, a learning speech is created. The dictionary unit 102 stores the created learning dictionary.
[0065]
After creating the learning dictionary by the dictionary unit 102, the forward / backward calculation unit 103 refers to the learning dictionary created by the dictionary unit 102, and based on the feature amount extracted by the speech analysis unit 101, the forward probability and A backward probability is calculated (step S905).
[0066]
Next, the forward / backward calculation unit 103 calculates a correspondence probability based on the calculated forward probability and backward probability (step S906).
[0067]
The predetermined control means determines whether or not learning voice similar to the learning voice input to the voice analysis unit 101 is input to the Viterbi calculation unit 105. Further, the predetermined control means determines whether or not the input acoustic model is input to the Viterbi calculation unit 105 via the dictionary unit 102 (step S907). When it is determined that the learning voice and the input acoustic model are not input to the Viterbi calculator 105 (step S907 / No), the process of step S907 is repeated.
[0068]
When it is determined that the learning speech and the input acoustic model are input to the Viterbi calculation unit 105 (step S907 / Yes), the Viterbi calculation unit 105 uses the input learning speech and the input acoustic model, and the dictionary unit 102 A maximum likelihood state sequence is generated by viterbi matching with reference to the created learning dictionary (step S908). The above-described maximum likelihood state sequence generated with reference to the learning dictionary is defined as a first maximum likelihood state sequence.
[0069]
Further, the Viterbi calculation unit 105 generates a maximum likelihood state sequence by Viterbi matching by using the input learning speech and the input acoustic model and referring to a dictionary representing an arbitrary character string stored in the dictionary unit 102. (Step S909). The above-described maximum likelihood state sequence generated with reference to a dictionary representing an arbitrary character is defined as a second maximum likelihood state sequence.
[0070]
Next, the weight calculation unit 106 compares each state of the first maximum likelihood state sequence generated by the Viterbi calculation unit 105 with each state of the second maximum likelihood state sequence, and shows the following (formula 6.1) ) And the weighting factor R given by (Equation 6.2) _t Is calculated (step S910). The weighting factor R _t Is calculated so as to correspond to each frame of the learning speech.
[0071]
[Formula 6]

[0072]
If a speaker's false speech or low-quality speech is used as learning speech, there is a possibility that a difference will occur between the input correct answer sequence and the learning speech recognized by the input acoustic model. There is a high possibility that a speech sample of a predetermined language unit (for example, phoneme unit, syllable unit, etc.) by the learning speech is erroneously recognized by the acoustic model. By preventing the erroneously recognized speech sample from being greatly reflected in the output acoustic model, a highly reliable output acoustic model can be obtained.
[0073]
The weight calculation unit 106 compares each state in the first maximum likelihood state sequence and each state in the second maximum likelihood state sequence for each frame, and the above (Equation 6.1) and (Equation 6. 2) weighting factor R based on _t Is calculated.
[0074]
(Expression 6.1) is a weighting factor R when a difference occurs between the first maximum likelihood state sequence and the second maximum likelihood state sequence in a predetermined frame. _t In the above case, the weighting factor R _t Is calculated as “0”.
[0075]
(Expression 6.2) is a weighting factor R when the first maximum likelihood state sequence matches the second maximum likelihood state sequence in all frames. _t In the above case, the weighting factor R _t Is calculated as “1”.
[0076]
When the quality of the learning voice is degraded due to a speaker's false speech or the like, the state of the first maximum likelihood state sequence assigned to the frame corresponding to the portion where the quality degradation has occurred and the second maximum likelihood state Differences occur between the column states. Therefore, in order to obtain a highly reliable output acoustic model, it is necessary to prevent the portion where the above difference has occurred from being reflected in the output acoustic model.
[0077]
In the present embodiment, the weighting factor R of the high-quality part (the state in which the state of the first maximum likelihood state sequence and the second maximum likelihood state sequence match in a predetermined frame) in the learning speech. _t Is set to “1” and the weight coefficient R of the low quality portion _t The weighting factor R of the high quality part _t By setting “0”, which is a lower value, a low quality portion of the learning speech, that is, a portion in which the learning speech is erroneously recognized by the input acoustic model is not reflected in the output acoustic model.
[0078]
In the input acoustic model in the present embodiment, the learning speech “Kato Kontaro” is recognized as “Sato Kontaro”. In the above case, it is unclear what kind of utterance the “ka” part was actually made, but the phoneme “k” in the “ka” part is erroneously recognized by the input acoustic model. In order to create an output acoustic model in which the model of phoneme “k” is recognized correctly, it is necessary to set so that the phoneme “k” in the “ka” part is not reflected in the output acoustic model.
[0079]
14 shows the weighting factor R given to the learning dictionary of FIG. 3 by (Equation 6.1) and (Equation 6.2). _t FIG. R _t (T = 1 to 13) is S _i It is a weighting coefficient in (i = 1 to 13). As shown in FIG. 14, the phoneme “k” (= S ₁ ) Weighting factor R ₁ Is set to “0” and other phonemes (S ₂ ~ S ₁₃ ) Weighting factor R ₂ ~ R ₁₃ By setting “1” to “1”, the phoneme “k” in the “ka” part can be prevented from being reflected in the output acoustic model, and a highly reliable acoustic model can be created.
[0080]
In this embodiment, the weighting factor R ₁ Is set to “0” so that the phoneme “k” in the “ka” part is not reflected in the output acoustic model. ₁ Is set to “any value between 0 and 1”, the influence of the phoneme “k” in the “ka” part on the output acoustic model can be adjusted.
[0081]
Hereinafter, the operation of the acoustic model learning apparatus will be described again along the flowchart of FIG. The re-evaluation unit 104 calculates the weight coefficient R calculated by the weight calculation unit 106. _t And each statistical quantity (mixing weight, average vector, and covariance matrix) of the acoustic model based on the feature amount extracted by the speech analysis unit 101 and the corresponding probability calculated by the forward / backward calculation unit 103. Are calculated) (step S911).
[0082]
The reevaluation unit 104 calculates each parameter (mixing weight distribution, average vector, input vector) of the input acoustic model input via the forward / backward calculation unit 103 based on the calculated statistic after calculating each statistic of the acoustic model. And the respective averages of the covariance matrices are reevaluated to create an output acoustic model (step S912). The created output acoustic model is output from the reevaluation unit 104 (step S913). After the output acoustic model is output, the acoustic model learning device ends the operation.
[0083]
(Second Embodiment)
Unless otherwise specified, the configuration and operation of the acoustic model learning device according to the second embodiment of the present invention are assumed to be the same as the configuration and operation of the acoustic model learning device according to the first embodiment of the present invention.
[0084]
In general, when the noise environment degrades the quality of the learning speech, the misrecognition of the learning speech affects not only a single phoneme but also the surrounding phonemes of the phoneme. In the first embodiment, the weighting factor R _t Is set for each phoneme, but it is possible to create a more reliable output acoustic model by weighting on a syllable basis if it is erroneously recognized across multiple phonemes due to environmental noise, etc. Become.
[0085]
FIG. 5 shows a weighting factor R in the second embodiment of the present invention. _t FIG. Similar to the first embodiment, the weighting factor R _t (T = 1 to 13) indicates the state S in FIG. _i (I = 1 to 13).
[0086]
In the first embodiment, the phoneme “k” (= S ₁ ) Weighting factor R ₁ Was set to “0”. In this embodiment, in the syllable “ka (ka)” in the learning speech “Kato Kontaro”, the quality is reduced, and the difference is between the first maximum likelihood state sequence and the second maximum likelihood state sequence. Has occurred. When the quality of the learning speech is deteriorated in syllable units as described above, the phoneme “k” (= S ₁ ) Weighting factor R ₁ And phoneme “a” (= S ₂ ) Weighting factor R ₂ And phoneme “k” (= S ₁ ) Weighting factor R ₁ As compared with the case where only “0” is set to “0”, it is possible to create a more reliable output acoustic model.
[0087]
In this embodiment, the weighting factor R ₁ And R ₂ Is set to “0” so that the phoneme “k” and the phoneme “a” in the “ka” part are not reflected in the output acoustic model. ₁ And R ₂ Is set to “any value between 0 and 1”, it is possible to adjust the influence of the phoneme “k” and the phoneme “a” in the “ka” portion on the output acoustic model.
[0088]
(Third embodiment)
Hereinafter, unless otherwise specified, the configuration and operation of the acoustic model learning device according to the third embodiment of the present invention are assumed to be the same as the configuration and operation of the acoustic model learning device according to the first embodiment of the present invention.
[0089]
In the second embodiment, the case has been described in which the misrecognition of the learning speech due to the noise environment affects not only a single phoneme but also surrounding phonemes of the phoneme. In the second embodiment, the weighting factor R _t Is set for each syllable, but if there is a range of phonemes that are misrecognized for reasons such as environmental noise over a wider range than syllable units, the range of weighted phonemes is further expanded than syllable units, By using word units, a more reliable output acoustic model can be created.
[0090]
FIG. 6 shows a weighting factor R in the third embodiment of the present invention. _t FIG. Similar to the first embodiment, the weighting factor R _t (T = 1 to 13) indicates the state S in FIG. _i (I = 1 to 13).
[0091]
In the first embodiment, the phoneme “k” (= S ₁ ) Weighting factor R ₁ Was set to “0”. Further, in the second embodiment, the phoneme “k” (= S ₁ ) Weighting factor R ₁ And phoneme “a” (= S ₂ ) Weighting factor R ₂ And “0”, respectively. In the present embodiment, the quality of the word “kato (ka-tou)” in the learning voice “katokontaro” decreases, and the first maximum likelihood state sequence and the second maximum likelihood state There is a difference between the columns. When the quality of the learning speech is deteriorated in units of words as described above, the phoneme “k” (= S) in the word “katou (ka-tou)” ₁ ), Phoneme “a” (= S ₂ ), Phoneme “t” (= S _Three ), Phoneme “o” (= S _Four ), And the phoneme “u” (= S _Five ) Weighting factor R corresponding to each ₁ ~ R _Five By setting “0” to “0”, the weighting factor R in phoneme units or syllable units _t As compared with the case where is set to “0”, it is possible to create a more reliable output acoustic model.
[0092]
In this embodiment, the weighting factor R ₁ ~ R _Five By setting “0” to “0”, the phoneme “katt” part of “Kato” is not reflected in the output acoustic model, but the weight coefficient R ₁ ~ R _Five Is set to an arbitrary value of 0 or more and less than 1, it is possible to adjust the influence of the phoneme “katou” of the “Kato” part on the output acoustic model.
[0093]
(Fourth embodiment)
Hereinafter, unless otherwise specified, it is assumed that the configuration and operation of the acoustic model learning device according to the fourth embodiment of the present invention are the same as the configuration and operation of the acoustic model learning device according to the first embodiment of the present invention.
[0094]
In the first to third embodiments described above, the weighting coefficient of the portion where the difference occurs between the first maximum likelihood state sequence and the second maximum likelihood state sequence (the portion where the quality of the learning speech has decreased) R _t Is set to “0” so that it is not reflected in the output acoustic model. The acoustic model learning apparatus according to the present embodiment actively takes in a part of a learning voice where an erroneous utterance or low-quality voice occurs as a change in utterance, and has a higher weighting factor R than a high-quality part of the learning voice. _t Is set to increase the number of samples of low-quality learning speech and improve the recognition performance for low-quality learning speech.
[0095]
FIG. 7 shows a weighting factor R in the fourth embodiment of the present invention. _t FIG. Weighting factor R shown in FIG. _t Is given by (Equation 7.1) and (Equation 7.2) below.
[0096]
[Expression 7]

[0097]
In the present embodiment, as in the first embodiment, an acoustic model for recognizing “Sato Kontaro” as a learning voice “Kato Kontaro” inputted by a predetermined speaker is input. In the first embodiment, the phoneme “k” (= S ₁ ) Corresponding to the weighting factor R ₁ Is set to “0” so that it is not reflected in the output acoustic model, thereby creating a highly reliable output acoustic model. In this embodiment, the phoneme “k” (= S) of “ka” in which a difference has occurred between the first maximum likelihood state sequence and the second maximum likelihood state sequence. ₁ ) “Weighting factor R set for other phonemes that match between the first maximum likelihood state sequence and the second maximum likelihood state sequence _t = 1 (t = 2 to 13) ”higher than“ weighting factor R ₁ = 10 "is set.
[0098]
As above, “weighting factor R ₁ By setting “= 10”, the phoneme “k” (= S) of “ka” considered as a rare feature that has not been sufficiently learned ₁ ) Can be reflected more greatly in the output acoustic model than other phonemes.
[0099]
In this embodiment, the weighting factor R _t The weighting is performed in units of phonemes, but may be performed in units of syllables as in the second embodiment, or may be performed in units of words as in the third embodiment.
[0100]
In the present embodiment, the weighting factor R corresponding to the phoneme in which the difference between the correct answer sequence and the learning speech recognized by the input acoustic model has occurred. _t Is set to “10”, but the weighting factor R corresponding to the phoneme in which the difference occurs is a numerical value larger than the phoneme matched between the correct answer sequence and the learning speech. _t May be other values.
[0101]
(Fifth embodiment)
Unless otherwise specified, the configuration and operation of the acoustic model learning device according to the fifth embodiment of the present invention are assumed to be the same as the configuration and operation of the acoustic model learning device according to the first embodiment of the present invention.
[0102]
The reliability of a statistical model is greatly affected by the amount of speech samples (phonemes, syllables, or words) used for statistical model parameter learning. Therefore, in order to make the reliability in each acoustic model uniform, it is necessary to prevent a significant deviation from occurring in the amount of each audio sample input.
[0103]
In the present embodiment, the weighting factor R for each state in the first maximum likelihood state sequence. _t The number of samples in each speech sample of a predetermined language unit (phoneme, syllable, word, etc.) to be input is made uniform.
[0104]
FIG. 10 is a flowchart showing an operation flow of the acoustic model learning apparatus according to the fifth embodiment of the present invention. Hereinafter, the operation of the acoustic model learning apparatus according to the present embodiment will be described with reference to FIG.
[0105]
In the present embodiment, as in the first embodiment, the utterance of “Kato Kontaro” by a predetermined speaker is used as an example of the learning voice.
[0106]
First, a predetermined control means (not shown) determines whether or not learning speech has been input to the speech analysis unit 101 (step S1001). When it is determined that the learning voice has not been input to the voice analysis unit 101 (step S1001 / No), the process of step S1001 is repeated.
[0107]
When it is determined that the learning voice is input to the voice analysis unit 101 (step S1001 / Yes), the voice analysis unit 101 analyzes the frequency of the learning voice for each frame, and uses the analyzed frequency of the learning voice. Based on this, the feature amount of the learning speech is extracted (step S1002). The extracted feature amount of the learning speech is output to the forward / backward calculation unit 103 and the reevaluation unit 104.
[0108]
Next, the predetermined control means determines whether or not the correct answer sequence and the input acoustic model are input to the dictionary unit 102 (step S1003). When it is determined that the correct answer sequence and the input acoustic model are not input (step S1003 / No), the process of step S1003 is repeated.
[0109]
When it is determined that the correct answer sequence and the input acoustic model are input to the dictionary unit 102 (step S1003 / Yes), the dictionary unit 102 creates a learning dictionary based on the input correct answer sequence and the input acoustic model, The created learning dictionary is stored (step S1004).
[0110]
After creating the learning dictionary by the dictionary unit 102, the forward / backward calculation unit 103 refers to the learning dictionary created by the dictionary unit 102, and based on the feature amount extracted by the speech analysis unit 101, the forward probability and A backward probability is calculated (step S1005).
[0111]
Next, the forward / backward calculation unit 103 calculates a correspondence probability based on the calculated forward probability and backward probability (step S1006).
[0112]
The predetermined control means determines whether or not learning voice similar to the learning voice input to the voice analysis unit 101 is input to the Viterbi calculation unit 105. Further, the predetermined control means determines whether or not the input acoustic model has been input to the Viterbi calculation unit 105 via the dictionary unit 102 (step S1007). When it is determined that the learning voice and the input acoustic model are not input to the Viterbi calculator 105 (step S1007 / No), the process of step S1007 is repeated.
[0113]
When it is determined that the learning speech and the input acoustic model are input to the Viterbi calculation unit 105 (step S1007 / Yes), the Viterbi calculation unit 105 uses the input learning speech and the input acoustic model, and the dictionary unit 102 A maximum likelihood state sequence is generated by viterbi matching with reference to the created learning dictionary (step S1008). The above-described maximum likelihood state sequence generated with reference to the learning dictionary is defined as a first maximum likelihood state sequence.
[0114]
Next, the weight calculation unit 106 refers to each state of the first maximum likelihood state sequence generated by the Viterbi calculation unit 105, and shows (Expression 8.1), (Expression 9.1), and (Expression) below. 9.2) and (Equation 9.3) _t Is calculated (step S1009).
[0115]
[Equation 8]

[0116]
[Equation 9]

[0117]
In the present embodiment, the weighting factor R is determined for each state in which the same speech sample (phoneme, syllable, or word unit) constituting the learning speech is assigned according to the condition given in (Equation 8.1) above. _t And the weighting factor R _t So that the sum of _t By calculating, the influence of each audio sample on the output acoustic model becomes uniform.
[0118]
In the present embodiment, it is assumed that the learning dictionary shown in FIG. 3 is generated as in the first embodiment of the present invention. FIG. 8 shows a weighting factor R in the fifth embodiment of the present invention. _t FIG. Weighting factor R shown in FIG. _t Is set based on the above (Formula 9.1), (Formula 9.2) and (Formula 9.3). The weighting factor R in FIG. _t (T = 1 to 13) is the state S shown in FIG. _i (I = 1 to 13) respectively.
[0119]
In the present embodiment, when the phonemes constituting the learning speech are observed in order from the smallest assigned frame value, the weight coefficient R corresponding to the type of phoneme observed for the first time. _t Is set to “1”, and the weighting factor R corresponding to the phoneme of the type observed before _t Is set to “0”.
[0120]
Hereinafter, for example, S will be described with reference to FIGS. ₆ Phoneme “k” is already S ₁ Weighting factor R ₆ Is set to “0”. On the other hand, S ₁₁ Phoneme “r” of S ₁ ~ S _Ten Weight coefficient R ₁₁ Is set to “1”.
[0121]
Weight factor R as above _t Is calculated, the weighting factor R added to the same type of phoneme _t Are equal to “1”, and the number of times each phoneme is collected as a speech sample is equal.
[0122]
Hereinafter, the operation of the acoustic model learning apparatus will be described again along the flowchart of FIG. The re-evaluation unit 104 calculates the weight coefficient R calculated by the weight calculation unit 106. _t And each statistical quantity (mixing weight, average vector, and covariance matrix) of the acoustic model based on the feature amount extracted by the speech analysis unit 101 and the corresponding probability calculated by the forward / backward calculation unit 103. Are calculated) (step S1010).
[0123]
The reevaluation unit 104 calculates each parameter (mixing weight distribution, average vector, input vector) of the input acoustic model input via the forward / backward calculation unit 103 based on the calculated statistic after calculating each statistic of the acoustic model. And each average of the covariance matrix are re-evaluated to generate an output acoustic model (step S1011). The created output acoustic model is output from the reevaluation unit 104 (step S1012). After the output acoustic model is output, the acoustic model learning device ends the operation.
[0124]
In the present embodiment, as described above, the weighting factor R for each state to which the same speech sample (phoneme, syllable, or word) is assigned. _t By making the sum of the constants constant, it is possible to create a highly reliable output acoustic model by equalizing the amount of each voice sample (phoneme, syllable, or word unit) and the effect on the output acoustic model. .
[0125]
In addition, the acoustic model learning device uses a speech analysis process that extracts a feature amount for each frame from an input learning speech and a probability distribution that indicates a feature amount that is extracted for each frame from a predetermined speech. The state in the input acoustic model is expressed based on the input acoustic model that expresses the feature amount of each frame in the speech as a state, and the state is a structural unit, and the correct answer sequence that is character string information indicating the content of the speech for learning. A dictionary generation process for generating a learning dictionary, which is information on a state string to which a correct sequence is assigned, and a learning dictionary generated by the dictionary generation process, and the feature amount of the learning voice and the state in the input acoustic model The correspondence probability calculation process for calculating the correspondence probability for each frame of the learning speech and a state represented by the input acoustic model or a plurality of states using a predetermined character string A state sequence is assigned to each frame of the learning speech with maximum likelihood, and a maximum likelihood state sequence generation process for generating a predetermined maximum likelihood state sequence and a predetermined maximum likelihood state sequence generated by the maximum likelihood state sequence generation process Based on the weight calculation processing for calculating the weighting coefficient, which is added when weighting the correspondence probability, for each frame of the learning speech, the correspondence probability calculated by the correspondence probability calculation processing, and the weight calculation processing. Statistic is calculated based on the calculated weighting factor and the feature amount calculated by the voice analysis process, and the parameter of the input acoustic model is re-estimated based on the calculated statistic to create the output acoustic model And a re-evaluation process. The above processing is executed by a computer program included in the acoustic model learning device. However, the above program may be recorded on a recording medium such as an optical disk or a magnetic disk and loaded from the recording medium.
[0126]
The above-described embodiment is an example of a preferred embodiment of the present invention. The embodiment of the present invention is not limited to this, and various modifications may be made without departing from the scope of the present invention. Is possible.
[0127]
【The invention's effect】
As described above, the present invention creates an acoustic model among the observed speech samples by calculating a weighting factor for each frame of the speech for learning and reflecting the weighting by the weighting factor in the output acoustic model. It is possible to extract only those that are useful for the creation of a highly reliable acoustic model.
[0128]
Further, the present invention sets the weighting coefficient of a speech sample of a predetermined high-quality language unit (phoneme, syllable, word, etc.) to “1”, and sets the weighting coefficient of a speech sample with low quality to “0”. Thus, it is possible to prevent a low-quality voice sample from being reflected in the output acoustic model.
[0129]
Further, according to the present invention, the weighting coefficient of a speech sample of a predetermined language unit with high quality is set to “1”, and the weighting coefficient of a speech sample with low quality is set to “an arbitrary value greater than 1”, so that the quality is low. It becomes possible to create an output acoustic model with high accuracy of speech recognition for speech samples.
[0130]
The present invention also provides a sample of each voice sample (phoneme, syllable, or word unit) by making the sum of the weighting coefficients for each state to which the same voice sample (phoneme, syllable, or word) is assigned constant. It is possible to create a reliable output acoustic model by uniformizing the quantity and the influence on the output acoustic model.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration of an acoustic model learning device according to a first embodiment of the present invention.
FIG. 2 is a diagram illustrating phoneme sets that can be represented by an input acoustic model according to the first embodiment of the present invention.
FIG. 3 is a diagram showing a learning dictionary created by the acoustic model learning device according to the first embodiment of the present invention.
FIG. 4 is a weighting factor R according to the first embodiment of the present invention. _t FIG.
FIG. 5 is a weighting factor R in the second embodiment of the present invention. _t FIG.
FIG. 6 is a weighting factor R in the third embodiment of the present invention. _t FIG.
FIG. 7 shows a weighting factor R in the fourth embodiment of the present invention. _t FIG.
FIG. 8 is a weighting factor R in the fifth embodiment of the present invention. _t FIG.
FIG. 9 is a flowchart showing an operation flow of the acoustic model learning apparatus according to the first embodiment of the present invention.
FIG. 10 is a flowchart showing an operation flow of the acoustic model learning device according to the fifth embodiment of the present invention.
[Explanation of symbols]
101 Speech analysis unit
102 Dictionary section
103 Forward / backward calculator
104 Re-evaluation department
105 Viterbi calculator
106 Weight calculator

Claims

Speech analysis means for extracting feature values for each frame from input learning speech;
Using the probability distribution indicating the feature amount extracted for each frame from the predetermined sound, the predetermined sound fragment divided for each frame of the predetermined sound is expressed as a state, and the state is used as a structural unit. Based on an input acoustic model and a correct answer sequence that is character string information indicating the content of the learning speech, a learning dictionary that is information on a state sequence in which the correct answer sequence is assigned to the state in the input acoustic model. Dictionary generating means for generating;
A correspondence probability calculating means for referring to the learning dictionary generated by the dictionary generating means and calculating a correspondence probability between the feature amount of the learning speech and the state in the input acoustic model for each frame of the learning speech;
With reference to the learning dictionary, the state represented by the input acoustic model or a state sequence composed of a plurality of the states is assigned with maximum likelihood for each frame of the learning speech, and a first maximum likelihood state sequence is First maximum likelihood state sequence generation means for generating;
A dictionary representing an arbitrary character is referenced, and the state represented by the input acoustic model or a state sequence composed of a plurality of states is assigned to each frame of the learning speech with maximum likelihood, and a second maximum likelihood state Second maximum likelihood state sequence generation means for generating a sequence;
A weighting coefficient, which is a coefficient added when weighting the corresponding probability based on the comparison result between the first maximum likelihood state string and the second maximum likelihood state string, is used as the learning speech. Weight calculation means for calculating each frame of
A statistic is calculated based on the correspondence probability calculated by the correspondence probability calculation means, the weight coefficient calculated by the weight calculation means, and the feature quantity calculated by the speech analysis means, and the calculated statistics Re-estimating means for re-estimating the parameters of the input acoustic model based on the quantity and creating an output acoustic model;
An acoustic model learning device comprising:

The reevaluation means includes:
The correspondence probability for each frame of the learning speech is multiplied by the weighting coefficient, the correspondence probability for each frame of the learning speech is weighted, and the statistic is calculated using the weighted correspondence probability. The acoustic model learning apparatus according to claim 1, wherein the output acoustic model is created by re-estimating parameters of the input acoustic model based on the calculated statistic.

The weight calculation means includes
For each frame of the learning speech, the first maximum likelihood state sequence and the second maximum likelihood state sequence are compared, and in the frame in which the state sequence consisting of the assigned state or a plurality of states matches, 3. The acoustic model learning apparatus according to claim 1, wherein the weighting factor is set to 1, and the weighting factor is calculated as a value smaller than 1 in different frames.

The weight calculation means includes
For each frame of the learning speech, the first maximum likelihood state sequence and the second maximum likelihood state sequence are compared, and in the frame in which the state sequence consisting of the assigned state or a plurality of states matches, 3. The acoustic model learning apparatus according to claim 1, wherein the weighting factor is set to 1, and the weighting factor is calculated as a value larger than 1 in different frames.

A voice analysis step of extracting feature values for each frame from the input learning voice;
Using the probability distribution indicating the feature amount extracted for each frame from the predetermined sound, the predetermined sound fragment divided for each frame of the predetermined sound is expressed as a state, and the state is used as a structural unit. Based on an input acoustic model and a correct answer sequence that is character string information indicating the content of the learning speech, a learning dictionary that is information on a state sequence in which the correct answer sequence is assigned to the state in the input acoustic model. A dictionary generation step to generate;
A correspondence probability calculation step of referring to the learning dictionary generated by the dictionary generation step, and calculating a correspondence probability between the feature amount of the learning speech and the state in the input acoustic model for each frame of the learning speech;
With reference to the learning dictionary, the state represented by the input acoustic model or a state sequence composed of a plurality of the states is assigned with maximum likelihood for each frame of the learning speech, and a first maximum likelihood state sequence is A first maximum likelihood state sequence generation step to generate;
A dictionary representing an arbitrary character is referenced, and the state represented by the input acoustic model or a state sequence composed of a plurality of states is assigned to each frame of the learning speech with maximum likelihood, and a second maximum likelihood state A second maximum likelihood state sequence generation step for generating a sequence;
A weighting coefficient, which is a coefficient added when weighting the corresponding probability based on the comparison result between the first maximum likelihood state string and the second maximum likelihood state string, is used as the learning speech. A weight calculation step to calculate for each frame,
A statistic is calculated based on the correspondence probability calculated by the correspondence probability calculation step, the weight coefficient calculated by the weight calculation step, and the feature amount calculated by the speech analysis step, and the calculated statistic Re-estimating the parameters of the input acoustic model based on the quantity and creating an output acoustic model; and
An acoustic model learning method characterized by comprising:

A voice analysis process that extracts features for each frame from the input learning voice;
Using the probability distribution indicating the feature amount extracted for each frame from the predetermined sound, the predetermined sound fragment divided for each frame of the predetermined sound is expressed as a state, and the state is used as a structural unit. Based on an input acoustic model and a correct answer sequence that is character string information indicating the content of the learning speech, a learning dictionary that is information on a state sequence in which the correct answer sequence is assigned to the state in the input acoustic model. A dictionary generation process to generate;
A correspondence probability calculation process for referring to the learning dictionary generated by the dictionary generation process and calculating a correspondence probability between the feature amount of the learning voice and the state in the input acoustic model for each frame of the learning voice;
With reference to the learning dictionary, the state represented by the input acoustic model or a state sequence composed of a plurality of the states is assigned with maximum likelihood for each frame of the learning speech, and a first maximum likelihood state sequence is A first maximum likelihood state sequence generation process to generate;
A dictionary representing an arbitrary character is referenced, and the state represented by the input acoustic model or a state sequence composed of a plurality of states is assigned to each frame of the learning speech with maximum likelihood, and a second maximum likelihood state A second maximum likelihood state sequence generation process for generating a sequence;
A weighting coefficient, which is a coefficient added when weighting the corresponding probability based on the comparison result between the first maximum likelihood state string and the second maximum likelihood state string, is used as the learning speech. Weight calculation processing for each frame,
A statistic is calculated based on the correspondence probability calculated by the correspondence probability calculation process, the weight coefficient calculated by the weight calculation process, and the feature amount calculated by the voice analysis process, and the calculated statistic A re-evaluation process for re-estimating the parameters of the input acoustic model based on the quantity and creating an output acoustic model;
A program characterized by causing a computer to execute.