JP2005010464A

JP2005010464A - Device, method, and program for speech recognition

Info

Publication number: JP2005010464A
Application number: JP2003174441A
Authority: JP
Inventors: Seiichi Miki; 清一三木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-06-19
Filing date: 2003-06-19
Publication date: 2005-01-13

Abstract

<P>PROBLEM TO BE SOLVED: To solve a problem in which, when speech recognition is carried out by using a plurality of sound models, a correct answer cannot be obtained if scores of a hypothesis based upon a desired sound model worsen. <P>SOLUTION: After performing A/D conversion of an input speech, a feature extraction part 100 extracts the converted speech as a feature quantity, and a distance calculation part 110 calculates the distance to a sound model 120. A hypothesis development part 130 develops the hypothesis according to the distance, and an all-hypothesis pruning part 140 decreases hypotheses for all hypotheses. A hypothesis-by-environment pruning part 150 decreases hypotheses by sound models by environments, and a hypothesis holding part 160 holds the decreased hypotheses without any overlapping. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置、音声認識方法、および、音声認識プログラムに関し、特に、環境別音響モデル毎の仮説の削減により、音声認識率を向上させる音声認識装置、音声認識方法、および、音声認識プログラムに関する。
【０００２】
【従来の技術】
特許文献１によれば、複数ｍ人の話者に対応して複数ｍ個の発声内容の仮説が存在し、入力された１人の話者の発声内容に基づいて発声内容と話者の２方向を同時にサーチの対象としてビームサーチしながら音声認識を連続的に実行することが示されている。
【０００３】
図６は、従来の特許文献１の音声認識装置３０の構成を示すブロック図である。
【０００４】
分かりやすくするために、図６は、特許文献１の図を機能的に書き直してある。
【０００５】
図７は、従来の特許文献１の音声認識装置３０の動作の概略を示す説明図である。
【０００６】
図６、図７を参照すると、従来の音声認識装置３０は、特徴抽出部３００と、距離計算部３１０と、音響モデル３２０と、仮説展開部３３０と、全仮説枝刈り部３４０と、仮説保持部３５０とから構成される。
【０００７】
特徴抽出部３００が、入力音声から特徴量を抽出し、距離計算部３１０が、予め用意されたｍ個の音響モデル３２０との距離を計算し、仮説展開部３３０が仮説を展開し、全仮説枝刈り部３４０が、時間毎に各話者モデル（Ｓ１〜Ｓｍ）の仮説全てに対して単一の基準に基づき可能性の低い仮説を削減し、仮説保持部３５０が残った仮説を保持し、最終的に入力音声の終了時間において最も可能性の高かった仮説を認識結果とする。一般に、ｍ個の話者モデルそれぞれを用いて独立に得られた認識結果の中から最も可能性の高い認識結果を選択する方法はよい認識結果を与えることが知られており、この従来の音声認識装置３０ではさらに、各音響モデルの仮説の削減を時間毎に同時に行うことで処理効率を向上させている。
【０００８】
また、一般的な確率モデルによる音声認識の技術が知られている。（非特許文献１、ｐ．１０〜１２）
【特許文献１】
特開平０７−１０４７８０号公報
【非特許文献１】
中川聖一著，「確率モデルによる音声認識」，第２版，電子情報通信学会，平成元年８月１０日，ｐ．１０〜１２
【０００９】
【発明が解決しようとする課題】
上述した特許文献１記載の複数のｍ人の音響モデルを同時に用い、その結果得られる全ての仮説に対してビームサーチを行う音声認識装置では、発声全体としては最適であり正解を与えるような音響モデルに基づく仮説が、一時的にスコアが悪くなった場合に枝刈りされてしまい、最終的に正解が得られないという問題がある。
【００１０】
本発明の目的は、複数の音響モデル・言語モデルを同時に用いて効率よく精度の高い認識結果を得ながら、全体としては最適な音響モデル・言語モデルによる仮説が、一時的にスコアが悪くなるような場合でも正解が得られるようにして音声認識性能を向上させた音声認識装置、音声認識方法、及びプログラムを提供することである。
【００１１】
【課題を解決するための手段】
本発明の第１の音声認識装置は、複数の環境別音響モデルを格納する記憶装置と、入力音声に対する仮説を前記記憶装置から読み出した環境別音響モデルごとに展開する仮説展開部と、前記仮説展開部からの仮説を環境別音響モデルごとに削減する環境別仮説枝刈り部とを有することを特徴とする。
【００１２】
本発明の第２の音声認識装置は、前記第１の音声認識装置であって、入力された音声を認識に適した特徴量の時系列に変換する特徴抽出部と、前記記憶装置の各環境別音響モデルと前記特徴抽出部からの特徴量との距離を計算する距離計算部と、仮説を保持する仮説保持部と、前記仮説保持部に保持された仮説および前記距離計算部からの距離から新たな仮説を生成する前記仮説展開部と、前記仮説生成部で生成された全ての仮説に対し削減を実施する全仮説枝刈り部とを有することを特徴とする。
【００１３】
本発明の第３の音声認識装置は、前記第２の音声認識装置であって、環境別音響モデルごとにスコア上位の仮説を所定の個数残す前記環境別枝刈り部を有することを特徴とする。
【００１４】
本発明の第４の音声認識装置は、前記第２の音声認識装置であって、環境別音響モデルごとに閾値以上のスコアを持つ仮説を残す前記環境別枝刈り部を有することを特徴とする。
【００１５】
本発明の第５の音声認識装置は、複数の環境別音響モデル、複数の環境別言語モデルを格納する記憶装置と、入力音声に対する仮説を前記記憶装置から読み出した環境別言語モデルごとに展開する仮説展開部と、前記仮説展開部からの仮説を環境別言語モデルごとに削減する環境別仮説枝刈り部とを有することを特徴とする。
【００１６】
本発明の第６の音声認識装置は、前記第５の音声認識装置であって、入力された音声を認識に適した特徴量の時系列に変換する特徴抽出部と、前記記憶装置の各環境別音響モデルと前記特徴抽出部からの特徴量との距離を計算する距離計算部と、仮説を保持する仮説保持部と、前記仮説保持部に保持された仮説、前記距離計算部からの距離、および、前記記憶装置の環境別言語モデルから新たな仮説を生成する前記仮説展開部と、前記仮説生成部で生成された全ての仮説に対し削減を実施する全仮説枝刈り部とを有することを特徴とする。
【００１７】
本発明の第７の音声認識装置は、前記第６の音声認識装置であって、言語モデルごとにスコア上位の仮説を所定の個数残すようにする前記環境別枝刈り部を有することを特徴とする。
【００１８】
本発明の第８の音声認識装置は、前記第６の音声認識装置であって、言語モデルごとに閾値以上のスコアを持つ仮説を残すようにする前記環境別枝刈り部を有することを特徴とする。
【００１９】
本発明の第１の音声認識方法は、入力音声に対する仮説を記憶装置から読み出した環境別音響モデルごとに展開する仮説展開手順と、前記仮説展開手順からの仮説を環境別音響モデルごとに削減する環境別仮説枝刈り手順とを含むことを特徴とする。
【００２０】
本発明の第２の音声認識方法は、前記第１の音声認識方法であって、入力された音声を認識に適した特徴量の時系列に変換する特徴抽出手順と、前記記憶装置の各環境別音響モデルと前記特徴抽出手順からの特徴量との距離を計算する距離計算手順と、仮説保持部に保持された仮説および前記距離計算手順からの距離から新たな仮説を生成する前記仮説展開手順と、前記仮説生成手順で生成された全ての仮説に対し削減を実施する全仮説枝刈り手順とを含むことを特徴とする。
【００２１】
本発明の第３の音声認識方法は、前記第２の音声認識方法であって、環境別音響モデルごとにスコア上位の仮説を所定の個数残す前記環境別枝刈り手順を含むことを特徴とする。
【００２２】
本発明の第４の音声認識方法は、前記第２の音声認識方法であって、環境別音響モデルごとに閾値以上のスコアを持つ仮説を残す前記環境別枝刈り手順を含むことを特徴とする。
【００２３】
本発明の第５の音声認識方法は、入力音声に対する仮説を記憶装置から読み出した環境別言語モデルごとに展開する仮説展開手順と、前記仮説展開手順からの仮説を環境別言語モデルごとに削減する環境別仮説枝刈り手順とを含むことを特徴とする。
【００２４】
本発明の第６の音声認識方法は、前記第５の音声認識方法であって、入力された音声を認識に適した特徴量の時系列に変換する特徴抽出手順と、前記記憶装置からの各環境別音響モデルと前記特徴抽出手順からの特徴量との距離を計算する距離計算手順と、仮説保持部に保持された仮説、前記距離計算手順からの距離、および、前記記憶装置の環境別言語モデルから新たな仮説を生成する前記仮説展開手順と、前記仮説生成手順で生成された全ての仮説に対し削減を実施する全仮説枝刈り手順とを含むことを特徴とする。
【００２５】
本発明の第７の音声認識方法は、前記第６の音声認識方法であって、言語モデルごとにスコア上位の仮説を所定の個数残すようにする前記環境別枝刈り手順を含むことを特徴とする。
【００２６】
本発明の第８の音声認識方法は、前記第６の音声認識方法であって、言語モデルごとに閾値以上のスコアを持つ仮説を残すようにする前記環境別枝刈り手順を含むことを特徴とする。
【００２７】
本発明の第１の音声認識プログラムは、入力音声に対する仮説を記憶装置から読み出した環境別音響モデルごとに展開する仮説展開手順と、前記仮説展開手順からの仮説を環境別音響モデルごとに削減する環境別仮説枝刈り手順とをコンピュータに実行させることを特徴とする。
【００２８】
本発明の第２の音声認識プログラムは、前記第１の音声認識プログラムであって、入力された音声を認識に適した特徴量の時系列に変換する特徴抽出手順と、前記記憶装置の各環境別音響モデルと前記特徴抽出手順からの特徴量との距離を計算する距離計算手順と、仮説保持部に保持された仮説および前記距離計算手順からの距離から新たな仮説を生成する前記仮説展開手順と、前記仮説生成手順で生成された全ての仮説に対し削減を実施する全仮説枝刈り手順とをコンピュータに実行させることを特徴とする。
【００２９】
本発明の第３の音声認識プログラムは、前記第２の音声認識プログラムであって、環境別音響モデルごとにスコア上位の仮説を所定の個数残す前記環境別枝刈り手順をコンピュータに実行させることを特徴とする。
【００３０】
本発明の第４の音声認識プログラムは、前記第２の音声認識プログラムであって、環境別音響モデルごとに閾値以上のスコアを持つ仮説を残す前記環境別枝刈り手順をコンピュータに実行させることを特徴とする。
【００３１】
本発明の第５の音声認識プログラムは、入力音声に対する仮説を記憶装置から読み出した環境別言語モデルごとに展開する仮説展開手順と、前記仮説展開手順からの仮説を環境別言語モデルごとに削減する環境別仮説枝刈り手順とをコンピュータに実行させることを特徴とする。
【００３２】
本発明の第６の音声認識プログラムは、前記第５の音声認識プログラムであって、入力された音声を認識に適した特徴量の時系列に変換する特徴抽出手順と、前記記憶装置からの各環境別音響モデルと前記特徴抽出手順からの特徴量との距離を計算する距離計算手順と、仮説保持部に保持された仮説、前記距離計算手順からの距離、および、前記記憶装置の環境別言語モデルから新たな仮説を生成する前記仮説展開手順と、前記仮説生成手順で生成された全ての仮説に対し削減を実施する全仮説枝刈り手順とをコンピュータに実行させることを特徴とする。
【００３３】
本発明の第７の音声認識プログラムは、前記第６の音声認識プログラムであって、言語モデルごとにスコア上位の仮説を所定の個数残すようにする前記環境別枝刈り手順をコンピュータに実行させることを特徴とする。
【００３４】
本発明の第８の音声認識プログラムは、前記第６の音声認識プログラムであって、言語モデルごとに閾値以上のスコアを持つ仮説を残すようにする前記環境別枝刈り手順をコンピュータに実行させることを特徴とする。
【００３５】
【発明の実施の形態】
次に、本発明の第１の実施の形態について図面を参照して詳細に説明する。
【００３６】
図１は、本発明の第１の実施の形態の構成を示すブロック図である。
【００３７】
図１を参照すると、本発明の第１の実施の形態の音声認識装置１０は、プログラムで実現されるか、あるいは、プログラムを含む特徴抽出部１００と、距離計算部１１０と、仮説展開部１３０と、全仮説枝刈り部１４０と、環境別仮説枝刈り部１５０と、記憶手段（たとえば、メモリ、ハードディスク装置）に設けられる音響モデル１２０と、仮説保持部１６０とから構成される。
【００３８】
特徴抽出部１００は、入力された音声信号をＡ／Ｄ変換した後、たとえば、ＬＰＣ分析を実行するなどして音声認識に適した多次元のベクトルを特徴量として抽出し、入力時間順に従って特徴量の時系列を出力する。ＬＰＣ分析については、非特許文献１に詳しい。多次元ベクトルの要素は実数であり、次元数は１０〜４０程度のことが多い。音響モデル１２０には話者（性別・年齢等）や入力手段（電話・マイク等）、背景雑音等、異なる特徴を持つ環境毎に複数（ｍ［個］）の音響モデル（ｍ［個］の環境別音響モデルと呼ぶ）が蓄積される。音響モデルは、たとえば、具体的にはＨＭＭやニューラルネットワーク等で表現される。また、各環境別音響モデルは、単語や音節、音素といったシンボル毎に用意され、連続分布ＨＭＭであれば、距離計算部１１０は、それぞれのシンボル毎の特徴量の出現確率を求める。たとえば、シンボル（たとえば、音素）がｋ［種類］であれば、ｍ×ｋ［個］の音響モデルが存在する。
【００３９】
図５は、音響モデル１２０の例を示す説明図である。
【００４０】
図５を参照すると、音響モデル１２０は、ｍ［個］の環境別音響モデルＲ１〜環境別音響モデルＲｍを含む。また、各環境別音響モデルＲ１〜環境別音響モデルＲｍは、それぞれ、ｋ［個］のシンボル別音響モデルＬｅ１〜シンボル別音響モデルＬｅｋを含んでいる。ここで、ｅは、各環境別音響モデルＲ１〜環境別音響モデルＲｍを示す。
【００４１】
距離計算部１１０は、入力された特徴量の時系列の特定時刻における音響モデルからの距離を算出する。
【００４２】
たとえば、入力された特徴ベクトルｘ、音響モデルの特徴ベクトルｗのある音素に対する出現確率は、以下の式で与えられる。
【００４３】
【数１】

【００４４】
ここで、ｘ：入力された特徴ベクトル、ｎ：特徴ベクトルの次元数、μ：音響モデルｗの平均ベクトル、Σ：音響モデルｗの共分散行列、ｔ：転置行列を示す記号、−１：逆行列を示す記号である。
【００４５】
ｍ×ｋ個の音響モデルに対して、ｍ×ｋ個の距離がそれぞれ得られる。特定時刻における距離の例としては、音響モデルとして連続分布ＨＭＭを用いる場合に、各音響モデルがその時刻の特徴量（多次元ベクトル）を出力する確率の対数を距離として用いる例がある。前述の通り、音響モデルにはシンボルが定められており、単語や音節、音素といったシンボルに対して距離が与えられる。またＨＭＭであれば状態遷移確率を持ち、その対数値も距離に加えられる。
【００４６】
たとえば、距離Ｔは、Ｔ＝ｌｏｇ｛Ｐ（ｘ／ｗ）｝で与えられる。
【００４７】
仮説展開部１３０は、仮説保持部１６０に保持されている仮説から展開可能な仮説を距離計算部１１０で得られた距離に基づいて展開し、新たな仮説を生成する。仮説は、ある時刻（通常音声開始点Ｚ０）からある時刻（Ｚ１）までのシンボル列、および、シンボル列の各シンボルに対する音響モデル距離の総和などから計算されるスコアによって構成される。順次、時刻（Ｚｎ）からある時刻（Ｚｎ＋１）までのスコアが計算される。
【００４８】
たとえば、スコアＳは、Ｓ＝総和［ｌｏｇ｛Ｐ（ｘ／ｗ）｝］で与えられる。
【００４９】
スコアは、通常一つの実数である。当然ながら、同じ時間区間に対し複数の仮説が存在し、それらはスコアによって比較可能である。
【００５０】
また、仮説を展開可能とは、たとえば、日本語であれば、＜促音「っ」の後には母音が来ない＞といったような規則があり、シンボル列として許容できる場合を展開可能という。入力音声の終端に達した仮説は、認識結果として出力される。すなわち、認識結果とは音声開始から終了までの区間で最もスコアの高い仮説である。仮説保持部１６０で保持されている仮説は音響モデルによって区別され、それぞれ対応する音響モデルによる距離が用いられる。
【００５１】
全仮説枝刈り部１４０は、新たに生成された仮説を加えた全ての仮説に対し、たとえば、閾値以下のスコアを持つ仮説を削減する。環境別仮説枝刈り部１５０は、新たに生成された仮説を加えた対応する音響モデルの仮説について音響モデル毎に閾値以下のスコアを持つ仮説を削減する。前述の通り、仮説は、シンボル列とスコアとから構成される。仮説を環境別に区別するということは、すなわち、同じシンボル列であってもその仮説が得られた環境が異なれば区別するということである。たとえば、「ぱんがたべたい」という発声に対し、環境として男性音響モデル・女性音響モデルを用いることを考えた場合、処理過程で仮説として以下のものが得られたとする。
【００５２】
（男性音響モデル，シンボル列「ぱんが」，スコア：−１５）。
【００５３】
（男性音響モデル，シンボル列「ぱんだ」，スコア：−３０）。
【００５４】
（女性音響モデル，シンボル列「ぱすが」，スコア：−３０）。
【００５５】
（女性音響モデル，シンボル列「ばすが」，スコア：−５０）。
【００５６】
この時、全仮説枝刈り部１４０は、全ての仮説に対し仮説の削減を行う（これを枝刈りと呼ぶ）。たとえば、閾値が−２０であれば（男性音響モデル，系列「ぱんが」，スコア：−１５）のみが残ることとなる。これに対し、環境別仮説枝刈り部１５０は、それぞれ環境毎に区別して仮説の削減を行う。たとえば、男性音響モデルによる仮説では閾値−２０、女性音響モデルによる仮説では閾値−４０で削減を行うとすると、（男性音響モデル，系列「ぱんが」，スコア：−１５）、および、（女性音響モデル，系列「ぱすが」，スコア：−３０）がそれぞれ残る。
【００５７】
仮説保持部１６０は、全仮説枝刈り部１４０、および、環境別仮説枝刈り部１５０から出力される仮説を重複しないように保持する。前の例であれば、（男性音響モデル，系列「ぱんが」，スコア：−１５）は重複する仮説の例となっており、一つのみ保持される。
【００５８】
なお、全仮説枝刈り部１４０と環境別仮説枝刈り部１５０とを一つにまとめて、全仮説の削減（枝刈り）するときに、時間的に同時に、音響モデル毎の仮説の削減（枝刈り）も行うことで機能をそこなわずに処理を高速化できることは明らかである。
【００５９】
次に、本発明の第１の実施の形態の動作について図面を参照して詳細に説明する。
【００６０】
図２は、本発明の第１の実施の形態の動作を示すフローチャートである。
【００６１】
図２を参照すると、まず、ユーザの発声した音声が入力されると、特徴抽出部１００は、入力音声のＡ／Ｄや、分析等の処理を行い、音声認識に適した特徴量の時系列を出力する（図２ステップＳ５１）。音声認識装置１０の初期化を行う（図２ステップＳ５２）。詳細には、特徴抽出部１００が入力のどの時間の部分が処理されているかを示す時間ｔを“０”に初期化する（図示しないメモリ等に時間ｔ＝０を格納する）。この時、特徴抽出部１００は、入力音声の終端を検出すると、終端であることを示す終端情報を付加して格納する。また、仮説保持部１６０は、内容を初期化する。次に、距離計算部１１０が、音響モデル１２０の全ての音響モデルについて入力音声の時間ｔにおける距離を計算する（図２ステップＳ５３）。
【００６２】
次に、仮説展開部１３０が、仮説保持部１６０に保持されている仮説から展開可能な仮説について、距離計算部１１０で計算された距離に基づき新たな仮説を展開する（図２ステップＳ５４）。すなわち、スコアを計算する。仮説展開部１３０は、時間ｔが入力音声の終端であれば（図２ステップＳ５５／Ｙ）、現在の仮説のうち最もスコアの高い仮説を認識結果として出力し処理を終了する（図２ステップＳ６０）。
【００６３】
時間ｔが、入力音声の終端でない場合（図２ステップＳ５５／Ｎ）、全仮説枝刈り部１４０が、仮説展開部１３０で得られた仮説全てについて、たとえば、スコアがある閾値以下の仮説を棄却して仮説の削減を行う（図２ステップＳ５６）。また、環境別仮説枝刈り部１５０が得られた仮説を音響モデル毎に別々に削減すする（図２ステップＳ５７）。仮説保持部１６０は、全仮説枝刈り部１４０、および、環境別仮説枝刈り部１５０で残った仮説をマージして保持する（図２ステップＳ５８）。マージの仕方としては、たとえば、重複するものは一方のみ残すような仕方が考えられる。
【００６４】
次に、特徴抽出部１００は、時間ｔ＝ｔ＋１とし（図２ステップＳ５９）、距離計算部１１０以降は、時間ｔ＝ｔ＋１における処理を行う（図２ステップＳ５３以降）。
【００６５】
このように、本発明の第１の実施の形態の音声認識装置１０では、環境別仮説枝刈り部１５０を持つことにより、全ての音響モデルの仮説が常に残るため、一時的にスコアが低くなるような発声に対して図６の従来の音声認識装置３０と比べよりよい認識結果を得ることができる。
【００６６】
図３は本発明の第１の実施の形態の動作を示す説明図である。
【００６７】
図３を参照すると、音響モデルの区別に関わらず全体で枝刈りを行う他に、環境別音響モデル毎（Ｓ１、Ｓ２、・・・）にも仮説の削減（枝刈り）を行う。
【００６８】
次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。
【００６９】
図４は、本発明の第２の実施の形態の構成を示すブロック図である。
【００７０】
図４を参照すると、本発明の第２の実施の形態の音声認識装置２０は、本発明の第１の実施の形態の音声認識装置１０に対して、言語モデル２７０が追加され、仮説展開部２３０、環境別枝刈り部２５０、および、仮説保持部２６０の動作が、それぞれ、仮説展開部１３０、環境別枝刈り部１５０、および、仮説保持部１６０の動作とは異なる。なお、図４では、図１と同じ機能の構成要素は、図１と同じ符号を付しているのでこれらの構成要素の説明は省略する。
【００７１】
言語モデル２７０は、分野や機能の差など、異なる特徴を持つ環境毎に複数（ｎ個）の言語モデル（環境別言語モデルと呼ぶ）を蓄積する。言語モデル２７０は、たとえば、具体的には、単語辞書、ｎ−ｇｒａｍモデルやＣＦＧ文法、また、その組み合わせ等で表現される。たとえば、ｎ−ｇｒａｍモデルの一つである２−ｇｒａｍの言語モデル２７０は、Ｐ（ｙ｜ｙ１）のように、単語ｙ１に続いて単語ｙが出現する確率を記録する。音響モデルのシンボルから単語への変換方法は単語辞書に記述される。たとえば、音響モデルのシンボルが音素の場合、「ｎｉｐｐｏｎ」というシンボル（音素）列が「日本」という単語に対応することなどが単語辞書に記述される。
【００７２】
仮説展開部２３０は、仮説保持部２６０に保持されている仮説から展開可能な仮説を距離計算部１１０で得られた距離と言語モデル２７０とに基づいて展開し新たな仮説を生成する。言語モデルに基づく展開としては、たとえば、仮説の音響モデルのシンボル列を単語列ｙ１，ｙ２，・・・，ｙｎに変換し、２−ｇｒａｍモデルの場合にはΣｌｏｇＰ（ｙｉ｜ｙ（ｉ−１））（ｉは１〜ｎ）を仮説のスコアに加えることなどが具体的な方法として挙げられる。入力音声の終端に達した仮説は、認識結果として出力される。仮説保持部２６０に保持されている仮説は、音響モデル１２０、言語モデル２７０によって区別され、それぞれ対応する音響モデル１２０による距離と言語モデル２７０とが用いられる。環境別仮説枝刈り部２５０は、新たに生成された仮説について、環境別音響モデル毎・環境別言語モデル毎に閾値以下のスコアを持つ仮説を削減する。音響モデル１２０は、１個でもよい。または、複数の音響モデルに対して言語モデル２７０は、１個でもよい。
【００７３】
このように本発明の第２の実施の形態では、本発明の第１の実施の形態と同じ効果を得るとともに、言語モデル２７０も異なる環境として使用することで音響的な特徴だけでなく、言語的な特徴（分野や機能等）も考慮に入れることができる。
【００７４】
本発明の第１の実施の形態、本発明の第２の実施の形態の構成の他に、たとえば、仮説を削減する際に閾値とスコアとの比較で仮説を削減するのではなく、スコアが上位の仮説を所定の個数のみ残すようにして仮説を削減することも可能である。また、閾値と比較する値として仮説のスコアそのものや、全仮説中で最大のスコアからの差、環境毎の仮説中で最大のスコアからの差を用いることも可能である。認識結果として最大のスコアの仮説を出力するだけでなく、たとえば、全体としてスコア上位の複数候補を出力したり、環境毎にスコア上位の複数候補を出力したり、ワードグラフの形式で出力することも可能である。
【００７５】
また、仮説保持部１６０、仮説保持部２６０において過去一定時間のスコア、または、全仮説に対する順位を保持しておき、ある音響モデル１２０または言語モデル２７０による仮説が一定時間の間所定の閾値以下のスコア、または、順位であった場合に、その環境の仮説を削減することで処理を高速化することも可能である。
【００７６】
次に、本発明の第３の実施の形態について説明する。
【００７７】
本発明の第３の実施の形態は、全仮説枝刈り部１４０を省略し、全ての仮説を、それぞれ、音響モデル１２０、言語モデル２７０に応じて環境別枝刈り部２５０で仮説の削減を行い、そのそれぞれの仮説の削減（枝刈り）の基準を全体で制御する構成をとる。
【００７８】
次に、本発明の第４の実施の形態について説明する。
【００７９】
本発明の第４の実施の形態は、本発明の第１の実施の形態〜本発明の第３の実施の形態の処理の各ステップ（図２等）を含む方法である。
【００８０】
次に、本発明の第５の実施の形態について説明する。
【００８１】
本発明の第５の実施の形態は、本発明の第４の実施の形態の処理の各ステップをコンピュータ（たとえば、音声認識装置１０、音声認識装置２０）に実行させるプログラムである。
【００８２】
【発明の効果】
本発明によれば、全体的な仮説の枝刈りに加え、環境毎に仮説の枝刈りを行うようにしたので、一時的にスコアが悪くなるため正解が枝刈りされてしまうような場合でも正解を得ることができるようになり、認識率を改善できるという効果がある。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態の構成を示すブロック図である。
【図２】本発明の第１の実施の形態の動作を示すフローチャートである。
【図３】本発明の第１の実施の形態の動作を示す説明図である。
【図４】本発明の第２の実施の形態の構成を示すブロック図である。
【図５】音響モデルの例を示す説明図である。
【図６】従来の技術の構成を示すブロック図である。
【図７】従来の技術の動作を示す説明図である。
【符号の説明】
１０音声認識装置
１００特徴抽出部
１１０距離計算部
１２０音響モデル
１３０仮説展開部
１４０全仮説枝刈り部
１５０環境別仮説枝刈り部
１６０仮説保持部
２０音声認識装置
２３０仮説展開部
２５０環境別仮説枝刈り部
２６０仮説保持部
２７０言語モデル
３０音声認識装置
３００特徴抽出部
３１０距離計算部
３２０音響モデル
３３０仮説展開部
３４０全仮説枝刈り部
３５０仮説保持部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition device, a speech recognition method, and a speech recognition program, and in particular, a speech recognition device, speech recognition method, and speech recognition that improve speech recognition rate by reducing hypotheses for each acoustic model by environment. Regarding the program.
[0002]
[Prior art]
According to Patent Document 1, there are a plurality of m utterance content hypotheses corresponding to a plurality of m speakers, and the utterance content and the speaker 2 based on the input utterance content of one speaker. It is shown that speech recognition is continuously executed while performing a beam search with the direction as a search target at the same time.
[0003]
FIG. 6 is a block diagram showing a configuration of a conventional speech recognition device 30 of Patent Document 1. In FIG.
[0004]
For the sake of clarity, FIG. 6 is a functional rewrite of the diagram of Patent Document 1.
[0005]
FIG. 7 is an explanatory diagram showing an outline of the operation of the conventional speech recognition apparatus 30 of Patent Document 1. In FIG.
[0006]
6 and 7, the conventional speech recognition apparatus 30 includes a feature extraction unit 300, a distance calculation unit 310, an acoustic model 320, a hypothesis development unit 330, an all hypothesis pruning unit 340, and a hypothesis holding. Part 350.
[0007]
The feature extraction unit 300 extracts feature amounts from the input speech, the distance calculation unit 310 calculates the distances from m acoustic models 320 prepared in advance, the hypothesis expansion unit 330 expands the hypotheses, and all hypotheses The pruning unit 340 reduces hypotheses that are not likely to be based on a single criterion for all hypotheses of each speaker model (S1 to Sm) every time, and the hypothesis holding unit 350 holds the remaining hypotheses. Finally, the most likely hypothesis at the end time of the input speech is taken as the recognition result. In general, it is known that a method of selecting the most likely recognition result from recognition results obtained independently using each of m speaker models gives a good recognition result. Further, the recognition apparatus 30 improves the processing efficiency by simultaneously reducing hypotheses of each acoustic model every time.
[0008]
A speech recognition technique based on a general probability model is also known. (Nonpatent literature 1, p.10-12)
[Patent Document 1]
Japanese Patent Application Laid-Open No. 07-104780
[Non-Patent Document 1]
Seiichi Nakagawa, “Voice Recognition Using Stochastic Models”, 2nd edition, IEICE, August 10, 1989, p. 10-12
[0009]
[Problems to be solved by the invention]
In the speech recognition apparatus that uses a plurality of m acoustic models described in Patent Document 1 described above and performs a beam search for all the hypotheses obtained as a result, an acoustic that is optimal for the whole utterance and gives a correct answer. There is a problem that the hypothesis based on the model is pruned when the score temporarily worsens, and the correct answer cannot be finally obtained.
[0010]
It is an object of the present invention to obtain a highly accurate recognition result by using a plurality of acoustic models / language models at the same time, and the hypothesis based on the optimal acoustic model / language model as a whole may temporarily deteriorate the score. It is to provide a speech recognition apparatus, speech recognition method, and program that improve speech recognition performance so that a correct answer can be obtained even in such a case.
[0011]
[Means for Solving the Problems]
The first speech recognition apparatus of the present invention includes a storage device that stores a plurality of environment-specific acoustic models, a hypothesis expansion unit that expands a hypothesis for an input speech for each environment-specific acoustic model read from the storage device, and the hypothesis An environment-specific hypothesis pruning unit that reduces hypotheses from the development unit for each environment-specific acoustic model is provided.
[0012]
The second speech recognition apparatus of the present invention is the first speech recognition apparatus, wherein a feature extraction unit that converts input speech into a time series of feature quantities suitable for recognition, and each environment of the storage device From a distance calculation unit that calculates the distance between another acoustic model and the feature quantity from the feature extraction unit, a hypothesis holding unit that holds a hypothesis, a hypothesis held in the hypothesis holding unit, and a distance from the distance calculation unit The hypothesis developing unit that generates a new hypothesis and the all hypothesis pruning unit that performs reduction on all hypotheses generated by the hypothesis generation unit.
[0013]
A third speech recognition apparatus according to the present invention is the second speech recognition apparatus, and includes the environment-specific pruning unit that leaves a predetermined number of hypotheses having higher scores for each environment-specific acoustic model. .
[0014]
The fourth speech recognition apparatus according to the present invention is the second speech recognition apparatus, and includes the environment-specific pruning unit that leaves a hypothesis having a score equal to or higher than a threshold value for each environment-specific acoustic model. .
[0015]
The fifth speech recognition apparatus of the present invention develops a plurality of environment-specific acoustic models, a storage device storing a plurality of environment-specific language models, and a hypothesis for input speech for each environment-specific language model read from the storage device. It has a hypothesis expansion unit and an environment-specific hypothesis pruning unit that reduces hypotheses from the hypothesis expansion unit for each environment-specific language model.
[0016]
A sixth speech recognition apparatus according to the present invention is the fifth speech recognition apparatus, wherein a feature extraction unit that converts input speech into a time series of feature quantities suitable for recognition, and each environment of the storage device A distance calculation unit that calculates the distance between another acoustic model and the feature amount from the feature extraction unit, a hypothesis holding unit that holds a hypothesis, a hypothesis held in the hypothesis holding unit, a distance from the distance calculation unit, And the hypothesis expansion unit that generates a new hypothesis from the environment-specific language model of the storage device, and the all hypothesis pruning unit that performs reduction for all hypotheses generated by the hypothesis generation unit. Features.
[0017]
A seventh speech recognition apparatus according to the present invention is the sixth speech recognition apparatus, comprising the environment-specific pruning unit that leaves a predetermined number of hypotheses with higher scores for each language model. To do.
[0018]
An eighth speech recognition apparatus according to the present invention is the sixth speech recognition apparatus, comprising the environment-specific pruning unit that leaves a hypothesis having a score equal to or higher than a threshold value for each language model. To do.
[0019]
The first speech recognition method of the present invention develops a hypothesis development procedure for developing a hypothesis for an input speech for each acoustic model by environment read from the storage device, and reduces hypotheses from the hypothesis development procedure for each acoustic model by environment. It includes a hypothetical pruning procedure according to environment.
[0020]
The second speech recognition method of the present invention is the first speech recognition method, wherein the input speech is converted into a time series of feature quantities suitable for recognition, and each environment of the storage device A distance calculation procedure for calculating a distance between another acoustic model and the feature quantity from the feature extraction procedure; and a hypothesis expansion procedure for generating a new hypothesis from the hypothesis held in the hypothesis holding unit and the distance from the distance calculation procedure And a hypothesis pruning procedure for reducing all hypotheses generated by the hypothesis generating procedure.
[0021]
The third speech recognition method of the present invention is the second speech recognition method, and includes the environment pruning procedure that leaves a predetermined number of hypotheses having higher scores for each environment acoustic model. .
[0022]
The fourth speech recognition method of the present invention is the second speech recognition method, and includes the environment-specific pruning procedure that leaves a hypothesis having a score equal to or higher than a threshold for each environment-specific acoustic model. .
[0023]
According to a fifth speech recognition method of the present invention, a hypothesis expansion procedure for expanding a hypothesis for an input speech for each language model by environment read from the storage device, and a hypothesis from the hypothesis expansion procedure is reduced for each language model by environment. It includes a hypothetical pruning procedure according to environment.
[0024]
The sixth speech recognition method of the present invention is the fifth speech recognition method, wherein a feature extraction procedure for converting the input speech into a time series of feature quantities suitable for recognition, and each of the storage devices Distance calculation procedure for calculating the distance between the acoustic model by environment and the feature quantity from the feature extraction procedure, the hypothesis held in the hypothesis holding unit, the distance from the distance calculation procedure, and the language by environment of the storage device The hypothesis development procedure for generating a new hypothesis from the model and the all hypothesis pruning procedure for reducing all hypotheses generated by the hypothesis generation procedure are included.
[0025]
The seventh speech recognition method of the present invention is the sixth speech recognition method, comprising the environment-specific pruning procedure for leaving a predetermined number of hypotheses with higher scores for each language model. To do.
[0026]
An eighth speech recognition method according to the present invention is the sixth speech recognition method, including the environment-specific pruning procedure for leaving a hypothesis having a score equal to or higher than a threshold value for each language model. To do.
[0027]
The first speech recognition program of the present invention develops a hypothesis development procedure for developing a hypothesis for an input speech for each acoustic model by environment read from the storage device, and reduces hypotheses from the hypothesis development procedure for each acoustic model by environment. It is characterized by causing a computer to execute a hypothetical pruning procedure for each environment.
[0028]
A second speech recognition program of the present invention is the first speech recognition program, a feature extraction procedure for converting an input speech into a time series of feature quantities suitable for recognition, and each environment of the storage device A distance calculation procedure for calculating a distance between another acoustic model and the feature quantity from the feature extraction procedure; and a hypothesis expansion procedure for generating a new hypothesis from the hypothesis held in the hypothesis holding unit and the distance from the distance calculation procedure And a hypothesis pruning procedure for reducing all hypotheses generated by the hypothesis generating procedure.
[0029]
A third speech recognition program according to the present invention is the second speech recognition program, which causes a computer to execute the environment-specific pruning procedure that leaves a predetermined number of hypotheses with higher scores for each environment-specific acoustic model. Features.
[0030]
A fourth speech recognition program of the present invention is the second speech recognition program, which causes a computer to execute the environment-specific pruning procedure that leaves a hypothesis having a score equal to or greater than a threshold for each environment-specific acoustic model. Features.
[0031]
A fifth speech recognition program of the present invention develops a hypothesis development procedure for developing a hypothesis for an input speech for each language model by environment read from the storage device, and reduces hypotheses from the hypothesis development procedure for each language model by environment. It is characterized by causing a computer to execute a hypothetical pruning procedure for each environment.
[0032]
A sixth speech recognition program according to the present invention is the fifth speech recognition program, wherein a feature extraction procedure for converting input speech into a time series of feature quantities suitable for recognition, Distance calculation procedure for calculating the distance between the acoustic model by environment and the feature quantity from the feature extraction procedure, the hypothesis held in the hypothesis holding unit, the distance from the distance calculation procedure, and the language by environment of the storage device The hypothesis development procedure for generating a new hypothesis from the model and the all hypothesis pruning procedure for reducing all hypotheses generated in the hypothesis generation procedure are executed by a computer.
[0033]
A seventh speech recognition program according to the present invention is the sixth speech recognition program, which causes a computer to execute the pruning procedure for each environment in which a predetermined number of hypotheses with higher scores are left for each language model. It is characterized by.
[0034]
An eighth speech recognition program of the present invention is the sixth speech recognition program, which causes a computer to execute the environment-specific pruning procedure for leaving a hypothesis having a score equal to or greater than a threshold value for each language model. It is characterized by.
[0035]
DETAILED DESCRIPTION OF THE INVENTION
Next, a first embodiment of the present invention will be described in detail with reference to the drawings.
[0036]
FIG. 1 is a block diagram showing the configuration of the first exemplary embodiment of the present invention.
[0037]
Referring to FIG. 1, the speech recognition apparatus 10 according to the first exemplary embodiment of the present invention is realized by a program, or includes a feature extraction unit 100 including a program, a distance calculation unit 110, and a hypothesis expansion unit 130. The hypothesis pruning unit 140, the hypothesis pruning unit 150 by environment, the acoustic model 120 provided in the storage means (for example, memory, hard disk device), and the hypothesis holding unit 160.
[0038]
The feature extraction unit 100 performs A / D conversion on the input speech signal, and then extracts, for example, a multidimensional vector suitable for speech recognition by performing LPC analysis as a feature amount. Outputs a time series of quantities. The LPC analysis is detailed in Non-Patent Document 1. The elements of the multidimensional vector are real numbers, and the number of dimensions is often about 10 to 40. The acoustic model 120 includes a plurality (m [pieces] of acoustic models (m [pieces]) for each environment having different characteristics such as a speaker (gender, age, etc.), input means (phone, microphone, etc.), background noise, and the like. Called environmental acoustic models). The acoustic model is specifically expressed by, for example, an HMM or a neural network. In addition, each environment-specific acoustic model is prepared for each symbol such as a word, syllable, and phoneme, and in the case of a continuous distribution HMM, the distance calculation unit 110 obtains the appearance probability of the feature amount for each symbol. For example, if a symbol (for example, phoneme) is k [type], there are m × k [pieces] acoustic models.
[0039]
FIG. 5 is an explanatory diagram illustrating an example of the acoustic model 120.
[0040]
Referring to FIG. 5, the acoustic model 120 includes m [number] environment-specific acoustic models R1 to environment-specific acoustic models Rm. Each environment-specific acoustic model R1 to environment-specific acoustic model Rm includes k [number] symbol-specific acoustic models Le1 to symbol-specific acoustic models Lek, respectively. Here, e indicates each environment-specific acoustic model R1 to environment-specific acoustic model Rm.
[0041]
The distance calculation unit 110 calculates the distance from the acoustic model at a specific time in the time series of the input feature quantity.
[0042]
For example, the appearance probability for a phoneme having the input feature vector x and the acoustic model feature vector w is given by the following equation.
[0043]
[Expression 1]

[0044]
Here, x: input feature vector, n: dimension number of feature vector, μ: average vector of acoustic model w, Σ: covariance matrix of acoustic model w, t: symbol indicating transpose matrix, −1: inverse This is a symbol indicating a matrix.
[0045]
For m × k acoustic models, m × k distances are obtained. As an example of the distance at a specific time, when a continuous distribution HMM is used as an acoustic model, there is an example in which the logarithm of the probability that each acoustic model outputs a feature quantity (multidimensional vector) at that time is used as the distance. As described above, symbols are defined in the acoustic model, and distances are given to symbols such as words, syllables, and phonemes. Moreover, if it is HMM, it has a state transition probability, The logarithm value is also added to distance.
[0046]
For example, the distance T is given by T = log {P (x / w)}.
[0047]
The hypothesis development unit 130 develops a hypothesis that can be developed from the hypotheses held in the hypothesis holding unit 160 based on the distance obtained by the distance calculation unit 110, and generates a new hypothesis. The hypothesis is composed of a symbol string from a certain time (normal voice start point Z0) to a certain time (Z1) and a score calculated from the sum of acoustic model distances for each symbol in the symbol string. Sequentially, a score from time (Zn) to a certain time (Zn + 1) is calculated.
[0048]
For example, the score S is given by S = sum [log {P (x / w)}].
[0049]
The score is usually one real number. Of course, there are multiple hypotheses for the same time interval, which can be compared by score.
[0050]
The hypothesis can be developed, for example, in the case of Japanese, there is a rule such as <no vowel comes after the prompting sound tsu>, and the case where it is acceptable as a symbol string can be developed. The hypothesis that has reached the end of the input speech is output as a recognition result. That is, the recognition result is a hypothesis having the highest score in the section from the start to the end of speech. Hypotheses held in the hypothesis holding unit 160 are distinguished by acoustic models, and distances corresponding to the corresponding acoustic models are used.
[0051]
The all hypothesis pruning unit 140 reduces hypotheses having a score equal to or lower than a threshold, for example, with respect to all hypotheses to which newly generated hypotheses are added. The environment-specific hypothesis pruning unit 150 reduces hypotheses having a score equal to or lower than a threshold for each acoustic model for the hypothesis of the corresponding acoustic model to which the newly generated hypothesis is added. As described above, the hypothesis is composed of a symbol string and a score. Distinguishing hypotheses by environment means that even if the symbol string is the same, the hypothesis is distinguished if the environment from which the hypothesis is obtained differs. For example, when using the male acoustic model / female acoustic model as an environment for the utterance “Pangatatai”, suppose that the following are obtained as hypotheses in the processing process.
[0052]
(Male acoustic model, symbol string “Panga”, score: −15).
[0053]
(Male acoustic model, symbol string “Panda”, score: −30).
[0054]
(Female acoustic model, symbol string “Pasuga”, score: −30).
[0055]
(Female acoustic model, symbol string “Basuga”, score: −50).
[0056]
At this time, all hypothesis pruning unit 140 reduces hypotheses for all hypotheses (this is called pruning). For example, if the threshold is −20, only (male acoustic model, series “Panga”, score: −15) will remain. In contrast, the environment-specific hypothesis pruning unit 150 performs hypothesis reduction by distinguishing each environment. For example, assuming that reduction is performed with a threshold of -20 for a hypothesis based on a male acoustic model and a threshold of -40 for a hypothesis based on a female acoustic model, (male acoustic model, series “Panga”, score: −15) and (female acoustic model) , Series “Pasuga”, score: −30) remain.
[0057]
The hypothesis holding unit 160 holds the hypotheses output from all the hypothesis pruning units 140 and the environment-specific hypothesis pruning unit 150 so as not to overlap. In the previous example, (male acoustic model, series “Panga”, score: −15) is an example of overlapping hypotheses, and only one is retained.
[0058]
When all hypothesis pruning unit 140 and environment-specific hypothesis pruning unit 150 are combined into one hypothesis reduction (pruning), hypothesis reduction (branch) for each acoustic model is performed simultaneously in time. It is clear that the processing can be speeded up without losing the function by performing (cutting).
[0059]
Next, the operation of the first exemplary embodiment of the present invention will be described in detail with reference to the drawings.
[0060]
FIG. 2 is a flowchart showing the operation of the first exemplary embodiment of the present invention.
[0061]
Referring to FIG. 2, first, when a voice uttered by a user is input, the feature extraction unit 100 performs processing such as A / D of input voice and analysis, and a time series of feature amounts suitable for voice recognition. Is output (step S51 in FIG. 2). The voice recognition device 10 is initialized (step S52 in FIG. 2). Specifically, the feature extraction unit 100 initializes a time t indicating which part of the input is being processed to “0” (stores the time t = 0 in a memory or the like not shown). At this time, when detecting the end of the input speech, the feature extraction unit 100 adds and stores end information indicating the end. The hypothesis holding unit 160 initializes the contents. Next, the distance calculation unit 110 calculates the distance of the input speech at time t for all the acoustic models of the acoustic model 120 (step S53 in FIG. 2).
[0062]
Next, the hypothesis developing unit 130 develops a new hypothesis based on the distance calculated by the distance calculating unit 110 for a hypothesis that can be developed from the hypothesis held in the hypothesis holding unit 160 (step S54 in FIG. 2). That is, the score is calculated. If the time t is the end of the input speech (step S55 / Y in FIG. 2), the hypothesis developing unit 130 outputs the hypothesis having the highest score among the current hypotheses as the recognition result and ends the process (step S60 in FIG. 2). ).
[0063]
When the time t is not the end of the input speech (step S55 / N in FIG. 2), all hypothesis pruning unit 140 rejects, for example, hypotheses whose scores are below a certain threshold for all hypotheses obtained by hypothesis expanding unit 130. Then, hypotheses are reduced (step S56 in FIG. 2). Further, the hypotheses obtained by the environment-specific hypothesis pruning unit 150 are separately reduced for each acoustic model (step S57 in FIG. 2). The hypothesis holding unit 160 merges and holds the hypotheses remaining in the all hypothesis pruning unit 140 and the environment-specific hypothesis pruning unit 150 (step S58 in FIG. 2). As a method of merging, for example, a method of leaving only one of duplicates is conceivable.
[0064]
Next, the feature extraction unit 100 sets time t = t + 1 (step S59 in FIG. 2), and the distance calculation unit 110 and subsequent units perform processing at time t = t + 1 (step S53 and subsequent steps in FIG. 2).
[0065]
As described above, in the speech recognition device 10 according to the first exemplary embodiment of the present invention, since the hypothesis pruning unit 150 for each environment is included, hypotheses of all acoustic models always remain, so the score temporarily decreases. A better recognition result can be obtained for such utterances than the conventional speech recognition apparatus 30 of FIG.
[0066]
FIG. 3 is an explanatory diagram showing the operation of the first embodiment of the present invention.
[0067]
Referring to FIG. 3, in addition to performing pruning as a whole regardless of the distinction of acoustic models, hypothesis reduction (pruning) is performed for each environment-specific acoustic model (S1, S2,...).
[0068]
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.
[0069]
FIG. 4 is a block diagram showing the configuration of the second exemplary embodiment of the present invention.
[0070]
Referring to FIG. 4, the speech recognition device 20 according to the second exemplary embodiment of the present invention has a language model 270 added to the speech recognition device 10 according to the first exemplary embodiment of the present invention, and a hypothesis developing unit. The operations of 230, the environment-specific pruning unit 250, and the hypothesis holding unit 260 are different from the operations of the hypothesis developing unit 130, the environment-specific pruning unit 150, and the hypothesis holding unit 160, respectively. In FIG. 4, components having the same functions as those in FIG. 1 are denoted by the same reference numerals as those in FIG.
[0071]
The language model 270 accumulates a plurality (n) of language models (referred to as environment-specific language models) for each environment having different characteristics such as differences in fields and functions. The language model 270 is specifically expressed by, for example, a word dictionary, an n-gram model, a CFG grammar, or a combination thereof. For example, the 2-gram language model 270, which is one of the n-gram models, records the probability that the word y appears after the word y1, as P (y | y1). The conversion method from the symbol of the acoustic model to the word is described in the word dictionary. For example, when the symbol of the acoustic model is a phoneme, a symbol (phoneme) string “nippon” corresponds to the word “Japan” is described in the word dictionary.
[0072]
The hypothesis developing unit 230 develops a hypothesis that can be developed from the hypothesis held in the hypothesis holding unit 260 based on the distance obtained by the distance calculation unit 110 and the language model 270 to generate a new hypothesis. As the development based on the language model, for example, a symbol string of a hypothetical acoustic model is converted into a word string y1, y2,..., Yn. )) (I is 1 to n) is added to the hypothesis score as a specific method. The hypothesis that has reached the end of the input speech is output as a recognition result. Hypotheses held in the hypothesis holding unit 260 are distinguished by the acoustic model 120 and the language model 270, and the distance and the language model 270 corresponding to the corresponding acoustic model 120 are used. The environment-specific hypothesis pruning unit 250 reduces hypotheses having a score equal to or lower than a threshold for each newly generated hypothesis for each environment-specific acoustic model and each environment-specific language model. One acoustic model 120 may be used. Alternatively, the number of language models 270 may be one for a plurality of acoustic models.
[0073]
As described above, in the second embodiment of the present invention, the same effect as that of the first embodiment of the present invention is obtained, and the language model 270 is also used as a different environment. Specific features (fields, functions, etc.) can also be taken into account.
[0074]
In addition to the configuration of the first embodiment of the present invention and the second embodiment of the present invention, for example, when the hypothesis is reduced, the hypothesis is not reduced by comparing the threshold with the score, but the score is It is also possible to reduce hypotheses by leaving only a predetermined number of upper hypotheses. It is also possible to use a hypothesis score itself, a difference from the maximum score among all hypotheses, and a difference from the maximum score among hypotheses for each environment as a value to be compared with the threshold value. In addition to outputting the highest score hypothesis as a recognition result, for example, outputting multiple candidates with high scores overall, outputting multiple candidates with high scores for each environment, or outputting in the form of a word graph Is also possible.
[0075]
In addition, the hypothesis holding unit 160 and the hypothesis holding unit 260 hold the scores for the past certain time or the ranks for all hypotheses, and the hypothesis by the acoustic model 120 or the language model 270 is below a predetermined threshold for a certain time. In the case of a score or ranking, it is possible to speed up the processing by reducing hypotheses of the environment.
[0076]
Next, a third embodiment of the present invention will be described.
[0077]
In the third embodiment of the present invention, all hypothesis pruning unit 140 is omitted, and all hypotheses are reduced by hypothesis by environment pruning unit 250 according to acoustic model 120 and language model 270, respectively. , The configuration is such that the hypothesis reduction (pruning) standard is controlled as a whole.
[0078]
Next, a fourth embodiment of the present invention will be described.
[0079]
The fourth embodiment of the present invention is a method including each step (FIG. 2 and the like) of processing of the first embodiment to the third embodiment of the present invention.
[0080]
Next, a fifth embodiment of the present invention will be described.
[0081]
The fifth embodiment of the present invention is a program that causes a computer (for example, the speech recognition device 10 or the speech recognition device 20) to execute each step of the processing of the fourth embodiment of the present invention.
[0082]
【The invention's effect】
According to the present invention, since the hypothesis is pruned for each environment in addition to the pruning of the entire hypothesis, the correct answer is pruned even if the correct answer is pruned because the score temporarily deteriorates. Can be obtained and the recognition rate can be improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a first exemplary embodiment of the present invention.
FIG. 2 is a flowchart showing the operation of the first exemplary embodiment of the present invention.
FIG. 3 is an explanatory diagram showing an operation of the first exemplary embodiment of the present invention.
FIG. 4 is a block diagram showing a configuration of a second exemplary embodiment of the present invention.
FIG. 5 is an explanatory diagram showing an example of an acoustic model.
FIG. 6 is a block diagram showing a configuration of a conventional technique.
FIG. 7 is an explanatory diagram showing the operation of a conventional technique.
[Explanation of symbols]
10 Voice recognition device
100 feature extraction unit
110 Distance calculator
120 Acoustic model
130 Hypothesis Development Department
140 All hypothetical pruning parts
150 Hypothesis Pruning Department by Environment
160 Hypothesis holding part
20 Voice recognition device
230 Hypothesis Development Department
250 Hypothesis Pruning Department by Environment
260 Hypothesis holding part
270 language model
30 Voice recognition device
300 Feature extraction unit
310 Distance calculator
320 Acoustic model
330 Hypothesis Development Department
340 All hypotheses pruning part
350 Hypothesis holding part

Claims

A storage device that stores a plurality of environment-specific acoustic models, a hypothesis expansion unit that develops a hypothesis for input speech for each environment-specific acoustic model read from the storage device, and a hypothesis from the hypothesis expansion unit for each environment-specific acoustic model A speech recognition apparatus comprising: a hypothetical pruning unit for each environment to be reduced.

A feature extraction unit that converts input speech into a time series of feature amounts suitable for recognition; a distance calculation unit that calculates a distance between each acoustic model of the storage device and the feature amount from the feature extraction unit; A hypothesis holding unit that holds a hypothesis, a hypothesis expanding unit that generates a new hypothesis from a hypothesis held in the hypothesis holding unit and a distance from the distance calculation unit, and all of the hypotheses generated by the hypothesis generation unit The speech recognition apparatus according to claim 1, further comprising an all-hypothesis pruning unit that performs reduction on hypotheses.

The speech recognition apparatus according to claim 2, further comprising the environment pruning unit that leaves a predetermined number of hypotheses having higher scores for each environment acoustic model.

The speech recognition apparatus according to claim 2, further comprising the environment-specific pruning unit that leaves a hypothesis having a score equal to or greater than a threshold value for each environment-specific acoustic model.

A storage device storing a plurality of environment-specific acoustic models, a plurality of environment-specific language models, a hypothesis expansion unit that expands a hypothesis for input speech for each environment-specific language model read from the storage device, and a hypothesis expansion unit A speech recognition apparatus comprising: an environment-specific hypothesis pruning unit that reduces hypotheses for each environment-specific language model.

A feature extraction unit that converts input speech into a time series of feature amounts suitable for recognition; a distance calculation unit that calculates a distance between each acoustic model of the storage device and the feature amount from the feature extraction unit; A hypothesis holding unit that holds a hypothesis, a hypothesis held in the hypothesis holding unit, a distance from the distance calculation unit, and the hypothesis expansion unit that generates a new hypothesis from an environment-specific language model of the storage device, The speech recognition apparatus according to claim 5, further comprising: an all-hypothesis pruning unit that performs reduction for all hypotheses generated by the hypothesis generation unit.

The speech recognition apparatus according to claim 6, further comprising a pruning unit for each environment that leaves a predetermined number of hypotheses with higher scores for each language model.

The speech recognition apparatus according to claim 6, further comprising a pruning unit for each environment that leaves a hypothesis having a score equal to or greater than a threshold value for each language model.

A hypothesis development procedure for developing hypotheses for the input speech for each acoustic model for each environment read from the storage device, and a hypothesis pruning procedure for each environment for reducing hypotheses from the hypothesis development procedure for each acoustic model for each environment. A feature of speech recognition.

A feature extraction procedure for converting input speech into a time series of feature quantities suitable for recognition; a distance calculation procedure for calculating a distance between each acoustic model of the storage device and a feature quantity from the feature extraction procedure; The hypothesis expansion procedure for generating a new hypothesis from the hypothesis held in the hypothesis holding unit and the distance from the distance calculation procedure, and all hypothesis branches for reducing all hypotheses generated in the hypothesis generation procedure The voice recognition method according to claim 9, further comprising a mowing procedure.

The speech recognition method according to claim 10, further comprising a pruning procedure for each environment that leaves a predetermined number of hypotheses having higher scores for each acoustic model for each environment.

The speech recognition method according to claim 10, further comprising a pruning procedure for each environment that leaves a hypothesis having a score equal to or higher than a threshold for each acoustic model for each environment.

A hypothesis development procedure for developing hypotheses for the input speech for each language model by environment read from the storage device, and a hypothesis pruning procedure for each environment for reducing hypotheses from the hypothesis development procedure for each language model by environment. A feature of speech recognition.

A feature extraction procedure for converting input speech into a time series of feature quantities suitable for recognition, and a distance calculation procedure for calculating the distance between each acoustic model for each environment from the storage device and the feature quantity from the feature extraction procedure The hypothesis held in the hypothesis holding unit, the distance from the distance calculation procedure, the hypothesis expansion procedure for generating a new hypothesis from the language model according to the environment of the storage device, and the hypothesis generation procedure The speech recognition method according to claim 13, further comprising: all hypothesis pruning procedures for performing reduction for all hypotheses.

15. The speech recognition method according to claim 14, further comprising a pruning procedure for each environment in which a predetermined number of hypotheses with higher scores are left for each language model.

The speech recognition method according to claim 14, further comprising a pruning procedure for each environment in which a hypothesis having a score equal to or higher than a threshold is left for each language model.

The computer executes a hypothesis expansion procedure for expanding the hypothesis for the input speech for each acoustic model read from the storage device and an environment-specific hypothesis pruning procedure for reducing the hypothesis from the hypothesis expansion procedure for each environment acoustic model. A speech recognition program characterized by causing

A feature extraction procedure for converting input speech into a time series of feature quantities suitable for recognition; a distance calculation procedure for calculating a distance between each acoustic model of the storage device and a feature quantity from the feature extraction procedure; The hypothesis expansion procedure for generating a new hypothesis from the hypothesis held in the hypothesis holding unit and the distance from the distance calculation procedure, and all hypothesis branches for reducing all hypotheses generated in the hypothesis generation procedure The voice recognition program according to claim 17, which causes a computer to execute a mowing procedure.

19. The speech recognition program according to claim 18, wherein the computer executes the environment-specific pruning procedure that leaves a predetermined number of hypotheses with higher scores for each environment-specific acoustic model.

19. The speech recognition program according to claim 18, further causing a computer to execute the environment-specific pruning procedure that leaves a hypothesis having a score equal to or greater than a threshold for each environment-specific acoustic model.

The computer executes a hypothesis expansion procedure that develops hypotheses for the input speech for each language model by environment read from the storage device, and a hypothesis pruning procedure for each environment that reduces hypotheses from the hypothesis development procedure for each language model by environment. A speech recognition program characterized by causing

A feature extraction procedure for converting input speech into a time series of feature quantities suitable for recognition, and a distance calculation procedure for calculating the distance between each acoustic model for each environment from the storage device and the feature quantity from the feature extraction procedure The hypothesis held in the hypothesis holding unit, the distance from the distance calculation procedure, the hypothesis expansion procedure for generating a new hypothesis from the language model according to the environment of the storage device, and the hypothesis generation procedure The speech recognition program according to claim 21, which causes a computer to execute all hypothesis pruning procedures for reducing all hypotheses.

23. The speech recognition program according to claim 22, which causes a computer to execute the environment-based pruning procedure for leaving a predetermined number of hypotheses with higher scores for each language model.

23. The speech recognition program according to claim 22, which causes a computer to execute the environment-specific pruning procedure for leaving a hypothesis having a score equal to or higher than a threshold value for each language model.