JP2004117724A

JP2004117724A - Speech recognition device

Info

Publication number: JP2004117724A
Application number: JP2002279725A
Authority: JP
Inventors: Akira Baba; 馬場　朗
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 2002-09-25
Filing date: 2002-09-25
Publication date: 2004-04-15
Anticipated expiration: 2022-09-25
Also published as: JP4221986B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device which enables a user to obtain a high speech recognition performance even in a plurality of spaces having different acoustic environments without re-learning acoustic models for respective acoustic environments or selecting standard patterns. <P>SOLUTION: A standard pattern storage part 6 preliminarily stores a plurality of standard patterns 6a to 6e corresponding to acoustic environments having different reverberation time. A pattern collation part 4 successively compares the plurality of standard patterns 6a to 6e stored in the standard pattern storage part 6 with the speech feature quantity of an input speech outputted from a speech feature quantity extraction part 3 and outputs a recognition result of the highest degree of similarity as a final recognition result 5. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、利用者の入力した音声を認識する音声認識装置に関するものである。
【０００２】
【従来の技術】
音声認識技術は、優れたヒューマンインターフェースを具現する上で重要な役割を担っている。音声認識技術を適用した音声認識装置としては図８に示すような構成の装置が従来提供されている。この音声認識装置は、音声を入力するマイクロフォンからなる音声入力部１と、音声入力部１からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部２と、Ａ／Ｄ変換部２からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部３と、標準音声から作成された音声認識用の標準パターンを記憶している標準パターン記憶部６と、音声特徴量抽出部３から出力される入力音声の音声特徴量と標準パターン記憶部６に記憶されている標準パターンとの類似度を計算して認識結果５を出力するパターン照合部４とから構成されており、標準パターン記憶部６に記憶させる標準パターンは、予め標準音声の特徴パターンを収集し、これを隠れマルコフモデルなどのモデル化手法を用いて作成したものが用いられている。
【０００３】
ところで、この従来例の音声認識装置では、装置の使用環境と標準パターンを作成したときの環境が異なる場合に、利用者の音声と標準パターンとの間に相違が生じることにより、認識率が低下するという問題があった。この問題に対する解決方法として、音声認識装置の使用環境が変化する毎に、標準パターンを学習し直す方法があるが、標準パターンの学習に多大な時間がかかるため実用的ではないという問題がある。
【０００４】
そこで、標準パターンを作成する際に、そのときの装置の使用環境を示す標準パターン作成環境情報を入力部から入力して標準パターンと対応付けて登録し、音声認識時に、標準パターン作成環境情報を表示部で表示して、利用者が表示されている情報から使用環境にあった標準パターンを選択する音声認識装置も提供されている（例えば、特許文献１参照）。
【０００５】
【特許文献１】
特開平８−７６７８４号公報（第１頁左欄第５−１７行）
【０００６】
【発明が解決しようとする課題】
上記のように特許文献１に記載された装置では使用環境を考慮しているが、利用者が使用環境に合った標準パターンを音声認識モードのときに選択するための操作などが必要で面倒であるという問題があった。
【０００７】
本発明は、上述の問題点に鑑みて為されたもので、その目的とするところは、音響環境の異なる複数の空間においても、夫々の音響環境毎に音響モデルの再学習や利用者が標準パターンを選択するための操作を行うことなく、高い音声認識性能を得ることが可能な音声認識装置を提供することにある。
【０００８】
【課題を解決するための手段】
上記目的を達成するために、請求項１の音声認識装置の発明では、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備え、
前記パターン照合部は、前記標準パターン記憶部に記憶されている複数の標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量とを順次比較し、最も類似度の高い認識結果を最終的な認識結果として出力することを特徴とする。
【０００９】
上記請求項１の発明に係る音声認識装置によれば、残響時間の異なる複数の音響環境に対応した標準パターンの中から入力音声に対して最も類似度の高い標準パターンを認識動作時に自動的に選択可能であるので、入力音声を最も高い音声認識性能で認識することが可能な標準パターンを利用して音声認識を行え、その結果高い音声認識性能を得ることができる。
【００１０】
また高い音声認識性能を得るためには、装置が利用される音響環境と類似した環境に対応した標準パターンを有している必要があるが、音声認識性能に強く悪影響を与える音響特性の一要素は残響時間であるので、残響時間の異なる複数の音響環境に対応した標準パターンを用意することにより、ごく少数の標準パターンでもって様々な利用環境に対応することが可能となる。
【００１１】
このように、残響時間が夫々異なる音響環境に対応する複数の標準パターンから、入力音声との類似度が高い標準パターンを選択することにより、少数の標準パターンでも入力音声と標準パターンの不一致による音声認識性能の低下を回避することが可能となる。
【００１２】
請求項２の音声認識装置の発明では、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備えるとともに、従前の音声認識動作時に最も類似度の高かった標準パターンを記憶する選択パターン記憶部を備え、
前記パターン照合部は、前記選択パターン記憶部に記憶されている標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力することを特徴とする。
【００１３】
上記請求項２の音声認識装置の発明によれば、音声が入力される度に全ての標準パターンと入力音声の類似度を計算する必要がなくなる。
【００１４】
請求項３の発明の音声認識装置の発明では、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備えるとともに、前記音声入力部に入力する音声に基づいて、直接波と反射波との相関をとることにより、直接波と反射波の音声入力部への到達時間差を計算し、この到達時間差の大小に応じて標準パターン記憶部から最適な標準パターンを選択する到来時間計算部を備え、
前記パターン照合部は、到来時間計算部の計算結果に基づいて選択される標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力することを特徴とする。
【００１５】
上記請求項３の音声認識装置の発明によれば、直接波と反射波の音声入力部への到達時間差に応じて、入力音声を認識するのに適した標準パターンを選択することが可能となるので、全ての標準パターンと入力音声との間の類似度を計算する必要がなくなる。
【００１６】
請求項４の音声認識装置の発明では、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備えるとともに、前記各標準パターンに対応した環境の特定のみに用いる各環境パターンを記憶している環境パターン記憶部を備え、
前記パターン照合部は、前記音声特徴量抽出部から出力される入力音声の音声特徴量と環境パターン記憶部に記憶されている各環境パターンとの類似度を計算して最も類似度の高い環境パターンを選択して当該環境パターンに対応する標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力することを特徴とする。
【００１７】
上記請求項４の音声認識装置の発明によれば、音声の認識に用いない環境の特定のみに用いる環境パターンの類似度を計算することによって、音声認識に用いる標準パターンを決定するので、音声認識に当たり入力音声の音声特徴量と標準パターンとの類似度の計算を行う場合に比べて計算量を少なくすることが可能である。
【００１８】
請求項５の音声認識装置の発明では、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部と、予め記憶されている指定音を前記各標準パターンに対応する各環境で発生した場合の指定音パターンを記憶している指定音パターン記憶部とを備えるとともに、話者の近傍に配置され、予め用意した指定音を再生して前記音声入力部に入力させるスピーカを具備し、
前記パターン照合部は、前記スピーカで再生された前記指定音の前記音声特徴量抽出部から出力される音声特徴量と前記指定音パターン記憶部に記憶されている各指定音パターンとの間の類似度を計算して最も類似度の高かった指定音パターンと同じ環境の標準パターンを選択し、該標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力することを特徴とする。
【００１９】
上記請求項５の音声認識装置の発明によれば、予め用意した指定音の再生音の音声特徴量と、標準パターンを作成した環境でこの指定音を再生した場合の指定パターンとの比較を行うことにより、音声認識に用いる標準パターンを決定するので、入力音声と標準パターンとの類似度を高い精度で計算することが可能となる。
【００２０】
請求項６の音声認識装置の発明では、請求項１乃至５の何れかの発明において、前記標準パターンに対応する音響環境として、装置利用が想定される部屋の環境を用いたことを特徴とする。
【００２１】
上記請求項６の音声認識装置の発明によれば、標準パターンと入力音声の音声特徴量との類似度を、残響時間以外の要素についても高くすることが可能となり、その結果より高い音声認識性能を得ることが可能となる。
【００２２】
請求項７の音声認識装置の発明では、請求項１乃至６の何れかの発明において、前記標準パターンの作成に用いる音声データは、残響時間の異なる複数の音響環境での伝達関数を、残響のない環境で収録された音声に畳み込んだ音声データから成ることを特徴とする。
【００２３】
上記請求項７の音声認識装置の発明によれば、標準パターンの作成に必要となる多量の音声データを、残響時間の異なる複数の音響環境毎に収録する必要がなくなるので、標準パターンの作成が容易になる。また、夫々の標準パターンの作成に同じ音声を使用することが可能であるので、発声のばらつきがなくなり、音響環境の特性だけが異なる標準パターン群を作成することが可能であり、その結果として標準パターンを高い精度で選択可能となり、高い音声認識性能を得ることが可能となる。
【００２４】
【発明の実施の形態】
以下本発明を実施形態により説明する。
【００２５】
（実施形態１）
本実施形態は、図１に示すように音声を入力するためのマイクロフォンからなる音声入力部１と、音声入力部１からの入力音声を例えば量子化ビット数１６、標本化周波数１６ｋＨｚでＡ／Ｄ変換するＡ／Ｄ変換部２と、Ａ／Ｄ変換部２によってＡ／Ｄ変換された音声データを例えば分析フレーム長２５ミリ秒、分析間隔１０ミリ秒で周波数変換し、音声特徴量として例えばメル周波数ケプストラム係数などの音声特徴を抽出する音声特徴量抽出部３と、残響時間の異なる複数の音響環境として、例えば残響時間３４０ミリ秒の可変残響室の環境と、残響時間５５０ミリ秒の小さな和室の環境と、残響時間８１０ミリ秒の大きな和室の環境、残響時間９００ミリ秒の会議室の環境と、残響時間１６８０ミリ秒の可変残響室の環境とを用い、各環境において夫々測定された伝達関数を、無残響室で収録された標準音声に畳み込んだ音声データを用いて作成された、例えば１状態当たりのガウス分布数が１６で音素モデル当たり３状態からなる隠れマルコフモデルによる４３個の音素モデルをからなる複数の標準パターン６ａ〜６ｅを予め記憶する標準パターン記憶部６と、音声特徴量抽出部３で抽出された入力音声の音声特徴量と、標準パターン記憶部６で記憶されている各標準パターン６ａ〜６ｅとの間の類似度を、例えば尤度をフォワードアルゴリズムによって計算して最も類似度の高い標準パターンを選択し、当該標準パターンを構成する４３個の音素モデルからなる認識結果５を出力するパターン照合部４とから成る。
【００２６】
次に本実施形態の動作を説明する。先ず利用者が音声入力部１を構成するマイクロフォンを用いて、例えば「こんにちは」という音声を入力したとする。この入力音声信号は、Ａ／Ｄ変換部２によってＡ／Ｄ変換されたのち、音声特徴量抽出部３によって音声特出量が抽出される。
【００２７】
この抽出された音声特徴量と伝達関数が畳み込まれた音声データから形成された６ａ〜６ｅの５つの標準パターンとの類似度を順次パターン照合部４において計算し、最も類似度の高い標準パターン、例えば６ａの標準パターン６ａが最も類似度の高い場合、当該標準パターン６ａに含まれる４３の音素モデルから「こんにちは」という音声に対する尤度の大きい（類似度の高い）音素モデルの列が認識結果５として出力されるのである。
【００２８】
而して、本実施形態では、残響時間が異なる音響環境での標準音声から作成された標準パターン６ａ〜６ｅの中から入力音声に対して最も類似度の高い標準パターンが認識動作時に自動的に選択されるので、入力音声を最も高い音声認識性能で認識可能な標準パターンを利用することが可能となる。
【００２９】
また、高い音声認識性能を得るためには、装置が利用される音響環境と類似した環境に対応した標準パターンを有している必要があり、特に音声認識性能に強く悪影響を与える音響特性の一要素が残響時間であるが、本実施形態では、残響時間の異なる複数の音響環境に対応した標準パターン６ａ〜６ｅ群を用意することにより、ごく少数の標準パターンにより様々な利用環境に対応することが可能となる。
【００３０】
そして残響時間が夫々異なる音響環境に対応した複数の標準パターン６ａ〜６ｅと入力音声の音声特徴量との類似度を計算して最も類似度の高い標準パターンによって音声認識を行うことにより、少数の標準パターンでも入力音声と標準パターンの不一致による音声認識性能の低下を回避することが可能となる。
【００３１】
また、音声認識装置が利用される空間は、異なる空間であってもその種類が同じであるならば残響時間は比較的類似しているので、残響時間の異なる複数の音響環境として音声認識装置が利用されることが予想されるような代象的な部屋を利用して標準パターン６ａ〜６ｅを上述のように作成することにより、標準パターン６ａ〜６ｅと入力音声の音声特徴量との類似度を、残響時間以外の環境要素についても高くすることが可能となり、その結果より高い音声認識性能を得られることになる。
【００３２】
また、標準パターン６ａ〜６ｅの作成に必要となる多量の音声データを、残響時間の異なる複数の音響環境毎に収録する必要がなくなるので、標準パターン６ａ〜６ｅの作成が容易になる。更にまた、夫々の標準パターン６ａ〜６ｅの作成に同じ標準音声を使用することが可能であるので、標準パターンの作成音声によるばらつきがなくなり、そのため音響環境の特性だけが異なる標準パターン群を作成することができ、その結果として高い音声認識性能を得ることが可能となる。
【００３３】
（実施形態２）
上記実施形態１の場合には、入力された音声から抽出された音声特徴量と標準パターン記憶部６に記憶してある標準パターン６ａ〜６ｅとの類似度を順次パターン照合部４で計算することで、類似度の最も高い標準パターンを選択する構成であったが、本実施形態は図２にに示すように、標準パターン記憶部６に記憶されている標準パターン６ａ〜６ｅの作成に用いた音声から作成された、例えば１個の環境モデルを１状態あたりのガウス分布数が６４で、環境モデルあたり１状態からなる隠れマルコフモデルによる環境パターン７ａ〜７ｅを記憶した環境パターン記憶部７を備え、パターン照合部４が、入力音声から抽出された音声特徴量と環境パターン記憶部７に記憶されている環境パターン７ａ〜７ｅとの類似度として例えば尤度をフォワードアルゴリズムによって計算して、最も類似度の高い環境パターンを選択するとともにその選択された環境パターンと同一の環境での音声から作成された音声認識用の標準パターンを決定し、この決定された標準パターンと入力音声の音声特徴量とを用いて尤度の大きい（類似度の高い）音素モデル列からなる認識結果５を出力する点に特徴がある。尚図２中図１と同じ構成には同じ符号を付し説明を省略する。
【００３４】
次に本実施形態の動作を説明する。残響時間が例えば４００ミリ秒の環境において、利用者が音声入力部１を構成するマイクロフォンを用いて、例えば「こんにちは」という音声を入力したとする。この入力音声信号は、Ａ／Ｄ変換部２によってＡ／Ｄ変換された後、音声特徴量抽出部３によって音声特徴量が抽出される。この音声特徴量と７ａ〜７ｅの５つの環境パターンとの類似度がパターン照合部４で計算することで最も大きな類似度の環境パターンが選択される。
【００３５】
図３はこの環境パターンの選択処理のフローチャートを示しており、先ず音声特徴量をステップＳ１で入力したパターン照合部４は、音声特徴量とステップＳ２、Ｓ３によって環境パターン記憶部７に記憶されている各環境パターン７ａ〜７ｅとの類似度を順次計算し、その計算が終了するの計算された類似度が最大だった環境パターンを選択するのである。従って上述のように例えば残響時間４００ミリ秒の環境下で音声入力が為された場合、類似度の高い環境パターンとして環境パターン７ａ（例えば残響時間３４０ミリ秒）が選択される。
【００３６】
次に、パターン照合部４は、選択された環境パターン７ａのパターン名から対応する標準パターン６ａを決定し、この標準パターン６ａと入力音声の音声特徴量とを用いて標準パターン６ａに含まれる４３個の音素モデルから入力音声の「こんいちは」に対する尤度の大きい（類似度の高い）音素モデル列を認識結果５として出力する。
【００３７】
（実施形態３）
上記実施形態２では、音声入力の都度、環境パターン記憶部７に記憶されている全環境パターン７ａ〜７ｅと音声特徴量との類似度をパターン照合部４が計算しているが、初期スタート後等の最初の認識動作時においてのみ実施形態２と同様に音声特徴量と、環境パターン７ａ〜７ｅとの類似度をパターン照合部４が計算し、最も大きな類似度の環境パターンを選択し、この環境パターン名から標準パターンを決定する処理を行うが、このときの標準パターンの決定に用いた環境パターン名を図４に示す選択パターン記憶部８に記憶し、以後の音声認識時には選択パターン記憶部８に記憶している環境パターン名に対応する標準パターンと、音声特徴量とを用いて当該標準パターンに含まれる４３個の音素モデルから入力音声に対する尤度の大きい（類似度の高い）音素モデル列を認識結果５として出力する点で実施形態２と相違する。尚本実施形態では選択パターン記憶部８に記憶されている環境パターン名を、音声認識装置に付設している押し釦スイッチ（図示せず）を操作することでリセットすることがようになっている。勿論音声認識装置に振動検知センサ等の装置の移動を検知するセンサを設け、装置が移動した場合に自動的にリセットさせるようにしても良い。
【００３８】
図５は、本実施形態におけるパターン照合部４の動作のフローチャートを示しており、音声入力部１を構成するマイクロフォンを通じて入力された音声の音声特徴量が音声特徴量抽出部３を通じてパターン照合部４に入力される（ステップＳ１０）と、パターン照合部４は選択パターン記憶部８をチェックして既に環境パターン名が記憶されているか否かを判定し（ステップＳ１１）、記憶されていない場合には最初の音声認識動作と判断し、音声特徴量と環境パターン７ａ〜７ｅとの類似度を順次計算し（ステップＳ１２、Ｓ１３）、全ての環境パターン７ａ〜７ｅに対する類似度の計算が終了すると、類似度が最も高い環境パターンを選択し（ステップＳ１４）、この選択した環境パターン名を選択パターン記憶部８に記憶させる（ステップＳ１５）。そしてこの環境パターン名によって、標準パターンを決定し、上述したように音声特徴量抽出部３から出力される音声入力の音声特徴量と決定された標準パターンとを用いて当該標準パターンに含まれる４３個の音素モデルから入力音声に対する尤度の大きい（類似度の高い）音素モデル列を認識結果５として出力する。
【００３９】
一方ステップＳ１１で既に選択パターン記憶部８で既に環境パターン名が記憶されている場合には、記憶されている環境パターン名に基づいて標準パターンを決定するのである。
【００４０】
（実施形態４）
本実施形態は、空間の残響時間がその空間の容量に比例する性質があり、また空間で発声された音声の音声入力部に入力する音波には、その音源から音声入力部へ直接到達する直接波と、音源から一旦壁などに反射してから音声入力部へ到達する反射波とがあって、両者の到達時間の差が空間の容量に比例する、つまり残響時間が長ければ到達時間差が大きくなる傾向がある点に注目して為されたものである。
【００４１】
つまり本実施形態では、図６に示すようにＡ／Ｄ変換部２によってＡ／Ｄ変換された音声信号から、利用者あるいは指定音９を再生するスピーカ１０から、音声入力部１へ到来する直接波と、一度以上壁などによって反射してから到来する反射波の時間差を、例えば音声信号の先頭区間と時間差△Ｔ秒後の区間との間の相関係数の高さから推測し、この時間差△Ｔとしていくつかの値を用いて上記推測を試し、最も相関係数の大きくなる時間差△Ｔを到来時間として選択し、この到来時間に近似する残響時間の音響環境に対応して作成されている標準パターンを選択する到来時間計算部１１を設けている点に特徴がある。尚時間差△Ｔが大きければ、高いほど残響時間の長い環境に対応した標準パターンが選択される。
【００４２】
そしてパターン照合部４は選択した標準パターンと、音声特徴量抽出部３からの音声特徴量とを用いて当該標準パターンに含まれる４３個の音素モデルから入力音声に対する尤度の大きい（類似度の高い）音素モデル列を認識結果５として出力する。
【００４３】
（実施形態５）
本実施形態では、図７に示すように標準パターン６ａ〜６ｅに対応する環境下で、予め用意した基準音たる指定音の特徴量を夫々指定パターン１２ａ〜１２ｅとして記録している指定音パターン記憶部１２を設け、先ず音声認識に当たって、話者の近傍に配置したスピーカ１０から上記指定音９を再生させ、この再生した指定音の音声特徴量を、音声入力部１を構成するマイクロフォンとＡ／Ｄ変換部２とを介して音声特徴量抽出部３で抽出し、パターン照合部４はこの指定音９の音声特徴量と、指定音パターン記憶部１２で記憶させている指定パターン１２ａ〜１２ｅとの類似度を計算し、最も類似度の高かった指定音パターンに対応する標準パターンを音声認識用の標準パターンとして選択し、その後の話者の入力音声の音声認識には選択された標準パターンを用いて行う。
【００４４】
この場合選択した指定音パターンを、上述した環境パターン選択時と同様に選択パターン記憶部（図示せず）に記憶させ、以後の話者の音声認識時には、その都度記憶している当該指定音パターンに対応する標準パターンを音声認識用として用いるようにすれば、類似度の計算量を少なくすることができる。勿論音声認識装置の移動があったり、使用環境が変化する場合には記憶をリセットし、上記の指定音の再生によって、類似度の最も高い指定音パターンを選択する処理を行うようにすれ良い。
【００４５】
尚音声認識装置では、一般に「言語モデル」も使用されることが多いが、本発明の音声認識装置においても言語モデルを備えるようにしても良い。
【００４６】
つまり、例えば「おはよう」という言葉を認識する際に、「おはよお」などと認識する可能性がある。
【００４７】
しかし実際には「おはよお」という単語は日常生活にはないので、ここで、例えば「おはよう」や「オハイオ」などの登録されている単語の日常生活における出現頻度を言語モデルと呼ばれる確率モデルに記憶しておき、この確率値と類似度との積を最終的な類似度とするようにしても良い。
【００４８】
【発明の効果】
請求項１の音声認識装置の発明は、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備え、該パターン照合部が、前記標準パターン記憶部に記憶されている複数の標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量とを順次比較し、最も類似度の高い認識結果を最終的な認識結果として出力するので、利用者が環境に応じた標準パターンを選択する操作を行うことなく、残響時間が夫々異なる音響環境に対応した複数の標準パターンの中から入力音声に対して最も類似度の高い標準パターンを認識動作時に自動的に選択することができるものであって、入力音声を最も高い音声認識性能で認識可能な標準パターンを利用することが可能なため高い音声認識性能が得られ、特に残響時間の異なる複数の音響環境に適した標準パターン群を用意することにより、ごく少数の標準パターンでもって様々な利用環境に対応することが可能となる上に、入力音声と標準パターンの不一致による音声認識性能の低下を回避することが可能となるという効果がある。
【００４９】
請求項２の音声認識装置の発明は、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備えるとともに、従前の音声認識動作時に最も類似度の高かった標準パターンを記憶する選択パターン記憶部を備え、前記パターン照合部が、前記選択パターン記憶部に記憶されている標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力するので、上述の請求項１の発明と同様に高い音声認識性能が得られる上に、音声が入力される度に全ての標準パターンと入力音声の類似度を計算する必要がなくなって、認識結果を高速に出力できるという効果がある。
【００５０】
請求項３の発明の音声認識装置の発明は、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備えるとともに、前記音声入力部に入力する音声に基づいて、直接波と反射波との相関をとることにより、直接波と反射波の音声入力部への到達時間差を計算し、この到達時間差の大小に応じて標準パターン記憶部から最適な標準パターンを選択する到来時間計算部を備え、前記パターン照合部が、到来時間計算部の計算結果に基づいて選択される標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力するので、上述の請求項１の発明と同様に高い音声認識性能が得られる上に、直接波と反射波の音声入力部への到達時間差に応じて、入力音声を認識するのに適した標準パターンを選択することが可能となって、全ての標準パターンと入力音声との間の類似度を計算する必要がなくなり、認識結果を高速に出力できるという効果がある。
【００５１】
請求項４の音声認識装置の発明は、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部とを備えるとともに、前記各標準パターンに対応した環境の特定のみに用いる各環境パターンを記憶している環境パターン記憶部を備え、前記パターン照合部が、前記音声特徴量抽出部から出力される入力音声の音声特徴量と環境パターン記憶部に記憶されている各環境パターンとの類似度を計算して最も類似度の高い環境パターンを選択して当該環境パターンに対応する標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力するので、上述の請求項１の発明と同様に高い音声認識性能が得られる上に、音声の認識に用いない環境の特定のみに用いる環境パターンの類似度を計算することによって、音声認識に用いる標準パターンを決定するので、音声認識に当たり入力音声の音声特徴量と標準パターンとの類似度の計算を行う場合に比べて計算量を少なくすることでき、その結果認識結果を高速に出力できるという効果がある。
【００５２】
請求項５の音声認識装置の発明は、音声を入力する音声入力部と、音声入力部からの出力信号をＡ／Ｄ変換するＡ／Ｄ変換部と、Ａ／Ｄ変換部からの出力信号を周波数変換して音声の特徴量を抽出する音声特徴量抽出部と、残響時間が異なる音響環境での標準音声から夫々作成された複数の標準パターンを記憶している標準パターン記憶部と、音声特徴量抽出部から出力される入力音声の音声特徴量と標準パターン記憶部に記憶されている標準パターンとの類似度を計算して認識結果を出力するパターン照合部と、予め記憶されている指定音を前記各標準パターンに対応する各環境で発生した場合の指定音パターンを記憶している指定音パターン記憶部とを備えるとともに、話者の近傍に配置され、予め用意した指定音を再生して前記音声入力部に入力させるスピーカを具備し、前記パターン照合部が、前記スピーカで再生された前記指定音の前記音声特徴量抽出部から出力される音声特徴量と前記指定音パターン記憶部に記憶されている各指定音パターンとの間の類似度を計算して最も類似度の高かった指定音パターンと同じ環境の標準パターンを選択し、該標準パターンと前記音声特徴量抽出部から出力される入力音声の音声特徴量との類似度の計算結果より認識結果を出力するので、入力音声と標準パターンとの類似度を高い精度で計算することが可能となり、より高い音声認識性能が得られるという効果がある。
【００５３】
請求項６の音声認識装置の発明は、請求項１乃至５の何れかの発明において、前記標準パターンに対応する音響環境として、装置利用が想定される部屋の環境を用いたので、標準パターンと入力音声の音声特徴量との類似度を、残響時間以外の要素についても高くすることが可能となり、その結果より高い音声認識性能を得られる。
【００５４】
請求項７の音声認識装置の発明は、請求項１乃至６の何れかの発明において、前記標準パターンの作成に用いる音声データが、残響時間の異なる複数の音響環境での伝達関数を、残響のない環境で収録された音声に畳み込んだ音声データから成るので、標準パターンの作成に必要となる多量の音声データを、残響時間の異なる複数の音響環境毎に収録する必要がなくなり、標準パターンの作成が容易になり、また、夫々の標準パターンの作成に同じ音声を使用することが可能となるあるので、発声のばらつきがなくなり、音響環境の特性だけが異なる標準パターン群を作成することが可能であり、その結果として標準パターンを高い精度で選択可能となり、高い音声認識性能を得られる。
【図面の簡単な説明】
【図１】本発明の実施形態１の構成図である。
【図２】本発明の実施形態２の構成図である。
【図３】同上の動作説明用フローチャートである。
【図４】本発明の実施形態３の構成図である。
【図５】同上の動作説明用フローチャートである。
【図６】本発明の実施形態４の構成図である。
【図７】本発明の実施形態５の構成図である。
【図８】従来例の構成図である。
【符号の説明】
１　音声入力部
２　Ａ／Ｄ変換部
３　音声特徴量抽出部
４　パターン照合部
５　認識結果
６　標準パターン記憶部
６ａ〜６ｅ　標準パターン[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice recognition device that recognizes voice input by a user.
[0002]
[Prior art]
Speech recognition technology plays an important role in realizing an excellent human interface. As a speech recognition device to which the speech recognition technology is applied, a device having a configuration as shown in FIG. 8 is conventionally provided. The speech recognition apparatus includes a speech input unit 1 including a microphone for inputting speech, an A / D conversion unit 2 for A / D converting an output signal from the speech input unit 1, and an output from the A / D conversion unit 2. A voice feature amount extraction unit 3 for frequency-converting a signal to extract a voice feature amount; a standard pattern storage unit 6 for storing a standard pattern for voice recognition created from the standard voice; And a pattern matching unit 4 that calculates the similarity between the voice feature amount of the input voice output from 3 and the standard pattern stored in the standard pattern storage unit 6 and outputs a recognition result 5. As the standard pattern stored in the pattern storage unit 6, a pattern obtained by collecting in advance a feature pattern of a standard voice and using a modeling method such as a hidden Markov model is used.
[0003]
By the way, in this conventional speech recognition device, when the environment in which the device is used and the environment when the standard pattern is created are different, a difference occurs between the user's voice and the standard pattern, and the recognition rate is reduced. There was a problem of doing. As a solution to this problem, there is a method of re-learning the standard pattern every time the use environment of the speech recognition device changes, but there is a problem that it takes a lot of time to learn the standard pattern and is not practical.
[0004]
Therefore, when creating a standard pattern, standard pattern creation environment information indicating the use environment of the device at that time is input from the input unit and registered in association with the standard pattern. There is also provided a voice recognition device that displays on a display unit and allows a user to select a standard pattern suitable for a usage environment from information displayed (for example, see Patent Document 1).
[0005]
[Patent Document 1]
JP-A-8-76784 (page 1, left column, lines 5-17)
[0006]
[Problems to be solved by the invention]
As described above, in the device described in Patent Document 1, the use environment is taken into consideration. However, an operation or the like is required for the user to select a standard pattern suitable for the use environment in the voice recognition mode, which is troublesome. There was a problem.
[0007]
The present invention has been made in view of the above-described problems, and has as its object the purpose of re-training an acoustic model and standardizing a user for each acoustic environment even in a plurality of spaces having different acoustic environments. It is an object of the present invention to provide a speech recognition device capable of obtaining high speech recognition performance without performing an operation for selecting a pattern.
[0008]
[Means for Solving the Problems]
In order to achieve the above object, according to the first aspect of the present invention, a voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, A voice feature amount extraction unit that frequency-converts an output signal from the D conversion unit to extract a voice feature amount, and a plurality of standard patterns respectively created from standard voices in acoustic environments with different reverberation times are stored. A standard pattern storage unit, and a pattern matching unit that calculates a similarity between the voice feature amount of the input voice output from the voice feature amount extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result. With
The pattern matching unit sequentially compares a plurality of standard patterns stored in the standard pattern storage unit with the voice feature amount of the input voice output from the voice feature amount extraction unit, and recognizes a recognition result having the highest similarity. Is output as a final recognition result.
[0009]
According to the speech recognition apparatus according to the first aspect of the present invention, a standard pattern having the highest similarity to the input speech is automatically selected from among the standard patterns corresponding to a plurality of acoustic environments having different reverberation times during the recognition operation. Since selection is possible, speech recognition can be performed using a standard pattern that can recognize input speech with the highest speech recognition performance, and as a result, high speech recognition performance can be obtained.
[0010]
To obtain high speech recognition performance, it is necessary to have a standard pattern corresponding to an environment similar to the acoustic environment in which the device is used, but one of the acoustic characteristics that strongly affects speech recognition performance Is a reverberation time, and by preparing standard patterns corresponding to a plurality of acoustic environments having different reverberation times, it is possible to cope with various use environments with a very small number of standard patterns.
[0011]
As described above, by selecting a standard pattern having a high degree of similarity with the input voice from a plurality of standard patterns corresponding to acoustic environments each having a different reverberation time, even a small number of standard patterns can generate a voice due to a mismatch between the input voice and the standard pattern. It is possible to avoid a decrease in recognition performance.
[0012]
According to the second aspect of the present invention, a voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and an output signal from the A / D converter are provided. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the volume extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; A selection pattern storage unit that stores the standard pattern with the highest similarity during operation is provided,
The pattern matching unit may output a recognition result based on a calculation result of a similarity between a standard pattern stored in the selected pattern storage unit and a voice feature amount of an input voice output from the voice feature amount extraction unit. Features.
[0013]
According to the second aspect of the present invention, it is not necessary to calculate the similarity between all the standard patterns and the input voice each time a voice is input.
[0014]
According to the third aspect of the present invention, a voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and an output from the A / D converter are provided. A voice feature amount extraction unit that converts a signal into a frequency to extract a voice feature amount, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard voices in acoustic environments with different reverberation times, A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the voice feature amount extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; By calculating the correlation between the direct wave and the reflected wave based on the sound input to the input unit, the arrival time difference between the direct wave and the reflected wave to the sound input unit is calculated, and a standard pattern is calculated according to the magnitude of the arrival time difference. From the storage unit With the arrival time calculation section for selecting the optimal standard pattern,
The pattern matching unit outputs a recognition result from a calculation result of a similarity between a standard pattern selected based on a calculation result of an arrival time calculation unit and a voice feature amount of an input voice output from the voice feature amount extraction unit. It is characterized by doing.
[0015]
According to the third aspect of the present invention, it is possible to select a standard pattern suitable for recognizing an input voice according to a difference in arrival time of a direct wave and a reflected wave to a voice input unit. Therefore, there is no need to calculate the similarity between all the standard patterns and the input speech.
[0016]
According to a fourth aspect of the present invention, a voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and an output signal from the A / D converter are provided. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the amount extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; An environment pattern storage unit that stores each environment pattern used only for specifying an environment corresponding to
The pattern matching unit calculates the similarity between the voice feature amount of the input voice output from the voice feature amount extraction unit and each environment pattern stored in the environment pattern storage unit, and calculates the highest similarity environment pattern. And outputs a recognition result based on the calculation result of the similarity between the standard pattern corresponding to the environment pattern and the audio feature amount of the input audio output from the audio feature amount extraction unit.
[0017]
According to the fourth aspect of the present invention, the standard pattern used for speech recognition is determined by calculating the similarity of an environment pattern used only for specifying an environment not used for speech recognition. In this case, it is possible to reduce the amount of calculation as compared with the case where the similarity between the voice feature amount of the input voice and the standard pattern is calculated.
[0018]
According to a fifth aspect of the present invention, a voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and an output signal from the A / D converter are provided. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the volume extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; A designated sound pattern storage unit that stores a designated sound pattern when it occurs in each environment corresponding to each of the standard patterns, and is arranged near the speaker to reproduce a designated sound prepared in advance. The voice Comprising a speaker to be input to the power unit,
The pattern matching unit is configured to determine a similarity between a voice feature amount output from the voice feature amount extraction unit of the specified sound reproduced by the speaker and each specified sound pattern stored in the specified sound pattern storage unit. Calculate the degree and select a standard pattern in the same environment as the designated sound pattern with the highest similarity, and calculate the similarity between the standard pattern and the audio feature of the input voice output from the audio feature extractor. A feature is to output a recognition result from the result.
[0019]
According to the invention of the voice recognition device of the fifth aspect, the voice feature amount of the reproduction sound of the designated sound prepared in advance is compared with the designated pattern when the designated sound is reproduced in the environment where the standard pattern is created. Thus, since the standard pattern used for speech recognition is determined, the similarity between the input speech and the standard pattern can be calculated with high accuracy.
[0020]
According to a sixth aspect of the present invention, in any one of the first to fifth aspects of the present invention, as an acoustic environment corresponding to the standard pattern, a room environment in which use of the device is assumed is used. .
[0021]
According to the speech recognition apparatus of the sixth aspect, it is possible to increase the similarity between the standard pattern and the speech feature amount of the input speech for elements other than the reverberation time. Can be obtained.
[0022]
According to a seventh aspect of the present invention, in the invention of any one of the first to sixth aspects, the speech data used for creating the standard pattern includes a transfer function in a plurality of acoustic environments having different reverberation times, It is characterized by being composed of audio data convolved with audio recorded in a non-environment.
[0023]
According to the invention of the speech recognition apparatus of the seventh aspect, it is not necessary to record a large amount of speech data necessary for creating a standard pattern for each of a plurality of acoustic environments having different reverberation times. Become easy. In addition, since the same voice can be used to create each of the standard patterns, there is no variation in utterance, and it is possible to create a standard pattern group that differs only in the characteristics of the acoustic environment. Patterns can be selected with high accuracy, and high voice recognition performance can be obtained.
[0024]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described with reference to embodiments.
[0025]
(Embodiment 1)
In the present embodiment, as shown in FIG. 1, an audio input unit 1 composed of a microphone for inputting audio, and an input audio from the audio input unit 1 are subjected to A / D conversion at, for example, 16 quantization bits and a sampling frequency of 16 kHz. The A / D converter 2 for conversion and the audio data A / D-converted by the A / D converter 2 are frequency-converted at, for example, an analysis frame length of 25 milliseconds and an analysis interval of 10 milliseconds. A speech feature amount extraction unit 3 for extracting speech features such as frequency cepstrum coefficients, a plurality of acoustic environments having different reverberation times, for example, a variable reverberation room environment with a reverberation time of 340 ms, and a small Japanese room with a reverberation time of 550 ms Environment, a large Japanese room environment with a reverberation time of 810 ms, a conference room environment with a reverberation time of 900 ms, and a variable reverberation room environment with a reverberation time of 1680 ms. The transfer function measured at the boundary is created by using voice data obtained by convolving the transfer function measured with the standard voice recorded in the anechoic chamber. For example, the number of Gaussian distributions per state is 16 and the phoneme model has 3 states. A standard pattern storage unit 6 for storing a plurality of standard patterns 6a to 6e each including 43 phoneme models based on a hidden Markov model; a speech feature amount of the input speech extracted by the speech feature amount extraction unit 3; The similarity between each of the standard patterns 6a to 6e stored in the storage unit 6, for example, the likelihood is calculated by a forward algorithm, and the standard pattern with the highest similarity is selected to constitute the standard pattern 43. And a pattern matching unit 4 that outputs a recognition result 5 composed of a plurality of phoneme models.
[0026]
Next, the operation of the present embodiment will be described. First, the user is using a microphone that make up the voice input unit 1, for example, have entered the voice saying "Hello". The input audio signal is subjected to A / D conversion by the A / D conversion unit 2, and thereafter, the audio feature amount extraction unit 3 extracts the audio feature amount.
[0027]
The pattern matching unit 4 sequentially calculates the similarity between the extracted voice feature quantity and the five standard patterns 6a to 6e formed from the voice data in which the transfer function is convolved, and obtains the standard pattern having the highest similarity. , for example, if the standard pattern 6a of 6a is the highest degree of similarity, (high similarity) phoneme model 43 included in the standard pattern 6a greater likelihood for the speech as "Hello" column of the phoneme model recognition result 5 is output.
[0028]
Thus, in the present embodiment, among the standard patterns 6a to 6e created from the standard voices in the acoustic environment having different reverberation times, the standard pattern having the highest similarity to the input voice is automatically generated during the recognition operation. Since the selection is made, it is possible to use a standard pattern that can recognize the input voice with the highest voice recognition performance.
[0029]
Also, in order to obtain high speech recognition performance, it is necessary to have a standard pattern corresponding to an environment similar to the acoustic environment in which the device is used. Although the element is the reverberation time, in this embodiment, by preparing a group of standard patterns 6a to 6e corresponding to a plurality of acoustic environments having different reverberation times, it is possible to cope with various use environments with a very small number of standard patterns. Becomes possible.
[0030]
Then, by calculating the similarity between the plurality of standard patterns 6a to 6e corresponding to acoustic environments having different reverberation times and the voice feature amount of the input voice, and performing voice recognition using the standard pattern having the highest similarity, a small number of Even with the standard pattern, it is possible to avoid a decrease in the voice recognition performance due to the mismatch between the input voice and the standard pattern.
[0031]
In addition, the reverberation time is relatively similar if the type of the space in which the voice recognition device is used is the same even in different spaces. By creating the standard patterns 6a to 6e as described above using a representative room that is expected to be used, the similarity between the standard patterns 6a to 6e and the audio feature amount of the input voice is obtained. Can be increased for environmental factors other than reverberation time, and as a result, higher speech recognition performance can be obtained.
[0032]
Further, since it is not necessary to record a large amount of audio data required for creating the standard patterns 6a to 6e for each of a plurality of acoustic environments having different reverberation times, the creation of the standard patterns 6a to 6e becomes easy. Furthermore, since the same standard voice can be used to generate each of the standard patterns 6a to 6e, there is no variation due to the voice generated in the standard pattern, and therefore, a standard pattern group that differs only in the characteristics of the acoustic environment is generated. As a result, high speech recognition performance can be obtained.
[0033]
(Embodiment 2)
In the case of the first embodiment, the pattern matching unit 4 sequentially calculates the similarity between the voice feature extracted from the input voice and the standard patterns 6a to 6e stored in the standard pattern storage unit 6. In this embodiment, the standard pattern having the highest similarity is selected. However, in the present embodiment, the standard patterns 6a to 6e stored in the standard pattern storage unit 6 are used as shown in FIG. An environment pattern storage unit 7 that stores environment patterns 7a to 7e based on a Hidden Markov Model with one environment model having 64 Gaussian distributions per state and one state per environment model created from speech, for example. The pattern matching unit 4 calculates, for example, likelihood as a similarity between the voice feature extracted from the input voice and the environment patterns 7a to 7e stored in the environment pattern storage unit 7. Calculate by the word algorithm, select the environment pattern with the highest similarity, determine a standard pattern for speech recognition created from speech in the same environment as the selected environment pattern, and determine the determined standard The feature is that a recognition result 5 composed of a phoneme model sequence having a large likelihood (high similarity) is output using the pattern and the speech feature amount of the input speech. In FIG. 2, the same components as those in FIG. 1 are denoted by the same reference numerals, and description thereof will be omitted.
[0034]
Next, the operation of the present embodiment will be described. In reverberation time, for example 400 milliseconds of environment, using a microphone for the user to configure the voice input unit 1, for example, to enter a voice saying "Hello." The input audio signal is subjected to A / D conversion by the A / D conversion unit 2, and then the audio feature amount extraction unit 3 extracts the audio feature amount. The pattern matching unit 4 calculates the similarity between the voice feature amount and the five environment patterns 7a to 7e, thereby selecting the environment pattern having the largest similarity.
[0035]
FIG. 3 shows a flowchart of this environment pattern selection processing. First, the pattern matching unit 4 which has input the voice feature amount in step S1 is stored in the environment pattern storage unit 7 by the voice feature amount and steps S2 and S3. The degree of similarity with each of the environment patterns 7a to 7e is sequentially calculated, and the environment pattern having the highest calculated degree of similarity after the calculation is completed is selected. Therefore, as described above, for example, when voice input is performed in an environment with a reverberation time of 400 milliseconds, the environment pattern 7a (for example, a reverberation time of 340 milliseconds) is selected as an environment pattern having a high degree of similarity.
[0036]
Next, the pattern matching unit 4 determines a corresponding standard pattern 6a from the pattern name of the selected environment pattern 7a, and uses the standard pattern 6a and the voice feature amount of the input voice to include the standard pattern 6a in the standard pattern 6a. From the phoneme models, a phoneme model sequence having a high likelihood (high similarity) with respect to “konichiha” of the input speech is output as a recognition result 5.
[0037]
(Embodiment 3)
In the second embodiment, the pattern matching unit 4 calculates the similarity between all the environment patterns 7a to 7e stored in the environment pattern storage unit 7 and the voice feature amount every time a voice is input. The pattern matching unit 4 calculates the similarity between the voice feature amount and the environment patterns 7a to 7e in the same manner as in the second embodiment only at the time of the first recognition operation, such as the first recognition operation. The standard pattern is determined from the environment pattern name. The environment pattern name used to determine the standard pattern at this time is stored in the selected pattern storage unit 8 shown in FIG. 8 using the standard pattern corresponding to the environment pattern name stored in the reference pattern 8 and the speech feature quantity, the likelihood of the input speech from the 43 phoneme models included in the standard pattern. Heard differs from Embodiment 2 in that output as 5 result recognizes (similar high degree) phoneme model column. In the present embodiment, the environment pattern name stored in the selection pattern storage unit 8 is reset by operating a push button switch (not shown) attached to the voice recognition device. . Of course, the voice recognition device may be provided with a sensor for detecting the movement of the device, such as a vibration detection sensor, so that the device is automatically reset when the device moves.
[0038]
FIG. 5 shows a flowchart of the operation of the pattern matching unit 4 in the present embodiment, and the voice feature amount of the voice input through the microphone constituting the voice input unit 1 is transmitted through the voice feature amount extraction unit 3 to the pattern matching unit 4. (Step S10), the pattern matching unit 4 checks the selected pattern storage unit 8 to determine whether or not the environment pattern name is already stored (step S11). It is determined that this is the first voice recognition operation, and the similarity between the voice feature amount and the environment patterns 7a to 7e is sequentially calculated (steps S12 and S13). When the calculation of the similarity for all the environment patterns 7a to 7e is completed, the similarity is determined. The environment pattern with the highest degree is selected (step S14), and the selected environment pattern name is stored in the selected pattern storage unit 8 (step S14). Flop S15). Then, a standard pattern is determined based on the environment pattern name, and is included in the standard pattern by using the voice feature amount of the voice input output from the voice feature amount extraction unit 3 and the determined standard pattern as described above. A phoneme model sequence having a large likelihood (high similarity) with respect to the input speech is output as the recognition result 5 from the phoneme models.
[0039]
On the other hand, if the environment pattern name is already stored in the selected pattern storage unit 8 in step S11, the standard pattern is determined based on the stored environment pattern name.
[0040]
(Embodiment 4)
The present embodiment has a property that the reverberation time of a space is proportional to the capacity of the space, and the sound wave input to the sound input unit of the sound uttered in the space is directly transmitted from the sound source to the sound input unit. There is a wave and a reflected wave that once reflects off a wall or the like from the sound source and reaches the voice input unit, and the difference between the two arrival times is proportional to the capacity of the space.In other words, the longer the reverberation time, the larger the arrival time difference It is made by paying attention to the tendency to become.
[0041]
That is, in the present embodiment, as shown in FIG. 6, the audio signal that has been A / D converted by the A / D conversion unit 2 is directly transmitted from the speaker 10 that reproduces the user or the designated sound 9 to the audio input unit 1. The time difference between the wave and the reflected wave arriving after being reflected by the wall or the like at least once is estimated from, for example, the height of the correlation coefficient between the head section of the audio signal and the section after the time difference of ΔT seconds. The above estimation is tried using several values as ΔT, and the time difference ΔT at which the correlation coefficient is the largest is selected as the arrival time, and the time difference ΔT is created corresponding to the acoustic environment of the reverberation time approximate to the arrival time. It is characterized in that an arrival time calculator 11 for selecting a standard pattern is provided. If the time difference ΔT is large, a standard pattern corresponding to an environment having a long reverberation time is selected as the time difference ΔT increases.
[0042]
Then, the pattern matching unit 4 uses the selected standard pattern and the voice feature amount from the voice feature amount extraction unit 3 to obtain a large likelihood for the input voice from the 43 phoneme models included in the standard pattern (similarity of similarity). A (high) phoneme model sequence is output as a recognition result 5.
[0043]
(Embodiment 5)
In the present embodiment, as shown in FIG. 7, in an environment corresponding to the standard patterns 6a to 6e, a designated sound pattern storage in which characteristic amounts of designated sounds which are prepared as reference sounds are recorded as designated patterns 12a to 12e, respectively. First, in performing voice recognition, the designated sound 9 is reproduced from a speaker 10 disposed near a speaker, and the speech characteristic amount of the reproduced designated sound is converted to a microphone and an A / The pattern matching unit 4 extracts the sound feature amount of the designated sound 9 and the designated patterns 12a to 12e stored in the designated sound pattern storage unit 12 via the D conversion unit 2 and the D feature unit 3. And select the standard pattern corresponding to the designated sound pattern with the highest similarity as the standard pattern for speech recognition, and select it for the subsequent speech recognition of the input speech of the speaker. It accomplished using standard patterns.
[0044]
In this case, the designated sound pattern selected is stored in the selected pattern storage unit (not shown) in the same manner as in the above-described environment pattern selection, and the designated sound pattern stored each time when the speaker recognizes the voice thereafter. Is used for speech recognition, the amount of calculation of the similarity can be reduced. Of course, when the voice recognition device is moved or the usage environment changes, the memory may be reset, and the designated sound pattern having the highest similarity may be selected by reproducing the designated sound.
[0045]
In general, a “language model” is often used in a speech recognition device, but a speech recognition device of the present invention may be provided with a language model.
[0046]
That is, for example, when recognizing the word "good morning", there is a possibility that the word "good morning" is recognized.
[0047]
However, in practice, the word "Ohayo" is not present in everyday life, so here, the appearance frequency of registered words such as "Ohayo" and "Ohio" in everyday life is called the language model It may be stored in a model, and the product of the probability value and the similarity may be used as the final similarity.
[0048]
【The invention's effect】
According to a first aspect of the present invention, there is provided a voice recognition unit for inputting voice, an A / D conversion unit for A / D converting an output signal from the voice input unit, and an output signal from the A / D conversion unit. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the amount extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result, and the pattern matching unit includes And sequentially comparing the plurality of standard patterns stored in the standard pattern storage unit with the speech feature amount of the input speech output from the speech feature amount extraction unit, and recognizing the recognition result with the highest similarity in the final recognition. As a result Therefore, the user does not need to perform an operation to select a standard pattern according to the environment, and from among a plurality of standard patterns corresponding to acoustic environments with different reverberation times, the standard pattern having the highest similarity to the input voice. Can be automatically selected at the time of the recognition operation, and it is possible to use a standard pattern capable of recognizing the input voice with the highest voice recognition performance. By preparing a group of standard patterns suitable for multiple different acoustic environments, it is possible to respond to various usage environments with a very small number of standard patterns, and to perform speech recognition based on mismatch between input voice and standard patterns. There is an effect that it is possible to avoid a decrease in performance.
[0049]
According to a second aspect of the present invention, there is provided a voice recognition unit for inputting voice, an A / D conversion unit for A / D converting an output signal from the voice input unit, and an output signal from the A / D conversion unit. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the volume extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; A selection pattern storage unit that stores a standard pattern having the highest similarity during operation, wherein the pattern matching unit is configured to extract a standard pattern stored in the selection pattern storage unit and the voice feature amount extraction unit. Since the recognition result is output from the calculation result of the similarity of the input speech with the speech feature amount, high speech recognition performance is obtained in the same manner as in the first aspect of the present invention. Therefore, there is no need to calculate the similarity between all the standard patterns and the input voice, and the recognition result can be output at high speed.
[0050]
According to a third aspect of the present invention, there is provided a voice recognition unit for inputting voice, an A / D conversion unit for A / D converting an output signal from the voice input unit, and an output from the A / D conversion unit. A voice feature amount extraction unit that converts a signal into a frequency to extract a voice feature amount, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard voices in acoustic environments with different reverberation times, A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the voice feature amount extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; By calculating the correlation between the direct wave and the reflected wave based on the sound input to the input unit, the arrival time difference between the direct wave and the reflected wave to the sound input unit is calculated, and a standard pattern is calculated according to the magnitude of the arrival time difference. From the memory An arrival time calculation unit for selecting a standard pattern, and the pattern matching unit includes a standard pattern selected based on a calculation result of the arrival time calculation unit and a voice feature of an input voice output from the voice feature amount extraction unit. Since the recognition result is output from the calculation result of the degree of similarity with the amount, high speech recognition performance can be obtained as in the first aspect of the present invention. Accordingly, it is possible to select a standard pattern suitable for recognizing the input voice, eliminating the need to calculate the similarity between all the standard patterns and the input voice, and outputting the recognition result at high speed. There is an effect that can be.
[0051]
According to a fourth aspect of the present invention, there is provided a voice recognition device for inputting voice, an A / D conversion unit for A / D converting an output signal from the voice input unit, and an output signal from the A / D conversion unit. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the amount extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; An environment pattern storage unit that stores each environment pattern used only for specifying an environment corresponding to the audio feature amount of the input speech output from the speech feature amount extraction unit. The degree of similarity with each environment pattern stored in the pattern storage unit is calculated, the environment pattern having the highest similarity is selected, and the standard pattern corresponding to the environment pattern and the audio feature quantity extraction unit output the pattern. Since the recognition result is output from the calculation result of the similarity between the input voice and the voice feature, the high voice recognition performance can be obtained as in the first aspect of the present invention, and the environment not used for voice recognition can be specified. Since the standard pattern used for speech recognition is determined by calculating the similarity of the environmental pattern used only for calculation, the calculation is similar to the case of calculating the similarity between the speech feature of the input speech and the standard pattern for speech recognition. The amount can be reduced, and as a result, the recognition result can be output at high speed.
[0052]
According to a fifth aspect of the present invention, there is provided a voice recognition unit for inputting voice, an A / D conversion unit for A / D converting an output signal from the voice input unit, and an output signal from the A / D conversion unit. A speech feature amount extraction unit for extracting a feature amount of speech by frequency conversion; a standard pattern storage unit for storing a plurality of standard patterns respectively created from standard speeches in acoustic environments with different reverberation times; A pattern matching unit that calculates the similarity between the voice feature amount of the input voice output from the volume extraction unit and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; A designated sound pattern storage unit that stores a designated sound pattern when it occurs in each environment corresponding to each of the standard patterns, and is arranged near the speaker to reproduce a designated sound prepared in advance. The voice input A speaker to be input to the unit, wherein the pattern matching unit is stored in the designated sound pattern storage unit and the speech feature amount output from the speech feature amount extraction unit of the designated sound reproduced by the speaker. A similarity between each designated sound pattern is calculated, and a standard pattern having the same environment as the designated sound pattern having the highest similarity is selected, and the standard pattern and the input speech output from the speech feature amount extraction unit are selected. Since the recognition result is output from the calculation result of the similarity with the voice feature amount, the similarity between the input voice and the standard pattern can be calculated with high accuracy, and there is an effect that higher voice recognition performance can be obtained. .
[0053]
According to a sixth aspect of the present invention, in the invention according to any one of the first to fifth aspects, a room environment in which the device is expected to be used is used as an acoustic environment corresponding to the standard pattern. The similarity of the input speech with the speech feature can be increased for elements other than the reverberation time. As a result, higher speech recognition performance can be obtained.
[0054]
According to a seventh aspect of the present invention, in the invention of any one of the first to sixth aspects, the speech data used for creating the standard pattern includes a transfer function in a plurality of acoustic environments having different reverberation times. Since it is composed of audio data convolved with audio recorded in an environment where there is no need to record a large amount of audio data necessary to create a standard pattern for each of multiple acoustic environments with different reverberation times, Easier creation and the ability to use the same voice to create each standard pattern, eliminating variability in utterances and creating standard patterns that differ only in acoustic environment characteristics As a result, a standard pattern can be selected with high accuracy, and high speech recognition performance can be obtained.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a first embodiment of the present invention.
FIG. 2 is a configuration diagram of a second embodiment of the present invention.
FIG. 3 is a flowchart for explaining the operation of the above.
FIG. 4 is a configuration diagram of a third embodiment of the present invention.
FIG. 5 is a flowchart for explaining the operation of the above.
FIG. 6 is a configuration diagram of a fourth embodiment of the present invention.
FIG. 7 is a configuration diagram of a fifth embodiment of the present invention.
FIG. 8 is a configuration diagram of a conventional example.
[Explanation of symbols]
1 Voice input section
2 A / D converter
3 Voice feature extraction unit
4 Pattern matching section
5 Recognition results
6 Standard pattern storage
6a-6e Standard pattern

Claims

A voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and a frequency conversion of an output signal from the A / D converter to extract a voice feature amount. An audio feature extraction unit, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard audio in an acoustic environment with different reverberation times, and a speech of an input audio output from the audio feature extraction unit A pattern matching unit that calculates a similarity between the feature amount and the standard pattern stored in the standard pattern storage unit and outputs a recognition result,
The pattern matching unit sequentially compares a plurality of standard patterns stored in the standard pattern storage unit with the voice feature amount of the input voice output from the voice feature amount extraction unit, and recognizes a recognition result having the highest similarity. Is output as a final recognition result.

A voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and a frequency conversion of an output signal from the A / D converter to extract a voice feature amount. An audio feature extraction unit, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard audio in an acoustic environment with different reverberation times, and a speech of an input audio output from the audio feature extraction unit A pattern matching unit that calculates the similarity between the feature amount and the standard pattern stored in the standard pattern storage unit and outputs a recognition result, and also selects the standard pattern with the highest similarity during the previous voice recognition operation. A selection pattern storage unit for storing
The pattern matching unit may output a recognition result based on a calculation result of a similarity between a standard pattern stored in the selected pattern storage unit and a voice feature amount of an input voice output from the voice feature amount extraction unit. Characteristic speech recognition device.

A voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and a frequency conversion of an output signal from the A / D converter to extract a voice feature amount. An audio feature extraction unit, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard audio in an acoustic environment with different reverberation times, and a speech of an input audio output from the audio feature extraction unit A pattern matching unit that calculates the similarity between the feature amount and the standard pattern stored in the standard pattern storage unit and outputs a recognition result, and based on the voice input to the voice input unit, Calculates the arrival time difference between the direct wave and the reflected wave to the voice input unit by correlating with the reflected wave, and selects the optimal standard pattern from the standard pattern storage unit according to the magnitude of this arrival time difference Equipped with a,
The pattern matching unit outputs a recognition result from a calculation result of a similarity between a standard pattern selected based on a calculation result of an arrival time calculation unit and a voice feature amount of an input voice output from the voice feature amount extraction unit. A voice recognition device.

A voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and a frequency conversion of an output signal from the A / D converter to extract a voice feature amount. An audio feature extraction unit, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard audio in an acoustic environment with different reverberation times, and a speech of an input audio output from the audio feature extraction unit A pattern matching unit that calculates a similarity between the feature amount and the standard pattern stored in the standard pattern storage unit and outputs a recognition result, and each environment used only for specifying an environment corresponding to each standard pattern. An environment pattern storage unit that stores patterns is provided,
The pattern matching unit calculates the similarity between the voice feature amount of the input voice output from the voice feature amount extraction unit and each environment pattern stored in the environment pattern storage unit, and calculates the highest similarity environment pattern. And outputting a recognition result from a calculation result of a similarity between a standard pattern corresponding to the environmental pattern and a voice feature amount of an input voice output from the voice feature amount extraction unit. .

A voice input unit for inputting voice, an A / D converter for A / D converting an output signal from the voice input unit, and a frequency conversion of an output signal from the A / D converter to extract a voice feature amount. An audio feature extraction unit, a standard pattern storage unit that stores a plurality of standard patterns respectively created from standard audio in an acoustic environment with different reverberation times, and a speech of an input audio output from the audio feature extraction unit A pattern matching unit that calculates the similarity between the feature amount and the standard pattern stored in the standard pattern storage unit and outputs a recognition result; and a designated sound that is stored in advance in each environment corresponding to each of the standard patterns. A designated sound pattern storage unit that stores a designated sound pattern when it is generated, and a speaker that is arranged near the speaker and reproduces a designated sound prepared in advance and inputs the designated sound to the voice input unit.
The pattern matching unit is configured to determine a similarity between a voice feature amount output from the voice feature amount extraction unit of the specified sound reproduced by the speaker and each specified sound pattern stored in the specified sound pattern storage unit. Calculate the degree and select a standard pattern in the same environment as the designated sound pattern with the highest similarity, and calculate the similarity between the standard pattern and the audio feature of the input voice output from the audio feature extractor. A speech recognition device characterized by outputting a recognition result from a result.

The voice recognition device according to claim 1, wherein an environment of a room in which use of the device is assumed is used as an acoustic environment corresponding to the standard pattern.

2. The audio data used to create the standard pattern is audio data obtained by convolving transfer functions in a plurality of acoustic environments with different reverberation times with audio recorded in an environment without reverberation. 7. The speech recognition device according to any one of claims 6 to 6.