JP3905620B2

JP3905620B2 - Voice recognition device

Info

Publication number: JP3905620B2
Application number: JP32302797A
Authority: JP
Inventors: 浩二赤塚
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 1997-06-10
Filing date: 1997-11-25
Publication date: 2007-04-18
Anticipated expiration: 2017-11-25
Also published as: JPH1165589A

Description

【０００１】
【発明の属する技術分野】
本発明は、不特定話者から離散的に発話された音声を自動的に認識する音声認識装置に関する。
【０００２】
【従来の技術】
複数の不特定話者からの音声を誤認識せずに認識する従来の音声認識装置の多くは、種々の周波数分析手法を用いて音声信号に対してある程度の周波数解像度を有する周波数分析を行って周波数−時間の符号系列に変換し、出現が予想される音素の数の隠れマルコフモデルを用意し、さらに該用意した隠れマルコフモデルを多くの話者からの発話音声によって学習させて予め用意しておく。
【０００３】
この学習済みの隠れマルコフモデルを用いて、不特定話者から発話された音声に基づく周波数−時間の符号系列の部分区間を、全ての音素モデルと照合することによって音素系列の候補の時系列に変換し、この音素の時系列が最もよく表される単語を認識結果として出力するようになされている。
【０００４】
【発明が解決しようとする課題】
しかしながら、従来の音声認識装置では、不特定話者の発話の多様性に対応して高性能な音声認識特性を維持するための隠れマルコフモデルの学習に多くの学習データを必要とし、隠れマルコフモデルで音素を精密に特定するためにある程度の周波数分析の解像度、すなわち、ある程度の大きさのベクトル次数を必要とするという問題点があった。
【０００５】
この結果、隠れマルコフモデルの学習時と音素特定時の演算負荷が重く、さらに単語の認識過程において少なくとも音素照合と単語照合の２段階の照合演算処理を必要とするという問題点があった。
【０００６】
本発明は、簡単な横成で、不特定話者の発話の多様性に対しても高性能を維持することができて、誤認識を低減させた音声認識装置を提供することを目的とする。
【０００７】
【課題を解決するための手段】
本発明にかかる音声認識装置は、音声信号を周波数分析して得た周波数スペクトルを、時間軸に沿って順次求めて時系列データ群に変換する周波数分析手段と、複数の学習話者から発話された音声に基づく音声信号が入力された前記周波数分析手段からの出力時系列データを予め定めた時間窓で切り出す切り出し手段と、前記切り出し手段によって切り出された時系列データ群を用いて主成分分析を行う主成分分析手段と、前記主成分分析により得た主成分中における低い周波数部分および時間窓の中心部分を用いて畳み込み積分を行って入力時系列データを低次の時系列データに圧縮する特徴抽出フィルタ手段とを備え、前記複数の学習話者から発話された音声に基づく低次の時系列データを参照用低次時系列データとし、該参照用低次時系列データと不特定話者から発話された音声に基づく低次の時系列データとを照合して照合結果に基づいて音声認識をすることを特徴とする。
【０００８】
本発明にかかる音声認識装置は、複数の学習話者から発話された音声に基づく音声信号が周波数分析手段に入力されて時系列データ群に変換され、周波数分析手段によって変換された時系列データが切り出し手段によって予め定めた時間窓で切り出され、切り出し手段によって切り出された時系列データ群を用いて主成分分析手段によって主成分分析され、主成分分析により得られた主成分中における低い周波数部分および時間窓の中心部分を用いて畳み込み積分が行われて特徴抽出フィルタ手段にて入力時系列データが低次の時系列データに圧縮される。複数の学習話者から発話された音声に基づく低次の時系列データが参照用低次時系列データとされて、不特定話者から発話された音声に基づく低次の時系列データと照合されて、照合結果に基づいて不特定話者から発話された音声に対する音声認識がなされる。
【０００９】
【発明の実施の形態】
以下、本発明にかかる音声認識装置を実施の一形態によって説明する。
【００１０】
図１は本発明の実施の一形態にかかる音声認識装置の構成を示す模式ブロック図である。
【００１１】
図１の模式ブロック図において、作用の理解を容易にするために、同一の構成要素であっても異なる音声信号ラインに使用する構成要素は重複して示してあって、図１において２重枠の構成要素がこれに当たり、同一符号は同一の構成手段を示している。
【００１２】
本発明の実施の一形態にかかる音声認識装置１は、複数の学習話者から発せられる発話音声に基づき学習話者の音素に対する特徴を抽出し、抽出した特徴に基づいて特徴抽出フィルタを作成する特徴抽出フィルタ作成部αと、複数の学習話者の発話たとえば単語の音声信号に基づく情報を特徴抽出フィルタに供給し、特徴抽出フィルタによって前記情報を圧縮して照合用低次圧縮時系列データ群を生成する照合時系列データ作成部βと、入力された不特定話者からの音声信号を特徴抽出フィルタに供給して、特徴抽出フィルタによって圧縮した時系列データを生成し、該時系列データを照合用低次圧縮時系列データと照合して音声認識結果を出力する不特定話者音声認識部γとを備えている。
【００１３】
特徴抽出フィルタ作成部αは、複数の学習話者から発話された音声（以下、学習音声群とも記す）の周波数スペクトルの時間的変化を示すため、複数の学習話者から発話された音声に基づく音声信号を周波数分析して得た周波数スペクトルを、時間軸に沿って順次求めた時系列データ群（周波数−時間の時系列データ群）に変換する周波数分析器２と、周波数分析器２によって変換された前記複数の学習話者からの音声に基づく周波数−時間の時系列データ群から小さな時間窓の範囲における部分周波数−時間の時系列データを切り出す部分周波数−時間パターン作成器３と、部分周波数−時間パターン生成器３によって切り出された複数の部分周波数−時間の時系列データを用いて主成分分析を行う主成分分析器４と、主成分分析器４による主成分分析結果の低次主成分において、周波数軸方向には低い周波数部分を用い、かつ時間軸方向には時間窓の中央部のみを用いて畳み込み積分を行う特徴抽出フィルタ５を備えて、複数の学習話者からの発話音声から学習話者の音素に対する特徴を抽出する。
【００１４】
照合時系列データ作成部βは照合用低次圧縮時系列データ記憶器６を備え、複数の学習話者から発話された単語音声の周波数スペクトルの時間的変化を示すため、複数の学習話者から発話された前記単語音声の音声信号を周波数分析器２によって周波数分析して得た周波数スペクトルを、時間軸に沿って順次求めた周波数−時間の時系列データ群に変換し、変換された周波数−時間の時系列データ群を特徴抽出フィルタ５に送出し、特徴抽出フィルタ５にて周波数−時間の時系列データを次元圧縮して照合用低次圧縮時系列データ群を得て、照合用低次圧縮時系列データ記憶器６に記憶させる。
【００１５】
不特定話者音声認識部γは時系列データ照合器７を備え、不特定話者から発話された音声の周波数スペクトルの時間的変化を示すため、不特定話者から発話された音声に基づく音声信号を周波数分析器２によって周波数分析して得た周波数スペクトルを、時間軸に沿って順次求めた周波数−時間の時系列データ群に変換し、変換された周波数−時間の時系列データ群を特徴抽出フィルタ５に送出し、特徴抽出フィルタ５にて周波数−時間の時系列データを次元圧縮して時系列データ群を得て、時系列データ群と照合用低次圧縮時系列データ記憶器６から読み出した照合用低次圧縮時系列データとを時系列データ照合器７にて照合し、照合用低次圧縮時系列データ群中から、時系列データ群に最も近いものを求め、照合結果に基づいて不特定話者からの発生音声に基づく言葉を認識する。
【００１６】
次に周波数分析器２、部分周波数−時間パターン作成器３、主成分分析器４、特徴抽出フィルタ５のそれぞれについて具体的に説明する。
【００１７】
周波数分析器２では、入力音声信号がＡ／Ｄ変換され、Ａ／Ｄ変換された音声信号に対して高域強調処理がなされ、高域処理されたＡ／Ｄ変換音声信号に対して時間窓としてのハニング窓がかけられ、線形予測（ＬＰＣ）分析によってＬＰＣ係数が求められ、このＬＰＣ係数に対してフーリエ変換が行われて、周波数スペクトルが求められ、これを時間軸に沿って逐次求めることで、音声スペクトルの時間的変化を示すための周波数−時間の時系列データに変換される。したがって周波数分析器２では入力音声のサウンドスペクトルパターンである周波数−時間パターンに実質的に展開されることになる。なおこの場合、周波数−時間の時系列データの各時刻における周波数−時間の時系列データはＮ次ベクトルＸｉである。
【００１８】
この周波数分析法に応じて特徴抽出フィルタ５を作成すれば、音声情報の欠落が少ない。また、周波数分析法に応じて特徴抽出フィルタ５を作成したときに音声情報に欠落がないような他の周波数分析法によってもよい。したがって、周波数分析器２による方法によれば、所謂ＬＰＣスペクトル包絡による方法よりも、さらにベクトル次数の少ない周波数−時間パターンにも適用することができる。この結果、周波数−時間の時系列データ群によって実質的に音声信号の周波数−時間パターンが示される。
【００１９】
部分周波数−時間パターン作成器３では、周波数分析器２から出力される周波数−時間の時系列データ群中から、所定の小さな時間窓の範囲における周波数−時間の時系列データが切り出される。このため、部分周波数−時間パターン作成器３から出力される周波数−時間の時系列データに基づく音声の周波数−時間パターンは、周波数分析器２から出力される周波数−時間の時系列データに基づく音声の周波数−時間パターンの一部分であって、部分周波数−時間パターンであるといえる。
【００２０】
特徴抽出フィルタ５は、周波数−時間の時系列データから情報の欠落を最小限に抑え、情報圧縮した時系列データを作成する。本例では情報の圧縮に主成分分析を用いている。さらに詳細には部分周波数−時間パターンをサンプルデータとして主成分分析を行った結果の主成分のうち低次主成分において、周波数軸方向には低い周波数部分を用い、かつ時間軸方向には時間窓の中央部分のみを用いて、畳み込み積分を行っている。
【００２１】
さらに詳細に、例えば９名の異なる学習話者の共通した１００語の発話データを学習音声信号群として用いた場合の例を説明する。
【００２２】
この場合、発話データには、単語音声信号区間中の発話音素と、発話音素の音声信号の時間軸上における開始点と終了点とに対応がつけられたラベルデータとを持っているものとする。例えば図３（Ａ）に示すように、音素Ｅに対する開始点の時間ラベルａ、音素Ｅに対する終了点の時間ラベルでありかつ音素Ｆに対する開始点の時間ラベルである時間ラベルｂ、音素Ｆに対する終了点の時間ラベルｃを持っている。
【００２３】
部分周波数−時間パターン作成器３は、周波数分析器２から出力される周波数−時間の時系列データをラベルデータと共に、時間抽上の音素の中心位置、図３（Ａ）に示す例では（ａ＋ｂ）／２、（ｂ＋ｃ）／２を求め、この中心位置を中心に時間窓部分の周波数−時間の時系列データを切り出す。
【００２４】
すなわち、学習音声信号群に対して、部分周波数−時間パターン作成器３によって、例えば３０ｍｓの時間窓Ｄで切り出しを行い、部分周波数−時間の時系列データ群を作成する。部分周波数−時間パターン作成器３によって作成された部分周波数−時間の時系列データの時間窓Ｄによる切り出しは、図３（Ｂ）に示すように、音素Ｅに対しては時間ラベルａと時間ラベルｂとの間の中央に時間窓Ｄがくるように、［｛（ａ＋ｂ）／２｝−（Ｄ／２）］の位置から［｛（ａ＋ｂ）／２｝＋（Ｄ／２）］の位置までが切り出され、音素Ｆに対しては時間ラベルｂと時間ラベルｃとの間の中央に時間窓Ｄがくるように、［｛（ｂ＋ｃ）／２｝−（Ｄ／２）］の位置から［｛（ｂ＋ｃ）／２｝＋（Ｄ／２）］の位置までが切り出される。
【００２５】
この切り出し処理を同じ音素のラベル区間について行うことによって、同じ音素の周波数−時間の時系列データを複数集めることができる。同じ音素の複数集めた周波数−時間の時系列データの平均値を求め、これを部分周波数−時間の時系列データとする。この部分周波数−時間の時系列データを音素毎に作成することによって部分周波数−時間の時系列データ群が作成される。
【００２６】
また、この切り出し処理を変化の少ない音素毎、すなわち比較的定常的な音素毎に行ってもよい。
【００２７】
この部分周波数−時間の時系列データ群から、主成分分析器４によって主成分が求められる。
【００２８】
部分周波数−時間の時系列データから主成分分析器４による主成分の出力までの作用について図４に基づいて説明する。図４においては、部分周波数−時間の時系列データをパターンと略記してある。
【００２９】
切り出された音素Ａの部分周波数−時間の時系列データ群、音素Ｂの部分周波数−時間の時系列データ群、……、音素Ｚの部分周波数−時間の時系列データ群は図４（Ａ）に模式的に示す如くであり、各音素Ａ〜Ｚについての部分周波数−時間の時系列データ群の平均値が求められる。音素Ａの部分周波数−時間の時系列データ群の平均値、音素Ｂの部分周波数−時間の時系列データ群の平均値、……、音素Ｚの部分周波数−時間の時系列データ群の平均値は図４（Ｂ）に模式的に示す如くである。
【００３０】
各音素Ａ〜Ｚの部分周波数−時間の時系列データの平均値は主成分分析器４によって、図４（Ｃ）に模式的に示すように、主成分分析が行われる。主成分分析の結果、図４（Ｄ）に模式的に示すように、第１主成分、第２主成分、……、第Ｋ主成分（Ｚ＞Ｋ）が求められる。
【００３１】
すなわち、主成分分析ではサンプルデータ空間のベクトル次元数と同数の次元数の主成分が求められ、サンプルデータの分散が最も多い軸を決める主成分を第１主成分、分散が２番目に大きい軸を決める主成分を第２主成分、以下同様に第Ｋ主成分が決まる。
【００３２】
主成分の内の低次主成分は部分周波数−時間の時系列データ群の特徴に多く含まれる成分の固有空間を定義しており、音声信号の周波数−時間の時系列データに基づく周波数−時間パターン中に最も含まれる部分の特徴を表している。そこで、音声信号に含まれる学習話者の個人性に基づく成分や認識に悪影響を及ぼすと考えられるノイズ成分は、低次主成分には含まれていないと考えられる。
【００３３】
特徴抽出フィルタ５では、部分周波数−時間パターンをサンプルデータとして、主成分分析を行った結果の低次主成分において、周波数軸方向には低い周波数部分を用い、かつ時間軸方向には時間窓Ｄの中央部分のみを用いて畳み込み積分を行う。この畳み込み積分を行うベクトルを特徴抽出ベクトルとも記す。
【００３４】
例えば、２つの特徴抽出ベクトルの場合は、第１主成分ベクトルの周波数軸方向には低い周波数部分を用い、かつ時間軸方向には時間窓Ｄの中央部分のみを用いて畳み込み積分を行うものを第１特徴抽出ベクトルδ１ｉ、第２主成分ベクトルの周波数軸方向には低い周波数部分を用い、かつ時間軸方向には時間窓Ｄの中央部分のみを用いて畳み込み積分を行うものを第２特徴抽出ベクトルδ２ｉと呼ぶことにする。
【００３５】
この第１、第２特徴抽出ベクトルδ１ｉ、δ２ｉを特徴抽出フィルタ５で用い、周波数分析器２から出力される周波数−時間時系列データの各時刻における周波数−時間の時系列データと、第１、第２特徴抽出ベクトルδ１ｉ、δ２ｉとの間で相関値を求める。この各特徴抽出ベクトル毎の相関値出力をチャンネル出力とも記す。この相関値出力を各チャンネル毎に正規化して２チャンネルフィルタ出力とする。
【００３６】
上記から明らかなように、特徴抽出フィルタ５は２つの特徴抽出ベクトルδ１ｉ、δ２ｉで構成される場合を例に示せば、図２に示すように、周波数分析結果のＮ次ベクトルＸｉと第１、第２の特徴抽出ベクトルδ１ｉ、δ２ｉとの積和演算を各時刻について積和演算器５１１、５１２にてそれぞれ入力のＮ次ベクトルＸｉに対して行って、各積和演算器５１１、５１２からの出力を、正規化器５２１、５２２によってそれぞれにレベルを正規化し、正規化された各正規化器５２１、５２２からの出力を各チャンネルの出力として送出する。
【００３７】
次に、照合用低次圧縮時系列データ群の作成について説明する。
【００３８】
各単語の学習音声信号が周波数分析器２に供給されて、学習音声信号に基づく周波数−時間の時系列データが作成される。この周波数−時間の時系列データが既に学習音声信号群における音素に対して求めておいた低次主成分を基底とする特徴抽出フィルタ５に供給され、特徴抽出フィルタ５において次元圧縮されて特徴抽出フィルタ５の各チャンネルから時系列データが出力され、この時系列データが照合用低次圧縮時系列データ群とされる。
【００３９】
このように作成された照合用低次圧縮時系列データ群の構造は、図５に示すごとくであって、図５（Ａ）、（Ｂ）、（Ｃ）はそれぞれ学習音声の発話者、例えばａ′、ｂ′、ｃ′による同じ単語の学習音声による場合の照合用低次圧縮時系列データ群であって、９名の話者による１００単語に対する場合には９００個の照合用低次圧縮時系列データ群が得られ、照合用低次圧縮時系列データ群の各要素は学習音声信号の各発話単語名とそれに対応する照合用低次圧縮時系列データの対で構成される。この照合用低次圧縮時系列データ群は照合用低次圧縮時系列データ記憶器６に記憶される。
【００４０】
上記のように照合用低次圧縮時系列データ群が照合用低次圧縮時系列データ記憶器６に記憶させてある状態で、不特定話者からの音声認識が行われる。不特定話者からの音声信号は周波数分析器２によって周波数分析され、既に学習音声信号群からの音声信号に基づいて予め特徴抽出フィルタ作成部αで求められた特徴抽出フィルタ５に供給されて、特徴抽出フィルタ５において次元圧縮処理がなされて、時系列データに変換される。
【００４１】
不特定話者からの音声信号に基づく時系列データは、学習音声信号群に基づいて照合時系列データ作成部βで求められた照合用低次圧縮時系列データ群との間で時系列データ照合器７において照合されて、不特定話者からの音声信号に基づく時系列データに最も近い照合用低次圧縮時系列データが照合用低次圧縮時系列データ群の中から選び出され、選び出された照合用低次圧縮時系列データに対する発話単語名が認識結果として出力される。
【００４２】
次に、本実施の形態における時系列データ照合器７について、ＤＰ（dynamic programming ）法を用いた照合の場合を例に説明する。
【００４３】
ＤＰ法は、入力時系列データとあらかじめ記憶された各時系列データ群との間で、非線形に時間伸縮することで時間正規化を行い対応づけを行う照合法である。この方法によれば、入力時系列データと予め記憶された各時系列データとの間の時間正規化後の距離が定義され、この距離が最小である時系列データが入力時系列データを最もよく表すものとし、認識結果とするものである。本実施の形態では、このＤＰ法が不特定話者からの音声信号に基づく時系列データと照合用低次圧縮時系列データとの間に適用されて、時間正規化後の最小距離を持つ照合用低次圧縮時系列データに対応させた単語名が出力される。
【００４４】
次に、本実施の形態に基づく評価実験結果について説明する。ここでは、テストサンプルとして、話者認定評価用データベースの１０名の話者の１００単語を用いた。
【００４５】
テスト話者１名を除く９名の話者の発話データを学習音声信号群として用いて特徴抽出フィルタ作成部αで特徴抽出フィルタ５を作成した。サンプルとして用いた音素は母音、破裂音、摩擦音、鼻音であり、部分周波数−時間パターン作成器３を用いて、話者毎に部分周波数−時間の時系列データを求め、この部分周波数−時間の時系列データから主成分分析器４で主成分を求め、この主成分のうち、第１、第２主成分において、周波数軸方向には４．５ｋＨｚ以下の低い周波数部分で、時間軸方向には時間窓Ｄの中央部分の１単位時間分のみの部分を用いて特徴抽出ベクトルδ１ｉ、δ２ｉとして用いた。この特徴抽出ベクトルδ１ｉ、δ２ｉの形状の一例を、横軸に周波数を縦軸に重み係数をとった場合を図６に示す。
【００４６】
時系列データ照合器７で用いる照合用低次圧縮時系列データ群は、前記テスト話者１名を除く９名の話者の発話データを学習音声信号群として、上記特徴抽出フィルタ５を用いた照合時系列データ作成部βで９００個の照合用低次圧縮時系列データを求めた。評価実験では、テスト話者を変えながら行い、その都度、特徴抽出フィルタ５を求め直し、照合用低次圧縮時系列データを作成し直した。
【００４７】
次に本発明の一実施の形態にかかる音声認識装置の変形例について説明する。
【００４８】
主成分の内の低次主成分は部分周波数−時間の時系列データ群の特徴に多く含まれる成分の固有空間を定義しており、音声信号の周波数−時間の時系列データに基づく周波数−時間パターン中に最も含まれる部分の特徴を表して、音声信号に含まれる学習話者の個人性に基づく成分や認識に悪影響を及ぼすと考えられるノイズ成分は、低次主成分には含まれていないと考えられることは前記のとおりである。
【００４９】
このため、本変形例では特徴抽出フィルタ５における特徴抽出ベクトルδ１ｉ、δ２ｉに代わって分散の大きい第１主成分から順次分散が減少する第４番目の主成分を特徴抽出ベクトルとして用いてもよく、例えば、情報の損失量の最小から最大の方向へ４つの主成分を低次主成分として用いてもよい。
【００５０】
上記の４つの主成分を低次主成分として用いたときの本変形例における特徴抽出フイルタでは、上記の低次主成分を基底として用いて、例えば４つの第１、第２、第３、第４低次主成分ベクトルδ１ｉ′、δ２ｉ′、δ３ｉ′、δ４ｉ′を特徴抽出フィルタの基底として用い、周波数分析器２から出力される周波数−時間の時系列データの各時刻における周波数−時間の時系列データと第１、第２、第３、第４低次主成分ベクトルδ１ｉ′、δ２ｉ′、δ３ｉ′、δ４ｉ′との間で相関値を求める。この各低次主成分ごとの相関値出力をチャンネルとも記す。この相関値を各チャンネルごとに正規化して、４チャンネルのフィルタ出力とする。
【００５１】
上記からも明らかなように、この変形例の場合の特徴抽出フィルタは４つの低次主成分の場合を例に示せば、図７に示すように、周波数分析結果のＮ次ベクトルＸｉと各低次主成分ベクトルδ１ｉ′、δ２ｉ′、δ３ｉ′、δ４ｉ′との積和演算を各時刻において積和演算器５１１′、５１２′、５１３′、５１４′にてそれぞれ入力Ｎ次ベクトルＸｉに対して行って、各積和演算器５１１′、５１２′、５１３′、５１４′からの出力を、正規化器５２１′、５２２′、５２３′、５２４′によってそれぞれ各別にレベルを正規化して、正規化された各正規化器５２１′、５２２′、５２３′、５２４′からの出力を各チャンネルの出力として送出する。
【００５２】
次に、照合用低次圧縮時系列データ群の作成について説明する。
【００５３】
各単語の学習音声信号が周波数分析器２に供給されて、学習音声信号に基づく周波数−時間の時系列データが作成される。この周波数−時間の時系列データが既に学習音声信号群における音素に対して求めておいた低次主成分を基底とする特徴抽出フィルタ５に供給され、特徴抽出フィルタ５において次元圧縮されて特徴抽出フィルタ５の各チャンネルから時系列データが出力され、この時系列データが照合用低次圧縮時系列データとされる。
【００５４】
このように作成された本変形例における照合用低次時系列データの構成は、図８に示すごとくであって、図５（Ａ）、（Ｂ）、（Ｃ）、（Ｄ）はそれぞれ学習音声の発話者、例えばａ′、ｂ′、ｃ′、ｄ′による同じ単語の学習音声による場合の照合用低次圧縮時系列データであって、９名の話者による１００単語に対する場合には９００個の照合用低次圧縮時系列データ群が得られ、照合用低次圧縮時系列データ群の各要素は学習音声信号の各発話単語名とそれに対応する照合用低次圧縮時系列データの対で構成される。この照合用低次圧縮時系列データ群は照合用低次圧縮時系列データ記憶器６に記憶される。
【００５５】
その他については、図６に示す特徴抽出ベクトルδ１ｉ、δ２ｉの形状以外は、上記した本発明の実施形態の一形態にかかる音声認識の場合と同様である。
【００５６】
上記した本発明の実施の一形態にかかる音声認識装置１において、２チャンネルに設定して特徴抽出フィルタ５に図６に示した特徴抽出ベクトルδ１ｉ、δ２ｉを用いた場合と、前記変形例において説明した低次の４つの主成分分析結果を用いて３チャンネルに設定した特徴抽出フィルタを用いた場合の音声認識結果を図９に示す。
【００５７】
図９において、ａは前者すなわち２チャンネルに設定して特徴抽出フィルタ５に図６に示した特徴抽出ベクトルを用いた場合の認識結果を示し、ｂは後者すなわち低次の４つの主成分分析結果を用いて３チャンネルに設定した特徴抽出フィルタを用いた場合の認識結果を示している。両者共良好な認識結果が得られているが、前者の方がさらによいことが判る。
【００５８】
主成分分析を行うため、周波数分析の際の切り出しの周波数、時間窓Ｄは０〜８ｋＨｚ、３０ｍｓｅｃ幅であり、本変形例においては周波数は８ｋＨｚ（３２点）、時間窓Ｄは３０ｍｓｅｃ（＝５ｍｓｅｃ×６、６単位時間）であるのに対し、一実施の形態では周波数は０〜４．５ｋＨｚ（１８点）であり、時間窓Ｄは５ｍｓｅｃ（１単位時間）であって、周波数はほぼ１／２であり、時間幅は１／６である。これは、音声認識のために定常的に安定な、周波数範囲かつ時間幅の部分を切り出すのであるが、０〜４．５ｋＨｚ、５ｍｓｅｃの範囲でも十分に音声の話者依存性の少ない特徴を抽出することができることが判ったためである。
【００５９】
したがって、一単位時間当たりの音声の特徴抽出のために必要とする計算量は、一実施の形態では周波数で１８点、時間軸で１点であるため合計１８回の乗算が必要であったのに対し、変形例では周波数で３２点、時間軸で６点で合計１９２回の乗算を必要とし、一実施の形態の場合、一単位時間の低次圧縮にかかる計算速度は変形例の場合の１／１０．６倍に短縮されて、大幅な計算量の削減が図れ、かつ同程度以上の良好な音声認識結果が得られている。
【００６０】
さらに、参照時系列ベクトルを記憶しておくためのメモリの記憶容量も、用いるチャンネル数が一実施の形態の場合は２チャンネル、変形例の場合は３チャンネルのために、一実施の形態の場合では２／３倍に削減できることになる。
【００６１】
【発明の効果】
以上説明したように、本発明に係る音声認識装置によれば、特徴抽出のための演算も、かつ照合のための処理も簡単なため、その構成は簡単ですみ、かつ不特定話者の多様性に対しても誤認識が少なく、音声認識をすることができるという効果が得られる。さらに、本発明によれば、音声の特徴抽出に必要な計算量の削減と、照合に必要な計算量の削減と参照時系列ベクトルを記憶しておくためのメモリー容量を削減しつつ、良好な音声認識特性を得ることができるという効果が得られる。
【図面の簡単な説明】
【図１】本発明の一実施の形態にかかる音声認識装置の構成を示す模式ブロック図である。
【図２】本発明の一実施の形態にかかる音声認識装置における特徴抽出フィルタの構成を示すブロック図である。
【図３】本発明の一実施の形態にかかる音声認識装置における部分周波数−時間パターン作成器の作用の説明に供する模式図である。
【図４】本発明の一実施の形態にかかる音声認識装置における部分周波数−時間パターン作成器および主成分分析器の作用の説明に供する模式図である。
【図５】本発明の一実施の形態にかかる音声認識装置における照合用低次圧縮時系列データの構造の一例を示す模式図である。
【図６】本発明の一実施の形態にかかる音声認識装置における特徴抽出フィルタの特徴抽出ベクトルを示す図である。
【図７】本発明の一実施の形態にかかる音声認識装置の変形例における特徴抽出フィルタの他の構成を示すブロック図である。
【図８】本発明の一実施の形態にかかる音声認識装置の変形例における照合用低次圧縮時系列データの構造の一例を示す模式図である。
【図９】本発明の一実施の形態にかかる音声認識装置による音声認識結果を示す特性図である。
【符号の説明】
α 特徴抽出フィルタ作成部
β 照合時系列データ作成部
γ 不特定話者音声認識部
１音声認識装置
２周波数分析器
３部分周波数−時間パターン作成器
４主成分分析器
５特徴抽出フィルタ
６照合用低次圧縮時系列データ記憶器
７時系列データ照合器[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition apparatus that automatically recognizes speech uttered discretely from an unspecified speaker.
[0002]
[Prior art]
Many of the conventional speech recognition devices that recognize speech from a plurality of unspecified speakers without misrecognition perform frequency analysis with a certain frequency resolution on speech signals using various frequency analysis techniques. Prepare a hidden Markov model with the number of phonemes expected to appear after conversion into a frequency-time code sequence, and prepare the prepared hidden Markov model by learning from uttered speech from many speakers. deep.
[0003]
Using this learned Hidden Markov Model, a partial sequence of a frequency-time code sequence based on speech uttered by an unspecified speaker is collated with all phoneme models to make a time series of phoneme sequence candidates. A word that best represents the time series of this phoneme is converted and output as a recognition result.
[0004]
[Problems to be solved by the invention]
However, the conventional speech recognition apparatus requires a large amount of learning data to learn a hidden Markov model for maintaining high-performance speech recognition characteristics corresponding to the diversity of utterances of unspecified speakers. In order to accurately specify phonemes, there is a problem that a certain degree of frequency analysis resolution, that is, a certain degree of vector order is required.
[0005]
As a result, there is a problem that the calculation load during learning of the hidden Markov model and the phoneme specification is heavy, and further, at least two steps of collation calculation processing of phoneme collation and word collation are required in the word recognition process.
[0006]
SUMMARY OF THE INVENTION An object of the present invention is to provide a speech recognition apparatus that can maintain high performance with respect to the diversity of utterances of unspecified speakers, and can reduce misrecognition. .
[0007]
[Means for Solving the Problems]
The speech recognition apparatus according to the present invention includes frequency analysis means that sequentially obtains a frequency spectrum obtained by frequency analysis of a speech signal along a time axis and converts it into a time-series data group, and is spoken by a plurality of learning speakers. A main component analysis is performed using a cut-out unit that cuts out output time-series data from the frequency analysis unit to which a voice signal based on the voice is input in a predetermined time window, and a time-series data group cut out by the cut-out unit. Principal component analysis means to be performed and principal components obtained by the principal component analysis Convolution integration using the low frequency part and the central part of the time window Feature extraction filter means for compressing input time-series data into low-order time-series data, and low-order time-series data based on speech uttered by the plurality of learning speakers is used as reference low-order time-series data. The reference low-order time-series data and low-order time-series data based on speech uttered by an unspecified speaker are collated, and voice recognition is performed based on the collation result.
[0008]
In the speech recognition apparatus according to the present invention, a speech signal based on speech uttered by a plurality of learning speakers is input to the frequency analysis means and converted into a time series data group, and the time series data converted by the frequency analysis means Principal components obtained by principal component analysis by the principal component analysis by the principal component analysis means using the time-series data group that has been extracted by the clipping means in a predetermined time window and cut by the clipping means. Convolution integration is performed using the lower frequency part and the center part of the time window. The input time series data is compressed into low-order time series data by the feature extraction filter means. Low-order time-series data based on speech uttered by multiple learning speakers is used as reference low-order time-series data, and is compared with low-order time-series data based on speech uttered by unspecified speakers. Thus, speech recognition for speech uttered by an unspecified speaker is performed based on the collation result.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a speech recognition apparatus according to the present invention will be described with reference to an embodiment.
[0010]
FIG. 1 is a schematic block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention.
[0011]
In the schematic block diagram of FIG. 1, in order to facilitate understanding of the operation, the components used for different audio signal lines are shown in duplicate in FIG. These components are the same, and the same reference numerals indicate the same component means.
[0012]
The speech recognition apparatus 1 according to an embodiment of the present invention extracts features for phonemes of learning speakers based on uttered speech emitted from a plurality of learning speakers, and creates a feature extraction filter based on the extracted features. A feature extraction filter creation unit α and a plurality of learning speakers' utterances, for example, information based on speech signals of words is supplied to the feature extraction filter, and the information is compressed by the feature extraction filter and a low-order compressed time series data group for collation And a collation time-series data creation unit β for generating the input speech signal from the unspecified speaker to the feature extraction filter to generate time-series data compressed by the feature extraction filter. An unspecified speaker speech recognition unit γ that collates with collation low-order compressed time-series data and outputs a speech recognition result is provided.
[0013]
The feature extraction filter creation unit α is based on speech uttered by a plurality of learning speakers in order to show temporal changes in frequency spectrum of speech uttered by a plurality of learning speakers (hereinafter also referred to as learning speech group). The frequency spectrum obtained by frequency analysis of the audio signal is converted into a time series data group (frequency-time time series data group) obtained sequentially along the time axis, and converted by the frequency analyzer 2. A partial frequency-time pattern generator 3 for extracting partial frequency-time time-series data in a small time window range from the frequency-time time-series data group based on the speech from the plurality of learning speakers, and a partial frequency A principal component analyzer 4 that performs principal component analysis using time-series data of a plurality of partial frequencies and time cut out by the time pattern generator 3, and a principal component analyzer 4 A low-order principal component of the principal component analysis result includes a feature extraction filter 5 that performs convolution integration using a low frequency portion in the frequency axis direction and using only the central portion of the time window in the time axis direction. The features for the phonemes of the learning speakers are extracted from the utterances from the learning speakers.
[0014]
The collation time-series data creation unit β includes a low-order compression time-series data storage 6 for collation, and shows a temporal change in the frequency spectrum of word speech uttered from a plurality of learning speakers. The frequency spectrum obtained by frequency analysis of the speech signal of the uttered word speech by the frequency analyzer 2 is converted into a time-series data group of frequency-time sequentially obtained along the time axis, and the converted frequency- The time-series data group is sent to the feature extraction filter 5, and the frequency-time time-series data is dimensionally compressed by the feature extraction filter 5 to obtain a low-order compressed time-series data group for collation. The data is stored in the compressed time series data storage 6.
[0015]
The unspecified speaker voice recognition unit γ includes a time-series data collator 7, and indicates the temporal change of the frequency spectrum of the speech uttered by the unspecified speaker, so that the speech based on the speech uttered by the unspecified speaker The frequency spectrum obtained by frequency analysis of the signal by the frequency analyzer 2 is converted into a frequency-time time-series data group obtained sequentially along the time axis, and the converted frequency-time time-series data group is characterized. A time series data group is obtained by dimensionally compressing frequency-time time series data by the feature extraction filter 5 and obtained from the time series data group and the low-order compressed time series data storage 6 for collation. The read low-order compressed time-series data is collated by the time-series data collator 7 to obtain the closest to the time-series data group from the low-order compressed time-series data group for collation, and based on the collation result Unspecified story It recognizes the word based on the generated sound from.
[0016]
Next, each of the frequency analyzer 2, the partial frequency-time pattern generator 3, the principal component analyzer 4, and the feature extraction filter 5 will be specifically described.
[0017]
In the frequency analyzer 2, the input audio signal is A / D converted, the high frequency emphasis processing is performed on the A / D converted audio signal, and the time window is applied to the high frequency processed A / D converted audio signal. A LPC coefficient is obtained by linear prediction (LPC) analysis, a Fourier transform is performed on the LPC coefficient, a frequency spectrum is obtained, and this is sequentially obtained along the time axis. Thus, it is converted into time-series data of frequency-time for indicating the temporal change of the voice spectrum. Accordingly, the frequency analyzer 2 substantially develops a frequency-time pattern that is a sound spectrum pattern of the input voice. In this case, the frequency-time time-series data at each time of the frequency-time time-series data is an N-order vector Xi.
[0018]
If the feature extraction filter 5 is created in accordance with this frequency analysis method, there is little missing voice information. Further, another frequency analysis method may be used in which the voice information is not missing when the feature extraction filter 5 is created according to the frequency analysis method. Therefore, the method using the frequency analyzer 2 can be applied to a frequency-time pattern having a smaller vector order than the method using the so-called LPC spectrum envelope. As a result, the frequency-time pattern of the audio signal is substantially indicated by the time-series data group of frequency-time.
[0019]
The partial frequency-time pattern generator 3 cuts out the frequency-time time-series data in a predetermined small time window from the frequency-time time-series data group output from the frequency analyzer 2. For this reason, the frequency-time pattern of the voice based on the frequency-time time-series data output from the partial frequency-time pattern generator 3 is the voice based on the frequency-time time-series data output from the frequency analyzer 2. It can be said that it is a part of the frequency-time pattern, and is a partial frequency-time pattern.
[0020]
The feature extraction filter 5 minimizes information loss from the frequency-time time-series data, and creates time-series data compressed with information. In this example, principal component analysis is used to compress information. More specifically, among the principal components obtained by performing the principal component analysis using the partial frequency-time pattern as sample data, the low-order principal component uses the lower frequency portion in the frequency axis direction and the time window in the time axis direction. Convolution integration is performed using only the central part of.
[0021]
In more detail, an example in which 100 words of utterance data common to nine different learning speakers, for example, is used as a learning speech signal group will be described.
[0022]
In this case, it is assumed that the utterance data has utterance phonemes in the word speech signal section and label data corresponding to the start and end points on the time axis of the speech signal of the utterance phonemes. . For example, as shown in FIG. 3A, the time label a of the start point for the phoneme E, the time label b of the end point for the phoneme E and the time label of the start point for the phoneme F, and the end for the phoneme F It has a point time label c.
[0023]
The partial frequency-time pattern generator 3 combines the frequency-time time-series data output from the frequency analyzer 2 with the label data, the center position of the phoneme on the time extraction, (a + b in the example shown in FIG. 3A). ) / 2 and (b + c) / 2 are obtained, and frequency-time time-series data of the time window portion is cut out with the center position as the center.
[0024]
That is, the partial frequency-time pattern generator 3 cuts out the learning speech signal group with a time window D of, for example, 30 ms, and generates a partial frequency-time time-series data group. As shown in FIG. 3B, the partial frequency-time time series data generated by the partial frequency-time pattern generator 3 is cut out by the time window D, as shown in FIG. The position of [{(a + b) / 2} + (D / 2)] from the position of [{(a + b) / 2}-(D / 2)] so that the time window D comes to the center between b From the position [{(b + c) / 2}-(D / 2)] so that the time window D comes to the center between the time label b and the time label c for the phoneme F. Up to the position of [{(b + c) / 2} + (D / 2)] is cut out.
[0025]
By performing this cut-out processing for the label segment of the same phoneme, a plurality of time-series data of the same phoneme frequency-time can be collected. An average value of time-series data of frequency-time collected for a plurality of the same phonemes is obtained, and this is used as time-series data of partial frequency-time. By creating the partial frequency-time time-series data for each phoneme, a partial frequency-time time-series data group is created.
[0026]
Further, this cut-out process may be performed for each phoneme with little change, that is, for each relatively stationary phoneme.
[0027]
The principal component is obtained by the principal component analyzer 4 from this partial frequency-time time series data group.
[0028]
The operation from the time-series data of partial frequency-time to the output of the principal component by the principal component analyzer 4 will be described with reference to FIG. In FIG. 4, the time-series data of partial frequency-time is abbreviated as a pattern.
[0029]
The partial frequency-time time series data group of the extracted phoneme A, the partial frequency-time time series data group of the phoneme B,..., And the partial frequency-time time series data group of the phoneme Z are shown in FIG. The average value of the partial frequency-time time-series data group for each phoneme A to Z is obtained. Average value of time series data group of phoneme A partial frequency-time, average value of time series data group of phoneme B partial frequency-time, ... Average value of time series data group of phoneme Z partial frequency-time Is as schematically shown in FIG.
[0030]
The average value of the time-series data of the partial frequency-time of each phoneme A to Z is subjected to principal component analysis by the principal component analyzer 4 as schematically shown in FIG. As a result of the principal component analysis, as schematically shown in FIG. 4D, a first principal component, a second principal component,..., A Kth principal component (Z> K) are obtained.
[0031]
That is, in the principal component analysis, principal components having the same number of vector dimensions as the sample data space are obtained. The principal component that determines the axis having the largest variance of the sample data is the first principal component, and the axis having the second largest variance. Is the second principal component, and the Kth principal component is similarly determined.
[0032]
Of the principal components, the low-order principal component defines the eigenspace of the component that is included in the characteristics of the partial frequency-time time-series data group, and the frequency-time based on the frequency-time time-series data of the audio signal. It represents the feature of the most contained part in the pattern. Therefore, it is considered that a component based on the personality of the learning speaker included in the speech signal and a noise component that is considered to adversely affect recognition are not included in the low-order principal component.
[0033]
The feature extraction filter 5 uses a low frequency component in the frequency axis direction and a time window D in the time axis direction in the low-order principal component obtained as a result of the principal component analysis using the partial frequency-time pattern as sample data. Convolution integration is performed using only the central part of. A vector that performs this convolution integration is also referred to as a feature extraction vector.
[0034]
For example, in the case of two feature extraction vectors, a convolution integral that uses a low frequency portion in the frequency axis direction of the first principal component vector and uses only a central portion of the time window D in the time axis direction is performed. First feature extraction vector δ1i is a second feature extraction that uses a low frequency portion in the frequency axis direction of the second principal component vector and performs convolution integration using only the central portion of time window D in the time axis direction. It will be called a vector δ2i.
[0035]
The first and second feature extraction vectors δ1i and δ2i are used in the feature extraction filter 5, and the frequency-time time-series data at each time of the frequency-time time-series data output from the frequency analyzer 2; Correlation values are obtained between the second feature extraction vectors δ1i and δ2i. The correlation value output for each feature extraction vector is also referred to as a channel output. The correlation value output is normalized for each channel to obtain a 2-channel filter output.
[0036]
As is apparent from the above, if the feature extraction filter 5 is composed of two feature extraction vectors δ1i and δ2i as an example, as shown in FIG. 2, the Nth-order vector Xi of the frequency analysis result and the first, The product-sum operation with the second feature extraction vectors δ1i and δ2i is performed on the input N-order vector Xi by the product-sum operation units 511 and 512 for each time, respectively. The outputs are normalized by the normalizers 521 and 522, respectively, and the normalized outputs from the normalizers 521 and 522 are transmitted as the outputs of the respective channels.
[0037]
Next, creation of a collation low-order compressed time series data group will be described.
[0038]
The learning speech signal of each word is supplied to the frequency analyzer 2 to create frequency-time time series data based on the learning speech signal. The time-series data of this frequency-time is supplied to the feature extraction filter 5 based on the low-order principal components that have already been obtained for the phonemes in the learning speech signal group, and the feature extraction filter 5 performs dimension compression and feature extraction. Time series data is output from each channel of the filter 5, and this time series data is used as a collation low-order compressed time series data group.
[0039]
The structure of the low-order compressed time-series data group for collation created in this way is as shown in FIG. 5, and FIGS. 5 (A), (B), and (C) are speakers of learning speech, for example, A low-order compressed time-series data group for collation when learning speech of the same word by a ′, b ′, and c ′ is used, and 900 collation low-order compressions for 100 words by nine speakers. A time-series data group is obtained, and each element of the collation low-order compressed time-series data group is composed of a pair of each utterance word name of the learning speech signal and the corresponding collation low-order compressed time-series data. The collation low-order compressed time series data group is stored in the collation low-order compressed time series data storage 6.
[0040]
As described above, speech recognition from an unspecified speaker is performed in a state where the low-order compressed time-series data group for matching is stored in the low-order compressed time-series data storage 6 for matching. The voice signal from the unspecified speaker is frequency-analyzed by the frequency analyzer 2 and supplied to the feature extraction filter 5 that has already been obtained in advance by the feature extraction filter creation unit α based on the voice signal from the learning voice signal group. The feature extraction filter 5 performs dimension compression processing and converts it into time-series data.
[0041]
Time-series data based on speech signals from unspecified speakers is time-series data collated with the low-order compressed time-series data group for collation obtained by the collation time-series data creation unit β based on the learning speech signal group The low-order compressed time-series data for matching closest to the time-series data based on the speech signal from the unspecified speaker is selected from the group of low-order compressed time-series data for verification. The spoken word name for the collated low-order compressed time series data is output as a recognition result.
[0042]
Next, the time-series data collator 7 according to the present embodiment will be described with reference to an example of collation using the DP (dynamic programming) method.
[0043]
The DP method is a matching method in which time normalization is performed by nonlinearly expanding and contracting time between input time-series data and each time-series data group stored in advance. According to this method, the distance after time normalization between the input time-series data and each time-series data stored in advance is defined, and the time-series data having the smallest distance is the best in the input time-series data. It represents and represents the recognition result. In this embodiment, this DP method is applied between time-series data based on a speech signal from an unspecified speaker and low-order compressed time-series data for verification, and verification with a minimum distance after time normalization A word name corresponding to the low-order compressed time-series data is output.
[0044]
Next, the evaluation experiment result based on this Embodiment is demonstrated. Here, 100 words of 10 speakers in the speaker certification evaluation database were used as test samples.
[0045]
The feature extraction filter 5 was created by the feature extraction filter creation unit α using the speech data of nine speakers excluding one test speaker as a learning speech signal group. The phonemes used as samples are vowels, plosives, friction sounds, and nasal sounds. Using the partial frequency-time pattern generator 3, time series data of partial frequency-time is obtained for each speaker, and the partial frequency-time The principal component is obtained from the time-series data by the principal component analyzer 4, and among these principal components, the first and second principal components have a low frequency portion of 4.5 kHz or less in the frequency axis direction and in the time axis direction. The feature extraction vectors δ1i and δ2i were used by using the portion of only one unit time in the central portion of the time window D. FIG. 6 shows an example of the shape of the feature extraction vectors δ1i and δ2i, where the horizontal axis represents frequency and the vertical axis represents weighting coefficient.
[0046]
The low-order compression time-series data group for collation used in the time-series data collator 7 uses the feature extraction filter 5 with the utterance data of nine speakers excluding the one test speaker as a learning speech signal group. The collation time-series data creation unit β obtained 900 low-order compressed time-series data for collation. In the evaluation experiment, the test speaker was changed, and the feature extraction filter 5 was obtained again each time, and the low-order compressed time series data for verification was recreated.
[0047]
Next, a modification of the speech recognition apparatus according to the embodiment of the present invention will be described.
[0048]
Of the principal components, the low-order principal component defines the eigenspace of the component that is included in the characteristics of the partial frequency-time time-series data group, and the frequency-time based on the frequency-time time-series data of the audio signal. The component of the most contained part in the pattern and the noise component that is considered to have an adverse effect on the recognition and the personality of the learning speaker included in the speech signal are not included in the low-order principal components It is as described above that it is considered.
[0049]
For this reason, in this modified example, the fourth principal component whose variance decreases sequentially from the first principal component having a large variance instead of the feature extraction vectors δ1i and δ2i in the feature extraction filter 5 may be used as the feature extraction vector. For example, four principal components may be used as the low-order principal components in the direction from the minimum to the maximum amount of information loss.
[0050]
In the feature extraction filter in the present modification example when the above four principal components are used as the low-order principal components, for example, four first, second, third, 4 Using the low-order principal component vectors δ1i ′, δ2i ′, δ3i ′, δ4i ′ as the basis of the feature extraction filter, the frequency-time time at each time of the frequency-time time-series data output from the frequency analyzer 2 Correlation values are obtained between the series data and the first, second, third, and fourth low-order principal component vectors δ1i ′, δ2i ′, δ3i ′, and δ4i ′. The correlation value output for each low-order principal component is also referred to as a channel. This correlation value is normalized for each channel to obtain a 4-channel filter output.
[0051]
As is clear from the above, if the feature extraction filter in this modification example has four low-order principal components as an example, as shown in FIG. The product-sum operation with the next principal component vectors δ1i ′, δ2i ′, δ3i ′, δ4i ′ is performed on the input N-order vector Xi by the product-sum operation units 511 ′, 512 ′, 513 ′, and 514 ′ at each time. And normalizing the outputs from the product-sum calculators 511 ′, 512 ′, 513 ′, and 514 ′ by normalizing the levels by the normalizers 521 ′, 522 ′, 523 ′, and 524 ′, respectively. The output from each of the normalizers 521 ', 522', 523 ', and 524' is sent as the output of each channel.
[0052]
Next, creation of a collation low-order compressed time series data group will be described.
[0053]
The learning speech signal of each word is supplied to the frequency analyzer 2 to create frequency-time time series data based on the learning speech signal. The time-series data of this frequency-time is supplied to the feature extraction filter 5 based on the low-order principal components that have already been obtained for the phonemes in the learning speech signal group, and the feature extraction filter 5 performs dimension compression and feature extraction. Time series data is output from each channel of the filter 5, and this time series data is used as low-order compressed time series data for verification.
[0054]
The configuration of the low-order time-series data for verification in this modified example created as described above is as shown in FIG. 8, and FIGS. 5 (A), (B), (C), and (D) are learned. Low-order compressed time-series data for collation in the case of voiced speakers, for example, learning speech of the same word by a ′, b ′, c ′, d ′, for 100 words by nine speakers 900 low-order compressed time-series data groups for collation are obtained, and each element of the low-order compressed time-series data group for collation includes each utterance word name of the learning speech signal and the corresponding low-order compressed time-series data for collation. Composed of pairs. The collation low-order compressed time series data group is stored in the collation low-order compressed time series data storage 6.
[0055]
Others are the same as those in the case of speech recognition according to one embodiment of the present invention described above, except for the shapes of the feature extraction vectors δ1i and δ2i shown in FIG.
[0056]
In the speech recognition apparatus 1 according to the embodiment of the present invention described above, the case where the feature extraction vectors δ1i and δ2i shown in FIG. FIG. 9 shows a speech recognition result when using the feature extraction filter set to 3 channels using the four low-order principal component analysis results.
[0057]
In FIG. 9, a indicates the recognition result when the former, that is, two channels are set and the feature extraction vector shown in FIG. 6 is used for the feature extraction filter 5, and b indicates the latter, that is, the lower-order four principal component analysis results. The recognition result in the case of using the feature extraction filter set to 3 channels using is shown. Although both have good recognition results, it can be seen that the former is even better.
[0058]
In order to perform principal component analysis, the cut-out frequency and time window D in frequency analysis are 0 to 8 kHz and 30 msec wide. In this modification, the frequency is 8 kHz (32 points) and the time window D is 30 msec (= 5 msec). In the embodiment, the frequency is 0 to 4.5 kHz (18 points), the time window D is 5 msec (1 unit time), and the frequency is approximately 1. / 2, and the time width is 1/6. This cuts out the part of the frequency range and time width that is steady and stable for speech recognition, but extracts features with sufficiently low speaker dependency even in the range of 0 to 4.5 kHz and 5 msec. This is because it was found that it can be done.
[0059]
Therefore, since the amount of calculation required for extracting voice features per unit time is 18 points in frequency and 1 point in time in one embodiment, a total of 18 multiplications are necessary. On the other hand, in the modified example, a total of 192 multiplications are required at a frequency of 32 points and a time axis of 6 points. In the case of one embodiment, the calculation speed for low-order compression of one unit time is the same as that of the modified example. By being reduced to 1 / 10.6 times, the calculation amount can be greatly reduced, and a good speech recognition result equal to or higher than that can be obtained.
[0060]
Furthermore, the storage capacity of the memory for storing the reference time series vector is 2 channels when the number of channels to be used is one embodiment, and 3 channels in the modified example. Then, it can be reduced to 2/3 times.
[0061]
【The invention's effect】
As described above, according to the speech recognition apparatus of the present invention, since the calculation for feature extraction and the processing for matching are simple, the configuration is simple, and a variety of unspecified speakers can be used. There is little misrecognition with respect to gender and the effect that voice recognition can be performed is obtained. Furthermore, according to the present invention, it is possible to reduce the amount of calculation required for speech feature extraction, reduce the amount of calculation required for collation, and reduce the memory capacity for storing the reference time series vector, while maintaining good performance. The effect that the voice recognition characteristic can be obtained is obtained.
[Brief description of the drawings]
FIG. 1 is a schematic block diagram showing a configuration of a speech recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing a configuration of a feature extraction filter in the speech recognition apparatus according to the embodiment of the present invention.
FIG. 3 is a schematic diagram for explaining the operation of the partial frequency-time pattern generator in the speech recognition apparatus according to the embodiment of the present invention.
FIG. 4 is a schematic diagram for explaining the operation of the partial frequency-time pattern generator and the principal component analyzer in the speech recognition apparatus according to the embodiment of the present invention.
FIG. 5 is a schematic diagram showing an example of the structure of collation low-order compressed time-series data in the speech recognition apparatus according to the embodiment of the present invention.
FIG. 6 is a diagram showing a feature extraction vector of a feature extraction filter in the speech recognition apparatus according to the embodiment of the present invention.
FIG. 7 is a block diagram showing another configuration of the feature extraction filter in the modified example of the speech recognition apparatus according to the embodiment of the present invention.
FIG. 8 is a schematic diagram showing an example of the structure of low-order compressed time-series data for verification in a modification of the speech recognition apparatus according to the embodiment of the present invention.
FIG. 9 is a characteristic diagram showing a speech recognition result by the speech recognition apparatus according to the embodiment of the present invention.
[Explanation of symbols]
α Feature extraction filter generator
β verification time series data creation part
γ Unspecified speaker voice recognition unit
1 Voice recognition device
2 Frequency analyzer
3 Partial frequency-time pattern generator
4 Principal component analyzer
5 Feature extraction filter
6 Low-order compressed time series data storage for verification
7 Time series data collator

Claims

Frequency analysis means that sequentially obtains the frequency spectrum obtained by frequency analysis of the speech signal along the time axis and converts it into a time-series data group, and speech signals based on speech uttered by a plurality of learning speakers are input. In addition, a cutout unit that cuts out output time series data from the frequency analysis unit using a predetermined time window, a principal component analysis unit that performs principal component analysis using the time series data group cut out by the cutout unit, Feature extraction filter means for compressing input time-series data into low-order time-series data by performing convolution integration using a low-frequency portion and a center portion of a time window in the principal component obtained by component analysis, Low-order time-series data based on speech uttered by a learning speaker is used as reference low-order time-series data, and the low-order time-series data for reference and utterances from unspecified speakers Speech recognition apparatus characterized by the voice recognition was based on the collation to collation result and low order of the time series data based on the sound.

2. A speech recognition apparatus according to claim 1, wherein the feature extraction filter means uses a low-order principal component in the principal components obtained by principal component analysis as a basis.

2. The speech recognition apparatus according to claim 1, wherein the reference low-order time-series data is output time-series data from frequency analysis means to which speech signals based on speech uttered by a plurality of learning speakers are input, as feature extraction filters. The speech recognition apparatus is low-order time-series data supplied to the means and compressed by the feature extraction filter means.

2. The speech recognition apparatus according to claim 1, wherein output time-series data from a frequency analysis unit to which a speech signal based on speech uttered from a plurality of learning speakers is input is supplied to the feature extraction filter unit, Storage means for storing compressed low-order time-series data as reference time-series data, low-order time-series data based on speech uttered by an unspecified speaker, and reference time read from the storage means A speech recognition apparatus that performs speech recognition by collating with series data.

2. The speech recognition apparatus according to claim 1, wherein low-order time-series data based on speech uttered by an unspecified speaker is obtained from frequency analysis means to which a speech signal based on speech uttered by an unspecified speaker is input. The output time-series data is supplied to the feature extraction filter means, and is a low-order time-series data compressed by the feature extraction filter means.

2. The speech recognition apparatus according to claim 1, wherein the cut-out means cuts out time-series data for each of the same phonemes of a plurality of learning speakers and creates average time-series data of the plurality of learning speakers. apparatus.

2. The voice recognition apparatus according to claim 1, wherein the cut-out means cuts time-series data for each relatively stationary phoneme.