JP3798681B2

JP3798681B2 - Speech spectrum estimation method, apparatus thereof, program thereof, and recording medium thereof

Info

Publication number: JP3798681B2
Application number: JP2001348735A
Authority: JP
Inventors: 清明相川; 健太郎石塚
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-11-14
Filing date: 2001-11-14
Publication date: 2006-07-19
Anticipated expiration: 2021-11-14
Also published as: JP2003150191A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声から例えばその音声認識や音声符号化のために用いるスペクトルを求める方法、その装置、そのプログラム及びその記録媒体に関する。
【０００２】
【従来の技術】
従来から、音声認識においては、音の特徴を表すのに、スペクトルが用いられてきた。スペクトルから求められる、ケプストラム、ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）などもこの範疇に含む。スペクトルを求めるには、２０ｍｓから４０ｍｓ程度の一定長の時間窓の中にはいる音声波形全体をフーリエ変換、あるいは線形予測分析を行なうことにより求めてきた［例えば、古井貞煕、ディジタル音声処理、東海大学出版会、１９８５］。
音声は声帯が開閉して生じるパルス状の音が声道や口腔を通る時にその音響的な周波数成分伝達特性により形成される。その声帯の開閉周期は２ｍｓ程度から１０ｍｓ程度であるので、上記の２０から４０ｍｓのスペクトル分析区間には通常複数の音声波形の繰り返しが含まれる。従来法では、この区間を１つの時系列信号として扱い、その区間での平均的なスペクトルを求める。
【０００３】
音声スペクトルを分析する区間に音声の他に雑音が含まれていた場合には、その影響がスペクトルに現れる。この区間をいくつかの小区間に分けたとすると、この区間全体での総雑音パワーは、小区間のパワーの和となる。ｉ時点から始まる区間において、ｍ番目の長さＬ_mの小区間のｋ時点目の信号をｓ_m(ｋ）とすると、ｉ時点から始まるｍ番目の長さＬ_mの小区間の平均パワーＰ（ｉ，ｍ）は、
Ｐ（ｉ，ｍ）＝（１／Ｌ_m）Σ_k=1 ^Lmｓ_m(ｋ）
により与えられる。ｉ時点から始まる区間のパワーは、その区間の小区間数をＭとすると、
Σ_m=1 ^MＬ_mＰ（ｉ，ｍ）であり、その区間の平均パワーＰ（ｉ）は
Ｐ（ｉ）＝Σ_m=1 ^MＬ_mＰ（ｉ，ｍ）／Σ_m=1 ^MＬ_m
により与えられる。この式は、信号の種類によらず成り立つ。
【０００４】
すなわち、従来法では、小区間に分けても、小区間の間で、演算が行なわれるわけではないので、雑音は音声信号、周期性雑音、非周期性雑音のすべてについて、分析区間内のすべての小区間のパワーが加算され、特定の信号が減衰したり増加したりすることは無い。また、区間内の信号パワーと雑音パワーの比率は小区間の分割方法を変えても一定である。
【０００５】
【発明が解決しようとする課題】
一方音声信号の内、声帯開閉によって生じる波形は、声帯の開閉が規則的ではないため、位相がずれ、信号パワーは変わらないものの、スペクトル形状がぼけ、ホルマントと呼ばれるエネルギーが集中した周波数成分のピークが低くなり、雑音に埋もれやすくなるという欠点もあった。つまり音声スペクトルが雑音に埋もれ、正しく音声スペクトルを推定することが困難な場合があった。
この発明の目的は雑音の影響を低減できる音声スペクトル推定方法、その装置、そのプログラム及びその記録媒体を提供することにある。
【０００６】
【課題を解決するための手段】
この発明によれば、音声スペクトルを求める区間中の音声波形の繰り返しの単位波形を複数個取り出し、その繰り返し波形をちょうど波形が合うようにして加算し、その加算した１周期分の波形のスペクトルを推定する。
この構成によれば例えば図１に示す音声波形の一分析区間において、繰り返しの単位波形１−１〜１−４がその各繰り返しの開始点Ｓｔ−１〜Ｓｔ−４を一致させて同じ波形が加算されるので繰り返し位置Ｓｔ−１〜Ｓｔ−４の不規則性によるスペクトルのぼけが防止できる。一方、以下の理由で、音声波形１に重畳する雑音２は抑制される。
【０００７】
（１）目的の音声以外の信号２の周期成分のうち、音声波形１と繰り返しの周期がずれたものは、音声の単位波形１−１〜１−４の加算にともなってこれらと重畳している音声以外の信号２−１〜２−４が互いに加算され、これらの信号２−１〜２−４は色々な位相で加算されるため、１周期の波形を複数個加算平均することにより、位相が打ち消されて減衰する。極端な例としては、雑音成分が隣り合う小区間の間で位相が反転していた場合には、完全に打ち消される。
（２）音声以外の周期成分のうち、音声と繰り返し周期がほぼ同一で一定のものも減衰する。
【０００８】
なぜならば、音声の繰り返し周期の位置ずれ、つまり繰り返しの開始点Ｓｔ−１〜Ｓｔ−４の等間隔分割点に対するずれを補正して加算するため、同一繰り返し周期の雑音成分はその波形が位相をずらして加算され、互いに打消されるようになる。音声は位相を同期させて加算されるのでＮ個加算すれば振幅がＮ倍、エネルギーはＮ^２倍になる。これに対し雑音はＮ個の雑音に相関がなければエネルギーはＮ倍となり、信号対雑音比はＮ倍改善される。
【０００９】
【発明の実施の形態】
図２にこの発明の実施形態の手順を示し、図３にこの実施形態を実行するこの発明の音声スペクトル推定装置の機能構成例を示す。
ステップＳ１で入力端子１１より入力された音声信号をＡ／Ｄ変換器１２によりディジタル音声信号に変換して記憶部１３に格納する。
ステップＳ２で記憶部１３から音声信号を２０〜４０ｍｓ、例えば３０ｍｓの分析区間Ｔづつ切り出し、切り出した音声信号が有声／無声判別部１４により有声音か無声音か、つまり声帯の開閉による周期性があるか、周期性がない摩擦性子音あるいは無音部分かの判別をする。この判別は従来より用いられている手法によればよい。
【００１０】
この判別結果が有声音であれば、ステップＳ４で単位波形検出部１５により、複数の繰り返しの単位の波形、例えば図１中の単位波形１−１〜１−４を検出する。ステップＳ５でこれら検出した単位波形１−１〜１−４を、互いに最も形状が合うように時間をずらして、つまり各単位波形１−１〜１−４の繰り返しの開始点Ｓｔ−１〜Ｓｔ−４を一致させて同位相加算部１６により加算する。
なおステップＳ４で単位波形を１つ検出するごとにステップＳ５で同位相加算してステップＳ４に戻るようにしてもよい。
ステップＳ６でこの加算した単位位相波形のスペクトルをスペクトル推定部１７により推定する。このスペクトル推定は、線形予測分析法、高速フーリエ変換による方法、バンドパスフィルタによる方法など従来行われている各種の手法を用いることができる。
【００１１】
ステップＳ３でその分析区間が無声音と判別された場合は例えばステップＳ７で無声音スペクトル推定部１８において分析区間Ｔを複数の小区間に分割し、ステップＳ８でこれら小区間の波形加算平均して、ステップＳ６に移り、スペクトルを推定する。この場合その分割小区間の長さは必ずしも同一としなくてもよい。あるいは図２中に破線で示すように音声音区間であればステップＳ６に移り、その分析区間ＴについてステップＳ６でスペクトル推定してもよい。
次に図２中のステップＳ４における単位波形検出処理の具体例を説明する。つまり単位波形検出部１５における処理を説明する。例えば図４に示すように切り出し区間Ｔを、その始点ｉ＝０から長さｎmin の小区間ＳＴ−１とその隣接する小区間ＳＴ−２との両波形の相互相関をとり、この小区間ＳＴ−１とＳＴ−２の長さを音声波形の１サンプル周期１／ｆ_sずつ長くしながら両小区間ＳＴ−１とＳＴ−２の波形の相互相関をとることを小区間ＳＴ−１とＳＴ−２の各長さがｎmax となるまで行い、これら２つの小区間ＳＴ−１とＳＴ−２の両波形の相互相関が最大となる小区間のｉ＝０〜ｎｘを求めてこの小区間の波形を１つの単位波形とする。なおｉ＝０は音声波形の立上りと一致するようにしておく。
【００１２】
このことを式を用いて説明すると以下のようになる。ｉ番目の音声波形サンプルをｓ（ｉ）とする。ｉ₁を始点とする音声波形において、隣接する長さｎサンプルの波形の相互相関ｃ（ｉ₁，ｎ）が次式により計算される。
【００１３】
【数１】

【００１４】
ｎを取り扱う最小の繰り返し周期から最大の繰り返し周期まで変化させて、式（１）の相互相関の最大値を求める。繰り返し周期の最小値ｎmin と最大値ｎmax は、人間の音声の声帯振動周波数、すなわち基本周波数あるいはピッチの最小値ｆmin と最大値ｆmax 、および、サンプリング周波数ｆ_sからそれぞれ式（２）、（３）により求める。
ｎmin ＝ｆ_s／ｆmax （２）
ｎmax ＝ｆ_s／ｆmin （３）
このようにしてｎをｎmin からｎmax で変化させた時の式（１）の相互相関ｃ（ｉ，ｎｘ）が最大となるときのｎの値をｎｘとすると次の相互相関分析開始点を
ｉ₂＝ｉ₁＋ｎｘ
とおいて、通常の音声スペクトル分析区間Ｔの範囲中で、順次繰り返し、波形を１つずつ切り出す。
【００１５】
こうして得られた分析区間Ｔ内の複数個の単位波形を、ステップＳ５で時間的な位置ずれが無いように時間をずらせて加算するが、その具体的手法を以下に説明する。
１つの音声スペクトル分析区間で得られるｍ番目の繰り返し波形の始端をｉ_mとする。この開始点ｉ_mは図１に示したように必ずしも複数の繰り返し波形の対応する時点になっているとは限らない。ｍ番目の繰り返し波形の開始点の位置ずれをｊ_mとすると、ｑまで加算された繰り返し波形（単位波形）ｒ_ｑ(ｋ）は、次式で与えられる。ｋは時点を表す。
【００１６】
ｒ_ｑ(ｋ）＝Σ^ｑ _m=1ｓ（ｉ_m＋ｊ_m＋ｋ）
ｑ＋１番目の単位波形の位置ずれｊ_ｑ ₊₁は、ｑ個の単位波形を加算した波形ｒ_q(ｋ）と、単位波形ｓ（ｉ_ｑ ₊₁＋ｊ_x＋ｋ）との相互相関ｇ（ｊ_x）（式（４））がもっとも高くなるｊ_xとして求めることができる。Ｍは単位波形の最大長を表わす。
【００１７】
【数２】

【００１８】
このようにして位置ずれｊ_xを用いてｑ＋１番目の単位波形をそれまでの加算波形に加算した、つまり新しい波形を取り込んだ１周期分の波形は、
ｒ_ｑ ₊₁(ｋ）＝ｒ_ｑ(ｋ）＋ｓ（ｉ_ｑ ₊₁＋ｊ_x＋ｋ）
により与えられる。なお音声スペクトルのエネルギーも求める場合は、分析区間Ｔごとに加算した単位波形を、その単位波形の数で割算して平均波形とする。
また前述したように単位波形を検出することはそれまでに得られた単位波形に開始点を合せて加算するようにしてもよい。
以上の音声スペクトル推定の具体的処理手順の例を図５に示す。取り扱う繰り返し周期の最小値ｎmin 及び最大値ｎmax 、分析区間Ｔを単位波形検出部１５内の記憶部１５ａに予め格納しておき、また単位波形検出部１５内にパラメータｉの格納レジスタ１５ｂ、パラメータｎの格納レジスタ１５ｃ、ｃ（ｉ，ｎｘ），ｎｘの記憶部１５ｄが設けられる。図３に示した音声スペクトルに推定装置中にバッファメモリ１９が設けられる。ｉ番目の音声波形サンプルをｓ（ｉ）とする。
【００１９】
まずステップ２１で波形蓄積バッファメモリ１９の記憶内容を初期化し、ステップ２２でｉ＝０とし、ステップ２３でｎ＝ｎmin とする。
次にステップ２４で隣接する長さｎサンプルの波形の相互相関ｃ（ｉ，ｎ）を式（１）により計算する。
ステップ２５でその相互相関ｃ（ｉ，ｎ）と、記憶部１５ｄ内のそれまでの最大相互相関ｃ（ｉ，ｎｘ）を比較し、ｃ（ｉ，ｎ）＞ｃ（ｉ，ｎｘ）であれば、そのｃ（ｉ，ｎ）に記憶部１５ｄの最大相互相関ｃ（ｉ，ｎｘ）を更新し、かつ、その繰り返し周期ｎｘも更新保持する。
【００２０】
ステップ２６でｎを＋１し、ステップ２７でｎがｎmax を超えたかを調べ超えていなければステップ２４に戻る。
ステップ２４でｎがｎmax を超えていれば、その時の記憶部１５ｄ内の周期ｎｘが求める１つの単位波形の周期であって、その時のｋ＝１〜ｋ＝ｎｘの音声波形ｓ（ｉ＋ｋ）が単位波形である。この例では単位波形が得られるごとに、それまでの加算した単位波形に求めた単位波形を位置を合わせて加算するようにした場合である。
つまり、ステツプ２７でｎがｎmax を超えていれば、ステップ２８で波形蓄積バッファメモリ１９内の波形と、現に求めた単位波形とを位置を合わせて加算して、波形蓄積バッファメモリ１９内の加算単位波形を更新する。この位置合せの具体的処理例については後で説明する。
【００２１】
ステップ２９でｉをｉ＋ｎｘに更新し、ステップ３０でｉが分析区間Ｔを超えたかを調べ、超えていなければ、ステップ２３に戻って、次の単位波形の検出を同様に行い、検出した単位波形を位置合せ加算する。ステップ３０でｉがＴを超えていれば終了する。
次にステップ２８の具体的処理例を図５を参照して説明するが、図２中の同位相加算部１６内の記憶部１６ａに、波形蓄積バッフアメモリ１９の格納最大長Ｌ、１単位波形の最大長Ｍ、最大ずらし幅Ｊが予め格納され、またパラメータｊのレジスタ１６ｂ、ｊ_x記憶部１６ｃが設けられているものとする。なお最大ずらし幅Ｊは例えば±１ミリ秒程度であり、音声波形のサンプル周期１／ｆ_sが０．１ｍｓの場合Ｊ＝±１０である。
【００２２】
ステップ３１で検出した単位波形ｓ（ｋ），（ｋ＝１，２，…，Ｍ）が１回目であるかを調べ、１回目であればステップ３２でその単位波形ｓ（ｋ），（ｋ＝１，２，…，Ｍ）をｒ（ｋ）としてバッファメモリ１９に格納する。
ステップ３１で検出単位波形ｓ（ｋ）が１回目でなければ、ステップ３３でその波形ｓ（ｋ）をｊ＝−Ｊだけ時間的位置をずらし、ステップ３４でその位置をずらした波形ｓ（ｊ＋ｋ）とバッファメモリ１９内の波形ｒ（ｋ）との相互相関ｇ（ｊ）を式（４）により計算する。式（４）の計算を正しく行うためにはｒ（ｋ）としても常にｋ＝−Ｊだけ余分に、またｋ＝Ｍ＋Ｊだけ余分に取り込んでおく必要がある。これらの長さは単位波形長と比較してわずかであるから、図５中のステップ３４に示すようにｋの負の値は省略し、またバッファメモリの格納容量長ＬとＭ−ｊの小さい方の値よりｋが大きい部分は省略して、ｋに関する各積の総和を求めている。
【００２３】
ステップ３５でこのようにして求めた相互相関ｇ（ｊ）と、記憶部１６ｃ内の相互相関ｇ（ｊ_x）とを比較し、ｇ（ｊ）＞ｇ（ｊ_x）であれば、記憶部１６ｃのｇ（ｊ_x）をそのｇ（ｊ）で更新し、かつ記憶部１６ｃのｊ_xをその時のｊで更新する。
ステップ３６でｊを＋１し、ステップ３７でｊがＪを超えたかを調べ、超えていなければステップ３４に戻る。このようにして検出した単位波形ｓ（ｋ）を−Ｊから１サンプル周期ずつ順次ずらしてバッファメモリ１９内の波形ｒ（ｋ）との相互相関をとり、その相互相関の大きいものとｊの値を残すことにより、ステップ３７でｊがＪを超えたと判断された時の、ｊ_xの値が、ｓ（ｋ）のｒ（ｋ）に対する位置ずれとなる。
【００２４】
ステップ３８でバッファメモリ１９に格納されている波形ｒ（ｋ）は位置（位相）合せした単位波形ｓ（ｊ_x＋ｋ）を加算して、バッファメモリ１９内の波形ｒ（ｋ）（ｋ＝１，２，…，Ｍ）を更新する。
図３に示した音声スペクトル推定装置の各部は制御部２０より順次制御される。この音声スペクトル推定装置はコンピュータによりプログラムを実行させて機能させることもできる。この場合は図２、図４及び図５に示した方法をコンピュータに実行させるための音声スペクトル推定プログラムを、コンピュータにＣＤ−ＲＯＭ、可撓性磁気ディスクなどからインストールし、又は通信回線を介してダウンロードして実行させればよい。コンピュータに実行させる場合は、例えば図３において、制御部２０がマイクロプロセッサ又はＣＰＵであり、有声／無声判別部１４は有声／無声判別サブルーチン、単位波形検出部１５は単位波形検出サブルーチン、同位相加算部１６は同位相加算サブルーチンをそれぞれ格納した記憶部であり、これらサブルーチンを用いて全体としての音声スペクトル推定プログラムを構成するためのプログラムが、図に示していないプログラムメモリに格納されることになる。
【００２５】
図１に示したように、周期性のある音声波形の繰り返しを小区間に切り離し、最も形状が一致するように時間をずらして加算すると、音声波形に同期していない雑音成分は、小区間の間で位相が異なるため打ち消しあって減衰する。この効果は音の高さ（ピッチ）が一見一定であるように見える部分でも、音声のピッチには自然な揺らぎがあり、これを補償するために時間をずらして加算することになる。従って、音声以外の信号の位相をずらして加算することになるため、音声以外の信号は減衰が起こり、雑音成分を抑制する効果がある。特に、イントネーションの変化が大きいところでは、雑音成分に周期性がある場合に、雑音の位相を大きくずらして加算できるので、雑音抑制効果は大きい。
【００２６】
雑音成分が周波数成分の位相がランダムに変化する白色雑音などの場合には、長い区間を一括してスペクトル分析した場合でも、短区間に分けて加算しても雑音成分のエネルギーの期待値は変わらないが、音声信号成分は位相を合わせて加算することにより、スペクトルのぼけを防ぐことができるので、雑音の影響は軽減できる。
実施例（１）
図７はこの発明の方法を用いたＬＰＣ分析による音声認識装置の実施例である。
【００２７】
１０１は、ＡＤ変換部で入力音声波形は、例えば、サンプリングレート１１０２５ｋＨｚでサンプルされる。
１０２は、スペクトル分析区間切出部で、連続的な音声信号から３０ｍｓ程度の音声を切り出す。
１０３は、この発明方法中の雑音抑圧部、つまり図２中のステップＳ６のスペクトル推定処理を除いた処理部分として図５に示した処理手順に従って、繰り返しの単位波形を位相を合わせて加算した波形を求める。
１０４は、自己相関係数計算部で、１６次程度の自己相関係数を出力する。
【００２８】
１０５は、線形予測分析部で、自己相関係数から、１６次程度の線形予測係数を求める。
１０６は、ケプストラム分析部で、線形予測係数から、１６次程度のケプストラムを求める。
１０７は、例えばＨＭＭ（Hidden Markov Model：隠れマルコフモデル）を用いた音声認識部で、スペクトル分析が１ステップ行なわれるたびに、１ステップ音声認識が進む。
１１１は、ケプストラム係数を用いて、音声認識の音響モデルをＨＭＭとして作成する。
【００２９】
１１２は、ＨＭＭ音響モデルを蓄積する。
１０８は、音声の終了を判定する。
１１０は、認識結果を表示する。
実施例（２）
図８はこの方法を用いたＦＦＴ（高速フーリエ変換）による音声認識装置の実施例である。
２０１は、ＡＤ変換部で、音声波形は、例えば、サンプリングレート１１０２５ｋＨｚでサンプルされる。
【００３０】
２０２は、スペクトル分析区間切出部で、連続的な音声信号から３０ｍｓ程度の音声を切り出す。
２０３は、この発明方法中の雑音抑圧部、つまり図２中のステップＳ６のスペクトル推定処理を除いた処理部分として図６に示した処理手順に従って、繰り返し単位波形を位相を合わせて加算した波形を求める。
２０４は、例えば２５６次のＦＦＴ（Fast Fourier Transform：高速フーリエ変換）を行なう。なお分析区間中の加算した単位波形以外の部分は０を詰めてＦＦＴを行う。
【００３１】
２０５は、対数スペクトルを求める。
２０６は、対数スペクトルの逆フーリエ変換を行ない、
２０７は、例えば１６次のケプストラム係数を得る。
２０８は、例えばＨＭＭ音声認識部で、スペクトル分析が１ステップ行なわれるたびに、１ステップ音声認識が進む。
２１１は、ケプストラム係数を用いて、音声認識の音響モデルをＨＭＭとして作成する。
２１２は、ＨＭＭ音響モデルを蓄積する。
【００３２】
２０９は、音声の終了を判定する。
２１０は、認識結果を表示する。
この発明は音声認識のみならず、スペクトルを求めて音声を符号化する場合などにも適用できる。
【００３３】
【発明の効果】
以上述べたようにこの発明によれば、スペクトルのぼけがあっても、そのぼけを少なくし、雑音の影響を軽減できる。
例えばこの発明を音声認識に適用して、ＨＭＭによる音素認識実験の結果、信号対雑音パワー比ＳＮＲが２０ｄＢのピンク雑音環境下において、２３音素認識率を４５．５％から５８．５％に、母音認識率を、９１．６％から９４．８％に改善できた。
【図面の簡単な説明】
【図１】音声波形とその繰り返しの時間ずれと、その時間ずれを補償して加算することによる雑音抑制を示す図。
【図２】この発明の実施形態を示す流れ図。
【図３】この発明の音声スペクトル推定装置の機能構成例を示す図。
【図４】この発明において単位波形を検出する手順を説明するための図。
【図５】この発明方法における単位波形検出部分の具体的手順の例を示す流れ図。
【図６】図５中のステップ２８の単位波形の位置合せ加算の具体的処理手順の例を示す流れ図。
【図７】この発明方法を適用し、ＬＰＣ分析を用いた音声認識の処理手順の例を示す流れ図。
【図８】この発明方法を適用し、ＦＦＴを用いた音声認識の処理手順の例を示す流れ図。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method, a device, a program, and a recording medium for obtaining a spectrum used for speech recognition or speech coding, for example, from speech.
[0002]
[Prior art]
Conventionally, in speech recognition, a spectrum has been used to represent the characteristics of a sound. Cepstrum, MFCC (Mel Frequency Cepstrum Coefficient), etc. obtained from the spectrum are also included in this category. In order to obtain the spectrum, it has been obtained by performing Fourier transform or linear prediction analysis on the entire speech waveform within a fixed time window of 20 ms to 40 ms [for example, Sadaaki Furui, digital speech processing, Tokai University Press, 1985].
The sound is formed by the acoustic frequency component transfer characteristic when a pulsed sound generated by opening and closing the vocal cords passes through the vocal tract or the oral cavity. Since the opening / closing period of the vocal cords is about 2 ms to about 10 ms, the spectrum analysis section of 20 to 40 ms usually includes a plurality of repeated speech waveforms. In the conventional method, this section is treated as one time-series signal, and an average spectrum in that section is obtained.
[0003]
If noise is included in the section in which the speech spectrum is analyzed, the effect appears in the spectrum. If this section is divided into several small sections, the total noise power in the entire section is the sum of the power of the small sections. In starting the interval from i point, m-th of the length L when the time k-th signal of the small sections of _m and s _m (k), the average power of the small section of the m-th length L _m starting at i time P (I, m) is
P (i, m) = (1 / L _m ) Σk _{= 1} ^Lms _m (k)
Given by. The power of the section starting from the time point i is M, where the number of small sections is M.
Σ _{m = 1} ^M L _m P (i, m), and the average power P (i) in that section is P (i) = Σ _{m = 1} ^M L _m P (i, m) / Σ _{m = 1} ^M L _m
Given by. This equation holds regardless of the type of signal.
[0004]
That is, in the conventional method, even if it is divided into small sections, computation is not performed between the small sections, so the noise is all about the speech signal, periodic noise, and non-periodic noise in the analysis section. The powers of the small sections are added, and a specific signal is not attenuated or increased. Further, the ratio of the signal power and the noise power in the section is constant even if the division method of the small section is changed.
[0005]
[Problems to be solved by the invention]
On the other hand, the waveform generated by the opening and closing of the vocal cords in the audio signal is not regular, and the phase shifts and the signal power does not change. However, there was also a drawback that it was easily buried in noise. That is, there are cases where the speech spectrum is buried in noise and it is difficult to correctly estimate the speech spectrum.
An object of the present invention is to provide a speech spectrum estimation method, apparatus, program, and recording medium that can reduce the influence of noise.
[0006]
[Means for Solving the Problems]
According to the present invention, a plurality of repeating unit waveforms of a speech waveform in a section for obtaining a speech spectrum are extracted, and the repeated waveforms are added so that the waveforms exactly match each other, and the spectrum of the added waveform for one period is obtained. presume.
According to this configuration, for example, in one analysis section of the speech waveform shown in FIG. 1, the repetitive unit waveforms 1-1 to 1-4 coincide with the start points St-1 to St-4 of the respective repetitive waveforms, and the same waveform is obtained. Since they are added, it is possible to prevent spectral blurring due to irregularities at the repeating positions St-1 to St-4. On the other hand, the noise 2 superimposed on the speech waveform 1 is suppressed for the following reason.
[0007]
(1) Of the periodic components of the signal 2 other than the target speech, those having the repetition period shifted from the speech waveform 1 are superposed on these as the speech unit waveforms 1-1 to 1-4 are added. Signals 2-1 to 2-4 other than the existing voice are added to each other, and these signals 2-1 to 2-4 are added in various phases. By averaging a plurality of waveforms in one cycle, The phase is canceled and attenuated. As an extreme example, when the phase of the noise component is inverted between adjacent subsections, it is completely canceled out.
(2) Among periodic components other than speech, those having a repetition period substantially the same as speech are attenuated.
[0008]
This is because the position of the repetition period of the speech, that is, the deviation of the repetition start points St-1 to St-4 with respect to the equally-spaced division points is corrected and added. They are added with a shift and cancel each other. Since sounds are added with their phases synchronized, adding N results in N times the amplitude and N ² times the energy. On the other hand, if the noise is not correlated with N noises, the energy becomes N times and the signal-to-noise ratio is improved N times.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 2 shows the procedure of the embodiment of the present invention, and FIG. 3 shows a functional configuration example of the speech spectrum estimation apparatus of the present invention that executes this embodiment.
In step S 1, the audio signal input from the input terminal 11 is converted into a digital audio signal by the A / D converter 12 and stored in the storage unit 13.
In step S2, a voice signal is cut out from the storage unit 13 in 20 to 40 ms, for example, 30 ms analysis sections T, and the cut out voice signal is voiced or unvoiced by the voiced / unvoiced discrimination unit 14, that is, has a periodicity due to opening and closing of the vocal cords. Or a frictional consonant with no periodicity or a silent part. This determination may be performed by a conventionally used method.
[0010]
If this discrimination result is a voiced sound, the unit waveform detector 15 detects a plurality of repetitive unit waveforms, for example, unit waveforms 1-1 to 1-4 in FIG. 1, in step S4. The unit waveforms 1-1 to 1-4 detected in step S5 are shifted in time so that the shapes of the unit waveforms 1-1 to 1-4 are best matched, that is, the start points St-1 to St of repetition of the unit waveforms 1-1 to 1-4. -4 are matched and added by the in-phase adder 16.
Each time one unit waveform is detected in step S4, the same phase may be added in step S5, and the process may return to step S4.
In step S 6, the spectrum of the added unit phase waveform is estimated by the spectrum estimation unit 17. For this spectrum estimation, various conventional methods such as a linear prediction analysis method, a fast Fourier transform method, and a bandpass filter method can be used.
[0011]
If it is determined in step S3 that the analysis section is an unvoiced sound, for example, in step S7, the unvoiced sound spectrum estimation unit 18 divides the analysis section T into a plurality of small sections, and in step S8, the waveforms are averaged and averaged. Moving to S6, the spectrum is estimated. In this case, the lengths of the divided small sections are not necessarily the same. Alternatively, as shown by a broken line in FIG. 2, if it is a speech sound section, the process may move to step S <b> 6, and the spectrum may be estimated for the analysis section T in step S <b> 6.
Next, a specific example of the unit waveform detection process in step S4 in FIG. 2 will be described. That is, the processing in the unit waveform detector 15 will be described. For example, as shown in FIG. 4, a cut-out section T is cross-correlated between both waveforms of a small section ST-1 having a length of nmin from its starting point i = 0 and its adjacent small section ST-2, and this small section ST. -1 and ST-2 are lengthened by one sample period 1 / f _s of the speech waveform while taking the cross-correlation between the waveforms of both the small sections ST-1 and ST-2. -2 until each length of n-2 reaches nmax, and i = 0 to nx of the subsection where the cross-correlation between the two subsections ST-1 and ST-2 is maximized is obtained. Let the waveform be one unit waveform. Note that i = 0 coincides with the rising edge of the speech waveform.
[0012]
This can be explained using equations as follows. The i-th speech waveform sample is s (i). In the speech waveform starting from i ₁ , the cross-correlation c (i ₁ , n) of the waveforms of adjacent n samples of length is calculated by the following equation.
[0013]
[Expression 1]

[0014]
The maximum value of the cross-correlation in Expression (1) is obtained by changing from the minimum repetition period that handles n to the maximum repetition period. Minimum nmin and maximum repetition period nmax is vocal cord vibration frequency of the human voice, i.e., the minimum value fmin and a maximum value fmax of the fundamental frequency or pitch, and each type of sampling frequency _{f s (2), (3} ) Ask for.
_{nmin = f s / fmax (2} )
_{nmax = f s / fmin (3} )
When the value of n is nx when the cross-correlation c (i, nx) of the equation (1) is maximized when n is changed from nmin to nmax in this way, the next cross-correlation analysis starting point is i. ₂ = i ₁ + nx
In the meantime, in the normal speech spectrum analysis section T, the waveform is cut out one by one in order.
[0015]
A plurality of unit waveforms in the analysis section T obtained in this way are added with a time shift so that there is no time misalignment in step S5. The specific method will be described below.
The beginning of the _mth repetitive waveform obtained in one speech spectrum analysis section is assumed to be im. The starting point i _m is not always become a corresponding point of the necessarily multiple repetitive waveform as shown in FIG. If the position shift of the start point of the m-th repetitive waveform is j _m , the repetitive waveform (unit waveform) r _q (k) added up to _q is given by the following equation. k represents a time point.
[0016]
_{^{r q (k) = Σ q}} m = 1 s (i m + j m + k)
The positional deviation j _q ₊₁ of the q + 1-th unit waveform is a cross-correlation g (j _x ) between the waveform r _q (k) obtained by adding _q unit waveforms and the unit waveform s (i _q ₊₁ + j _x + k). ) (Equation (4)) can be obtained as j _x that is the highest. M represents the maximum length of the unit waveform.
[0017]
[Expression 2]

[0018]
In this way, the q + 1-th unit waveform is added to the addition waveform so far using the positional deviation j _x , that is, the waveform for one cycle in which a new waveform is taken in is
_{_{_{r q +1 (k) = r}}} q (k) + s (i q +1 + j x + k)
Given by. When the energy of the speech spectrum is also obtained, the unit waveform added for each analysis section T is divided by the number of unit waveforms to obtain an average waveform.
Further, as described above, the unit waveform may be detected by adding the start point to the unit waveforms obtained so far.
An example of a specific processing procedure of the above speech spectrum estimation is shown in FIG. The minimum value nmin and maximum value nmax of the repetition period to be handled and the analysis interval T are stored in advance in the storage unit 15a in the unit waveform detection unit 15, and the parameter i storage register 15b and parameter n are stored in the unit waveform detection unit 15. Storage registers 15c, c (i, nx), nx storage units 15d are provided. A buffer memory 19 is provided in the estimation apparatus for the speech spectrum shown in FIG. The i-th speech waveform sample is s (i).
[0019]
First, the stored contents of the waveform accumulation buffer memory 19 are initialized at step 21, i = 0 is set at step 22, and n = nmin is set at step 23.
Next, in step 24, the cross-correlation c (i, n) of the waveform of the adjacent n samples of length is calculated by the equation (1).
In step 25, the cross-correlation c (i, n) is compared with the maximum cross-correlation c (i, nx) so far in the storage unit 15d, and if c (i, n)> c (i, nx). For example, the maximum cross-correlation c (i, nx) of the storage unit 15d is updated to the c (i, n), and the repetition period nx is also updated and held.
[0020]
In step 26, n is incremented by 1, and in step 27, it is checked whether n has exceeded nmax.
If n exceeds nmax in step 24, the period nx in the storage unit 15d at that time is the period of one unit waveform to be obtained, and the voice waveform s (i + k) of k = 1 to k = nx at that time is obtained. It is a unit waveform. In this example, every time a unit waveform is obtained, the obtained unit waveform is added to the added unit waveform with the position adjusted.
In other words, if n exceeds nmax in step 27, the waveform in the waveform storage buffer memory 19 and the actually obtained unit waveform are added together in position in step 28, and the addition in the waveform storage buffer memory 19 is performed. Update unit waveform. A specific processing example of this alignment will be described later.
[0021]
In step 29, i is updated to i + nx. In step 30, it is checked whether i exceeds the analysis interval T. If not, the process returns to step 23 to detect the next unit waveform in the same manner, and the detected unit waveform. Are aligned and added. If i exceeds T in step 30, the process ends.
Next, a specific processing example of step 28 will be described with reference to FIG. 5. The storage unit 16a in the in-phase addition unit 16 in FIG. the maximum length M, the maximum displacement width J is stored in advance, and shall register 16b of the parameters j, is j _x storage unit 16c is provided. Note the maximum displacement width J is, for example, about ± 1 millisecond sample period 1 / f _s of the speech waveform is if J = ± 10 of 0.1 ms.
[0022]
It is checked whether the unit waveforms s (k), (k = 1, 2,..., M) detected in step 31 are the first time, and if they are the first time, in step 32, the unit waveforms s (k), (k) = 1, 2,..., M) is stored in the buffer memory 19 as r (k).
If the detection unit waveform s (k) is not the first time in step 31, the waveform s (j + k) is shifted in time in step 33 by j = -J in step 33 and shifted in step 34 in step 34. ) And the waveform r (k) in the buffer memory 19 is calculated by the equation (4). In order to correctly calculate the expression (4), it is necessary to always include an extra k = −J and an extra k = M + J as r (k). Since these lengths are small compared with the unit waveform length, the negative value of k is omitted as shown in step 34 in FIG. 5, and the storage capacity lengths L and Mj of the buffer memory are small. The part where k is larger than the other value is omitted, and the sum of each product relating to k is obtained.
[0023]
The cross-correlation g (j) thus obtained in step 35 is compared with the cross-correlation g (j _x ) in the storage unit 16c, and if g (j)> g (j _x ), the storage unit G (j _x ) of 16c is updated with the g (j), and j _{x of the} storage unit 16c is updated with j at that time.
In step 36, j is incremented by 1, and in step 37, it is checked whether j exceeds J. If not, the process returns to step 34. The unit waveform s (k) detected in this way is sequentially shifted by one sample period from −J to obtain a cross-correlation with the waveform r (k) in the buffer memory 19, and a value having a large cross-correlation and a value of j Is left, the value of j _x when j is determined to have exceeded J in step 37 is the positional deviation of s (k) with respect to r (k).
[0024]
In step 38, the waveform r (k) stored in the buffer memory 19 is added to the unit waveform s (j _x + k) whose position (phase) is matched, and the waveform r (k) (k = 1) in the buffer memory 19 is added. , 2, ..., M).
Each unit of the speech spectrum estimation apparatus shown in FIG. 3 is sequentially controlled by the control unit 20. The speech spectrum estimation apparatus can also function by executing a program by a computer. In this case, a speech spectrum estimation program for causing a computer to execute the method shown in FIGS. 2, 4 and 5 is installed in the computer from a CD-ROM, a flexible magnetic disk, or the like, or via a communication line. Download and execute. For example, in FIG. 3, the control unit 20 is a microprocessor or CPU, the voiced / unvoiced discrimination unit 14 is a voiced / unvoiced discrimination subroutine, the unit waveform detection unit 15 is a unit waveform detection subroutine, and in-phase addition is performed. The unit 16 is a storage unit that stores in-phase addition subroutines, and a program for configuring a speech spectrum estimation program as a whole using these subroutines is stored in a program memory (not shown). .
[0025]
As shown in FIG. 1, when repeating a speech waveform having periodicity is cut into small sections and added by shifting the time so that the shapes most closely match, noise components that are not synchronized with the speech waveform are Because the phase is different between the two, it cancels out and attenuates. The effect is that even if the sound pitch (pitch) seems to be constant, there is a natural fluctuation in the pitch of the sound, and in order to compensate for this, the time is shifted and added. Therefore, since the signals other than the sound are added with the phase shifted, the signals other than the sound are attenuated, and there is an effect of suppressing the noise component. In particular, when the change of intonation is large, when the noise component has periodicity, the noise phase can be added with a large shift, so that the noise suppression effect is great.
[0026]
In the case of white noise, etc., in which the phase of the frequency component of the noise component changes randomly, the expected value of the energy of the noise component will not change even if spectrum analysis is performed over a long section or even if it is divided into short sections. Although there is no audio signal component, the influence of noise can be reduced since the spectral blur can be prevented by adding the phases in phase.
Example (1)
FIG. 7 shows an embodiment of a speech recognition apparatus by LPC analysis using the method of the present invention.
[0027]
Reference numeral 101 denotes an AD converter, and the input speech waveform is sampled at a sampling rate of 11025 kHz, for example.
Reference numeral 102 denotes a spectrum analysis section extraction unit that extracts a voice of about 30 ms from a continuous voice signal.
103 is a noise suppression unit in the method of the present invention, that is, a waveform obtained by adding repeated unit waveforms in phase according to the processing procedure shown in FIG. 5 as a processing part excluding the spectrum estimation processing in step S6 in FIG. Ask for.
Reference numeral 104 denotes an autocorrelation coefficient calculator that outputs an autocorrelation coefficient of the 16th order.
[0028]
A linear prediction analysis unit 105 obtains a 16th-order linear prediction coefficient from the autocorrelation coefficient.
A cepstrum analysis unit 106 obtains a 16th-order cepstrum from the linear prediction coefficient.
Reference numeral 107 denotes a speech recognition unit using, for example, an HMM (Hidden Markov Model), and one-step speech recognition proceeds each time a spectrum analysis is performed for one step.
111 creates an acoustic model for speech recognition as an HMM using a cepstrum coefficient.
[0029]
112 stores the HMM acoustic model.
108 determines the end of the voice.
110 displays the recognition result.
Example (2)
FIG. 8 shows an embodiment of a speech recognition apparatus using FFT (Fast Fourier Transform) using this method.
Reference numeral 201 denotes an AD converter, and the speech waveform is sampled at a sampling rate of 11025 kHz, for example.
[0030]
Reference numeral 202 denotes a spectrum analysis section cutout unit that cuts out a voice of about 30 ms from a continuous voice signal.
203 is a noise suppression unit in the method of the present invention, that is, a waveform obtained by adding repeated unit waveforms in phase according to the processing procedure shown in FIG. 6 as a processing part excluding the spectrum estimation processing in step S6 in FIG. Ask.
204 performs, for example, 256th-order FFT (Fast Fourier Transform). Note that the portion other than the unit waveform added in the analysis interval is filled with 0 and subjected to FFT.
[0031]
205 obtains a logarithmic spectrum.
206 performs an inverse Fourier transform of the logarithmic spectrum;
For example, 207 obtains a 16th-order cepstrum coefficient.
208 is, for example, an HMM speech recognition unit, and the one-step speech recognition proceeds every time one spectrum analysis is performed.
211 creates an acoustic model for speech recognition as an HMM using a cepstrum coefficient.
212 stores the HMM acoustic model.
[0032]
In step S209, the end of the voice is determined.
210 displays the recognition result.
The present invention can be applied not only to speech recognition but also to the case of encoding speech by obtaining a spectrum.
[0033]
【The invention's effect】
As described above, according to the present invention, even if there is a spectrum blur, the blur can be reduced and the influence of noise can be reduced.
For example, the present invention is applied to speech recognition, and as a result of a phoneme recognition experiment by HMM, the 23-phoneme recognition rate is increased from 45.5% to 58.5% in a pink noise environment with a signal-to-noise power ratio SNR of 20 dB. The vowel recognition rate was improved from 91.6% to 94.8%.
[Brief description of the drawings]
FIG. 1 is a diagram showing a time lag of a speech waveform and its repetition, and noise suppression by compensating and adding the time lag.
FIG. 2 is a flowchart showing an embodiment of the present invention.
FIG. 3 is a diagram showing a functional configuration example of a speech spectrum estimation apparatus according to the present invention.
FIG. 4 is a diagram for explaining a procedure for detecting a unit waveform in the present invention.
FIG. 5 is a flowchart showing an example of a specific procedure of a unit waveform detection portion in the method of the present invention.
FIG. 6 is a flowchart showing an example of a specific processing procedure of unit waveform alignment addition in step 28 in FIG. 5;
FIG. 7 is a flowchart showing an example of a speech recognition processing procedure using LPC analysis to which the method of the present invention is applied.
FIG. 8 is a flowchart showing an example of a speech recognition processing procedure using FFT by applying the method of the present invention.

Claims

In a method of obtaining a spectrum of a section from a speech waveform having a short time length of 20 to 40 ms,
The voiced / unvoiced determination unit determines whether the voice waveform of the section is voiced or unvoiced,
When it is determined that the speech waveform is voiced during that interval, the unit waveform detection unit detects a plurality of repeated unit waveforms,
Add the unit waveforms detected by the same phase adder while shifting the time so that the shapes are best matched to each other,
A speech spectrum estimation method in which a spectrum estimation unit estimates a spectrum of a waveform for the added one period ,
(A) The detection of the plurality of repeating unit waveforms is as follows:
Store the speech waveform of the section in the storage unit,
Taking out the waveforms of two small sections that are temporally adjacent from the beginning of the section from the storage section, changing the length of the small section,
Obtain the cross-correlation of the waveforms of the two subsections taken out,
The waveform of the small section with the highest cross-correlation is obtained as a one-cycle waveform, and this waveform is stored in the buffer memory.
Next, the time is shifted by one cycle of the obtained waveform, and the same operations from taking out the waveforms of the two small sections to storing them in the buffer memory are performed,
The above operation is performed by sequentially shifting the time by one cycle until the end of the section, and sequentially extracting a plurality of waveforms of one cycle.
(B) Addition by shifting the time so that the plurality of unit waveforms have the best shape,
A plurality of unit waveforms stored in the buffer memory are sequentially extracted,
Put one of the extracted waveforms into the buffer memory,
The position of the other waveform is shifted so that the cross-correlation between the one waveform and the other extracted waveform is the highest.
A speech spectrum estimation method characterized by repeating the operation of adding another waveform to one waveform in the buffer memory at the time position .

The method of claim 1 , wherein
The speech spectrum estimation method according to claim 1, wherein the change of the length of the small section is between the reciprocal of the minimum value of the fundamental frequency of human speech and the reciprocal of the maximum value.

The method of claim 1 , wherein
A method for estimating a speech spectrum, characterized in that each time a unit waveform is detected, addition is performed with a time lag relative to the previous addition waveform.

A speech spectrum estimation program for causing a computer to execute each step of the method according to claim 1.

A computer-readable recording medium on which the speech spectrum estimation program according to claim 4 is recorded.

In an apparatus for extracting an input speech waveform for each analysis interval of 20 to 40 ms and obtaining a spectrum of the speech waveform in that interval,
A voiced / unvoiced determination unit that determines whether the voice waveform of the cut-out analysis section is voiced or unvoiced;
A unit waveform detection unit for detecting a waveform of a plurality of repetitive units from the speech waveform of the analysis section determined to be voiced by the voiced / unvoiced determination unit;
An in-phase addition unit that adds the unit waveforms detected by the unit waveform detection unit while shifting the time so that the shapes are best matched to each other,
A spectrum estimation unit that estimates the spectrum of the waveform for one period added by the same phase addition unit ,
(A) The unit waveform detector
The speech waveform of the analysis section is stored in the storage unit,
Perform that you take out the two small sections of the waveform that temporally adjacent from the beginning of the section from the storage unit by changing the length of the small sections,
Obtain the cross-correlation of the waveforms of the two subsections taken out,
The waveform of the small section with the highest cross-correlation is obtained as a one-cycle waveform, and this waveform is stored in the buffer memory.
Next, the time is shifted by one cycle of the obtained waveform, and the same operations from taking out the waveforms of the two small sections to storing them in the buffer memory are performed,
The above operation is performed by sequentially shifting the time by one cycle until the end of the analysis section, and sequentially extracting a plurality of waveforms of one cycle.
(B) The in-phase addition unit
A plurality of unit waveforms stored in the buffer memory are sequentially extracted,
Put one of the extracted waveforms into the buffer memory,
The position of the other waveform is shifted so that the cross-correlation between the one waveform and the other extracted waveform is the highest.
A speech spectrum estimation apparatus characterized by performing an operation of adding another waveform to one waveform in a buffer memory at the time position .