JPH0552512B2

JPH0552512B2 -

Info

Publication number: JPH0552512B2
Application number: JP19465683A
Authority: JP
Inventors: Yoichiro Sako; Masao Watari; Makoto Akaha; Atsunobu Hiraiwa
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1983-10-18
Filing date: 1983-10-18
Publication date: 1993-08-05
Also published as: JPS6086600A

Description

[Detailed description of the invention]

産業上の利用分野本発明は不特定話者を対象とした音声認識装置
に関する。背景技術とその問題点音声認識においては、特定話者に対する単語認
識によるものがすでに実用化されている。これは
認識対象とする全ての単語について特定話者にこ
れらを発音させ、バンドパスフイルタバンク等に
よりその音響パラメータを検出して記憶（登録）
しておく。そして特定話者が発声したときその音
響パラメータを検出し、登録された各単語の音響
パラメータと比較し、これらが一致したときその
単語であるとの認識を行う。このような装置において、話者の発声の時間軸
が登録時と異なつている場合には、一定時間（５
〜20msec）毎に抽出される音響パラメータの時
系列を伸縮して時間軸を整合させる。これによつ
て発声速度の変動に対処させるようにしている。ところがこの装置の場合、認識対象とする全て
の単語についてその単語の全体の音響パラメータ
をあらかじめ登録格納しておかなければならず、
膨大な記憶容量と演算を必要とする。このため認
識語い数に限界があつた。これに対して音韻（日本語でいえばローマ文表
記したときのＡ，Ｉ，Ｕ，Ｅ，Ｏ，Ｋ，Ｓ，Ｔ
等）あるいは音節（KA，KI，KU等）単位での
認識を行うことが提案されている。しかしこの場
合に、母音等の準定常部を有する音韻の認識は容
易であつても、破裂音（Ｋ，Ｔ，Ｐ等）のように
音韻的特徴が非常に短いものを音響パラメータの
みで一つの音韻に特定することは極めて困難であ
る。さらに不特定話者を認識対象とした場合には、
音響パラメータに個人差による大きな分散があ
り、上述のように時間軸の整合だけでは認識を行
うことができない。そこで例えば一つの単語につ
いて複数の音響パラメータを登録して近似の音響
パラメータを認識する方法や、単語全体を固定次
元のパラメータに変換し、識別函数によつて判別
する方法が提案されているが、いずれも膨大な記
憶容量を必要としたり、演算量が多く、認識語い
数が極めて少くなつてしまう。ところで音韻の発声現象を観察すると、母音や
摩擦音（Ｓ，Ｈ等）等の音韻は長く伸して発声す
ることができる。例えば“はい”という発声を考
えた場合に、この音韻は第１図Ａに示すように、
「無音→Ｈ→Ａ→Ｉ→無音」に変化する。これに
対して同じ“はい”の発声を第１図Ｂのように行
うこともできる。ここでＨ，Ａ，Ｉの準定常部の
長さは発声ごとに変化し、これによつて時間軸の
変動を生じる。ところがこの場合に、各音韻間の
過渡部（斜線で示す）は比較的時間軸の変動が少
いことが判明した。そこで本願発明者は先にこの点に着目して以下
のような装置を提案した。第２図において、マイクロフオン１に供給され
た音声信号がマイクアンプ２、5.5kHz以下のロー
パスフイルタ３を通じてAD変換回路４に供給さ
れる。またクロツク発生器５からの12.5kHz
（80μsec間隔）のサンプリングクロツクがAD変
換回路４に供給され、このタイミングで音声信号
がそれぞれ所定ビツト数（＝１ワード）のデジタ
ル信号に変換される。このデジタル信号が、周波数分析用のバンドパ
スフイルタ６₁，６₂……６₃₀に供給され、人間の
聴覚特性に合せた周波数メルスケールに応じて例
えば30の帯域に分割される。この分割された各帯
域の信号がエンフアシス回路７₁，７₂……７₃₀に
供給され、人間の聴覚特性に合せた高域増強が行
われる。この信号が絶対値回路８₁，８₂……８₃₀
に供給されて一極性とされ、平均値回路９₁，９₂
……９₃₀に供給されて信号の包絡線が取り出され
る。この信号が対数回路１０₁，１０₂……１０₃₀に
供給され、各信号の対数値に変換される。これに
よつて上述のエンフアシス回路７₁，７₂……７₃₀
での重み付け等による冗長度が排除される。ここ
で、例えばＴの時間長に含まれるn_f個のサンプリ
ングデータによつて表される波形函数を Un_fT_(t) ……(1) としたとき、これを周波数分析して、対数を取つ
た対数パワースペクトル log｜U²n_fT_(f)｜ ……(2) をスペクトルパラメータx_(i)（ｉ＝０，１……29）
と称する。このスペクトルパラメータx_(i)が離散的フーリ
エ変換（DFT）回路１１に供給される。ここで
このDFT回路１１において、例えば分割された
帯域の数をＭとすると、このＭ次元スペクトルパ
ラメータx_(i)（ｉ＝０，１……Ｍ−１）を2M−１
点の実数対称パラメータとみなして2M−２点の
DFTを行う。従つて X_(n)＝_2M-3 〓ⁱ⁼⁰ x_(i)W^mi _2M-2 ……(3) 但しW^mi _2M-2＝ｅ−ｊ（2π・ｉ・ｍ／2M−２）ｍ＝０，１，……2M−３となる。さらにこのDFTを行う函数は偶函数と
みなされるため W^mi _2M-2＝cos（2π・ｉ・ｍ／2M−２）＝cosπ・ｉ・ｍ／Ｍ−１となり、これらより X_(n)＝_2M-3 〓ⁱ⁼⁰ x_(i) cosπ・ｉ・ｍ／Ｍ−１ ……(4) となる。このDFTによりスペクトルの包絡特性
を表現する音響パラメータが抽出される。このようにしてDFTされたスペクトルラムパ
ラメータx_(i)について、０〜Ｐ−１（例えばＰ＝
８）次までのＰ次元の値を取り出し、これをロー
カルパラメータL_(p)（ｐ＝０，１……Ｐ−１）とす
ると L_(p)＝_2M-3 〓ⁱ⁼⁰ x_(i) cosπ・ｉ・ｐ／Ｍ−１ ……(5) となり、ここでスペクトルパラメータが対称であ
ることを考慮して x_(i)＝ｘ（2M−ｉ−２）とおくと、ローカルパラメータL_(p)は L_(p)＝x₍〓₎＋_M-2 〓〓ⁱ⁼¹ x_(i)｛cosπ・ｉ・ｍ／Ｍ−１＋cosπ・（2M−２−
ｉ）・ｐ／Ｍ−１｝＋x_(M-1)cosπ・ｐ／Ｍ−１但し、ｐ＝０，１……Ｐ−１となる。このようにして30ワードの信号がＰ（例
えば８）ワードに圧縮される。このローカルパラメータL_(p)がメモリ装置１２に
供給される。このメモリ装置１２は１行Ｐワード
の記憶部が例えば83行マトリクス状に配されたも
ので、ローカルパラメータL_(p)が各次元ごとに順
次記憶されると共に、上述のクロツク発生器５か
らの0.96msec間隔のクロツクが供給されて、各
行のパラメータが順次横方向へシフトされる。こ
れによつてメモリ装置１２には0.96msec間隔の
Ｐ次元のローカルパラメータL_(p)が83ポイント
（79.68msec）分記憶され、クロツクごとに順次
新しいパラメータに更新される。さらに音声過渡点検出回路２０が以下のように
構成される。すなわち平均値回路９₁〜９₃₀から
のそれぞれの帯域の信号の量に応じた信号V_(o)
（ｎ＝０，１……29）がバイアス付き対数回路２
１₁，２１₂……２１₃₀に供給されて v′_(o)＝log（V_(o)＋Ｂ） ……(7) が形成される。また信号V_(o)が累算平均回路２２
に供給されて V_a＝₃₀ 〓ⁿ⁼¹ V_(o)／30 が形成され、この信号Vaが対数回路２１ｘに供
給されて v′_a＝log（V_a＋Ｂ） ……(8) が形成される。そしてこれらの信号が演算回路２
３に供給されて v_(o)＝v′_a−v′_(o) ……(9) が形成される。ここで上述のような信号V_(o)を用いることによ
り、この信号は音韻から音韻への変化に対して各
次（ｎ＝０，１……29）の変化が同程度となり、
音韻の種類による変化量のばらつきを回避でき
る。また対数をとり演算を行つて正規化パラメー
タv_(o)を形成したことにより、入力音声のレベル
の変化によるパラメータv_(o)の変動が排除される。
さらにバイアスＢを加算して演算を行つたことに
より、仮りにＢ→∞とするとパラメータv_(o)→０
となることから明らかなように、入力音声の微少
成分（ノイズ等）に対する感度を下げることがで
きる。このパラメータv_(o)がメモリ装置２４に供給さ
れて2w＋１（例えば57）ポイント分が記憶され
る。この記憶された信号が演算回路２５に供給さ
れて Yn，ｔ＝min Ｉ∈GFt ｛v_(o)（Ｉ）｝ ……(10) 但し GF_t＝｛Ｉ；−ｗ＋ｔ≦Ｉ≦ｗ＋ｔ｝が形成され、この信号とパラメータv_(o)が演算回
路２６に供給されて T_(t)＝_N-1 〓ⁿ⁼⁰ _w 〓^I=-w （v_(o)（Ｉ＋ｔ）−Yn，ｔ） ……(11) が形成される。このT_tが過渡点検出パラメータ
であつて、このT_tがピーク判別回路２７に供給
されて、入力音声信号の音韻の過渡点が検出され
る。ここでパラメータT_tが、ポイントｔを挾んで
前後ｗポイントずつで定義されているので、不要
な凹凸や多極を生じるおそれがない。なお第３図
は例えば“ゼロ”という発声を、サンプリング周
波数12.5kHz、12ビツトデジタルデータとし、ポ
イント間隔＝0.96msec、帯域数Ｎ＝30、バイア
スＢ＝０、検出ポイント数2W＋１＝57で上述の
検出を行つた場合を示している。図中Ａは音声波
形、Ｂは音韻、Ｃは検出信号であつて、「無音→
Ｚ」「Ｚ→Ｅ」「Ｅ→Ｒ」「Ｒ→Ｏ」「Ｏ→無音」の
各過渡部で顕著なピークを発生する。ここで無音
部にノイズによる多少の凹凸が形成されるがこれ
はバイアスＢを大きくすることにより破線図示の
ように略０になる。この過渡点検出信号T_(t)がメモリ装置１２に供
給され、この検出信号のタイミングに相当するロ
ーカルパラメータL_(p)が42番目の行にシフトされ
た時点でメモリ装置１２の読み出しが行われる。
ここでメモリ装置１２の読み出しは、各次元Ｐご
とに83ポイント分の信号が横方向に読み出され
る。そして読み出された信号がDFT回路１３に
供給される。この回路１３において上述と同様にDFTが行
われ、音響パラメータの時系列変化の包絡特性が
抽出される。このDFTされた信号の内から０〜
Ｑ−１（例えばＱ＝３）次までのＱ次元の値を取
り出す。このDFTを各次元Ｐごとに行い、全体
でＰ×Ｑ（＝24）ワードの過渡点パラメータK_(p,q)
（ｐ＝０，１……Ｐ−１）（ｑ＝０，１……Ｑ−
１）が形成される。ここで、K_(0,0)は音声波形の
パワーを表現しているので、パワー正規化のため
ｐ＝０のときにｑ＝１〜Ｑとしてもよい。すなわち第４図において、Ａのような入力音声
信号（HAI）に対してＢのような過渡点が検出
されている場合に、この信号の全体のパワースペ
クトルはＣのようになつている。そして例えば
「Ｈ→Ａ」の過渡点のパワースペクトルがＤのよ
うであつたとすると、この信号がエンフアシスさ
れてＥのようになり、メルスケールで圧縮されて
Ｆのようになる。この信号がDFTされてＧのよ
うになり、Ｈのように前後の83ポイント分がマト
リツクスされ、この信号が順次時間軸ｔ方向に
DFTされて過渡点パラメータK_(p,q)が形成される。この過渡点パラメータK_(p,q)がマハラノビス距
離算出回路１４に供給されると共に、メモリ装置
１５からのクラスタ係数が回路１４に供給されて
各クラスタ係数とのマハラノビス距離が算出され
る。ここでクラスタ係数は複数の話者の発音から
上述と同様に過渡点パラメータを抽出し、これを
音韻の内容に応じて分類し統計解析して得られた
ものである。そしてこの算出されたマハラノビス距離が判定
回路１６に供給され、検出された過渡点が何の音
韻から何の音韻への過渡点であるかが判定され、
出力端子１７に取り出される。すなわち例えば“はい”“いいえ”“０（ゼロ）”
〜“９（キユウ）”の12単語について、あらかじめ
多数（百人以上）の話者の音声を前述の装置に供
給し、過渡点を検出し過渡点パラメータを抽出す
る。この過渡点パラメータを例えば第５図に示す
ようなテーブルに分類し、この分類（クラスタ）
ごとに統計解析する。図中＊は無音を示す。これらの過渡点パラメータについて、任意のサ
ンプルR^(a) _r,o（ｒ＝１，２……24）（ａはクラスタ指
標で例えばａ＝１は＊→Ｈ，ａ＝２はＨ→Ａに対
応する。ｎは話者番号）として、共分散マトリク
ス A^(a) _rs≡Ｅ（R^(a) _r,o−^(a) _r）（R^(a) _s,o−^(a) _s）……(12) 但し、 ^(a) _r＝Ｅ（R^(a) _r,o Ｅはアンサンブル平均を計数し、この逆マトリクス B^(a) _r,s＝（A^(a) _tu）^-1 _r,s ……(13) を求める。ここで任意の過渡点パラメータKrとクラスタ
ａとの距離が、マハラノビス距離Ｄ（K_r，ａ）≡ｄ〓ｒ〓ｓ（Kr−^(a) _r）・B^(a) _r,
s・（Kr−^(a) _s） ……(14) で求められる。従つてメモリ装置１５に上述のB^(a) _r,s及びR^(a) _rを
求めて記憶しておくことにより、マハラノビス距
離算出回路１６にて入力音声の過渡点パラメータ
とのマハラノビス距離が算出される。これによつて回路１４から入力音声の過渡点ご
とに各クラスタとの最小距離と過渡点の順位が取
り出される。これらが判定回路１６に供給され、
入力音声が無声になつた時点において認識判定を
行う。例えば各単語ごとに、各過渡点パラメータ
とクラスタとの最小距離の平方根の平均値による
単語距離を求める。なお過渡点の一部脱落を考慮
して各単語は脱落を想定した複数のタイプについ
て単語距離を求める。ただし過渡点の順位関係が
テーブルと異なつているものはリジエクトする。
そしてこの単語距離が最小になる単語を認識判定
する。このようにして音声認識が行われるわけである
が、この装置によれば音声の過渡点の音韻の変化
を検出しているので、時間軸の変動がなく、不特
定話者についても良好な認識を行うことができ
る。また過渡点において上述のようなパラメータの
抽出を行つたことにより、一つの過渡点を例えば
24次元で認識することができ、認識を極めて容易
かつ正確に行うことができる。なお上述の装置において120名の話者にて学習
を行い、この120名以外の話者にて上述の12単語
について実験を行つた結果、98.2％の平均認識率
が得られた。さらに上述の例で“はい”の「Ｈ→Ａ」と“８
（ハチ）”の「Ｈ→Ａ」は同じクラスタに分類可能
である。従つて認識すべき言語の音韻数をαとし
てαC₂個のクラスタをあらかじめ計算してクラス
タ係数をメモリ装置１５に記憶させておけば、
種々の単語の認識に適用でき、多くの語いの認識
を容易に行うことができる。ところで上述の例では、“はい”、“いいえ”等
の特定の単語について認識を行つたが、これをさ
らに一般の音声にて例えば単音節ごとに認識する
ことも可能である。しかしながらその場合に、人間の発音における
音韻の数は多く、従つて過渡点のクラスタも100
〜200と極めて多くなる。このため、例えばマハ
ラノビス距離の計算をこれらの全てのクラスタに
ついて行おうとすると、計算量が極めて多くな
り、実用的ではなかつた。また例えば単音節の認識において、最後の母音
→無音を見た場合に、音声レベルのゆれ等によつ
て過渡点が複数発生し、さらにこの場合の母音が
それぞれ異なることがある。その場合にマハラノ
ビス距離の最小のものが必らずしもそのときの音
韻とは限らないことが判明した。発明の目的本発明はかかる点にかんがみ、簡単な構成で良
好な音声認識が行えるようにするものである。発明の概要本発明は、無音を含む音韻間の過渡部を検出
し、この検出された過渡部の音声を所定長抽出し
てパラメータに変換し、このパラメータを認識基
本単位とするようにした音声認識方法において、
異なるクラスタ係数に分類される母音から無音へ
の過渡点が複数存在する場合に、各クラスタ係数
に分類される過渡点の個数に基づいてクラスタ係
数を判定することを特徴とする音声認識方法であ
つて、これによれば簡単な構成で良好な音声認識
を行うことができる。実施例ところで以下の実施例では次のような装置が使
用される。すなわち第６図において、バンドパス
フイルタ６₁〜６₃₀の前段にエンフアシス回路７
が設けられる。そしてこのエンフアシス回路７に
おいて、例えば低域側の１〜16番の帯域では信号
が無補正でバンドパスフイルタ６₁〜６₁₆に供給
され、高域側の17〜30番の帯域では信号が差分回
路３１を通じてバンドパスフイルタ６₁₇〜６₃₀に
供給される。このエンフアシス回路７において、差分回路３
１の特性は y_(o)＝x_(o)−x_(o-1) ……(15) で現わされ、この式をＺ変換すると Y_(o)＝（１−Z^-1）X_(o) ……(16) となる。さらにこの回路の伝達関数H_(z)は｜H(Z)｜²＝｜H(Z)・H_(z-1)｜＝｜２−2cosωT｜ ……(17) となり、第７図に示すように低域側で小、高域側
で大となる特性となつている。そしてこの伝達関
数が１となるのは、角周波数ωがπ／２となる点
である。一方上述のメルスケールで30の帯域に分
割した場合に、角周波数ωがπ／２の点は、16番
と17番の帯域の間になつている。そこで上述のよ
うに１〜16番の帯域で無補正、17〜30番の帯域で
差分とすることにより、第８図に示すように人間
の聴覚特性に合せた高域増強を行うことができ
る。またそれぞれの帯域の平均値回路９₁〜９₃₀か
らの信号がノイズ除去回路３２₁〜３２₃₀に供給
される。一方AD変換回路４からの信号が無音状
態の検出回路３３に供給され、この検出信号が除
去回路３２₁〜３２₃₀に供給される。そして除去
回路３２₁〜３２₃₀にて、無音状態での信号（ノ
イズ）が測定され、この平均値（またはピーク値
あるいはこれらを演算して得た値）をスレシヨル
ドレベルＮとして、入力信号ｘがこのレベルＮよ
り小のとき０、大のとき（ｘ−Ｎ）の信号が出力
される。この信号が対数回路１０₁〜１０₃₀に供
給される。すなわちノイズ除去回路３２₁〜３２₃₀におい
て、一の帯域の除去回路に第９図Ａに示すような
信号が供給されている場合に、検出回路３３にて
無音部が検出され、この部分の信号の例えば平均
値からなるスレシヨルドレベルＮによつて第９図
Ｂに示すような信号が出力される。そしてこの場
合にノイズレベルが各帯域ごとに測定されてお
り、ノイズの周波数特性に応じたノイズ除去が行
われる。他は第２図と同様に構成される。この装置によれば乗算器を用いずに簡単な差分
回路のみで人間の聴覚特性に合せた良好なエンフ
アシスを行うことができる。またソフトウエアで
処理する場合にも演算量を少なくすることができ
る。さらにノイズの周波数特性に応じたノイズ除去
を行うことができ、パラメータの精度が極めて向
上する。そしてこの装置において、距離算出回路１４及
び判定回路１６が以下のように構成される。すな
わち第１０図において、DFT回路１３からの信
号が第１の距離算出回路４１に供給され、メモリ
装置５１からのクラスタ係数との距離が算出され
る。ここでメモリ装置５１には、［＊→（は有
音を示す）」「→（は母音を示す）」「→
＊」の３通りクラスタ係数が書込まれている。な
お単音節はこの３通りの過渡点で形成されてい
る。さらに算出された距離が第１の判定回路６１に
供給され、入力された過渡点パラメータが上述の
３通りのクラスタごとに分類される。この分類されたパラメータの内の「→＊」の
パラメータが第２の距離算出回路４２に供給さ
れ、メモリ装置５２からのクラスタ係数との距離
が算出される。ここでメモリ装置５２には、「Ａ→＊」「Ｉ→
＊」「Ｕ→＊」「Ｅ→＊」「Ｏ→＊」「→＊」（
は“ん”を示す）」の６通りのクラスタ係数が書
込まれている。さらに算出された距離が第２の判定回路６２に
供給され、入力されたパラメータが６通りのクラ
スタのどれに相当するか判定される。さらにこの判定結果が処理回路７１に供給され
る。ここでこの回路７１において母音の総合判定
が行われる。すなわち、「→＊」の過渡点において、いわ
ゆるふかれ等のノイズ的成分によつて、過渡点が
複数検出される場合があり、その場合にたまたま
他のクラスタに近いパラメータが出るおそれがあ
る。そこで処理回路７１において、算出された距
離と共にその数が総合判定される。すなわち例え
ば第１１図Ａのような過渡点検出で、Ｂのような
判定結果及び距離が算出された場合に、ここでは
距離が最短のものは例えば「Ｕ」になつている。
ところがこの場合に判定された数は「Ａ」の方が
多い。そしてこのような場合について実験及びシ
ミユレーシヨンを行つた結果、このような場合に
は一般的に多くある方が正しいことが判明した。従つてこの処理回路７１においては、例えば過
渡点パラメータの多数決による判定を行う。なお
多数決で同数の場合や、極端に距離が異なる場合
には、これらの距離を勘案するようにしてもよ
い。このようにして最終母音の判定が行われる。また判定回路６１で分類された「＊→」及び
「→」の過渡点パラメータが、第３及び第４
の距離算出回路４３，４４に供給され、それぞれ
メモリ装置５３，５４からのクラスタ係数との距
離が算出される。ここでまずメモリ装置５３には、以下の表のよ
うなクラスタ係数が、最終母音ごとに分類されて
書込まれている。 INDUSTRIAL APPLICATION FIELD The present invention relates to a speech recognition device intended for unspecified speakers. BACKGROUND TECHNOLOGY AND PROBLEMS In speech recognition, methods based on word recognition for specific speakers have already been put into practical use. This involves having a specific speaker pronounce all the words to be recognized, and then detecting and storing (registering) the acoustic parameters using a bandpass filter bank, etc.
I'll keep it. Then, when a specific speaker utters a utterance, its acoustic parameters are detected and compared with the acoustic parameters of each registered word, and when these match, the word is recognized. In such a device, if the time axis of the speaker's utterance is different from the time of registration, the time axis of the speaker's utterance is different from the time of registration,
The time series of acoustic parameters extracted every ~20 msec) is expanded or contracted to align the time axis. This makes it possible to cope with variations in speaking speed. However, with this device, the entire acoustic parameters of every word to be recognized must be registered and stored in advance.
Requires huge storage capacity and calculations. For this reason, there was a limit to the number of words that could be recognized. On the other hand, phonology (in Japanese, A, I, U, E, O, K, S, T when written in Roman sentences)
) or syllable units (KA, KI, KU, etc.) have been proposed. However, in this case, even though it is easy to recognize phonemes with quasi-stationary parts such as vowels, phonemes with very short phonological features such as plosives (K, T, P, etc.) can be recognized using only acoustic parameters. It is extremely difficult to specify one phoneme. Furthermore, when recognizing unspecified speakers,
There is a large variance in acoustic parameters due to individual differences, and recognition cannot be achieved only by matching the time axis as described above. Therefore, for example, methods have been proposed such as registering multiple acoustic parameters for one word and recognizing approximate acoustic parameters, or converting the entire word into fixed-dimensional parameters and discriminating using a discrimination function. Either method requires a huge amount of storage capacity, a large amount of calculation, and the number of words to be recognized becomes extremely small. By the way, when observing the phenomenon of phoneme production, phonemes such as vowels and fricatives (S, H, etc.) can be elongated and uttered. For example, when considering the utterance of "yes", the phoneme is as shown in Figure 1A.
Changes to "silence → H → A → I → silence". In response, the same "yes" can be uttered as shown in FIG. 1B. Here, the lengths of the quasi-stationary portions of H, A, and I change with each utterance, which causes fluctuations in the time axis. However, in this case, it has been found that there is relatively little variation in the time axis in the transitional part between each phoneme (indicated by diagonal lines). Therefore, the inventor of the present application first focused on this point and proposed the following device. In FIG. 2, an audio signal supplied to a microphone 1 is supplied to an AD conversion circuit 4 through a microphone amplifier 2 and a low-pass filter 3 of 5.5 kHz or less. Also, 12.5kHz from clock generator 5
A sampling clock (at intervals of 80 .mu.sec) is supplied to the AD conversion circuit 4, and each audio signal is converted into a digital signal of a predetermined number of bits (=1 word) at this timing. This digital signal is supplied to bandpass filters 6 ₁ , 6 _{2 .} . . 6 ₃₀ for frequency analysis, and is divided into, for example, 30 bands according to a frequency mel scale that matches human auditory characteristics. The divided signals of each band are supplied to emphasis circuits 7 ₁ , 7 _{2 .} . . 7 ₃₀ to perform high frequency enhancement in accordance with human auditory characteristics. This signal is the absolute value circuit 8 ₁ , 8 ₂ ... 8 ₃₀
is supplied to the average value circuits 9 ₁ , 9 ₂ to make it unipolar.
...9 ₃₀ is supplied and the envelope of the signal is extracted. These signals are supplied to logarithm circuits 10 ₁ , 10 ₂ . . . 10 ₃₀ and converted into logarithmic values of each signal. As a result, the above-mentioned emphasis circuits 7 ₁ , 7 ₂ ... 7 ₃₀
Redundancy due to weighting etc. is eliminated. Here, for example, if the waveform function represented by n _f sampling data included in the time length of T is Un _f T _(t) ...(1), then frequency analysis is performed to calculate the logarithm. The obtained logarithmic power spectrum log | U ² n _f T _(f) | ...(2) is the spectrum parameter x _(i) (i = 0, 1...29)
It is called. This spectral parameter x _(i) is supplied to a discrete Fourier transform (DFT) circuit 11 . Here, in this DFT circuit 11, for example, if the number of divided bands is M, this M-dimensional spectral parameter x _(i) (i=0, 1...M-1) is 2M-1
Considering the real symmetric parameters of the points, 2M−2 points
Perform DFT. _Therefore _, ^_ _{_} ^_ _{_} ^_ _{_} =0,1,...2M-3. Furthermore, since the function that performs this DFT is considered to be an even function, W ^mi _2M-2 = cos (2π・i・m/2M−2) = cosπ・i・m/M−1, and from these, X _(n) = _2M-3 〓 ⁱ⁼⁰ x _(i) cosπ・i・m/M−1 ...(4). This DFT extracts acoustic parameters that express the envelope characteristics of the spectrum. Regarding the spectral lamb parameter x _(i) DFT'd in this way, 0 to P-1 (for example, P=
8) Take the values of the P dimension up to the next and set this as the local parameter L _(p) (p=0,1...P-1) L _(p) = _2M-3 〓 ⁱ⁼⁰ x _(i) cosπ・i・p/M−1 ...(5) Here, considering that the spectral parameters are symmetrical, and setting x _(i) = x (2M−i−2), the local parameter L _{( p)} is L _(p) = x ₍ 〓 ₎ + _M-2 〓〓 ⁱ⁼¹ x _(i) {cosπ・i・m/M−1+cosπ・(2M−2−
i)・p/M-1}+x _(M-1) cosπ・p/M-1 However, p=0, 1...P-1. In this way a 30 word signal is compressed into P (for example 8) words. This local parameter L _(p) is supplied to the memory device 12 . This memory device 12 has a storage section of P words per row arranged in a matrix of, for example, 83 rows, in which local parameters L _(p) are sequentially stored for each dimension, and the local parameters L (p) are stored in sequence from the clock generator 5 mentioned above. A clock is supplied at 0.96 msec intervals to sequentially shift the parameters of each row horizontally. As a result, 83 points (79.68 msec) of P-dimensional local parameters L _(p) at intervals of 0.96 msec are stored in the memory device 12, and are sequentially updated to new parameters every clock. Furthermore, the audio transition point detection circuit 20 is configured as follows. In other words, the signal V _(o) corresponding to the amount of signals in each band from the average value circuits 9 ₁ to 9 ₃₀
(n = 0, 1...29) is the biased logarithmic circuit 2
1 ₁ , 21 ₂ ... 21 ₃₀ to form v' _(o) = log (V _(o) + B) ... (7). In addition, the signal V _(o) is
This signal Va is supplied to the logarithm circuit ^21x _to form V _{' a} ₌ log( _V _a +B)...(8) be done. These signals are then sent to the arithmetic circuit 2.
3 to form v _(o) = v′ _a −v′ _(o) ……(9). Here, by using the signal V _(o) as described above, this signal has the same degree of change for each order (n = 0, 1...29) with respect to the change from phoneme to phoneme,
It is possible to avoid variations in the amount of change depending on the type of phoneme. Further, by forming the normalized parameter v _(o) by taking a logarithm and performing an operation, fluctuations in the parameter v _(o) due to changes in the level of input audio are eliminated.
Furthermore, by adding bias B and performing calculations, if B → ∞, the parameter v _(o) → 0
As is clear from this, the sensitivity to minute components (noise, etc.) of the input voice can be lowered. This parameter v _(o) is supplied to the memory device 24 and 2w+1 (for example, 57) points are stored. This stored signal is supplied to the arithmetic circuit 25 and Yn, t=min I∈GFt {v _(o) (I)} ...(10) where GF _t = {I; −w+t≦I≦w+t} This signal and parameter v _(o) are supplied to the arithmetic circuit 26 and T _(t) = _N-1 〓 ⁿ⁼⁰ _w 〓 ^I=-w (v _(o) (I + t) - Yn, t) ...(11) is formed. This T _t is a transient point detection parameter, and is supplied to the peak discrimination circuit 27 to detect the transition point of _the phoneme of the input speech signal. Here, since the parameter T _t is defined by w points before and after point t, there is no risk of unnecessary unevenness or multipolarity. In addition, in Figure 3, for example, the utterance of "zero" is assumed to be 12-bit digital data with a sampling frequency of 12.5 kHz, point interval = 0.96 msec, number of bands N = 30, bias B = 0, and number of detection points 2W + 1 = 57 as described above. This shows the case where detection is performed. In the figure, A is the speech waveform, B is the phoneme, and C is the detection signal.
Remarkable peaks are generated at each transition of Z, Z→E, E→R, R→O, and O→silence. Here, some unevenness is formed in the silent part due to noise, but by increasing the bias B, this becomes approximately zero as shown by the broken line. This transient point detection signal T _(t) is supplied to the memory device 12, and reading from the memory device 12 is performed when the local parameter L _(p) corresponding to the timing of this detection signal is shifted to the 42nd row. .
Here, when reading out the memory device 12, signals for 83 points are read out in the horizontal direction for each dimension P. The read signal is then supplied to the DFT circuit 13. In this circuit 13, DFT is performed in the same manner as described above, and the envelope characteristics of the time-series changes in the acoustic parameters are extracted. 0 to 0 from this DFT signal
The values of the Q dimension up to the Q-1 (for example, Q=3) order are extracted. This DFT is performed for each dimension P, and the entire transition point parameter K _(p,q) of P×Q (=24) words is
(p=0,1...P-1) (q=0,1...Q-
1) is formed. Here, since K _(0,0) expresses the power of the audio waveform, q may be set to 1 to Q when p=0 for power normalization. That is, in FIG. 4, when a transient point like B is detected for an input audio signal (HAI) like A, the entire power spectrum of this signal is like C. For example, if the power spectrum at the transition point of "H→A" is like D, this signal is emphasized to become like E, and compressed by the mel scale to become like F. This signal is subjected to DFT to become something like G, and 83 points before and after are matrixed like H, and this signal is sequentially moved in the time axis t direction.
DFT is performed to form transient point parameters K _(p,q) . This transition point parameter K _{(p, q)} is supplied to the Mahalanobis distance calculation circuit 14, and the cluster coefficients from the memory device 15 are supplied to the circuit 14 to calculate the Mahalanobis distance with each cluster coefficient. Here, the cluster coefficients are obtained by extracting transient point parameters from the pronunciations of multiple speakers in the same manner as described above, classifying them according to phoneme content, and performing statistical analysis. The calculated Mahalanobis distance is then supplied to the determination circuit 16, which determines which phoneme to which phoneme the detected transition point is a transition point,
It is taken out to the output terminal 17. For example, “Yes”, “No”, “0 (zero)”
Regarding the 12 words of ~9 (Kiyuu), the voices of a large number of speakers (more than 100 people) are supplied in advance to the above-mentioned device, the transition point is detected, and the transition point parameter is extracted. This transition point parameter is classified into a table as shown in Fig. 5, and this classification (cluster)
Perform statistical analysis for each. * in the figure indicates silence. Regarding these transition point parameters, any sample R ^(a) _r,o (r=1,2...24) (a is a cluster index, for example, a=1 is *→H, a=2 is H→A) n is the speaker number), the covariance matrix A ^(a) _rs ≡E(R ^(a) _r,o − ^(a) _r )(R ^(a) _s,o − ^(a) _s )...(12) However, ^(a) _r = E (R ^(a) _r,o E counts the ensemble average, and this inverse matrix B ^(a) _r,s = (A ^(a) _tu ) ^-1 Find _r,s ...(13) Here, the distance between any transient point parameter Kr and cluster a is the Mahalanobis distance D( _Kr ,a)≡d〓r〓s(Kr− ^(a) _r )・B ^(a) _{r,
s}・ (Kr− ^(a) _s ) ...(14). Therefore, by determining and storing the above-mentioned B ^(a) _r,s and R ^(a) _r in the memory device 15, the Mahalanobis distance calculation circuit 16 calculates the Mahalanobis distance between the transition point parameter of the input voice. be done. As a result, the minimum distance to each cluster and the ranking of the transition points are extracted from the circuit 14 for each transition point of the input audio. These are supplied to the determination circuit 16,
Recognition determination is made when the input voice becomes silent. For example, for each word, the word distance is determined by the average value of the square root of the minimum distance between each transition point parameter and the cluster. In addition, taking into account the dropout of some of the transition points, word distances are calculated for multiple types assuming that each word is dropped. However, if the ranking relationship of the transition points is different from the table, it will be rejected.
Then, the word with the minimum word distance is recognized and determined. Speech recognition is performed in this way, and since this device detects changes in phoneme at transitional points in speech, there is no change in the time axis, and good recognition is possible even for unspecified speakers. It can be performed. In addition, by extracting the parameters described above at the transition point, one transition point can be
It can be recognized in 24 dimensions, making recognition extremely easy and accurate. Furthermore, as a result of learning using the above-mentioned device with 120 speakers and conducting experiments on the 12 words mentioned above with speakers other than these 120, an average recognition rate of 98.2% was obtained. Furthermore, in the above example, “H → A” of “Yes” and “8”
“H→A” of “(Hachi)” can be classified into the same cluster. Therefore, if the number of phonemes of the language to be recognized is α, αC ₂ clusters are calculated in advance and the cluster coefficients are stored in the memory device 15.
It can be applied to the recognition of various words, and can easily recognize many words. By the way, in the above example, specific words such as "yes" and "no" were recognized, but it is also possible to recognize general speech, for example, on a monosyllable basis. However, in that case, the number of phonemes in human pronunciation is large, and therefore the number of clusters of transition points is 100.
~200, which is extremely large. For this reason, for example, if an attempt was made to calculate the Mahalanobis distance for all these clusters, the amount of calculation would be extremely large, making it impractical. For example, when recognizing a single syllable, when looking at the last vowel → silence, multiple transition points may occur due to fluctuations in the voice level, and the vowels in these cases may be different. In that case, it was found that the one with the smallest Mahalanobis distance was not necessarily the phoneme at that time. OBJECTS OF THE INVENTION In view of the above points, the present invention enables good speech recognition to be performed with a simple configuration. Summary of the Invention The present invention detects a transitional part between phonemes including silence, extracts a predetermined length of sound in the detected transitional part, converts it into a parameter, and uses this parameter as a basic unit of speech recognition. In the recognition method,
A speech recognition method characterized in that, when there are multiple transition points from a vowel to silence classified into different cluster coefficients, a cluster coefficient is determined based on the number of transition points classified into each cluster coefficient, According to this, good speech recognition can be performed with a simple configuration. Embodiment By the way, in the following embodiment, the following apparatus is used. That is, in FIG. 6, the emphasis circuit 7 is provided before the bandpass filters ₆₁ to ₆₃₀ .
is provided. In this emphasis circuit 7, for example, in the bands 1 to 16 on the low frequency side, the signals are supplied to the bandpass filters ₆₁ to ₆₁₆ without correction, and in the bands 17 to 30 on the high frequency side, the signals are supplied to the bandpass filters 61 to 616 by the difference. The signal is supplied through the circuit 31 to bandpass filters 6 ₁₇ to 6 ₃₀ . In this emphasis circuit 7, the differential circuit 3
_The _{characteristics} _of _{_} ^_ _{_ o)} ……(16) becomes. Furthermore, the transfer function H _(z) of this circuit is |H(Z)| ² = |H(Z)・H _(z-1) | = |2−2cosωT| ...(17), as shown in Figure 7. The characteristic is that it is small in the low range and large in the high range. This transfer function becomes 1 at the point where the angular frequency ω becomes π/2. On the other hand, when dividing into 30 bands on the above-mentioned mel scale, the point where the angular frequency ω is π/2 is between the 16th and 17th bands. Therefore, as mentioned above, by making no correction in the 1st to 16th bands and using a difference in the 17th to 30th bands, it is possible to perform high-frequency enhancement that matches the human auditory characteristics as shown in Figure 8. . Further, signals from the average value circuits 9 ₁ to 9 ₃₀ of the respective bands are supplied to the noise removal circuits 32 ₁ to 32 ₃₀ . On the other hand, the signal from the AD conversion circuit 4 is supplied to the silent state detection circuit 33, and this detection signal is supplied to the removal circuits 32 ₁ to _{32 30} . Then, in the removal circuits 32 ₁ to 32 ₃₀ , the signal (noise) in a silent state is measured, and the average value (or peak value or the value obtained by calculating these) is set as the threshold level N, and the input signal is When x is smaller than this level N, a signal of 0 is output, and when x is larger, a signal of (x-N) is output. This signal is supplied to logarithmic circuits 10 ₁ to 10 ₃₀ . That is, in the noise removal circuits 32 ₁ to 32 ₃₀ , when a signal as shown in FIG. 9A is supplied to the removal circuit of one band, the detection circuit 33 detects a silent part, and the signal of this part A signal as shown in FIG. 9B is output depending on the threshold level N, which is, for example, an average value. In this case, the noise level is measured for each band, and noise removal is performed according to the frequency characteristics of the noise. The rest of the structure is the same as in FIG. According to this device, it is possible to perform good emphasis matching the human auditory characteristics using only a simple differential circuit without using a multiplier. Furthermore, the amount of calculation can be reduced when processing is performed using software. Furthermore, noise removal can be performed according to the frequency characteristics of the noise, and the accuracy of parameters is greatly improved. In this device, the distance calculation circuit 14 and the determination circuit 16 are configured as follows. That is, in FIG. 10, the signal from the DFT circuit 13 is supplied to the first distance calculation circuit 41, and the distance from the cluster coefficient from the memory device 51 is calculated. Here, the memory device 51 stores information such as [*→ (indicates a voiced sound),” “→ (indicates a vowel),” “→
Three types of cluster coefficients, ``*'', are written. A monosyllable is formed by these three transition points. Furthermore, the calculated distance is supplied to the first determination circuit 61, and the input transient point parameters are classified into the three clusters described above. Among the classified parameters, the "→*" parameter is supplied to the second distance calculation circuit 42, and the distance from the cluster coefficient from the memory device 52 is calculated. Here, the memory device 52 stores “A→*” and “I→
*” “U→*” “E→*” “O→*” “→*” (
6 types of cluster coefficients are written. Furthermore, the calculated distance is supplied to a second determination circuit 62, which determines which of the six clusters the input parameter corresponds to. Furthermore, this determination result is supplied to the processing circuit 71. Here, in this circuit 71, comprehensive determination of vowels is performed. That is, at the transition point of "→*", a plurality of transition points may be detected due to noise components such as so-called bulges, and in that case, there is a possibility that parameters close to other clusters may appear by chance. Therefore, in the processing circuit 71, the calculated distance and the number are comprehensively determined. That is, when a determination result and distance as shown in B are calculated by detecting a transient point as shown in FIG. 11A, for example, the shortest distance is "U".
However, in this case, the number determined is "A". As a result of conducting experiments and simulations on such cases, it was found that in such cases, it is generally correct to have more. Therefore, in this processing circuit 71, for example, determination is made by majority vote on the transition point parameters. Note that if the number is the same by majority vote or if the distances are extremely different, these distances may be taken into consideration. In this way, the final vowel is determined. In addition, the transition point parameters of “*→” and “→” classified by the determination circuit 61 are
are supplied to distance calculation circuits 43 and 44, respectively, and the distances from the cluster coefficients from memory devices 53 and 54 are calculated. First, cluster coefficients as shown in the table below are written in the memory device 53, classified by final vowel.

【表】【table】

【表】ここで例えば最終母音「Ａ」に分類されるクラ
スタは、50音表のア段の10個、濁音・半濁音５
個、拗音11個、及びバズ音の27個に、「＊→」
「→」の判定のしにくい破裂音５個を含めた
計32個である。また「Ｉ」は「Ａ」よりヤ行、ワ行、ダ行及び
拗音を除いた計15個である。以下「Ｕ」「Ｅ」「Ｏ」についてもそれぞれ発音
の特性に合せて30個、17個、31個のクラスタで構
成される。なお「」は「」に含めてある。またメモリ装置５４には、以下の表のようなク
ラスタ係数が、最終母音ごとに分類されて書込ま
れている。[Table] Here, for example, the clusters classified as the final vowel "A" are the 10 A-stages of the 50-tone table, the 5 voiced and semi-voiced
"＊→" for 27 sounds, 11 sounds, and 27 buzz sounds.
There are a total of 32 sounds, including 5 plosives that are difficult to judge, such as "→". In addition, "I" has a total of 15 characters, excluding the "Y", "Wa", "DA", and "Su" sounds from "A". Below, "U", "E", and "O" are also composed of 30 clusters, 17 clusters, and 31 clusters, respectively, according to their pronunciation characteristics. Note that ``'' is included in ``''. Further, cluster coefficients as shown in the table below are written in the memory device 54, classified by final vowel.

【表】【table】

【表】ここでも、上述のメモリ装置５３の場合と同様
に、それぞれ発音の特性に合せて、「Ａ」26個、
「Ｉ」12個、「Ｕ」25個、「Ｅ」13個、「Ｏ」25個の
クラスタに分類して書込まれている。なお拗音は
それぞれを「Ｙ→Ａ」「Ｙ→Ｕ」「Ｙ→Ｏ」に統合
してもよい。また破裂音はメモリ装置５３と同じ
ものが繰り返り設けられている。そして上述の処理回路７１からの最終母音の判
定出力に応じて、各メモリ装置５３，５４の対応
する母音の部分のみが算出回路４３，４４に供給
されて、距離の算出が行われる。さらに算出されて距離が、それぞれ第３、第４
の判定回路６３，６４に供給され、入力されたパ
ラメータがそれぞれのクラスタのどれに相当する
か判定される。これらの判定結果及び判定回路６２からの判定
結果が、単語・単音節の判定回路８１に供給さ
れ、入力された音声の単語・単音節が識別され
る。こうしてこの装置において音声認識が行われる
わけであるが、この装置によれば、まず過渡点を
３種類に分類し、次に最終母音を判定している。
ここで一般に母音の検出は容易であり、また最初
の３分類及び母音の判定はクラスタ数が３及び６
と少いので、パラメータの次元数を多くして極め
て精確な判定を行うことができる。また最終母音が複数検出された場合に、これを
距離及び個数にて総合判定することにより、判定
の確度をさらに高めることができる。そしてこの判定された最終母音によつて、それ
以前の過渡点の検出のクラスタを制限することに
より、これらの距離の計算量を少くすることがで
き、容易に実施できるようになると共に、精度を
高めることもできる。発明の効果本発明によれば、簡単な構成で良好な音声認識
が行えるようになつた。[Table] Here, as in the case of the memory device 53 described above, 26 "A"s, 26 "A"s,
It is classified and written into clusters of 12 "I", 25 "U", 13 "E", and 25 "O". Note that the syllables may be integrated into "Y→A,""Y→U," and "Y→O." Furthermore, the same plosive sounds as in the memory device 53 are repeatedly provided. Then, in accordance with the determination output of the final vowel from the processing circuit 71 described above, only the corresponding vowel portion of each memory device 53, 54 is supplied to the calculation circuits 43, 44, and distance calculation is performed. The distances are further calculated and the third and fourth distances are respectively calculated.
The input parameters are supplied to determination circuits 63 and 64, which determine which of the respective clusters the input parameters correspond to. These determination results and the determination result from the determination circuit 62 are supplied to a word/single syllable determination circuit 81 to identify words/single syllables of the input speech. Speech recognition is thus performed in this device.According to this device, first, the transition points are classified into three types, and then the final vowel is determined.
In general, vowel detection is easy, and the first three classifications and vowel determinations are performed when the number of clusters is 3 and 6.
Therefore, extremely accurate judgment can be made by increasing the number of dimensions of the parameters. Further, when a plurality of final vowels are detected, the accuracy of the determination can be further improved by comprehensively determining them based on distance and number. By using this determined final vowel to limit the cluster of previous transient point detections, the amount of calculation for these distances can be reduced, making it easier to implement and improving accuracy. It can also be increased. Effects of the Invention According to the present invention, it has become possible to perform good speech recognition with a simple configuration.

[Brief explanation of the drawing]

第１図は音声の説明のための図、第２図〜第５
図は従来の装置の説明のための図、第６図〜第９
図は本発明の説明のための図、第１０図は本発明
の一例の系統図、第１１図はその説明のための図
である。１はマイクロフオン、３はローパスフイルタ、
４はAD変換回路、５はクロツク発生器、６はバ
ンドパスフイルタ、７はエンフアシス回路、８は
絶対値回路、９は平均値回路、１０は対数回路、
１１，１３は離散的フーリエ変換回路、１２，１
５，５１〜５４はメモリ装置、１４，４１〜４４
はマハラノビス距離算出回路、１６，６１〜６４
は判定回路、１７は出力端子、２０は過渡点検出
回路、３１は差分回路、３２はノイズ除去回路、
３３は無音部検出回路、７１は処理回路、８１は
単語・単音節判定回路である。 Figure 1 is a diagram for explaining audio, Figures 2 to 5
The figures are for explaining the conventional device, Figures 6 to 9.
The figures are diagrams for explaining the present invention, FIG. 10 is a system diagram of an example of the present invention, and FIG. 11 is a diagram for explaining the same. 1 is a microphone, 3 is a low pass filter,
4 is an AD conversion circuit, 5 is a clock generator, 6 is a band pass filter, 7 is an emphasis circuit, 8 is an absolute value circuit, 9 is an average value circuit, 10 is a logarithmic circuit,
11, 13 are discrete Fourier transform circuits, 12, 1
5, 51-54 are memory devices, 14, 41-44
is Mahalanobis distance calculation circuit, 16, 61-64
is a determination circuit, 17 is an output terminal, 20 is a transient point detection circuit, 31 is a difference circuit, 32 is a noise removal circuit,
33 is a silent part detection circuit, 71 is a processing circuit, and 81 is a word/single syllable determination circuit.

Claims

[Claims] 1. A speech in which a transitional part between phonemes including silence is detected, a predetermined length of speech in the detected transient part is extracted and converted into a parameter, and this parameter is used as a basic recognition unit. In the recognition method, when there are multiple transition points from a vowel to silence classified into different cluster coefficients, the cluster coefficient is determined based on the number of transition points classified into each cluster coefficient. Recognition method.