JPH0455518B2

JPH0455518B2 -

Info

Publication number: JPH0455518B2
Application number: JP59170659A
Authority: JP
Inventors: Satoshi Fujii; Katsuyuki Futayada
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-08-16
Filing date: 1984-08-16
Publication date: 1992-09-03
Also published as: JPS6148897A

Description

[Detailed description of the invention]

産業上の利用分野本発明は音声の内容を自動的に認識するための
音声認識装置に関するものである。従来例の構成とその問題点不特定話者を対象とする音声認識においては、
性別のちがい、あるいは年令のちがいにより音声
の性質が大きく異なり、いかに音声の性質を共通
化して不特定話者の声を認識するかが課題とな
る。音声を音素単位で認識する場合に、音素標準パ
タンはこれらの性別、年令のちがいにより大きな
ばらつきがおこり、例えば母音／ａ／では男女間
にはスペクトル形状に大きな差がある。この問題に対処するため、従来は同じ音素に対
しても複数個の標準パタンを用意しておき、入力
音声に対し全ての標準パタンの類似度計算を行
い、どの標準パタンに最も似ているかによつて認
識を行つている。しかしこの方法では、用意する標準パタンの数
が多いほどお互いの混同が増加することによつて
認識性能が低下し、かつ演算量が膨大となる欠点
を有している。従来の音声認識装置のブロツク図の一例を第１
図に示す。まずあらかじめ多数話者の音声データ
をクラスタリング手法等を用いてグループ分け
し、音素あるいは音節の単位で標準パタン群を作
成し、標準パタン格納部１１に格納しておく。こ
こでは説明のため標準パタン格納部１１中の標準
パタン群１は男性のみのデータで、標準パタン群
２は女性のみのデータとし各群毎に６種類の標準
パタンが用意されているとする。さてマイク１より入力された入力音声はAD変
換器２によりAD変換された後に一方は信号処理
回路３へ送られ、プリエンフアシス、窓計算を行
つて線形予測分析プロセツサ４へ送られる。AD
変換された他方の信号はセグメンテーシヨン部５
に送られ、ここで帯域パワー計算、音声区間の検
出、有声無声無音判定、子声のセグメンテーシヨ
ンを行い、結果をメインメモリ７に転送する。線
形予測分析プロセツサ４で得たLPCパラメータ
を用い、類似度計算部６は次の手順で類似度計算
を行う。まず標準パタン格納部１１に格納された
標準パタン群１の中の標準パタンを類似度計算部
６に転送し、フレーム毎に類似度計算を行い、メ
インメモリ７に転送する。同様にして標準パタン
群２についても行う。メインプロセツサ８はメイ
ンメモリ７を参照しフレーム毎に最も類似度の高
かつた標準パタンに相当する音素又は音節を認識
結果として採用し、これとセグメンテーシヨン部
５の結果を用いて音素又は音節の系列を作成す
る。そしてできた系列を単語辞書１２と照合する
事によつて単語認識を行い、結果を出力部９に送
る。この従来例の欠点は、標準パタン格納部１１に
格納されている標準パタン群全部に対して類似度
計算を常に行なわねばならないために１類似度計算部６の計算量が大きなものとな
り、高速演算の要求される高価なものとなる。２標準パタン群を複数個用意し、全ての標準パ
タン群を対象としてその中で最も類似度の高い
音素を認識に用いる方法であるために、似かよ
つた音素の数が多くなり、互いの混同が増加す
ることにより認識性能を低下させる。発明の目的本発明は前記欠点を解消し、未知入力音声を用
いてその音声に最も適した標準パタンを自動選択
することにより、話者に負担をかけることなく不
特定話者の音声に対して高い信頼度で認識を可能
とし、かつ類似度計算の計算量を大幅に軽減する
ことにより高速処理の可能な音声認識装置を提供
することを目的とする。発明の構成本発明は前記目的を達成するために、あらかじ
め多数話者の音声をグループ分けして音声の標準
パタンをグループ毎に求めておき、未知音声が入
力されたらその音声の前記標準パタンに対する類
似度を算出し、前記類似度を用いてグループ毎に
未知音声がそのグループに属する信頼度を算出
し、信類度がある閾値を超えた時点で、閾値を超
えたグループの標準パタンのみを以後の未知入力
音声の認識に使用することを特徴とする音声認識
装置に関するものである。実施例の説明本発明は不特定話者の音声を年令や性別に関係
なく安定に認識できることを特徴とする。そのた
めに、音声を認識するための標準パタンを、男
性，女性，子供，老人……というように声の質の
ちがいによつてグループ分けして作成しておく。
実際に入力される音声はその中のどのグループに
属するかは不明であるが、本方法を用いることに
よつて、入力された音声を分析し、どのグループ
に属するかを高い確度で自動的に決定することが
できる。以下実施例では男性（グループ１）と女性（グ
ループ２）の２つのグループのみを対象にした場
合について説明する。実施例の音声認識装置の構
成図を第２図に示す。まず標準パタン格納部２５に格納する内容を説
明する。この実施例では、グループ１，２ごとに平均値
を求め、グループ全体を対象として共分散行列を
求め、これらを用いて標準パタンである重み係数
a_ij，平均距離d_iを求め標準パタン格納部２５に格
納するものである。まずグループ１の音声における音声ｉのLPC
ケプストラム係数の平均値を m_i ⁽¹⁾＝（m_i1 ⁽¹⁾，m_i2 ⁽¹⁾，………，m_ip ⁽¹⁾）とする。式中(1)はグループ１であることを示し、
ｐは使用パラメータ数である。標準パタンをフレ
ーム単位で構成する場合はLPCケプストラム係
数の次数をｐとすると、Ｐ＝ｐとなる。標準パタ
ンをｎフレームの時間パタンで構成する場合はＰ
＝ｐ×ｎとなる。又、グループ２の音声における音素ｉのLPC
ケプストラム係数平均値を m_i ⁽²⁾＝（m_i1 ⁽²⁾，m_i2 ⁽²⁾，……，m_ip ⁽²⁾）とする。これをグループ１とグループ２の各々に
対して母音／ａ／，／ｉ／，／ｕ／，／ｅ／，／
ｏ／と鼻音について求め、計12個とする。次にグループ１の平均値m_i ⁽¹⁾とグループ２の平
均値m_i ⁽²⁾を使用しこの12個の音素に共通の共分散
行列をＲとし、その逆行列をR^-1とする。R^-1の
（ｊ，j′）要素をr_jj′とすると、LPCケプストラム
係数のｊ次に対する重み係数は、グループ１の音
素ｉに対して a_ij ⁽¹⁾＝Ｚ_P 〓^j=1 r_jj′m_ij ⁽¹⁾ ……(1) で求める。又、音声ｉに対する平均距離をd_i ⁽¹⁾と
すると、 d_i ⁽¹⁾＝m_i ^(1)tR^-1m_i ⁽¹⁾ ……(2) で求める。（ｔは転置行列を表わす。）このa_ij ⁽¹⁾，d_i ⁽¹⁾を各音素毎に求め、標準パタン
格納部25の標準パタン群１に格納する。同様に、グループ２についてa_ij ⁽²⁾，d_i ⁽²⁾を求め、
標準パタン群２に格納する。なお共分散行列はグループ１，２ごとに求めて
も良いことはもちろんである。次に、類似度計算部２４の動作について説明す
る。未知入力音声がマイク２０から入力される
と、（ここでは「ハジメ」（ha zi me）という音
声であつたとする）信号処理回路２２でプリエン
フアシス、窓計算を行つた後、線形予測分析プロ
セツサ２３にてLPCケプストラムに係数c_j（ｊ＝
１，２，……，ｐ）が求められる。時間パタンを
用いる場合は（ｎフレーム）のc₁〜c_pを並べてc₁
〜c_pとする類似度計算部２４は、このc_jと標準パ
タン切換部２６を通して送られてきた標準パタン
を用いて、類似度計算を行なう。グループ１の音
素ｉに対しては類似度l_i ⁽¹⁾は l_i ⁽¹⁾＝_P 〓^j=1 a_ij ⁽¹⁾c_jdm_i ⁽¹⁾ ……(3) で求める。これをグループ２に対しても l_i ⁽²⁾＝_P 〓^j=1 a_ij ⁽²⁾c_j−d_i ⁽²⁾ ……(4) で求め、計12音素分求めてメインメモリ２７に転
送する。セグメンテーシヨン部２６では帯域パワー，有
声無声判定の計算を行い、音声区間の決定と子音
区間の検出（ここではha zi meの／ｈ／，／
ｚ／，／ｍ／）を行い、メインメモリ２７に転送
する。メインプロセツサ２８は、メインメモリ２
７に登録された子音区間と類似度を用いて母音，
鼻音区間を決定し（ここではha zi meの／
ａ／，／ｉ／，／ｍ／，／ｅ／）、音素中心（中
央の位置又は類似度最大の位置）をＮ個（ここで
はＮ＝４）求める。次に選択部２９の動作について説明する。まず
前記方法で求めた４個の音素（／ａ／，／
ｉ／，／ｍ／，／ｅ／）の音素中心における用意
された全音素（／ａ／，／ｉ／，／ｕ／，／
ｅ／，／ｏ／，鼻音）に対する最大類似度をグル
ープ毎に求め、グループ１の場合をl_i ⁽¹⁾，グルー
プ２の場合をl_i ⁽²⁾とする。これをＮ個の音素中心
について各々求め、グループ毎に類似度の総和を
L⁽¹⁾，L⁽²⁾とする。 L⁽¹⁾＝_N 〓ⁿ⁼¹ l_i ⁽¹⁾ ……(4) L⁽²⁾＝_N 〓ⁿ⁼¹ l_j ⁽²⁾ ……(5) このL⁽¹⁾，L⁽²⁾を用いて信頼度R_eを次式で定義
する。 R_e ⁽¹⁾L⁽¹⁾−L⁽²⁾ ……(6) なおグループが３個以上の場合は各グループに
ついて類似度の総和を求め、その値の最大なもの
二つについて上記(6)式により信頼度R_e ⁽¹⁾を求めれ
ば良い。さてこのR_e ⁽¹⁾が正値であらかじめ定められた閾
値を超えた場合は、使用者の音声はグループ１に
属するものと決定する。負値で、その絶対値が閾
値を超えた場合は、使用者の音声はグループ２に
属するものと決定する。いき値を超えることによ
つて決定された後は、選択部２９は、標準パタン
切換部３２に対して、決定されたグループの標準
パタンのみを類似度計算部２４に与えるよう指示
して、動作を終了する。 Reがいき値を超えなかつた場合、選択部２９
は標準パタン切換部３２に対しグループ１及びグ
ループ２の両方の標準パタンを選択するように指
示し、さらにメインプロセツサ２８に対し、正値
の場合はグループ１の類似度を、負値の場合はグ
ループ２の類似度を音素認識に用いるよう指示を
与える。従つて、メインプロセツサ２８は信頼度Reが
閾値を超えない間は選択部２９の指示に従い、指
示された類似度を用いて音素認識し、結果を単語
辞書３０を照合することによつて単語認識を行
い、最も類似度の高かつた単語辞書を認識結果と
して出力部３１に転送する。又、信頼度Reが閾値を超えない間は、標準パ
タン切換部２６は選択部２９の指示に従い、標準
パタン１，２を順次転送し、類似度計算部２４は
標準パタン群１，２に対する類似度計算をくり返
す。従つて、この間類似度計算部２４は類似度計
算のための演算量が多いが、選択部２９の動作を
終了した時点より、決定されたグループのみの類
似度計算を行えば良くなり、演算量は大幅に軽減
される。又、メインプロセツサ２８は信頼性の高
い標準パタンを用いて音素認識が得られるように
なり、単語認識の精度が向上する。以上述べた実施例においては、信頼度の算出を
最大類似度の和を用いて行つていたが、それ以外
に信頼度の算出を最大類似度を得る回数で行つて
も良い。類似度計算部２４で得た類似度の中で、
最も類似度の高いものをl_i ⁽¹⁾とする。メインプロ
セツサ２８は選択部２９にl_i ⁽¹⁾であることを知ら
せる。選択部２９は、l_i ⁽¹⁾はグループ１に属する
ものとして、回数N⁽¹⁾をカウントアツプする。l_j
⁽²⁾が送られてきた場合には、回数N⁽²⁾をカウント
アツプする。信頼度Reは次式で計算する。 Re⁽¹⁾＝N⁽¹⁾／N⁽²⁾ ……(7) Re⁽²⁾＝N⁽²⁾／N⁽¹⁾ ……(8) このRe⁽¹⁾，Re⁽²⁾のいずれかがあらかじめ定め
られた閾値を超えたら、それがRe⁽¹⁾の場合グル
ープ１に、Re⁽²⁾の場合グループ２に決定する。
閾値を超えない間はN⁽¹⁾とN⁽²⁾を比較し、大きい
方のグループの類似度を音素認識に用いるよう、
メインプロセツサ２８に指示を与える。この方法では、最大類似度を得る回数のみを用
いるため、前述の類似度和を用いる方法に比し
て、騒音等の音声スペクトルに歪を与える要因に
対してより安定である特長がある。なおグループが３個以上の場合には回数の多い
もの二つについて信頼度Reを計算すれば良い。次に標準パタン群の自動選択について第２図の
ブロツク図及び第３図のフローチヤートを用いて
説明する。処理イに示すように任意の言葉、例えば「ハジ
メ」（ha zi me）という音声がマイクに入力され
たとする。かかる音声はＡ／Ｄ変換器２１でＡ／Ｄ変換さ
れ（処理ロ）、一方は信号処理回路２２へ、他方
はセグメンテーシヨン部２６へ送られる。信号処
理回路２２では処理ハに示すようにフレーム毎に
プリエンフアシス，ハミング窓による窓計算を行
つた後、その結果を線形予測分析プロセツサ２３
へ送る。線形予測分析プロセツサ２３は線形予測
分析を行なつてLPCケプストラム係数Ｃ＝（c₁，
c₂，……c_j，……c_p）を求め（処理ニ）、類似度
計算部２４へ送る。一方、セグメンテーシヨン部２６は帯域フイル
タ計算を行い（処理ホ）、又線形予測分析プロセ
ツサ２３で求めたLPCケプストラム係数Ｃを用
いて有声無声判定，音声区間の検出（処理ヘ）、
子音ha zi meの／ｈ／，／ｚ／，／ｍ／のセグ
メンテーシヨンおよび子音判別（処理ト）を行い
その結果をメインメモリ２７に送る（処理チ）。また選択部２９は標準パタン群１の中に予め準
備されたa_ij ⁽¹⁾，d_i ⁽¹⁾を類似度計算部２４に送る
（処理リ）。類似度計算部２４では処理ヌに示すよ
うに次式でグループ１の音素ｉに対する類似度l_i
⁽¹⁾を求める。 l_i ⁽¹⁾＝_P 〓^j=1 a_ij ⁽¹⁾c_j−d_i ⁽¹⁾ ……(3) 類似度はベイズ判定やマハラノビス距離等の統計
的距離尺度に基づくものが好適である。同様にグループ２についてもl_i ⁽²⁾を求め、これ
らをメインメモリ２７に転送する（処理ル）。メインプロセツサ２８はセグメンテーシヨン結
果と、母音・鼻音に対する類似度を参照して処理
オの如く母音・鼻音部ha zi meの／ａ／と／
ｉ／と／ｍ／と／ｅ／を決定し、決定した母音，
鼻音部の中から、最も母音又は鼻音らしい中心の
フレーム（中央の位置又は類似度最大の位置）を
各母音・鼻音部に対して選び、その位置情報を選
択部２９に与える。選択部２９は中心フレームのグループ毎の最大
類似度を求め、さらにその類似度和L⁽¹⁾，L⁽²⁾を求
める。そして(6)式又は(7)，(8)式を用いて信頼度を算出
し、閾値を越えるか歪かの判定を行う（処理カ）。
この結果に基づく標準パタン切換部３２は標準パ
タン格納部２５内の標準パタン群を選択する。次に本実施例による音声認識装置の処理の流れ
を第４図に示す。最初に音声が入力されたら（判
断ツ）音響分析し（処理ネ）、判断ナを経由して
セグメンテーシヨン，類似度計算１を行なう（処
理ラ）。この時は、用意された全てのグループの
標準パタンに対して類似度計算を行う。次に、音
声中の母音，鼻音の音素中心を抽出し、グループ
を判別する信頼度を計算する（処理ム）。信頼度
が閾値以下なら（判断ウ）、その時点で信頼度の
最も高いグループの類似度を用いて音素認識を行
う。閾値以上なら標準パタン選択終了命令を出し
（処理マ）、閾値を超えたグループの類似度で音素
認識を行なう（処理イ）。音素認識結果を用いて
単語認識を行い（処理ヲ）、単語認識結果を出力
して（処理ワ）、再び音声入力待ちにもどる。次の音声が入力されたら音響分析の後（処理
ネ）標準パタン選択終了命令が出されているか歪
かを調べ（判断ナ）、されてなければ最初の音声
の場合と同様な処理をくり返す。されていれば、
すでにグループが決定されているため、そのグル
ープの標準パタンのみを用いてセグメンテーシヨ
ン，類似度計算２を行い（処理ヤ）、音素認識の
ルーチンへ移る。このように装置としての処理の流れは簡単であ
り、特に複雑な演算処理を行うことなく実現でき
ることを特徴とする。本実施例の方法で、成人男女100名を対象に、
212単語中の最初の10単語を用いて、閾値を超え
るに必要な単語数を話者毎に求め、人数を評価し
た結果を第１表に示す。 FIELD OF THE INVENTION The present invention relates to a speech recognition device for automatically recognizing the content of speech. Conventional structure and its problems In speech recognition targeting unspecified speakers,
The characteristics of voices vary greatly depending on gender or age, and the challenge is how to standardize the characteristics of voices and recognize the voices of unspecified speakers. When recognizing speech phoneme by phoneme, standard phoneme patterns vary greatly depending on gender and age; for example, for the vowel /a/, there is a large difference in spectral shape between men and women. In order to deal with this problem, conventionally, multiple standard patterns are prepared for the same phoneme, and the similarity of all standard patterns is calculated for the input speech to determine which standard pattern is most similar. It is then recognized. However, this method has the drawback that the larger the number of standard patterns to be prepared, the more likely they are to be confused with each other, resulting in lower recognition performance and an enormous amount of calculation. An example of a block diagram of a conventional speech recognition device is shown in the first example.
As shown in the figure. First, the speech data of multiple speakers is divided into groups using a clustering method or the like, and standard pattern groups are created in units of phonemes or syllables and stored in the standard pattern storage section 11. For the sake of explanation, it is assumed here that standard pattern group 1 in standard pattern storage 11 is data for only men, and standard pattern group 2 is data for women only, and six types of standard patterns are prepared for each group. Now, the input voice inputted from the microphone 1 is subjected to AD conversion by the AD converter 2, and then one side is sent to the signal processing circuit 3, which performs pre-emphasis and window calculation, and then is sent to the linear predictive analysis processor 4. A.D.
The other converted signal is sent to the segmentation unit 5
Here, band power calculation, voice interval detection, voiced/unvoiced/silent determination, and consonant voice segmentation are performed, and the results are transferred to the main memory 7. Using the LPC parameters obtained by the linear predictive analysis processor 4, the similarity calculation section 6 performs similarity calculation according to the following procedure. First, the standard patterns in the standard pattern group 1 stored in the standard pattern storage section 11 are transferred to the similarity calculation section 6, where similarity calculation is performed for each frame and transferred to the main memory 7. The same process is performed for standard pattern group 2 as well. The main processor 8 refers to the main memory 7, adopts as a recognition result the phoneme or syllable corresponding to the standard pattern with the highest degree of similarity for each frame, and uses this and the result of the segmentation unit 5 to identify the phoneme or syllable. Create a sequence of syllables. Word recognition is then performed by comparing the resulting series with the word dictionary 12, and the results are sent to the output section 9. The disadvantage of this conventional example is that similarity calculations must always be performed for all the standard pattern groups stored in the standard pattern storage unit 11.1 The amount of calculation in the similarity calculation unit 6 becomes large, and high-speed calculation is required. The required cost is high. 2. Since this method prepares multiple standard pattern groups and uses the phoneme with the highest degree of similarity among all the standard pattern groups for recognition, the number of similar phonemes increases, leading to confusion with each other. This increases the recognition performance. Purpose of the Invention The present invention solves the above-mentioned drawbacks and automatically selects a standard pattern that is most suitable for the unknown input speech. It is an object of the present invention to provide a speech recognition device that enables recognition with high reliability and can perform high-speed processing by significantly reducing the amount of calculation for similarity calculation. Structure of the Invention In order to achieve the above object, the present invention divides the voices of many speakers into groups in advance and obtains a standard pattern of the voice for each group, and when an unknown voice is input, the standard pattern of the voice is The degree of similarity is calculated, and the degree of confidence that the unknown voice belongs to that group is calculated for each group using the degree of similarity. When the degree of confidence exceeds a certain threshold, only the standard patterns of the group that exceeded the threshold are selected. The present invention relates to a speech recognition device characterized in that it is used to recognize unknown input speech thereafter. DESCRIPTION OF EMBODIMENTS The present invention is characterized in that the voice of an unspecified speaker can be stably recognized regardless of age or gender. To this end, standard patterns for recognizing voices are created by dividing them into groups according to the quality of the voice, such as male, female, child, elderly, etc.
It is unknown which group the input voice actually belongs to, but by using this method, the input voice can be analyzed and automatically determined with high accuracy which group it belongs to. can be determined. In the following embodiment, a case will be described in which only two groups, men (group 1) and women (group 2) are targeted. FIG. 2 shows a configuration diagram of the speech recognition device of the embodiment. First, the contents stored in the standard pattern storage section 25 will be explained. In this example, the average value is determined for each group 1 and 2, the covariance matrix is determined for the entire group, and these are used to calculate the weighting coefficient, which is a standard pattern.
a _ij and average distance d _i are determined and stored in the standard pattern storage section 25. First, LPC of voice i in group 1 voice
Let the average value of the cepstral coefficients be m _i ⁽¹⁾ = (m _i1 ⁽¹⁾ , m _i2 ⁽¹⁾ , ......, m _ip ⁽¹⁾ ). In the formula, (1) indicates group 1,
p is the number of parameters used. When the standard pattern is constructed on a frame-by-frame basis, if the order of the LPC cepstrum coefficient is p, then P=p. If the standard pattern consists of a time pattern of n frames, use P.
=p×n. Also, the LPC of phoneme i in group 2 speech
Let the average value of the cepstral coefficients be m _i ⁽²⁾ = (m _i1 ⁽²⁾ , m _i2 ⁽²⁾ , ..., m _ip ⁽²⁾ ). This is applied to the vowels /a/, /i/, /u/, /e/, / for each of Group 1 and Group 2.
Obtain o/ and nasal sounds, making a total of 12. Next, using the average value m _i ⁽¹⁾ of group 1 and the average value m _i ⁽²⁾ of group 2, let the covariance matrix common to these 12 phonemes be R, and its inverse matrix be R ^-1 . . If the (j, j') element of R ^-1 is r _jj ', then the weighting coefficient for the jth order of the LPC cepstrum coefficient is a _ij ⁽¹⁾ = Z _P 〓 ^j=1 r for phoneme i of group 1. _jj ′m _ij ⁽¹⁾ …(1). Further, if the average distance to voice i is d _i ⁽¹⁾ , then it is calculated as d _i ⁽¹⁾ = m _i ^(1)t R ^-1 m _i ⁽¹⁾ ...(2). (t represents a transposed matrix.) These a _ij ⁽¹⁾ and d _i ⁽¹⁾ are obtained for each phoneme and stored in the standard pattern group 1 of the standard pattern storage section 25. Similarly, find a _ij ⁽²⁾ and d _i ⁽²⁾ for group 2,
Store in standard pattern group 2. It goes without saying that the covariance matrix may be obtained for each group 1 and 2. Next, the operation of the similarity calculation section 24 will be explained. When an unknown input voice is input from the microphone 20 (assuming that it is the voice "ha zi me"), the signal processing circuit 22 performs pre-emphasis and window calculation, and then the linear predictive analysis processor 23 inputs the unknown input voice. and add a coefficient c _j (j=
1, 2, ..., p) are found. When using a time pattern, line up c ₁ to c _p of (n frames) and use c ₁
The similarity calculation unit 24 calculates the similarity between c _p and the standard pattern sent through the standard pattern switching unit 26 and c _j . For the phoneme i of group 1, the similarity l _i ⁽¹⁾ is obtained as l _i ⁽¹⁾ = _P 〓 ^j=1 a _ij ⁽¹⁾ c _j dm _i ⁽¹⁾ ...(3). This is also obtained for group 2 using l _i ⁽²⁾ = _P 〓 ^j=1 a _ij ⁽²⁾ c _j −d _i ⁽²⁾ ...(4), and a total of 12 phonemes are obtained and stored in the main memory 27. Forward. The segmentation unit 26 calculates the band power and voiced/unvoiced judgment, determines the speech interval, and detects the consonant interval (here, /h/ of ha zi me, /
z/, /m/) and transfer it to the main memory 27. The main processor 28 has main memory 2
Vowels using the consonant intervals and similarity registered in 7.
Determine the nasal interval (here / of ha zi me)
a/, /i/, /m/, /e/), N phoneme centers (center position or position with maximum similarity) are found (here, N=4). Next, the operation of the selection section 29 will be explained. First, the four phonemes (/a/, /
Prepared all phonemes (/a/, /i/, /u/, /) at the phoneme center of i/, /m/, /e/)
The maximum similarity for each group (e/, /o/, nasal) is determined for each group, and the case of group 1 is set as l _i ⁽¹⁾ , and the case of group 2 is set as l _i ⁽²⁾ . Obtain this for each N phoneme center, and calculate the sum of similarities for each group.
Let L ⁽¹⁾ and L ⁽²⁾ . L ⁽¹⁾ = _N 〓 ⁿ⁼¹ l _i ⁽¹⁾ ……(4) L ⁽²⁾ = _N 〓 ⁿ⁼¹ l _j ⁽²⁾ ……(5) This L ⁽¹⁾ , L ⁽²⁾ The reliability R _e is defined using the following formula. R _e ⁽¹⁾ L ⁽¹⁾ −L ⁽²⁾ ……(6) If there are three or more groups, calculate the sum of the similarities for each group, and calculate the two maximum similarities using the above (6). ) can be used to find the reliability R _e ⁽¹⁾ . Now, if this R _e ⁽¹⁾ is a positive value and exceeds a predetermined threshold, it is determined that the user's voice belongs to group 1. If it is a negative value and its absolute value exceeds the threshold, it is determined that the user's voice belongs to group 2. After the determination is made by exceeding the threshold value, the selection unit 29 instructs the standard pattern switching unit 32 to provide only the standard patterns of the determined group to the similarity calculation unit 24, and performs the operation. end. If Re does not exceed the threshold, the selection section 29
instructs the standard pattern switching unit 32 to select the standard patterns of both groups 1 and 2, and also instructs the main processor 28 to select the similarity of group 1 if the value is positive, or to select the similarity of group 1 if the value is negative. gives an instruction to use the similarity of group 2 for phoneme recognition. Therefore, the main processor 28 follows the instructions of the selection unit 29 as long as the reliability Re does not exceed the threshold, performs phoneme recognition using the instructed similarity, and compares the results with the word dictionary 30 to identify the word. Recognition is performed, and the word dictionary with the highest degree of similarity is transferred to the output unit 31 as a recognition result. Further, while the reliability Re does not exceed the threshold, the standard pattern switching section 26 sequentially transfers the standard patterns 1 and 2 according to the instructions of the selection section 29, and the similarity calculation section 24 transfers the similarity to the standard pattern groups 1 and 2. Repeat the calculation. Therefore, during this time, the similarity calculation unit 24 requires a large amount of calculation for similarity calculation, but from the time when the operation of the selection unit 29 is finished, it is only necessary to calculate the similarity of the determined group, and the calculation amount is reduced. is significantly reduced. Furthermore, the main processor 28 can now perform phoneme recognition using highly reliable standard patterns, improving the accuracy of word recognition. In the embodiments described above, the reliability is calculated using the sum of the maximum similarities, but the reliability may also be calculated using the number of times the maximum similarity is obtained. Among the similarities obtained by the similarity calculation unit 24,
Let l _i ⁽¹⁾ be the one with the highest degree of similarity. The main processor 28 notifies the selection unit 29 that l _i ⁽¹⁾ . The selection unit 29 assumes that l _i ⁽¹⁾ belongs to group 1 and counts up the number of times N ⁽¹⁾ . l _j
If ⁽²⁾ is sent, count up the number of times N ⁽²⁾ . Reliability Re is calculated using the following formula. Re ⁽¹⁾ =N ⁽¹⁾ /N ⁽²⁾ ...(7) Re ⁽²⁾ =N ⁽²⁾ /N ⁽¹⁾ ...(8) Which of Re ⁽¹⁾ and Re ⁽²⁾ If the value exceeds a predetermined threshold, group 1 is determined if it is Re ⁽¹⁾ , and group 2 is determined if it is Re ⁽²⁾ .
As long as the threshold is not exceeded, N ⁽¹⁾ and N ⁽²⁾ are compared, and the similarity of the larger group is used for phoneme recognition.
An instruction is given to the main processor 28. Since this method uses only the number of times to obtain the maximum similarity, it has the advantage of being more stable against factors that distort the speech spectrum, such as noise, compared to the method using the sum of similarities described above. Note that when there are three or more groups, the reliability Re can be calculated for the two groups that have the highest number of times. Next, automatic selection of a standard pattern group will be explained using the block diagram of FIG. 2 and the flowchart of FIG. 3. As shown in Process A, suppose that an arbitrary word, for example, the voice "ha zi me" is input into the microphone. The audio is A/D converted by the A/D converter 21 (processing 2), and one is sent to the signal processing circuit 22 and the other is sent to the segmentation section 26. The signal processing circuit 22 performs pre-emphasis and window calculation using a Hamming window for each frame, as shown in FIG.
send to The linear predictive analysis processor 23 performs linear predictive analysis to obtain LPC cepstral coefficients C=(c ₁ ,
c ₂ , . . . c _j _, . On the other hand, the segmentation unit 26 performs band filter calculations (processing E), and uses the LPC cepstrum coefficients C obtained by the linear predictive analysis processor 23 to determine voicedness and unvoicedness, detect voice sections (processing E),
Segmentation and consonant discrimination (processing) of the consonants /h/, /z/, /m/ of the consonants ha zi me are performed and the results are sent to the main memory 27 (processing). The selection unit 29 also sends a _ij ⁽¹⁾ and d _i ⁽¹⁾ prepared in advance in the standard pattern group 1 to the similarity calculation unit 24 (processing). The similarity calculation unit 24 calculates the similarity l _i for the phoneme i of group 1 using the following equation as shown in Process
Find ⁽¹⁾ . l _i ⁽¹⁾ = _P 〓 ^j=1 a _ij ⁽¹⁾ c _j −d _i ⁽¹⁾ ...(3) The similarity is preferably based on a statistical distance measure such as Bayesian judgment or Mahalanobis distance. . Similarly, l _i ⁽²⁾ is obtained for group 2, and these are transferred to the main memory 27 (processing). The main processor 28 refers to the segmentation result and the degree of similarity for vowels and nasals to process the vowel/nasal parts ha zi me /a/ and /.
Determine i/, /m/ and /e/, and determine the determined vowel,
From among the nasal parts, the center frame (center position or position with maximum similarity) that is most likely to be a vowel or nasal sound is selected for each vowel/nasal part, and its position information is provided to the selection unit 29. The selection unit 29 determines the maximum similarity for each group of central frames, and further determines the sum of the similarities L ⁽¹⁾ and L ⁽²⁾ . Then, the reliability is calculated using equation (6) or equations (7) and (8), and it is determined whether the threshold value is exceeded or there is distortion (processing).
Based on this result, the standard pattern switching unit 32 selects a standard pattern group in the standard pattern storage unit 25. Next, FIG. 4 shows the flow of processing of the speech recognition apparatus according to this embodiment. When speech is first input (judgment 2), it is acoustically analyzed (processing), and then segmentation and similarity calculation 1 are performed via the judgment (processing). At this time, similarity calculation is performed for standard patterns of all prepared groups. Next, the phoneme centers of vowels and nasal sounds in the speech are extracted, and the reliability of group discrimination is calculated (processing step). If the reliability is below the threshold (judgment c), phoneme recognition is performed using the similarity of the group with the highest reliability at that point. If it exceeds the threshold, a standard pattern selection end command is issued (processing ma), and phoneme recognition is performed based on the similarity of groups exceeding the threshold (processing a). Word recognition is performed using the phoneme recognition results (processing wo), the word recognition results are output (processing wa), and the process returns to the voice input waiting state. When the next voice is input, after acoustic analysis (processing), check whether a standard pattern selection end command has been issued or if it is distorted (judgment), and if not, repeat the same process as for the first voice. . If it was,
Since the group has already been determined, segmentation and similarity calculation 2 are performed using only the standard pattern of that group (processing step 2), and the process moves to the phoneme recognition routine. As described above, the processing flow of the device is simple, and the device is characterized in that it can be realized without performing particularly complicated calculation processing. Using the method of this example, 100 adult men and women were
Using the first 10 words out of 212 words, the number of words required to exceed the threshold was determined for each speaker, and the results of evaluating the number of speakers are shown in Table 1.

【表】すなわち、４単語まで用いれば100人中98人ま
で正しく、グループの判定を行うことができる。
残り２名中１名は、９単語まで必要とするが、正
しくグループを判定される。このグループ判定を
誤つた場合には母音・鼻音認識率が88.4％→59.3
％と大幅に低下するため学習単語数を多くとつて
誤らないようにすることが重要である。誤つた１
名は、女性を男性と誤つた場合であるが、この話
者は男性の標準パタンを用いても母音，鼻音の認
識率は78.5％→75.5％と認識率の低下は極めて少
ない。すなわちこの話者の音声は男性の標準パタ
ンにも合つており男女の判別を誤つても問題はな
い。このように、本実施例を用いれば、高い確度で
男女の判別を行うことが可能となる。[Table] In other words, if up to four words are used, up to 98 out of 100 people can correctly judge the group.
One of the remaining two participants required up to nine words, but was able to correctly determine the group. If this group judgment is incorrect, the vowel/nasal recognition rate will be 88.4% → 59.3
%, so it is important to avoid making mistakes by increasing the number of words you learn. I got it wrong 1
This is a case in which a woman is mistaken for a man, but even when this speaker uses the standard pattern for men, the recognition rate for vowels and nasal sounds is 78.5% → 75.5%, which is an extremely small drop in the recognition rate. In other words, the voice of this speaker also conforms to the standard male pattern, so there is no problem even if the speaker is misidentified as male or female. In this way, by using this embodiment, it becomes possible to discriminate between men and women with high accuracy.

【表】男女20名を対象に、５母音，鼻音の平均音素認
識率をフレーム単位で評価，比較した結果を第２
表に示す。男女の区別無は、従来法に述べた、男
女別々の標準パタンを用意し、男女を区別するこ
となく最大類似度を得る標準パタンを認識結果と
するものである。男女の区別有は本実施例による
方法である。各々フレーム認識率を％で示し、
（）で認識率のバラツキを標準偏差で示す。従来法に比し、本実施例を用いると認識率が向
上し、バラツキも減少する。特に、女性の認識率
の向上と、男性のバラツキの減少に大きな効果が
あり、本実施例の有効性を示している。発明の効果以上述べたように本発明は、あらかじめ多数話
者の音声をグループ分けして認識のための標準パ
タンをグループ毎に作成しておき、未知入力音声
を用いてその音声に最も適した標準パタンを自動
選択する機能を持たせることにより、１使用者に負担をかけることなく、使用者の声
に最も適した標準パタンを用いて音声を認識す
ることができ、不特定話者に対して安定した高
い精度の認識を実現することができる。２使用標準パタンを１組にしぼることにより計
算量を軽減し、処理速度の速い音声認識装置を
実現することができる。という利点を有する。[Table] The second table shows the results of evaluating and comparing the average phoneme recognition rate of five vowels and nasal sounds on a frame-by-frame basis for 20 men and women.
Shown in the table. In the case of no distinction between men and women, standard patterns for men and women are prepared separately, and the standard pattern that obtains the maximum similarity without distinguishing between men and women is used as the recognition result, as described in the conventional method. The method according to this embodiment distinguishes between men and women. Each frame recognition rate is shown in %,
The variation in recognition rate is shown in parentheses as standard deviation. Compared to the conventional method, using this embodiment improves the recognition rate and reduces variation. In particular, there was a great effect on improving the recognition rate for women and reducing the variation for men, demonstrating the effectiveness of this example. Effects of the Invention As described above, the present invention divides the voices of multiple speakers into groups in advance, creates a standard pattern for recognition for each group, and uses unknown input voice to find the most suitable pattern for that voice. By providing a function that automatically selects a standard pattern, it is possible to recognize speech using the standard pattern that is most suitable for the user's voice, without putting any burden on the user, and to recognize unspecified speakers. This makes it possible to achieve stable and highly accurate recognition. 2. By limiting the number of standard patterns used to one set, the amount of calculation can be reduced and a speech recognition device with high processing speed can be realized. It has the advantage of

[Brief explanation of the drawing]

第１図は従来の音声認識装置を示す機能ブロツ
ク図、第２図は本発明の一実施例における音声認
識装置を示す機能ブロツク図、第３図は本発明の
一実施例における標準パタン群の自動選択機能を
説明するフローチヤート、第４図は本発明の音声
認識装置の認識手順の一例を示すフローチヤート
である。２３……線形予測分析プロセツサ、２４……類
似度計算部、２５……標準パタン格納部、２６…
…セグメンテーシヨン部、２８……メインプロセ
ツサ、２９……選択部、３０……単語辞書部、３
２……標準パタン切換部。 FIG. 1 is a functional block diagram showing a conventional speech recognition device, FIG. 2 is a functional block diagram showing a speech recognition device in an embodiment of the present invention, and FIG. 3 is a functional block diagram showing a group of standard patterns in an embodiment of the present invention. FIG. 4 is a flowchart illustrating the automatic selection function. FIG. 4 is a flowchart illustrating an example of the recognition procedure of the speech recognition apparatus of the present invention. 23...Linear predictive analysis processor, 24...Similarity calculation unit, 25...Standard pattern storage unit, 26...
...Segmentation section, 28... Main processor, 29... Selection section, 30... Word dictionary section, 3
2...Standard pattern switching section.

Claims

[Scope of Claims] 1. An acoustic analysis unit that calculates a spectrum or information similar to the spectrum (hereinafter referred to as spectrum information) obtained from input speech for each frame period, and detects speech sections of input speech and segments each phoneme. a segmentation unit that performs segmentation; a standard pattern storage unit that stores in advance a plurality of standard pattern groups classified by speakers with similar characteristics from standard speech signals composed of multiple speakers; and the standard pattern storage unit. a standard pattern switching unit that selects a standard pattern group within the standard pattern storage unit; and a similarity calculation unit that calculates similarity based on a statistical distance measure for each phoneme using the standard pattern group in the standard pattern storage unit and the spectral information. , a processor unit that determines at least a vowel part from the results of the similarity calculation unit and the segmentation unit and selects position information of a frame indicating a stationary part of the phoneme; and a processor unit that corresponds to the position information obtained by the processor unit. Using the similarity calculated by the similarity calculation section,
A cumulative total of the degrees of resemblance of all input voices to the prepared standard pattern group is set for each standard pattern group, and the reliability is calculated by calculating the difference or ratio between the two standard pattern groups with the largest cumulative total. The standard pattern switching unit is controlled to select one of the two standard pattern groups when the reliability exceeds a certain threshold, and to sequentially read out all the standard pattern groups when the reliability does not exceed the threshold. a selection unit; and a word dictionary unit storing a word dictionary to be compared with the phoneme or syllable sequence created by the processor unit using the standard pattern group selected by the standard pattern switching unit or the standard pattern group with the largest cumulative total. A speech recognition device comprising: 2. A patent claim characterized in that the Mahalanobis distance, which is a combination of a covariance matrix common to all target standard pattern groups and an average value set for each phoneme for each standard pattern group, is used as a statistical distance measure. The speech recognition device according to item 1. 3. Claim 1, characterized in that the cumulative degree of resemblance of all input speech to a prepared standard pattern group is determined by the number of standard patterns that have the maximum similarity or the sum of the maximum similarities. Speech recognition device described in section. 4. The speech recognition device according to claim 1, wherein the standard pattern storage section stores a standard pattern group consisting of at least a male voice and a female voice.