JP3467556B2

JP3467556B2 - Voice recognition device

Info

Publication number: JP3467556B2
Application number: JP14648293A
Authority: JP
Inventors: 満広稲積
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1992-06-19
Filing date: 1993-06-17
Publication date: 2003-11-17
Anticipated expiration: 2018-11-17
Also published as: JPH0667698A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ニューラルネットワー
クを用いた音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device using a neural network.

【０００２】[0002]

【従来の技術】従来の音声認識装置に実用的に用いられ
ている手法は大別して、ＤＰマッチング法、隠れマルコ
フモデル（ＨＭＭ）法の２つである。これらの手法は、
例えば、中川聖一著「確率モデルによる音声認識」に詳
しく記述されている。2. Description of the Related Art Techniques that are practically used in conventional speech recognition apparatuses are roughly classified into a DP matching method and a hidden Markov model (HMM) method. These techniques are
For example, it is described in detail in "Speech Recognition by Stochastic Model" by Seiichi Nakagawa.

【０００３】これを要約すれば、ＤＰマッチング法は、
入力されたデータと、標準データとの始端と終端の対応
を仮定し、その内部を様々な時間正規化関数を用い変形
する。そして、その差異が最小となる変形と、その時の
パターン間の距離をその標準パターンの失点とする。そ
して、複数の標準パターンの内、失点が最小となるパタ
ーンをマッチング結果とするものである。To summarize this, the DP matching method is
Assuming that the input data and the standard data correspond to the start end and end, the inside of the input data is transformed using various time normalization functions. Then, the deformation that minimizes the difference and the distance between the patterns at that time are defined as the points of the standard pattern. Then, of the plurality of standard patterns, the pattern having the smallest score is used as the matching result.

【０００４】一方、ＨＭＭ法を用いた音声認識手法は、
確率的な方法により音声認識を行おうとするものであ
る。この方法では、ＤＰ法の場合における標準パターン
に相当するＨＭＭモデルが設定される。一つのＨＭＭモ
デルは複数の状態と、複数の遷移とにより構成される。
それぞれの状態には存在確率が、またそれぞれ遷移には
遷移確率と、出力確率が与えられる。これによりある一
つのＨＭＭモデルが、ある時系列パターンを生成する確
率を計算することができる。On the other hand, the speech recognition method using the HMM method is
It tries to perform voice recognition by a probabilistic method. In this method, an HMM model corresponding to the standard pattern in the DP method is set. One HMM model is composed of a plurality of states and a plurality of transitions.
The existence probability is given to each state, and the transition probability and the output probability are given to each transition. As a result, a certain HMM model can calculate the probability of generating a certain time series pattern.

【０００５】[0005]

【発明が解決しようとする課題】ところで、音声データ
の特徴は話者によって大きく異なる。特に男性と女性、
または大人と子供などのように性別・年齢層が異なる
と、同じ文章（または単語）を発音してもまったく異な
った特徴の音声パターンの音声データとなる。このた
め、特定の話者による音声データを学習用データとして
用いて構築された従来の音声認識装置では、学習用の話
者の特徴と音声パターンが大きく異なる第三者の音声デ
ータを、殆ど認識できなかった。By the way, the characteristics of voice data greatly differ depending on the speaker. Especially men and women,
If the sexes and age groups are different, such as adults and children, even if the same sentence (or word) is pronounced, the voice data will have voice patterns with completely different characteristics. Therefore, in the conventional voice recognition device constructed by using the voice data of a specific speaker as the learning data, most of the voice data of a third party whose voice pattern is greatly different from the characteristics of the speaker for learning are recognized. could not.

【０００６】本発明は、音声パターンの異なる音声デー
タを正確に認識できる音声認識装置を提供することを目
的とする。An object of the present invention is to provide a voice recognition device capable of accurately recognizing voice data having different voice patterns.

【０００７】[0007]

【０００８】[0008]

【課題を解決するための手段】前記目的を達成するため
に、本発明の音声認識装置は、所定の音声データを認識
するようそれぞれ異なる特徴の音声パターンで予め学習
され、入力された音声データが認識対象となる音声デー
タと一致するか否かの音声認識動作を行うとともに、音
声認識の適合度を表す適合度判定用データを出力する動
作を行う複数の音声認識用ニューラルネットワーク部を
含む音声認識処理手段と、前記各音声認識用ニューラル
ネットワーク部から出力された適合度判定用データに基
づき、最も音声認識の適合度が高い音声認識用ニューラ
ルネットワーク部を選択する選択手段と、前記選択手段
で選択された音声認識用ニューラルネットワーク部から
の音声認識結果を出力する出力制御手段と、を含むこと
を特徴とする。In order to achieve the above object, the voice recognition device of the present invention is designed such that input voice data is learned in advance with voice patterns having different characteristics so as to recognize predetermined voice data. Speech recognition including a plurality of neural network units for speech recognition, which performs a speech recognition operation to determine whether or not the speech data to be recognized matches, and outputs a fitness determination data indicating the fitness of speech recognition. A processing means, a selection means for selecting the speech recognition neural network section having the highest degree of fitness of speech recognition based on the fitness determination data output from each of the speech recognition neural network sections, and the selection means. Output control means for outputting the voice recognition result from the generated voice recognition neural network unit.

【０００９】ここにおいて、前記音声認識装置は、入力
された音声データをフレーム単位で切出し、特徴ベクト
ルに変換して順次出力する特徴抽出手段を含み、前記各
音声認識用ニューラルネットワーク部は、前記特徴抽出
手段から出力される特徴ベクトルが音声データとして入
力されるよう形成することが好ましい。Here, the voice recognition device includes a feature extraction means for extracting input voice data in frame units, converting the feature data into feature vectors and sequentially outputting the feature vectors, and each of the voice recognition neural network units includes the feature. It is preferable that the feature vector output from the extraction means is input as voice data.

【００１０】さらに前記各音声認識用ニューラルネット
ワーク部は、内部状態値Ｘが設定された複数のニューロ
ンを相互に結合して構成されており、前記各ニューロン
は、その内部状態値Ｘが、当該ニューロンに与えられる
入力データＺj （ｊ＝０〜ｎ：ｎは自然数）および内部
状態値Ｘを用いて表された関数Ｘ＝Ｇ（Ｘ，Ｚj ）を満
足する値に時間変化するダイナミックニューロンとして
形成され、前記各ダイナミックニューロンは、その内部
状態値Ｘを、関数Ｆ（Ｘ）を満足する値に変換して出力
されるよう形成することが好ましい。Furthermore, each of the neural network units for speech recognition is constructed by mutually connecting a plurality of neurons having an internal state value X set therein, and each of the neurons has an internal state value X of the corresponding neuron. Is formed as a dynamic neuron that changes with time to a value satisfying a function X = G (X, Zj) represented by using input data Zj (j = 0 to n: n is a natural number) and internal state value X given to , Each of the dynamic neurons is preferably formed so that the internal state value X is converted into a value satisfying the function F (X) and then output.

【００１１】ここにおいて、前記関数Ｘ＝Ｇ（Ｘ，Ｚj
）は、Here, the function X = G (X, Zj
) Is

【００１２】[0012]

【数５】 [Equation 5]

【００１３】として表されるよう形成することができ
る。Can be formed as represented by

【００１４】また前記関数Ｘ＝Ｇ（Ｘ，Ｚj ）は、ｊ番
目のニューロンの出力をｉ番目のニューロンの入力へ結
合する結合強度Ｗij、外部入力値Ｄi 、バイアス値θi
を用いて、Further, the function X = G (X, Zj) is such that the coupling strength Wij for coupling the output of the jth neuron to the input of the ith neuron, the external input value Di, and the bias value θi.
Using,

【００１５】[0015]

【数６】 [Equation 6]

【００１６】として表すこともできる。It can also be expressed as:

【００１７】また、前記関数Ｘ＝Ｇ（Ｘ，Ｚj ）は、シ
グモイド関数Ｓを用いて、The function X = G (X, Zj) is obtained by using the sigmoid function S,

【００１８】[0018]

【数７】 [Equation 7]

【００１９】として表すこともできる。It can also be expressed as:

【００２０】また、前記関数Ｘ＝Ｇ（Ｘ，Ｚj ）は、シ
グモイド関数Ｓ、ｊ番目のニューロンの出力をｉ番目の
ニューロンの入力へ結合する結合強度Ｗij、外部入力値
Ｄi、バイアス値θi を用いて、The function X = G (X, Zj) is a sigmoid function S, a coupling strength Wij for coupling the output of the jth neuron to the input of the ith neuron, an external input value Di, and a bias value θi. make use of,

【００２１】[0021]

【数８】 [Equation 8]

【００２２】として表すこともできる。It can also be expressed as

【００２３】前記各音声認識用ニューラルネットワーク
部は、音声データが入力される入力ニューロンと、音声
データの認識結果を出力する認識結果出力ニューロン
と、適合度判定用データを出力する適合度出力ニューロ
ンとを含み、前記適合度出力ニューロンは、前記入力ニ
ューロンに入力される音声データを推定し、この推定デ
ータを適合度判定用データとして出力するよう形成さ
れ、前記選択手段は、実際の音声データに対する前記推
定データの正答率を音声認識の適合度として演算するよ
う形成できる。Each of the speech recognition neural network units has an input neuron to which voice data is input, a recognition result output neuron that outputs a recognition result of voice data, and a fitness output neuron that outputs fitness determination data. The fitness output neuron is configured to estimate the voice data input to the input neuron and output the estimated data as the fitness determination data, and the selection unit is configured to output the fitness data to the actual voice data. The correct answer rate of the estimated data may be calculated as the goodness of fit of the voice recognition.

【００２４】前記関数Ｆ（Ｘ）はｓｉｇｍｏｉｄ関数と
することができる。The function F (X) can be a sigmoid function.

【００２５】また前記関数Ｆ（Ｘ）はしきい値関数とす
ることもできる。Further, the function F (X) may be a threshold function.

【００２６】前記各ダイナミックニューロンは、前記入
力データＺj として、自己のニューロンの出力に重みを
乗算してフィードバックさせたデータを含むよう形成で
きる。Each of the dynamic neurons can be formed to include, as the input data Zj, data obtained by multiplying the output of its own neuron by a weight and feeding it back.

【００２７】また前記各ダイナミックニューロンは、前
記入力データＺj として、他のニューロンの出力に重み
を乗算したデータを含むよう形成できる。Further, each of the dynamic neurons can be formed so as to include, as the input data Zj, data obtained by multiplying the output of another neuron by a weight.

【００２８】また前記各ダイナミックニューロンは、前
記入力データＺj として、外部から与えられた所望のデ
ータを含むよう形成できる。Further, each of the dynamic neurons can be formed so as to include desired data given from the outside as the input data Zj.

【００２９】本発明の音声認識装置によれば、入力され
た音声データは音声認識手段に備えられた複数の音声認
識用ニューラルネットワーク部に与えられる。そして、
各音声認識用ニューラルネットワーク部では、入力され
た音声データの認識処理と、入力された音声データと学
習に用いられた音声データとの音声認識の適合度判定用
データの演算が行われる。According to the voice recognition device of the present invention, the input voice data is given to the plurality of neural network units for voice recognition provided in the voice recognition means. And
In each voice recognition neural network unit, the recognition process of the input voice data and the calculation of the fitness determination data of the voice recognition between the input voice data and the voice data used for learning are performed.

【００３０】前記各音声認識用ニューラルネットワーク
部は、それぞれ異なる音声パターンで音声データを認識
するよう予め学習されているため、その認識適合度も各
ニューラルネットワーク部ごとに異なる値となる。Since each of the neural network units for voice recognition is previously learned so as to recognize voice data with different voice patterns, the recognition suitability also has a different value for each neural network unit.

【００３１】各音声認識用ニューラルネットワーク部の
適合度判定用データは、選択手段に与えられ、ここで最
も認識適合度の高い音声認識用ニューラルネットワーク
部が選択される。この選択結果が出力制御手段に与えら
れ、選択された音声認識用ニューラルネットワーク部か
らの音声認識結果が出力される。The fitness determination data of each voice recognition neural network unit is given to the selecting means, and the voice recognition neural network unit having the highest recognition fitness is selected here. This selection result is given to the output control means, and the speech recognition result from the selected speech recognition neural network unit is output.

【００３２】このようにして、音声パターンの異なる音
声データを正確に認識することができる。In this way, the voice data having different voice patterns can be accurately recognized.

【００３３】ここにおいて、各音声認識用ニューラルネ
ットワーク部を、内部状態値Ｘが設定された複数のニュ
ーロンを相互に結合して構成することがこのましい。前
記各ニューロンは、内部状態値Ｘが、入力データＺj
（ｊ＝０〜ｎ：ｎは自然数）および内部状態値Ｘを用い
て表された関数Ｘ＝Ｇ（Ｘ，Ｚj ）を満足する値に時間
変化するダイナミックニューロンとして構成することが
好ましい。Here, it is preferable that each of the neural network units for voice recognition is constructed by mutually connecting a plurality of neurons in which the internal state value X is set. The internal state value X of each neuron is the input data Zj
(J = 0 to n: n is a natural number) and the internal state value X is preferably used as a dynamic neuron that changes with time to a value that satisfies the function X = G (X, Zj) expressed using the internal state value X.

【００３４】これにより、ニューラルネットワーク部全
体のデータ処理を簡略化し、かつ、音声認識精度を高め
ることができる。As a result, the data processing of the entire neural network section can be simplified and the voice recognition accuracy can be improved.

【００３５】また、前記他の音声認識装置は、入力され
る音声データをフレーム単位で切出し、特徴ベクトルに
変換して順次出力する特徴抽出手段と、前記特徴抽出手
段から入力される認識対象話者の特徴ベクトルに基づ
き、入力される認識対象話者の特徴ベクトルを予測し、
音声認識の適合度を表す適合度判定用データとして出力
するよう予め学習され、前記特徴抽出手段から実際に入
力される特徴ベクトルに基づき前記適合度判定用データ
を出力するよう形成された複数の音声認識用ニューラル
ネットワーク部を含む音声認識処理手段と、前記各音声
認識用ニューラルネットワーク部から出力された適合度
判定用データと、前記特徴抽出手段から入力される実際
の話者の特徴ベクトルとの正答率を各音声認識用ニュー
ラルネットワーク部毎に演算し、入力音声の話者認識を
行う話者認識手段と、を含むことを特徴とする。Further, the other speech recognition apparatus, cut out the audio data inputted in units of frames, a feature extraction means for sequentially outputs the converted feature vector, the recognition target speaker inputted from said feature extracting means The feature vector of the input recognition target speaker based on the feature vector of the speaker,
A plurality of voices that are pre-learned to be output as the fitness determination data representing the fitness of voice recognition and are formed to output the fitness determination data based on the feature vector actually input from the feature extraction means. Correct answer of voice recognition processing means including a recognition neural network section, fitness determination data output from each of the voice recognition neural network sections, and an actual speaker feature vector input from the feature extraction means. And a speaker recognizing means for recognizing the speaker of the input voice by calculating the ratio for each voice recognition neural network unit.

【００３６】以上の構成とすることにより、入力される
音声データから複数の話者を正確に認識できる。With the above configuration, a plurality of speakers can be accurately recognized from the input voice data.

【００３７】ここにおいて、前記各音声認識用ニューラ
ルネットワーク部は、内部状態値Ｘが設定された複数の
ニューロンを相互に結合して構成されており、前記各ニ
ューロンは、その内部状態値Ｘが、当該ニューロンに与
えられる入力データＺj （ｊ＝０〜ｎ：ｎは自然数）お
よび内部状態値Ｘを用いて表された関数Ｘ＝Ｇ（Ｘ，Ｚ
j ）を満足する値に時間変化するダイナミックニューロ
ンとして形成され、前記各ダイナミックニューロンは、
その内部状態値Ｘを、関数Ｆ（Ｘ）を満足する値に変換
して出力されるよう形成することがこのましい。Here, each of the neural network units for speech recognition is configured by mutually connecting a plurality of neurons in which the internal state value X is set, and in each of the neurons, the internal state value X is A function X = G (X, Z) represented by using input data Zj (j = 0 to n: n is a natural number) and internal state value X given to the neuron.
j) is formed as a time-varying dynamic neuron that satisfies
It is preferable to convert the internal state value X into a value that satisfies the function F (X) and output the value.

【００３８】また前記各音声認識用ニューラルネットワ
ーク部は、前記特徴ベクトルが入力される入力ニューロ
ンと、適合度判定用データを出力する適合度出力ニュー
ロンとを含み、前記適合度出力ニューロンは、入力され
る前記特徴ベクトルを推定し、この推定データを適合度
判定用データとして出力するよう形成することができ
る。Each of the neural network units for speech recognition includes an input neuron to which the feature vector is input, and a fitness output neuron that outputs fitness determination data. The fitness output neuron is input. The feature vector may be estimated and the estimated data may be output as the fitness determination data.

【００３９】[0039]

【実施例】次に、本発明の好適な実施例を図面に基づき
詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, preferred embodiments of the present invention will be described in detail with reference to the drawings.

【００４０】図１には、本発明の音声認識装置の好適な
実施例が示されている。FIG. 1 shows a preferred embodiment of the voice recognition apparatus of the present invention.

【００４１】音声認識装置全体の説明実施例の音声認識装置は、特徴抽出部１０、音声認識理
部２０、選択部３０、出力制御部４０を含む。 Description of Entire Voice Recognition Device The voice recognition device of the embodiment includes a feature extraction unit 10, a voice recognition processing unit 20, a selection unit 30, and an output control unit 40.

【００４２】前記特徴抽出部１０は、図２に示すよう、
入力されるアナログ音声データ１００をフレーム単位で
切り出し、特徴ベクトル１００に変換して音声認識処理
部２０へ向け出力する。この特徴ベクトル１００は、次
のようにして求められる。すなわち、図２（Ａ）に示す
よう、アナログ音声データ１００を所定のフレーム１０
２の単位で順次切り出す。図２（Ｂ）に示すよう、フレ
ーム単位で切り出された音声データ１００は、線形予測
分析やフィルタバンク等で特徴が抽出され、特徴ベクト
ル１１０の列として音声認識理部２００へ向け、順次出
力される。As shown in FIG. 2, the feature extracting unit 10
The input analog voice data 100 is cut out in frame units, converted into a feature vector 100, and output to the voice recognition processing unit 20. The feature vector 100 is obtained as follows. That is, as shown in FIG. 2 (A), the analog audio data 100 is converted into a predetermined frame 10
Cut out in units of 2. As shown in FIG. 2B, the features of the voice data 100 cut out in units of frames are extracted by linear prediction analysis, a filter bank, or the like, and sequentially output to the voice recognition processing unit 200 as a sequence of feature vectors 110. It

【００４３】音声認識処理部２０は、複数のニューラル
ネットワーク部２００−１，２００−２，……２００−
ｋを含む。特徴抽出部１０から出力される特徴ベクトル
１１０は、各ニューラルネットワーク部へそれぞれ入力
される。The voice recognition processing section 20 includes a plurality of neural network sections 200-1, 200-2, ... 200-.
Including k. The feature vector 110 output from the feature extraction unit 10 is input to each neural network unit.

【００４４】前記各ニューラルネットワーク部２００−
１，２００−２……２００−ｋは、所定の音声データを
認識するよう、それぞれ異なる特徴の音声パターンでそ
の学習が行われている。そして、各ニューラルネットワ
ーク部２００−１，２００−２……２００−ｋは、特徴
ベクトル１１０として入力される音声データが、認識対
象となる音声データと一致するか否かの音声認識動作を
行い、さらに、その認識の適合度を表す適合度判定用デ
ータを出力する動作を行うよう形成されている。Each of the neural network units 200-
1, 200-2 ... 200-k are learned with voice patterns having different characteristics so as to recognize predetermined voice data. Then, each of the neural network units 200-1, 200-2, ..., 200-k performs a voice recognition operation as to whether or not the voice data input as the feature vector 110 matches the voice data to be recognized, Further, it is configured to perform an operation of outputting the fitness determination data representing the fitness of the recognition.

【００４５】例えば、「ビール」という音声データを認
識させようとする場合、同じ「ビール」という音声デー
タを発音しても、その音声パターンの特徴は、話者によ
って大きく異なることは前述した通りである。このた
め、例えばニューラルネットワーク部２００−１，２０
０−２には、それぞれ異なる特徴の音声パターンを有す
る男性の声で、「ビール」という音声データの認識を行
うように学習をさせ、ニューラルネットワーク部２００
−ｋには、女性の声で「ビール」という音声データを認
識するよう学習を行わせる。このようにすることによ
り、各ニューラルネットワーク部２００−１，２００−
２……２００−ｋは、入力された音声データが、学習に
用いた音声データ「ビール」と一致するか否かの音声認
識動作を行い、その認識結果１２０を出力制御部４０へ
向け出力する。このとき、各ニューラルネットワーク部
２００−１，２００−２……２００−ｋは、この音声認
識の適合度を判定するためのデータの演算を行い、これ
を適合度判定用データ１３０として選択部３０へ向け出
力する。選択部３０は、各ニューラルネットワーク２０
０−１，２００−２……２００−ｋから出力される適合
度判定用データ１３０に基づき、認識適合度の最も高い
ニューラルネットワークを識別し、選択データ１４０を
出力制御部４０へ向け出力する。For example, when the voice data "beer" is to be recognized, even if the same voice data "beer" is pronounced, the characteristics of the voice pattern greatly differ depending on the speaker, as described above. is there. Therefore, for example, the neural network units 200-1 and 20
0-2 is trained to recognize voice data “beer” with male voices having different voice patterns, and the neural network unit 200
-K is trained to recognize voice data "beer" by a female voice. By doing so, the neural network units 200-1, 200-
2 ... 200-k performs a voice recognition operation as to whether the input voice data matches the voice data “beer” used for learning, and outputs the recognition result 120 to the output control unit 40. . At this time, each of the neural network units 200-1, 200-2, ... 200-k calculates data for determining the fitness of the voice recognition, and the data is used as the fitness determination data 130 for the selection unit 30. Output to. The selection unit 30 controls each neural network 20.
Based on the fitness determination data 130 output from 0-1, 200-2 ... 200-k, the neural network having the highest recognition fitness is identified, and the selection data 140 is output to the output control unit 40.

【００４６】認識適合度の判定処理とは、入力された音
声データと、学習で用いられた音声データとの適合度１
３０を判定する処理をいう。この判定処理は、入力され
た音声データに基づき、当該入力データより時間的に前
（過去）の音声データが推定できるようにニューラルネ
ットワークを学習させて、推定の正解率を認識適合度と
して求める処理をいう。例えば、図２において、ニュー
ラルネットワーク２００に特徴ベクトル１１０が入力さ
れると、この入力データ１１０より時間的に一つ前に入
力される特徴ベクトル１１０ａが予測できるようにニュ
ーラルネットワーク２００を学習させ、推定した特徴ベ
クトルを適合度判定用データ１３０として選択部３０へ
向け出力させる。つまり、入力データの時間的な関係
は、話者の個性を反映したもので、推定しやすい話者の
音声データは、そのニューラルネットワークの学習で用
いた音声データと類似した個性である音韻または特徴で
ある音韻を持っているのである。The recognition conformity determination process is the conformance degree 1 between the input voice data and the voice data used for learning.
This is a process of determining 30. This determination process is a process of learning the neural network based on the input voice data so that the voice data that is previous (past) in time with respect to the input data can be estimated, and obtains the estimated correct answer rate as the recognition suitability. Say. For example, in FIG. 2, when the feature vector 110 is input to the neural network 200, the neural network 200 is trained and estimated so that the feature vector 110a that is input immediately before the input data 110 in time can be predicted. The selected feature vector is output as the fitness determination data 130 to the selection unit 30. In other words, the temporal relationship of the input data reflects the individuality of the speaker, and the voice data of the speaker, which is easy to estimate, is a phoneme or feature whose personality is similar to the voice data used in the learning of the neural network. It has a phoneme that is.

【００４７】そこで、選択部３０は、ニューラルネット
ワーク２００−１，２００−２……２００−ｋから出力
される適合度判定用データ１３０（推定された一つ前の
特徴ベクトル）を、特徴抽出部１０から実際に出力され
た一つ前の特徴ベクトル１１０と照合し、各ニューラル
ネットワーク毎に正解率を演算する。この正解率（認識
適合度）の最も高いニューラルネットワークの音声認識
結果の出力が最も正解であるといえるので、その出力を
音声認識装置の認識結果として採用ればよい。そして、
認識適合度の最も高いニューラルネットワークの選択デ
ータ１４０を出力制御部４０へ向け出力する。Therefore, the selection unit 30 uses the fitness determination data 130 (estimated previous feature vector) output from the neural networks 200-1, 200-2 ... 200-k as the feature extraction unit. The accuracy rate is calculated for each neural network by collating with the immediately preceding feature vector 110 actually output from 10. Since it can be said that the output of the speech recognition result of the neural network having the highest correct answer rate (recognition compatibility) is the most correct answer, the output may be adopted as the recognition result of the speech recognition device. And
The selection data 140 of the neural network having the highest recognition suitability is output to the output control unit 40.

【００４８】そして、出力制御部４０は、このように入
力される選択データ１４０で指定される最も相性のよい
ニューラルネットワーク部２００の認識データ１２０
を、認識結果データ１５０として選択出力する。Then, the output control section 40 recognizes the recognition data 120 of the neural network section 200 having the best compatibility specified by the selection data 140 thus input.
Is selected and output as the recognition result data 150.

【００４９】このようにして、本発明の音声認識装置に
よれば、例えば男性や女性、または大人や子供というよ
うに音声パターンの特徴の異なる話者から入力される音
声データ１００を、その音声パターンの相違に影響され
ることなく、正確に認識することができる。As described above, according to the voice recognition device of the present invention, the voice data 100 input from a speaker having different voice pattern characteristics, such as male or female, or adult or child, is converted into the voice pattern. It can be accurately recognized without being affected by the difference of.

【００５０】なお、前記位各ニューラルネットワーク２
００−１，２００−２……２００−ｋは、図２に示すよ
う、特徴抽出部１０から入力される特徴ベクトル１１０
に基づき、当該特徴ベクトル１１０そのもの、あるい
は、この特徴ベクトル１１０より後（未来）に入力され
るいずれかの特徴ベクトル１００ｂを予測し、予測した
特徴ベクトルを適合度判定用データ１３０として選択部
３０へ向け出力するように形成してもよい。The above neural networks 2
00-1, 200-2 ... 200-k are the feature vectors 110 input from the feature extraction unit 10 as shown in FIG.
On the basis of the above, the feature vector 110 itself or any feature vector 100b input (future) after this feature vector 110 is predicted, and the predicted feature vector is sent to the selection unit 30 as the fitness determination data 130. You may form so that it may output toward.

【００５１】この場合にも、選択部３０は、各ニューラ
ルネットワーク２００−１，２００−２，……２００−
ｋで予測された特徴ベクトルと、予測対象として特徴抽
出部１０から入力された実際の特徴ベクトル１１０とを
照合し、その正解率を認識適合度として各ニューラルネ
ットワーク毎に演算するよう形成すれば良い。Also in this case, the selection unit 30 causes the neural networks 200-1, 200-2, ... 200-
The feature vector predicted by k may be collated with the actual feature vector 110 input from the feature extraction unit 10 as a prediction target, and the accuracy rate may be calculated as the recognition suitability for each neural network. .

【００５２】本発明に用いられるニューラルネットワー
ク部２００としては、例えば階層型モデルや、マルコフ
モデル等で表される従来の静的なニューラルネットワー
クでもよいが、簡単な構成でより良好な認識動作を行う
ためには、以下に詳述するようなダイナミックなニュー
ラルネットワークを用いることが好ましい。The neural network unit 200 used in the present invention may be a conventional static neural network represented by, for example, a hierarchical model or a Markov model, but it performs a better recognition operation with a simple structure. For this purpose, it is preferable to use a dynamic neural network as described in detail below.

【００５３】音声認識用ニューラルネットワークの構成図３には、前記音声認識用ニューラルネットワーク２０
０として用いられるダイナミックなニューラルネットワ
ークの一例を簡略化して表したものが示されている。実
施例のニューラルネットワーク２００は、神経細胞を構
成する複数のニューロン２００−１，２００−２……２
００−６を相互に接続して構成されている。各ニューロ
ン２００の結合部には、それぞれ大きさが可変の重みが
備えられている。この重みを学習によって所定の値に変
化させることによって、正確な音声認識処理が行われる
ようになる。 Structure of Neural Network for Speech Recognition FIG. 3 shows the neural network 20 for speech recognition.
A simplified representation of an example of a dynamic neural network used as 0 is shown. The neural network 200 of the embodiment includes a plurality of neurons 200-1, 200-2, ...
00-6 are connected to each other. A weight whose size is variable is provided in the coupling portion of each neuron 200. By changing this weight to a predetermined value by learning, accurate voice recognition processing is performed.

【００５４】なお、音声データ１００の特徴ベクトル２
１０は、ニューロン２１０−２，２１０−３に与えら
れ、音声認識処理の認識結果データ１５０はニューロン
２１０−５，２１０−６から出力される。なお、ニュー
ロン２１０−５からは否定出力１５８−Ｂ、ニューロン
２１０−６からは肯定出力１５８−Ａがそれぞれ出力さ
れるようになっている。さらに、ニューロン２１０−４
からは、適合度判定用データ１３０が出力されるように
構成されている。The feature vector 2 of the voice data 100
10 is given to the neurons 210-2 and 210-3, and the recognition result data 150 of the voice recognition processing is output from the neurons 210-5 and 210-6. The neuron 210-5 outputs a negative output 158-B and the neuron 210-6 outputs a positive output 158-A. Furthermore, the neuron 210-4
Is configured to output the fitness determination data 130.

【００５５】ニューロンの構成図４には、前記ニューロン２１０の構成が模式的に示さ
れている。このニューロン２１０は、所定の内部状態値
Ｘを記憶する内部状態値記憶手段２２０と、前記内部状
態値Ｘ及び以下に説明する外部入力値Ｚj を入力として
内部状態記憶手段２２０の内部状態値Ｘを更新する内部
状態値更新手段２４０と、内部状態値Ｘを外部出力Ｙへ
変換する出力値生成手段２６０とを含む。 Structure of Neuron FIG. 4 schematically shows the structure of the neuron 210. The neuron 210 stores an internal state value storage means 220 for storing a predetermined internal state value X, and the internal state value X of the internal state storage means 220 with the internal state value X and an external input value Zj described below as inputs. The internal state value updating means 240 for updating and the output value generating means 260 for converting the internal state value X into the external output Y are included.

【００５６】このように、実施例に用いたニューラルネ
ットワーク２００では、ニューロン２１０の内部状態値
Ｘの値を、その値Ｘそのものを基にして順次更新してい
く。従って、そのニューロン２１０へ入力されるデータ
の過去の履歴が、その内部状態値Ｘとして変換、保存さ
れる。つまり、内部状態値Ｘとして、入力の時間的な履
歴が保存され、出力Ｙに反映される。この意味で、実施
例のニューロン２１０の動作はダイナミックなものであ
るといえる。したがって、静的なニューロンを用いたネ
ットワークと異なり、実施例のニューラルネットワーク
２００は、ニューラルネットワークの構造等によらず、
時系列データを処理することができ、全体の回路規模を
小さくできる。As described above, in the neural network 200 used in the embodiment, the value of the internal state value X of the neuron 210 is sequentially updated based on the value X itself. Therefore, the past history of the data input to the neuron 210 is converted and stored as the internal state value X. That is, the time history of the input is stored as the internal state value X and reflected in the output Y. In this sense, the operation of the neuron 210 of the embodiment can be said to be dynamic. Therefore, unlike the network using static neurons, the neural network 200 of the embodiment does not depend on the structure of the neural network,
The time series data can be processed, and the overall circuit scale can be reduced.

【００５７】図５には、前記ニューロン２１０の具体例
が示されている。前記内部状態記憶手段２２０は、内部
状態値Ｘを記憶するメモリ２２２を含んで構成されてい
る。前記内部状態値更新手段２４０は、入力Ｚj の積算
手段２４２と、次式で示す演算を行い新たな内部状態値
Ｘを求めメモリ２２２の内容を更新する演算部２４４と
を含む。FIG. 5 shows a concrete example of the neuron 210. The internal state storage means 220 includes a memory 222 that stores an internal state value X. The internal state value updating means 240 includes an integrating means 242 for the input Zj, and a computing section 244 for computing a new internal state value X by performing the computation shown in the following equation and updating the contents of the memory 222.

【００５８】[0058]

【数９】 [Equation 9]

【００５９】前記出力値生成手段２６０は、演算部２６
２を含む。この演算部２６２は、メモリ２２２に記憶さ
れている内部状態値Ｘを、値域制限した出力値Ｙへシグ
モイド（ロジスティック）関数等を用いて変換出力する
よう形成されている。The output value generating means 260 includes an arithmetic unit 26.
Including 2. The calculation unit 262 is configured to convert and output the internal state value X stored in the memory 222 into an output value Y whose range is limited by using a sigmoid (logistic) function or the like.

【００６０】前記内部状態値Ｘ、出力値Ｙのそれぞれの
時間変化において、現在の内部状態値をＸｃｕｒｒ、更
新される内部状態値をＸｎｅｘｔ、またその更新動作時
点での外部入力値をＺj （j は０からｎであり、ｎはそ
のニューロン２１０への外部入力数）とする。このと
き、内部状態更新手段２４０の動作を形式的に関数Ｇで
表すと、Ｘｎｅｘｔ＝Ｇ（Ｘｃｕｒｒ、Ｚ１、−−−、Ｚｉ、−−−、Ｚｎ）と表現できる。この表現の具体的な形は様々なものが考
えられるが、例えば１階の微分方程式を用いた前記数９
で示すことができる。ここでτはある定数である。At each time change of the internal state value X and the output value Y, the current internal state value is Xcurr, the updated internal state value is Xnext, and the external input value at the time of the updating operation is Zj (j Is 0 to n, and n is the number of external inputs to the neuron 210. At this time, when the operation of the internal state updating means 240 is formally expressed by a function G, it can be expressed as Xnext = G (Xcurr, Z1, ---, Zi, ---, Zn). Although various concrete forms of this expression can be considered, for example, the equation 9 using the first-order differential equation is used.
Can be shown as Where τ is a constant.

【００６１】また、数９をもう少し変形した形として
は、以下の数１０のような表現も可能である。Further, as a modified form of the equation (9), the expression of the following equation (10) is also possible.

【００６２】[0062]

【数１０】 [Equation 10]

【００６３】この中で、Ｗijはｊ番目のニューロンの出
力を、ｉ番目のニューロンの入力へ結合する結合強度を
示す。また、Ｄi は外部入力値を示す。またθi はバイ
アス値を示す。このバイアス値は、固定された値との結
合として、Ｗijの中に含めて考えることも可能である。Among them, Wij represents the coupling strength for coupling the output of the j-th neuron to the input of the i-th neuron. Further, Di represents an external input value. Further, θi represents a bias value. This bias value can be considered in Wij as a combination with a fixed value.

【００６４】このようにして決定されたある瞬間のニュ
ーロン２１０の内部状態をＸとし、出力生成手段２６０
の動作を形式的に関数Ｆで表すと、ニューロン２１０の
出力Ｙは、Ｙ＝Ｆ（Ｘ）と表現できる。Ｆの具体的な形としては、以下の数１１
で示されるような正負対称出力のシグモイド（ロジステ
ィック）関数等が考えられる。The internal state of the neuron 210 at a certain moment thus determined is X, and the output generating means 260 is used.
The output Y of the neuron 210 can be expressed as Y = F (X) when the operation of is expressed formally by a function F. As a concrete form of F,
A positive / negative symmetrical output sigmoid (logistic) function as shown in FIG.

【００６５】[0065]

【数１１】 [Equation 11]

【００６６】しかし、この関数型は、必須のものではな
く、その他にもより単純な線形変換や、あるいはしきい
値関数等も考えられる。However, this function type is not indispensable, and simpler linear conversion, a threshold function, or the like may be considered.

【００６７】このような演算式を用い、実施例のダイナ
ミックなニューロン３２０の出力Ｙの時系列は、図６に
示したような処理により計算される。図６においては、
簡略のためニューロンを単にノードと記載している。Using such an arithmetic expression, the time series of the output Y of the dynamic neuron 320 of the embodiment is calculated by the processing shown in FIG. In FIG.
For simplicity, the neuron is simply described as a node.

【００６８】なお、前記ニューロン２１０への入力Ｚj
としては、ある重みが乗算されたそのニューロン自身の
出力、結合重みが乗算された他のニューロンの出力、あ
るいはそのニューラルネットワーク以外からの外部入力
などがある。The input Zj to the neuron 210 is
Is the output of the neuron itself multiplied by a certain weight, the output of another neuron multiplied by the connection weight, or an external input from other than the neural network.

【００６９】実施例においては、図３示すよう、ニュー
ロン２１０−２，２１０−３には、重み付けされた自分
自身の出力、重み付けされた他のニューロンからの出
力、及び特徴抽出部１０からの出力１１０が与えられ
る。また、ニューロン２１０−１には、重み付けされた
自分自身の出力、重み付けされた他のニューロンからの
出力が与えられる。さらに、ニューロン２１０−４，２
１０−５，５１０−６には、重み付けされた自分自身の
出力、重み付けされた他のニューロンからの出力が与え
られる。そしてニューロン２１０−４の出力は、選択部
３０に与えられる。ニューロン２１０−５，２１０−６
の出力は出力制御部４０に与えられる。In the embodiment, as shown in FIG. 3, each of the neurons 210-2 and 210-3 has a weighted output of itself, an output of another weighted neuron, and an output of the feature extraction unit 10. 110 is given. Further, the weighted output of itself and the output of another weighted neuron are given to the neuron 210-1. Furthermore, the neurons 210-4, 2
10-5 and 510-6 are given the weighted own output and the weighted output from other neurons. The output of the neuron 210-4 is given to the selection unit 30. Neurons 210-5, 210-6
Is output to the output control unit 40.

【００７０】内部状態量の初期値設定また、実施例の各ニューロン２１０は、内部状態記憶手
段２２０内に記憶された内部状態量Ｘを、前述したよう
に内部状態値更新手段２４０を用いて順次更新していく
ように構成されている。したがって、このようなニュー
ロン２１０を用いて構成されたニューラルネットワーク
２００では、動作に先立って予めその初期値を設定して
やることが必要となる。 Initial Value Setting of Internal State Quantity Further, each neuron 210 of the embodiment sequentially uses the internal state quantity X stored in the internal state storage means 220 to sequentially use the internal state value updating means 240 as described above. It is configured to update. Therefore, in the neural network 200 configured by using such a neuron 210, it is necessary to set the initial value in advance before the operation.

【００７１】このため、図１に示すよう、実施例の音声
認識装置には、内部状態初期値設定部６０が設けられて
いる。そして、この内部状態初期値設定部６０は、ニュ
ーラルネットワーク２００が動作するに先立って、予め
定められた初期値を全てのニューロンに与えるよう形成
されている。すなわち、ニューラルネットワーク２００
の動作に先立って、全てのニューロン２１０に、適当に
選択された初期内部状態値Ｘをセットし、それに対応す
る出力Ｙをセットする。このようにして初期値をセット
することにより、ニューラルネットワークは速やかにス
タートすることになる。Therefore, as shown in FIG. 1, the internal state initial value setting unit 60 is provided in the voice recognition apparatus of the embodiment. Then, the internal state initial value setting unit 60 is formed so as to give a predetermined initial value to all neurons before the neural network 200 operates. That is, the neural network 200
Prior to the operation of, the neuron 210 is set to the appropriately selected initial internal state value X, and the corresponding output Y is set. By setting the initial values in this way, the neural network will start quickly.

【００７２】ニューラルネットワークの学習次に、ニューラルネットワーク２００の音声認識処理の
学習方法について説明する。 Learning of Neural Network Next, a learning method of the speech recognition processing of the neural network 200 will be described.

【００７３】図７には、ニューラルネットワーク２００
を学習させるための学習装置３００の構成が示されてい
る。この学習装置３００は、図１に示す各ニューラルネ
ットワーク２００−１，２００−２……２００−ｋをそ
れぞれ異なる特徴の音声パターンで学習させるように形
成されている。FIG. 7 shows a neural network 200.
The configuration of a learning device 300 for learning is shown. This learning device 300 is formed so as to learn each of the neural networks 200-1, 200-2, ... 200-k shown in FIG. 1 with a voice pattern having a different characteristic.

【００７４】この学習装置３００は、学習用の入力音声
データが記憶された入力データ記憶部３１０と、入力音
声データに対応する模範となる出力データが記憶された
出力データ記憶部３１２と、学習させたい入力データを
選択する入力データ選択部３１４と、出力データを選択
する出力データ選択部３１６と、ニューラルネットワー
ク２００の学習を制御する学習制御部３１８とを含む。This learning device 300 is configured to learn by input data storage unit 310 in which input voice data for learning is stored, output data storage unit 312 in which model output data corresponding to the input voice data is stored. An input data selection unit 314 that selects desired input data, an output data selection unit 316 that selects output data, and a learning control unit 318 that controls learning of the neural network 200 are included.

【００７５】そして、この学習装置３００による学習方
法を行う場合には、まず、学習対象となるニューラルネ
ットワーク２００を構成する全てのニューロン２１０
に、初期状態値Ｘをセットする。次に、学習させたい音
声データが、入力データ選択部３１０により選択され、
学習制御部３１８に入力される。このとき、選択した学
習用入力データに対応する学習用出力データが、出力デ
ータ選択部３１６により選択され、学習制御部３１８に
入力される。選択された学習用の入力音声データは、音
声抽出部１０に入力され、ここで抽出された特徴ベクト
ル１１０がニューラルネットワーク２００へ外部入力と
して入力される。全てのニューロン２１０についてそれ
ぞれ入力Ｚj の和を求め、その内部状態量Ｘが更新され
る。そして、更新されたＸによりニューロン２１０の出
力Ｙを求める。When carrying out the learning method by the learning device 300, first, all the neurons 210 constituting the neural network 200 to be learned are to be learned.
The initial state value X is set to. Next, the voice data to be learned is selected by the input data selection unit 310,
It is input to the learning control unit 318. At this time, the learning output data corresponding to the selected learning input data is selected by the output data selection unit 316 and input to the learning control unit 318. The selected input voice data for learning is input to the voice extraction unit 10, and the feature vector 110 extracted here is input to the neural network 200 as an external input. The sum of the inputs Zj is calculated for all the neurons 210, and the internal state quantity X is updated. Then, the output Y of the neuron 210 is obtained from the updated X.

【００７６】初期状態では、ニューラルネットワーク２
００の各ニューロン間の結合強度にはランダムな値が与
えられている。したがって、図３の各ニューロン２１０
−５，２１０−６から出力される認識結果１２０Ｂ，１
２０Ａはでたらめな値である。これらの出力が正しい値
となるように、少しだけ各ニューロン間の重みを変更す
る。In the initial state, the neural network 2
A random value is given to the connection strength between the neurons of 00. Therefore, each neuron 210 in FIG.
-5, 210-6 output recognition results 120B, 1
20A is a random value. The weights between the neurons are slightly changed so that these outputs have the correct values.

【００７７】学習対象となるニューラルネットワーク２
００は、認識対象となる音声データが入力された場合
に、図８に示すよう、ニューロン２１０−６から肯定出
力１２０Ａとしてハイレベルの信号が出力され、ニュー
ロン２１０−５から否定出力１２０Ｂとしてローレベル
の信号が出力されるよう学習を行う。このように、肯定
出力と否定出力の２種類の認識結果データ１２０Ａ，１
２０Ｂを出力させるのは、音声認識処理の精度を向上さ
せるためである。Neural network 2 to be learned
When the voice data to be recognized is input, a signal 00 outputs a high level signal as a positive output 120A from the neuron 210-6 and a low level output from the neuron 210-5 as a negative output 120B, as shown in FIG. Learn so that the signal of is output. Thus, two types of recognition result data 120A, 1 for positive output and negative output
The reason why 20B is output is to improve the accuracy of voice recognition processing.

【００７８】そして、認識させたい音声データ１００を
何回も繰返入力し、少しづつ各ニューロン間の重みを変
更する。これにより、次第にニューロン２１０−５，２
１０−６から正しい値が出力されるようになる。入力さ
れる音声データが認識させたくないデータを学習される
場合は、肯定出力１２０Ａがローレベル、否定出力がハ
イレベルとなるように各ニューロン間の重みを変更す
る。Then, the voice data 100 to be recognized is repeatedly input many times, and the weight between the neurons is changed little by little. As a result, the neurons 210-5, 2
The correct value will be output from 10-6. If the input voice data is learned as data that should not be recognized, the weight between the neurons is changed so that the positive output 120A becomes the low level and the negative output becomes the high level.

【００７９】ニューラルネットワーク２００の出力が収
束するまでの繰りかえし学習回数は、数千回程度であ
る。The number of repeated learnings until the output of the neural network 200 converges is about several thousand.

【００８０】なお、学習方法の応用として、二つの音声
データを続けて入力し、学習させる方法がある。その理
由は、音声データを一つづつ用いた学習では、一度ハイ
レベルになった肯定出力はローレベルに下げることが出
来ず、また一度ローレベルになった否定出力はハイレベ
ルに上げることができないからである。つまり、音声デ
ータを一つづつ用いた学習では、図９（Ａ）に示すよう
に、認識させたい音声データ（以下真データという）を
与えて肯定出力をハイレベルに上昇させる学習（この場
合、否定出力はローレベルを保持している）、あるいは
図９（Ｂ）に示すよう、認識させたくないデータ（以
下、偽データという）を与えて否定出力をハイレベルに
上昇させる学習（この場合、肯定出力はローレベルを保
持している）が行われる。この学習では、肯定出力及び
否定出力とも、一旦ハイレベルに上昇した後は、その出
力値がローレベルになることはないという問題が生ず
る。As an application of the learning method, there is a method in which two voice data are continuously input and learned. The reason is that in learning using voice data one by one, the positive output that once became high level cannot be lowered to low level, and the negative output that once became low level cannot be raised to high level. Because. That is, in the learning using the voice data one by one, as shown in FIG. 9A, the voice data to be recognized (hereinafter referred to as true data) is given to increase the positive output to a high level (in this case, learning). Negative output holds a low level), or as shown in FIG. 9 (B), learning (in this case, raising the negative output to a high level by giving data that is not desired to be recognized (hereinafter referred to as false data)) The positive output holds the low level). In this learning, there arises a problem that both the positive output and the negative output do not become the low level after the output value once rises to the high level.

【００８１】したがって、真データと偽データが混在し
た複数の音声データが連続して与えられた場合、真デー
タの入力で一度ハイレベルに上がった肯定出力は、その
後、偽データの入力があってもローレベルに下がること
はない。これは否定出力についても同様である。Therefore, when a plurality of audio data in which true data and false data are mixed are continuously given, an affirmative output which once rises to a high level at the input of the true data has a false data input thereafter. Does not go down to low level. This also applies to the negative output.

【００８２】そこで、本実施例では、図１０（Ａ）〜
（Ｄ）に示すように、二つの音声データを連続して与
え、出力の上昇と下降の両方の学習を行わせる方法が取
られている。図１０（Ａ）では、真データと偽データを
連続して入力し、これを繰り返して学習させている。こ
の学習によって、肯定出力の上昇、否定出力の上昇と降
下が学べる。図１０（Ｂ）では、偽データと真データを
連続して入力し、これを繰り返して学習させている。こ
の学習によって、肯定出力の上昇と降下、否定出力の上
昇が学べる。図１０（Ｃ）では、偽データを連続して入
力し、これを繰り返して学習させている。この学習は、
図１０（Ｂ）に示した学習によって、偽データの次のデ
ータは真データであるといった誤った認識をニューラル
ネットワーク２００に持たせないためのものである。同
様に図１０（Ｄ）では、真データを二つ連続して入力
し、これを繰り返して学習させている。この学習も、図
１０（Ａ）に示した学習によって、真データの次のデー
タは偽データであるといった誤った認識をニューラルネ
ットワーク２００に持たせないためのものである。Therefore, in this embodiment, FIG.
As shown in (D), a method is adopted in which two audio data are continuously given to perform learning of both rising and falling of the output. In FIG. 10 (A), true data and false data are continuously input, and this is repeated for learning. Through this learning, you can learn the rise of positive output and the rise and fall of negative output. In FIG. 10 (B), the false data and the true data are continuously input, and the learning is repeated. By this learning, you can learn the rise and fall of positive output and the rise of negative output. In FIG. 10C, false data is continuously input, and the learning is repeated. This learning is
This is for preventing the neural network 200 from erroneously recognizing that the data next to the false data is true data by the learning shown in FIG. Similarly, in FIG. 10D, two true data are continuously input, and the true data is repeatedly learned. This learning is also to prevent the neural network 200 from erroneously recognizing that the data next to the true data is false data by the learning shown in FIG.

【００８３】このような学習を、図１に示すニューラル
ネットワーク２００−１，２００−２……２００−ｋに
対し、それぞれ異なる特徴の音声パターンで行う。例え
ば各ニューラルネットワーク２００−１，２００−２…
…２００−ｋで、それぞれ「ビール」という音声データ
を認識させたい場合には、異なる特徴の音声パターンを
有する音声データ「ビール」を学習用音声データとして
用い、各ニューラルネットワーク２００−１，２００−
２……２００−ｋの学習を前述したように行わせる。こ
のような学習の結果、認識に最適な入力音声パターンが
各ニューラルネットワーク毎にそれぞれ設定される。し
たがって、同じ「ビール」という音声データ１００を与
えても、各ニューラルネットワーク毎にその認識率は異
なったものとなる。例えば、男性の音声でニューラルネ
ットワーク２００−１の学習を行い、女性の音声でニュ
ーラルネットワーク２００−２の学習を行った音声認識
装置では、別の男性の音声で入力データを与えた場合、
ニューラルネットワーク２００−１では高い確率で認識
できるが、ニューラルネットワーク２００−２ではほと
んど認識できない事態が生ずる。逆に、別の女性の音声
で入力データを与えた場合は、ニューラルネットワーク
２００−２での認識率は高くなり、ニューラルネットワ
ーク２００−１の認識率は低下する。Such learning is performed for the neural networks 200-1, 200-2, ... 200-k shown in FIG. 1 by using voice patterns having different characteristics. For example, each neural network 200-1, 200-2 ...
When it is desired to recognize voice data "beer" at 200-k, the voice data "beer" having voice patterns having different characteristics is used as learning voice data, and each neural network 200-1, 200- is used.
2 ... Let the learning of 200-k be performed as described above. As a result of such learning, the optimum input voice pattern for recognition is set for each neural network. Therefore, even if the same voice data 100 of "beer" is given, the recognition rate becomes different for each neural network. For example, in a voice recognition device in which the neural network 200-1 is learned by a male voice and the neural network 200-2 is learned by a female voice, when input data is given by another male voice,
The neural network 200-1 can recognize it with a high probability, but the neural network 200-2 hardly recognizes it. On the contrary, when the input data is given by the voice of another woman, the recognition rate of the neural network 200-2 is high and the recognition rate of the neural network 200-1 is low.

【００８４】このように、本実施例は各ニューラルネッ
トワーク２００−１，２００−２……２００−ｋをそれ
ぞれ異なった特徴の人の音声で学習させるので、特徴抽
出部１０から同一の音声ベクトル１１０が各ニューラル
ネットワーク２００−１，２００−２……２００−ｋに
与えられても、音声認識結果１２０は各ニューラルネッ
トワーク毎にそれぞれ異なったものとなる。As described above, in this embodiment, the neural networks 200-1, 200-2, ..., 200-k are trained by the human voices having different characteristics. Is given to each neural network 200-1, 200-2 ... 200-k, the speech recognition result 120 is different for each neural network.

【００８５】各ニューラルネットワーク２００−１，２
００−２……２００−ｋから出力される複数の音声認識
結果１２０の内、一番認識率の高い認識結果を採用する
ために、本実施例では、音声データとの認識適合度判別
用データ１３０が各ニューラルネットワーク２００−
１，２００−２……２００−ｋからそれぞれ出力される
ように工夫されている。Each neural network 200-1, 2
00-2 ... In order to adopt the recognition result with the highest recognition rate out of the plurality of voice recognition results 120 output from 200-k, in the present embodiment, the data for recognition compatibility determination with the voice data is used. 130 is each neural network 200-
1, 200-2 ... 200-k are designed so that they are output respectively.

【００８６】前述したように、認識適合度の判定処理と
は、入力された音声データと、学習で用いられた音声デ
ータとの適合度１３０を判定する処理をいう。この判定
処理は、入力された音声データに基づき、当該入力デー
タより時間的に前の音声データが推定できるようにニュ
ーラルネットワークを学習させて、推定の正解率を認識
適合度として求める処理をいう。As described above, the recognition compatibility determination processing refers to processing for determining the compatibility 130 between the input voice data and the voice data used for learning. This determination process is a process of learning the neural network based on the input voice data so that the voice data preceding the input data in time can be estimated, and obtaining the estimated correct answer rate as the recognition suitability.

【００８７】例えば、図２において、ニューラルネット
ワーク２００に特徴ベクトル１１０が入力されると、こ
の入力データ１１０より時間的に一つ前（過去）に入力
された特徴ベクトル１１０ａが予測できるようにニュー
ラルネットワーク２００を学習させ、推定した特徴ベク
トルを適合度判定用データ１３０として選択部３０へ向
け出力させる。つまり、入力データの時間的な関係は、
話者の個性を反映したもので、推定しやすい話者の音声
データは、そのニューラルネットワークの学習で用いた
音声データと類似した個性である音韻または特徴である
音韻を持っているのである。For example, in FIG. 2, when the feature vector 110 is input to the neural network 200, the neural network 200 can predict the feature vector 110a input immediately before (past) the input data 110 in time. 200 is learned, and the estimated feature vector is output to the selection unit 30 as the fitness determination data 130. In other words, the temporal relationship of input data is
The voice data of the speaker, which reflects the individuality of the speaker and is easy to estimate, has a phoneme that is a personality or a phoneme that is a feature similar to the voice data used in the learning of the neural network.

【００８８】そこで、選択部３０は、ニューラルネット
ワーク２００−１，２００−２……２００−ｋから出力
される適合度判定データ１３０（推定された一つ前の特
徴ベクトル）を、特徴抽出部１０から実際に出力される
一つ前の特徴ベクトル１１０と照合し、各ニューラルネ
ットワーク毎に正解率を演算する。この正解率（認識適
合度）の最も高いニューラルネットワークの音声認識結
果の出力が最も正解であるといえるので、その出力を音
声認識装置の認識結果として採用する。Therefore, the selection unit 30 uses the fitness determination data 130 (estimated previous feature vector) output from the neural networks 200-1, 200-2 ... 200-k as the feature extraction unit 10. Is compared with the previous one of the feature vectors 110 actually output from, and the accuracy rate is calculated for each neural network. Since it can be said that the output of the speech recognition result of the neural network having the highest correct answer rate (recognition compatibility) is the most correct answer, that output is adopted as the recognition result of the speech recognition device.

【００８９】この認識適合度の判定処理の学習は、前述
した音声認識処理の学習と同時に行う。すなわち、ニュ
ーラルネットワーク２００を構成する要素の一つである
適合度出力ニューロン２１０−４が、入力ニューロン２
１０−２，２１０−３から前に入力された過去の特徴ベ
クトルを推定し、これを適合度判定用データ１３０とし
て出力するように、学習用の音声データを用いてニュー
トラルネットワーク２００を学習させればよい。The learning of the determination process of the recognition suitability is performed at the same time as the learning of the voice recognition process described above. That is, the fitness output neuron 210-4, which is one of the elements configuring the neural network 200, is
The neutral network 200 is trained by using the speech data for learning so as to estimate the past feature vector input from 10-2 and 210-3 and output this as the fitness determination data 130. Good.

【００９０】なお、認識適合度の判定処理は、このよう
に前のデータの予測以外に、図２に示すよう、入力され
た特徴ベクトル１１０そのものの推定データ、あるいは
次に入力される未来の特徴ベクトル１１０ｂの予測デー
タに基づいて行ってもよい。しかし、実験によれば、過
去の特徴ベクトルを予測させる方が、より高い精度で認
識動作を行うことができた。In addition to the prediction of the previous data, the recognition suitability determination process is performed by using the estimated data of the input feature vector 110 itself or the future features to be input next, as shown in FIG. It may be performed based on the prediction data of the vector 110b. However, according to the experiment, it was possible to perform the recognition operation with higher accuracy by predicting the past feature vector.

【００９１】音声認識処理動作次に、このように構成されたニューラルネットワーク２
００の行う音声認識処理動作を、図１１のフローチャー
トに従って簡単に説明する。 Speech Recognition Processing Operation Next, the neural network 2 configured as above will be described.
The voice recognition processing operation of 00 will be briefly described with reference to the flowchart of FIG.

【００９２】まず、音声認識処理が開始されると、全て
のニューロン２１０−１，２１０−２……２１０−６
に、適当に選択された初期内部状態値Ｘがセットされ、
それに対応する出力Ｙがセットされる（ステップ１０
１）。First, when the voice recognition process is started, all the neurons 210-1, 210-2, ... 210-6.
Is set to an appropriately selected initial internal state value X,
The corresponding output Y is set (step 10).
1).

【００９３】次に、全てのニューロンについて、前述し
た入力データＺj の和が求められる（ステップ１０４，
１０３）。Next, the sum of the above-mentioned input data Zj is obtained for all neurons (step 104,
103).

【００９４】次に、全てのニューロンのそれぞれについ
て、ステップ１０３で求めたＺj の和と、内部状態値Ｘ
とにより、Ｘの値を更新する（ステップ１０５）。そし
て、更新されたＸの値に基づいて、それぞれのニューロ
ンの出力値を計算する（ステップ１０６）。この計算を
した後、処理をステップ１０２に戻し、処理終了の指令
があれば終了する。Next, for each of all neurons, the sum of Zj obtained in step 103 and the internal state value X
The value of X is updated by (step 105). Then, the output value of each neuron is calculated based on the updated value of X (step 106). After this calculation, the process is returned to step 102, and if there is a command to end the process, the process ends.

【００９５】ニューラルネットワーク２００の認識結果
は、ニューロン２１０−５，２１０−６の出力として与
えられる。また、適合度判定用の出力１３０は、ニュー
ロン２１０−４の出力として与えられる。The recognition result of the neural network 200 is given as the output of the neurons 210-5 and 210-6. Further, the output 130 for determining the fitness is given as the output of the neuron 210-4.

【００９６】図１２、図１３、図１４には、実施例の音
声認識装置を用いて、実際に音声認識動作を行った場合
の実験データが示されている。この実験では、ニューラ
ルネットワーク２００−１，２００−２を、それぞれ入
力ニューロン数が２０、出力ニューロン数が２、その他
のニューロン数が３２のニューラルネットワークとして
構成したものを用いた。そして、特徴抽出部１０から２
０次元のＬＰＣケプストラムを各ニューラルネットワー
ク２００−１，２００−２に与え、このときニューラル
ネットワーク２００−１，２００−２から出力されるデ
ータを実測した。12, FIG. 13, and FIG. 14 show experimental data when a voice recognition operation is actually performed using the voice recognition device of the embodiment. In this experiment, the neural networks 200-1 and 200-2 configured as neural networks each having 20 input neurons, 2 output neurons, and 32 other neurons are used. Then, from the feature extraction unit 10 to 2
The 0-dimensional LPC cepstrum was given to each neural network 200-1, 200-2, and the data output from the neural networks 200-1, 200-2 at this time were measured.

【００９７】図１２（Ａ），図１３（Ａ），図１４
（Ａ）に、ニューラルネットワーク２００−１の肯定出
力４１０と否定出力４１２とを示す。また、図１２
（Ｂ），図１３（Ｂ），図１４（Ｂ）に、ニューラルネ
ットワーク２００−２の肯定出力４２０と否定出力４２
２とを示す。さらに図１２（Ｃ），図１３（Ｃ），図１
４（Ｃ）に、入力された音声データとニューラルネット
ワーク２００−１の適合度４３０と、入力された音声デ
ータとニューラルネットワーク２００−２の適合度４３
２とを示す。12 (A), FIG. 13 (A), and FIG.
(A) shows the positive output 410 and the negative output 412 of the neural network 200-1. In addition, FIG.
(B), FIG. 13 (B), and FIG. 14 (B), a positive output 420 and a negative output 42 of the neural network 200-2.
2 and. Further, FIG. 12 (C), FIG. 13 (C), and FIG.
4 (C), the fitness 430 of the input voice data and the neural network 200-1, and the fitness 43 of the input voice data and the neural network 200-2.
2 and.

【００９８】この実験では、音韻グループの異なる二人
の話者Ａ，Ｂを用意し、ニューラルネットワーク２００
−１を話者Ａの音声で、ニューラルネットワーク２００
−２を話者Ｂの音声で学習させた。各ニューラルネット
ワーク２００−１，２００−２は、それぞれ肯定的な認
識対象として、「とりあえず」を与え、否定的な認識対
象として「終点」，「腕前」，「拒絶」，「超越」，
「分類」，「ロッカー」，「山脈」，「隠れピューリタ
ン」の８つの単語を与えた。各ニューラルネットワーク
２００−１，２００−２は、肯定的認識対象が与えられ
た場合、その対象の半分までが認識された時点で肯定出
力、否定出力が変化するように、それぞれ話者Ａ、話者
Ｂの音声で学習させてある。同図での縦軸は、出力ニュ
ーロンの出力値を、横軸は左から右へ時間の流れを表
す。In this experiment, two speakers A and B having different phoneme groups are prepared, and the neural network 200
-1 is the voice of the speaker A and the neural network 200
-2 was learned by the voice of speaker B. Each of the neural networks 200-1 and 200-2 gives “for the time being” as a positive recognition target, and “end point”, “skill”, “rejection”, “transcendence”, as a negative recognition target.
Eight words were given: "classification", "rocker", "mountain range", and "hidden puritan". When a positive recognition target is given, each of the neural networks 200-1 and 200-2 changes the positive output and the negative output at the time when half of the target is recognized, so that the speaker A and the talker respectively talk. Person B's voice is used for learning. In the figure, the vertical axis represents the output value of the output neuron, and the horizontal axis represents the time flow from left to right.

【００９９】ここにおいて、図１２の実験データは、こ
のようにして学習された音声認識装置に話者Ａの音声デ
ータを認識させた場合の結果である。図１２（Ａ）から
明らかなように、話者Ａの音声で学習したニューラルネ
ットワーク２００−１は、単語「とりあえず」の入力に
対し、その肯定出力４１０が大きな値に変化している。
また、その否定出力４１２は小さな値に変化している。
これに対し、図１２（Ｂ）に示すよう、別の話者の音声
で学習された他のニューラルネットワーク２００−２の
肯定出力４２０、否定出力４２２は、単語「とりあえ
ず」の入力に対しては大きく変化していない。このこと
により、ニューラルネットワーク２００−１は、単語
「とりあえず」を正しく識別しているが、ニューラルネ
ットワーク２００−２は識別できていないことがわか
る。これは、図１２（Ｃ）の認識適合度の判定結果を示
すグラフから明らかである。ニューラルネットワーク２
００−１の適合度４３０の値の方が、他のニューラルネ
ットワーク２００−２の適合度４３２に比べて常に大き
な値を示しているからである。Here, the experimental data shown in FIG. 12 is the result when the speech recognition apparatus thus learned is made to recognize the speech data of the speaker A. As is clear from FIG. 12A, in the neural network 200-1 learned by the voice of the speaker A, the positive output 410 changes to a large value with respect to the input of the word “for the time being”.
The negative output 412 has changed to a small value.
On the other hand, as shown in FIG. 12B, the positive output 420 and the negative output 422 of the other neural network 200-2 learned by the voice of another speaker are different from the input of the word “for the time being”. It has not changed significantly. From this, it can be seen that the neural network 200-1 correctly identifies the word “for the time being”, but the neural network 200-2 cannot. This is clear from the graph showing the determination result of the recognition suitability in FIG. Neural network 2
This is because the value of the goodness of fit 430 of 00-1 always shows a larger value than the value of the goodness of fit 432 of the other neural network 200-2.

【０１００】以上の結果より、認識適合度の判定結果に
基づいてニュラルネットワーク２００−１の音声認識結
果を採用すれば、単語「とりあえず」を正しく認識した
肯定出力および否定出力が得られることが理解されよ
う。From the above results, if the voice recognition result of the neural network 200-1 is adopted based on the determination result of the recognition suitability, a positive output and a negative output for correctly recognizing the word "for the time being" can be obtained. Be understood.

【０１０１】これに対し、図１３は、実施例の音声認識
装置に、話者Ｂが入力した音声データを同様にして認識
させた場合に得られるデータである。On the other hand, FIG. 13 shows data obtained when the voice recognition apparatus of the embodiment recognizes voice data input by the speaker B in the same manner.

【０１０２】図１３（Ａ）に示すよう、別の話者Ａで学
習されたニューラルネットワーク２００−１は、話者Ｂ
の入力した単語「とりあえず」を正確に認識できない。
これに対し、話者Ｂの音声を学習に用いた他方のニュー
ラルネットワーク２００−２は、話者Ｂの入力する単語
「とりあえず」を正確に認識できている。これは図１３
（Ｃ）に示す、認識適合度の判定結果を示すグラフから
明らかである。As shown in FIG. 13A, the neural network 200-1 learned by another speaker A is a speaker B.
I can't recognize the word I entered for the time being.
On the other hand, the other neural network 200-2 that uses the voice of the speaker B for learning can accurately recognize the word “temporarily” input by the speaker B. This is
It is clear from the graph shown in (C) showing the determination result of the recognition suitability.

【０１０３】この例でも選択部３０での認識適合度の判
定結果に基づき、ニューラルネットワーク２００−２の
認識結果を採用すれば、正しく認識した出力が得られる
ことが分かる。Also in this example, it can be seen that if the recognition result of the neural network 200-2 is adopted based on the recognition result of the recognition adaptability in the selection unit 30, a correctly recognized output can be obtained.

【０１０４】図１４には、図１２，図１３と同様な処理
を、音質の異なる別の話者Ｃによる音声データを用いて
行った場合のデータである。FIG. 14 shows data obtained when the same processing as that shown in FIGS. 12 and 13 is performed using voice data of another speaker C having different sound quality.

【０１０５】図１４（Ａ），（Ｂ）から明らかなよう
に、話者Ｃが入力した音声データに対し、ニューラルネ
ットワーク２００−１では単語「とりあえず」を正しく
認識できている。これに対し、ニューラルネットワーク
２００−２では、単語「とりあえず」は正しく認識でき
ているものの、別の単語「拒絶」を単語「とりあえず」
と誤って認識している。これは、図１４（Ｃ）の認識適
合度の判定結果を示すグラフから明らかである。この例
でも選択部３０での認識適合度の判定結果に基づいて、
ニューラルネットワーク２００−１の認識結果を採用す
れば、正しく認識した出力が得られることが理解されよ
う。As is clear from FIGS. 14A and 14B, the neural network 200-1 can correctly recognize the word "for the time being" with respect to the voice data input by the speaker C. On the other hand, in the neural network 200-2, although the word “for the time being” is correctly recognized, another word “rejection” is replaced with the word “for the time being”.
I mistakenly recognize that. This is clear from the graph showing the determination result of the recognition suitability in FIG. Also in this example, based on the determination result of the recognition suitability in the selection unit 30,
It will be understood that if the recognition result of the neural network 200-1 is adopted, a correctly recognized output can be obtained.

【０１０６】図１５は、実施例の音声認識装置のハード
ウエア構成図である。実施例の音声認識装置は、特徴抽
出部１０として機能するアナログデジタルコンバータ７
０と、ニューラルネットワーク２００の内部状態値Ｘ等
のデータが格納されたデータメモリ７２と、ＣＰＵ７６
と、ＣＰＵ７６を選択部３０あるいは出力制御部４０と
して機能させるための処理プログラムが格納された認識
処理プログラムメモリ７４とを含んで構成されている。FIG. 15 is a hardware configuration diagram of the voice recognition device of the embodiment. The voice recognition device according to the embodiment includes an analog-digital converter 7 that functions as the feature extraction unit 10.
0, a data memory 72 storing data such as the internal state value X of the neural network 200, and a CPU 76.
And a recognition processing program memory 74 in which a processing program for causing the CPU 76 to function as the selection unit 30 or the output control unit 40 is stored.

【０１０７】他の実施例なお、本発明は前記実施例に限定されるものではなく、
本発明の要旨の範囲内で各種の変型実施が可能である。 Other Embodiments The present invention is not limited to the above embodiments,
Various modifications can be made within the scope of the present invention.

【０１０８】他のニューロンの実施例例えば、前記実施例では、ニューラルネットワーク２０
０を構成するニューロン２１０を、図５に示すような構
成のニューロンとして形成する場合を例にとり説明した
が、本発明はこれ以外にも各種ニューロンを用いること
ができる。 Other Neuron Embodiments For example, in the above embodiment, the neural network 20 is used.
The case where the neuron 210 forming 0 is formed as the neuron having the configuration shown in FIG. 5 has been described as an example, but the present invention can use various neurons other than this.

【０１０９】図１６には、本発明のニューラルネットワ
ーク２００に用いられる他のダイナミックニューロン２
１０の具体例が示されている。FIG. 16 shows another dynamic neuron 2 used in the neural network 200 of the present invention.
Ten specific examples are shown.

【０１１０】実施例のダイナミックニューロン２１０に
おいて、内部状態更新手段２４０は、積算部２５０と、
関数変換部２５２と、演算部２５４とを用いて構成さ
れ、次式に基づく演算を行い、メモリ２２２の内部状態
量Ｘを更新するように形成されている。In the dynamic neuron 210 of the embodiment, the internal state updating means 240 has an integrating section 250,
The function conversion unit 252 and the operation unit 254 are used to perform an operation based on the following equation to update the internal state quantity X of the memory 222.

【０１１１】[0111]

【数１２】 [Equation 12]

【０１１２】すなわち、積算部２５０は、入力Ｚj を積
算し、関数部２５２は、この積算した値をシグモイド
（ロジスティック）関数Ｓを用いて変換するように構成
されている。そして、演算部２５４は、関数変換された
値と、メモリ２２２の内部状態量Ｘとに基づき、前記数
１２の演算を行い、新たな内部状態量Ｘを求め、メモリ
２２２の値を更新するように形成されている。That is, the integrating section 250 integrates the input Zj, and the function section 252 is configured to convert the integrated value using the sigmoid (logistic) function S. Then, the calculation unit 254 performs the calculation of Expression 12 based on the function-converted value and the internal state quantity X of the memory 222, obtains a new internal state quantity X, and updates the value of the memory 222. Is formed in.

【０１１３】また、より具体的な演算としては、次式に
示すような演算を実行するようにしてもよい。As a more specific operation, the operation shown in the following equation may be executed.

【０１１４】[0114]

【数１３】 [Equation 13]

【０１１５】この中で、Ｗｉｊはｊ番目のニューロンの
出力を、ｉ番目のニューロンの入力へ結合する結合強度
を表す。Ｄｉは外部入力値を示す。またθｉはバイアス
値を示す。このバイアス値は、固定された値との結合と
してＷｉｊの中に含めて考えることも可能である。ま
た、値域制限関数Ｓの具体的な形としては、正負対称出
力のシグモイド関数等を用いればよい。In the above, Wij represents the coupling strength for coupling the output of the j-th neuron to the input of the i-th neuron. Di indicates an external input value. Further, θi indicates a bias value. This bias value can be considered to be included in Wij as a combination with a fixed value. Further, as a specific form of the range limiting function S, a sigmoid function having positive and negative symmetrical outputs may be used.

【０１１６】出力生成手段２６０は、内部状態値Ｘを定
数倍した出力値Ｙへ変換する写像関数演算部２６４とし
て形成されている。The output generation means 260 is formed as a mapping function calculation section 264 for converting the internal state value X into an output value Y which is a constant multiple.

【０１１７】また、前記各実施例では音声データとして
単語等の認識を行う場合を例にとり説明したが、本発明
はこれに限らず、各種の音素や音節等の認識を行うよう
形成することも可能である。In each of the above embodiments, the case of recognizing a word or the like as voice data has been described as an example. However, the present invention is not limited to this, and various types of phonemes and syllables may be recognized. It is possible.

【０１１８】[0118]

【０１１９】話者認識型の音声認識装置の実施例図１７には、話者認識型の音声認識装置の好適な実施例
が示されている。なお、前述した実施例と対応する部材
には同一符号を付してその説明は省略する。 Embodiment of Speaker Recognition Type Voice Recognition Device FIG. 17 shows a preferred embodiment of the speaker recognition type voice recognition device. The members corresponding to those in the above-described embodiment are designated by the same reference numerals, and the description thereof will be omitted.

【０１２０】ここにおいて、音声認識処理部２０は、異
なる話者を認識対象とする複数のニューラルネットワー
ク２００−１，２００−２・・・２００−ｋを含む。各
ニューラルネットワーク２００は、認識対象話者の特徴
ベクトル１１０に基づき、入力される認識対象者の話者
ベクトル１００を予測し、音声認識の適合度を表す適合
度判定用データ１３０として出力するよう予め学習され
ている（学習の詳細は、前記実施例と同様である）。こ
こで用いた話者の特徴量は、８次のＰＡＲＣＯＲ係数で
ある。話者特徴量としてはＰＡＲＣＯＲ係数の他にも、
種々のものを使用することが可能である。しかし、ＰＡ
ＲＣＯＲ係数は、その値が原理的に−１〜１の値にある
こと、また、比較的話者に依存する割合が高い等の特徴
があり、話者認識においては有効な特徴量である。Here, the voice recognition processing section 20 includes a plurality of neural networks 200-1, 200-2 ... 200-k which recognize different speakers. Each neural network 200 predicts the speaker vector 100 of the input recognition target person based on the feature vector 110 of the recognition target speaker, and outputs it in advance as the goodness of fit determination data 130 representing the goodness of speech recognition. It has been learned (details of learning are the same as in the above-mentioned embodiment). The speaker feature amount used here is an 8th-order PARCOR coefficient. In addition to the PARCOR coefficient,
Various ones can be used. But PA
The RCOR coefficient has features such that its value is in principle a value of -1 to 1 and has a relatively high proportion depending on the speaker, and is an effective feature amount in speaker recognition.

【０１２１】そして、話者認識部９０は、各ニューラル
ネットワーク２００−１，２００−２，・・・２００−
ｋから入力される適合度判定用データ１３０と、特徴抽
出部１０から入力される実際の話者の特徴ベクトル１０
０との正解率を各ニューラルネットワーク毎に演算し、
最も正解率の高いニューラルネットワーク２００を選択
する。そして、選択されたニューラルネットワークの正
解率が、所定基準レベル以上の場合に、入力された音声
データ１００が、選択されたニューラルネットワーク２
００の学習に用いた話者であると判断し、これを認識結
果１５０として出力する。例えば、話者Ａを認識対象と
するニューラルネットワーク２００−１が選択された場
合には、入力された音声データ１００が話者が話者Ａで
あると認識し、これを認識結果１５０として出力するこ
とになる。Then, the speaker recognition unit 90 causes the neural networks 200-1, 200-2, ... 200-
The fitness determination data 130 input from k and the feature vector 10 of the actual speaker input from the feature extraction unit 10.
The correct answer rate with 0 is calculated for each neural network,
The neural network 200 with the highest correct answer rate is selected. Then, when the correct answer rate of the selected neural network is equal to or higher than the predetermined reference level, the input voice data 100 is converted into the selected neural network 2
It is determined that the speaker is a speaker used for learning 00, and this is output as a recognition result 150. For example, when the neural network 200-1 that recognizes the speaker A is selected, the input voice data 100 recognizes that the speaker is the speaker A, and outputs this as the recognition result 150. It will be.

【０１２２】なお、選択されたニューラルネットワーク
２００の正解率が所定基準以下の場合には、全てのニュ
ーラルネットワーク２００−１，２００−２，・・・２
００−ｋの認識対象話者ではないと判断し、認識結果１
５０を出力する。If the correct answer rate of the selected neural network 200 is less than or equal to the predetermined standard, all the neural networks 200-1, 200-2, ...
00-k is not the recognition target speaker, and the recognition result 1
Output 50.

【０１２３】なお、話者認識部９０は、このような話者
認識動作以外に、図１に示す前記実施例と同様に、音声
データの認識をも行うように形成してもよい。この場
合、話者認識部９０は、選択部３０と、出力制御部４０
を含むよう構成される。In addition to the speaker recognition operation, the speaker recognition section 90 may also be formed so as to recognize voice data as in the embodiment shown in FIG. In this case, the speaker recognition unit 90 includes the selection unit 30 and the output control unit 40.
Is configured to include.

【０１２４】そして、前記選択部３０は、各ニューラル
ネットワーク２００−１，２００−２，・・・２００−
ｋ毎に、前記正解率を演算し、出力制御部４０に向けて
出力する。Then, the selection unit 30 causes the neural networks 200-1, 200-2, ... 200-
The correct answer rate is calculated for each k and output to the output control unit 40.

【０１２５】出力制御部４０は、入力された各ニューラ
ルネットワーク毎の正解率に基づき、入力された各音声
データ１００の話者認識を行う。さらに、認識対象とす
る話者が存在した場合には、選択されたニューラルネッ
トワーク２００から出力される音声認識データ１２０
を、認識結果１５０として出力するよう構成されてい
る。The output control section 40 recognizes the speaker of each input voice data 100 on the basis of the correct rate of each input neural network. Furthermore, when there is a speaker to be recognized, the voice recognition data 120 output from the selected neural network 200.
Is output as the recognition result 150.

【０１２６】このようにすることにより、話者認識のみ
ならず、認識された話者の音声データをも同時に認識す
ることができ、音声認識装置としての適用分野をさらに
広げることができる。By doing so, not only the speaker recognition but also the voice data of the recognized speaker can be recognized at the same time, and the application field as a voice recognition device can be further expanded.

【０１２７】次に、図１７の音声認識装置を用いた実際
の音声認識動作の詳細を説明する。この実施例において
は、ニューラルネットワークを訓練する標準データとし
て９つの単語、「終点」「腕前」「拒絶」「超越」「と
りあえず」「分類」「ロッカー」「山脈」「隠れピュー
リタン」を用いた。また音声データとしては、ＡＴＲ者
の研究用日本語音声データベースに収録されているもの
を用いた。Next, the details of the actual voice recognition operation using the voice recognition device of FIG. 17 will be described. In this example, nine words were used as standard data for training the neural network: "end point", "skill", "rejection", "transcendence", "for the time being", "classification", "rocker", "mountain range", and "hidden puritan". Moreover, as the voice data, the one recorded in the Japanese voice database for research of the ATR person was used.

【０１２８】図１８、図１９には、このようにして学習
させたニューラルネットワーク２００による話者認識の
実験結果が示されている。この実験では、ニューラルネ
ットワークが予測した特徴ベクトルと、実際の特徴ベク
トルとの正解率の代わりに、両者の誤差を用いて話者認
識を行っている。18 and 19 show the results of the speaker recognition experiment by the neural network 200 trained in this way. In this experiment, speaker recognition is performed by using the error between the feature vector predicted by the neural network and the actual feature vector instead of the correct answer rate.

【０１２９】図中の実線は、話者ＭＡＵの音声を認識さ
せるために学習させたニューラルネットワークの出力誤
差の時間変化を示す。また破線は、話者ＭＸＭの音声を
認識させるために学習させたニューラルネットワークの
出力誤差の時間変化を示す。ここで示した誤差は、８次
の入力ベクトルデータ、および出力ベクトルとの比較に
より生成された誤差ベクトルの長さの絶対値を、その時
点でのフレームの前後３２フレームについて平均した値
を示したものである。なお、図１８の入力話者はＭＡＵ
であり、図１９入力話者はＭＸＭである。The solid line in the figure shows the time change of the output error of the neural network learned to recognize the voice of the speaker MAU. Further, the broken line shows the time change of the output error of the neural network learned to recognize the voice of the speaker MXM. The error shown here is a value obtained by averaging the absolute value of the length of the error vector generated by comparison with the 8th-order input vector data and the output vector for 32 frames before and after the frame at that time. It is a thing. The input speaker in FIG. 18 is MAU.
The input speaker in FIG. 19 is MXM.

【０１３０】図より明らかであるように、図１８の場合
は、ＭＡＵの声で訓練されたニューラルネットワークに
よるデータ復元誤差が小さく、ＭＸＭで訓練されたニュ
ーラルネットワークによる復元誤差が大きい。これはＭ
ＡＵの発話特徴を用いたデータ復元の方が精度の良い復
元が可能であることを示す。つまり、入力された音声が
ＭＡＵによるものであることを示している。As is clear from the figure, in the case of FIG. 18, the data restoration error by the neural network trained by the voice of MAU is small, and the restoration error by the neural network trained by MXM is large. This is M
It is shown that the data restoration using the AU utterance feature enables more accurate restoration. That is, it indicates that the input voice is from the MAU.

【０１３１】また、図１９の場合は、図１８の場合とは
逆に、ＭＸＭの声で訓練されたニューラルネットワーク
によるデータ復元誤差が小さい。つまり、この入力され
た音声がＭＸＭによるものであることを示している。Further, in the case of FIG. 19, contrary to the case of FIG. 18, the data restoration error due to the neural network trained by the MXM voice is small. That is, this indicates that the input voice is due to MXM.

【０１３２】図１８，１９より明らかであるように、本
発明の話者認識方式によれば、連続した話者認識結果を
得ることができる。As is apparent from FIGS. 18 and 19, according to the speaker recognition method of the present invention, continuous speaker recognition results can be obtained.

【０１３３】下の表１は、上の例の二つのニューラルネ
ットワークに、訓練話者以外の９話者を含む合計１１人
の音声を入力した場合の誤差の平均値を示したものであ
る。入力は訓練に用いた９単語そのものである。平均は
その全発話区間について行った。表１より明らかである
ように、それぞれのニューラルネットワークにおいて、
１１人の音声入力に対し訓練話者に対する誤差が一番小
さく、１１人の中から正確に訓練話者を認識している事
が示される。Table 1 below shows the average value of the error when a total of 11 voices including 9 speakers other than the training speaker are input to the two neural networks in the above example. The input is the 9 words themselves used for training. The average was performed for all the speech sections. As is clear from Table 1, in each neural network,
The error with respect to the training speaker is the smallest with respect to the voice input of 11 people, and it is shown that the training speaker is correctly recognized from the 11 people.

【０１３４】[0134]

【表１】 [Table 1]

【０１３５】また、下の表２は表１と同様の結果である
が、上の場合と異なり、訓練に用いた単語音声とは内容
が異なる単語音声を入力した場合の結果である。ここで
用いた単語は「カレンダー」「いらっしゃる」「極端」
「駐車」「プログラム」「録音」「購入」「タイピュー
タ」である。Table 2 below shows the same result as that of Table 1, but unlike the above case, it shows the result when a word voice having a different content from the word voice used for training is input. The words used here are "calendar,""welcome," and "extreme."
These are "parking", "program", "recording", "purchase", and "typuter".

【０１３６】[0136]

【表２】 [Table 2]

【０１３７】上の表より明らかであるように、本発明の
話者認識方式は、入力された音声の発話内容が異なって
も正確に訓練話者を認識することができる。As is clear from the above table, the speaker recognition method of the present invention can accurately recognize the training speaker even if the utterance contents of the input voice are different.

【０１３８】なお、前記説明は、時間的に離散的な場合
について説明してきたが、例えばアナログ的な処理を行
う事により連続時間処理においても適用可能である。Although the above description has been made with respect to the case of being discrete in time, it is also applicable to continuous time processing by performing analog processing, for example.

【０１３９】[0139]

【発明の効果】以上説明したように、請求項１〜１３の
発明によれば、それぞれ異なる音声パターンを持った複
数の音声データが入力されても、適合度の最も高い音声
認識用ニューラルネットワーク部で認識処理が行われる
ので、音声データの音声パターン、例えば音質、音韻等
によってその認識率が左右されることがない音声認識装
置を得ることができるという効果がある。As described above, according to the inventions of claims 1 to 13, even if a plurality of voice data having different voice patterns are inputted, the neural network unit for voice recognition having the highest matching degree. Since the recognition processing is performed in step 1, there is an effect that it is possible to obtain a voice recognition device in which the recognition rate is not influenced by the voice pattern of voice data, such as sound quality and phoneme.

【０１４０】特に、音声認識用ニューラルネットワーク
部を構成するニューロンとして、内部状態量が時間的に
変化するダイナミックなニューロンを用いることによ
り、ニューラルネットワーク部全体の構成を簡単なもの
とし、かつその認識精度を高めることができるという効
果がある。Particularly, by using a dynamic neuron whose internal state quantity changes with time as a neuron forming the neural network unit for speech recognition, the entire structure of the neural network unit is simplified and its recognition accuracy is improved. There is an effect that can increase.

【０１４１】[0141]

[Brief description of drawings]

【図１】本発明の音声認識装置の実施例を示すブロック
図である。FIG. 1 is a block diagram showing an embodiment of a voice recognition device of the present invention.

【図２】図１に示す特徴抽出部での変換処理を示す説明
図である。FIG. 2 is an explanatory diagram showing a conversion process in a feature extraction unit shown in FIG.

【図３】実施例のニューラルネットワーク部の構成を示
す概念図である。FIG. 3 is a conceptual diagram showing a configuration of a neural network unit of the embodiment.

【図４】実施例のニューラルネットワーク部を構成する
ニューロンの説明図である。FIG. 4 is an explanatory diagram of neurons forming a neural network unit of the embodiment.

【図５】図４に示すニューロンの具体的な構成を示す説
明図である。5 is an explanatory diagram showing a specific configuration of the neuron shown in FIG.

【図６】実施例のニューロンの動作を示すフローチャー
ト図である。FIG. 6 is a flowchart showing the operation of the neuron of the embodiment.

【図７】実施例のニューラルネットワーク部を学習させ
るために用いる学習装置の説明図である。FIG. 7 is an explanatory diagram of a learning device used for learning the neural network unit of the embodiment.

【図８】学習方法の例を示す説明図である。FIG. 8 is an explanatory diagram showing an example of a learning method.

【図９】学習方法の例を示す説明図である。FIG. 9 is an explanatory diagram showing an example of a learning method.

【図１０】学習方法の例を示す説明図である。FIG. 10 is an explanatory diagram showing an example of a learning method.

【図１１】音声認識処理動作を示すフローチャート図で
ある。FIG. 11 is a flowchart showing a voice recognition processing operation.

【図１２】音声認識処理の出力例を示す説明図である。FIG. 12 is an explanatory diagram illustrating an output example of voice recognition processing.

【図１３】音声認識処理の出力例を示す説明図である。FIG. 13 is an explanatory diagram illustrating an output example of voice recognition processing.

【図１４】音声認識処理の出力例を示す説明図である。FIG. 14 is an explanatory diagram illustrating an output example of voice recognition processing.

【図１５】本実施例のハードウエアの構成図である。FIG. 15 is a hardware configuration diagram of the present embodiment.

【図１６】本実施例に用いられるダイナミックニューロ
ンの他の具体例の説明図である。FIG. 16 is an explanatory diagram of another specific example of the dynamic neuron used in this embodiment.

【図１７】話者認識に用いられる音声認識装置のブロッ
ク図である。FIG. 17 is a block diagram of a voice recognition device used for speaker recognition.

【図１８】実施例の音声認識装置を用いた話者認識結果
を示す図である。FIG. 18 is a diagram showing a speaker recognition result using the voice recognition device in the example.

【図１９】実施例の音声認識装置を用いた話者認識結果
を示す図である。FIG. 19 is a diagram showing a speaker recognition result using the voice recognition device in the example.

[Explanation of symbols]

１０特徴抽出部２０音声認識理部３０選択部４０出力制御部１００音声データ１１０特徴ベクトル１２０認識データ１３０適合度判断用データ１４０選択データ１５０認識出力２００ニュートラルネットワーク２１０ニューロン２２０内部状態値記憶手段２４０内部状態値記憶更新手段２６０出力値生成手段 10 Feature extraction unit 20 Speech Recognition Management Department 30 Selector 40 Output control unit 100 voice data 110 feature vector 120 recognition data 130 Goodness-of-fit judgment data 140 selection data 150 recognition output 200 Neutral Network 210 neurons 220 internal state value storage means 240 internal state value storage updating means 260 output value generation means

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平４−328799（ＪＰ，Ａ) 特開平６−309464（ＪＰ，Ａ) 特開平６−110500（ＪＰ，Ａ) 中野馨，「ニューロコンピュータの基礎」，日本，株式会社コロナ社，1990年４月５日，初版，ｐｐ．44−49、 115−122，ＩＳＢＮ：４−339−02276− ４佐藤雅昭・他，「リカレントネットによる音声ゆらぎの学習」，電子情報通信学会技術研究報告，日本，社団法人電子情報通信学会，1991年３月19日，Ｖｏｌ．90，Ｎｏ．484（ＮＣ90−112〜 141），ｐｐ．169−174 渡辺隆夫，「ニューラルネットワークの音声処理への応用」，システム／制御／情報，日本，システム制御情報学会, 1991年１月15日，Ｖｏｌ．35，Ｎｏ. １，ｐｐ．19−25，ＪＳＴ資料番号：Ｇ 0902Ａ (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06N 1/00 - 7/08 G10L 3/00 G10L 3/02 G10L 5/06 G10L 7/08 G10L 9/00 - 9/02 G10L 9/06 - 9/20 ＪＳＴファイル（ＪＯＩＳ) ＣＳＤＢ（日本国特許庁)─────────────────────────────────────────────────── ─── Continued Front Page (56) References JP-A-4-328799 (JP, A) JP-A-6-309464 (JP, A) JP-A-6-110500 (JP, A) Nakano Kaoru, “Neuro Fundamentals of Computers ", Japan, Corona Co., Ltd., April 5, 1990, first edition, pp. 44-49, 115-122, ISBN: 4-339-02276-4, Masaaki Sato, et al., "Learning of Voice Fluctuation by Recurrent Net", IEICE Technical Report, Japan, The Institute of Electronics, Information and Communication Engineers, March 19, 1991, Vol. 90, No. 484 (NC90-112 to 141), pp. 169-174 Takao Watanabe, "Application of Neural Networks to Speech Processing," System / Control / Information, Japan, Society of System Control Information, January 15, 1991, Vol. 35, No. 1, pp. 19-25, JST Material No .: G 0902A (58) Fields surveyed (Int.Cl. ⁷ , DB name) G06N 1/00-7/08 G10L 3/00 G10L 3/02 G10L 5/06 G10L 7/08 G10L 9/00-9/02 G10L 9/06-9/20 JST file (JOIS) CSDB (Japan Patent Office)

Claims

(57) [Claims]

1. A voice recognition operation for preliminarily learning voice patterns of different characteristics so as to recognize predetermined voice data, and performing a voice recognition operation as to whether or not the input voice data matches the voice data to be recognized, A speech recognition processing unit including a plurality of neural network units for speech recognition for performing an operation of outputting the fitness determination data representing the degree of fitness of speech recognition, and the fitness determination unit output from each neural network unit for speech recognition. Selecting means for selecting a neural network portion for speech recognition having the highest degree of conformity for speech recognition based on the data; and output control means for outputting a speech recognition result from the neural network portion for speech recognition selected by the selecting means. A voice recognition device comprising:

2. The feature extraction means according to claim 1, further comprising: feature extraction means for extracting the input voice data in frame units, converting the feature data into feature vectors and sequentially outputting the feature vectors, wherein each voice recognition neural network unit includes the feature extraction means. A voice recognition device, characterized in that the feature vector output from is input as voice data.

3. The neural network unit for speech recognition according to claim 1, wherein each of the neural networks for speech recognition is configured by mutually connecting a plurality of neurons having an internal state value X set therein. Of the internal state value X satisfies the function X = G (X, Zj) represented by using the input data Zj (j = 0 to n: n is a natural number) and the internal state value X given to the neuron. Is formed as a time-varying dynamic neuron, and each dynamic neuron is formed so as to be output by converting its internal state value X into a value that satisfies the function F (X). Speech recognizer.

4. The function X = G (X, Zj) according to claim 3, wherein: A speech recognition device characterized by being formed as follows.

5. The coupling strength W according to claim 3, wherein the function X = G (X, Zj) couples the output of the j-th neuron to the input of the i-th neuron.
Using ij, external input value Di, and bias value θi, A speech recognition device characterized by being formed as follows.

6. The function according to claim 3, wherein the function X = G (X, Zj) is calculated by using a sigmoid function S: A speech recognition device characterized by being formed as follows.

7. The function according to claim 3, wherein the function X = G (X, Zj) is a sigmoid function S, j.
The coupling strength Wij for coupling the output of the n-th neuron to the input of the i-th neuron, the external input value Di, and the bias value θ
Using i, A speech recognition device characterized by being formed as follows.

8. The neural network unit for speech recognition according to claim 3, wherein each of the speech recognition neural network units includes: an input neuron into which speech data is input; a recognition result output neuron that outputs a recognition result of the speech data; And a fitness output neuron for outputting the fitness determination data, wherein the fitness output neuron estimates speech data input to the input neuron,
The voice recognition device is formed so as to output the estimated data as fitness determination data, and the selecting unit calculates a correct answer rate of the estimated data with respect to actual voice data as a fitness of voice recognition.

9. The voice recognition device according to claim 3, wherein the function F (X) of each of the dynamic neurons is a sigmoid function.

10. The voice recognition device according to claim 3, wherein the function F (X) of each of the dynamic neurons is a threshold function.

11. The speech according to claim 3, wherein each of the dynamic neurons includes, as the input data Zj, data obtained by multiplying an output of its own neuron by a weight and feeding it back. Recognition device.

12. The speech recognition device according to claim 3, wherein each of the dynamic neurons includes, as the input data Zj, data obtained by multiplying an output of another neuron by a weight.

13. The voice recognition device according to claim 3, wherein each of the dynamic neurons includes desired data given from the outside as the input data Zj.