JPH064097A

JPH064097A - Speaker recognizing method

Info

Publication number: JPH064097A
Application number: JP4159442A
Authority: JP
Inventors: Mitsuhiro Inazumi; 満広稲積
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 1992-06-18
Filing date: 1992-06-18
Publication date: 1994-01-14

Abstract

PURPOSE:To precisely recognize a speaker with a small number of learning data by inputting an internal state value and an external input value, and updating the internal state value with the input and converting it into an external output value. CONSTITUTION:This method is equipped with a speech feature extracting means 501, a neural network 502, an input value storage means 504 which stores the input of the neural network 502, and a data comparing means 503 which compares the stored input with the output. In this case, the input value storage means 504 is stored with only input data which is one frame before and compares the output data of a current frame to recognizes the speaker. Namely, a speech feature series can easily be understood from the mechanism of the generation, but the continuous motion of a speaker and a modulation organs is reflected. Then the individuality of the speaker appears as the feature of the motion, so its feature time series is processed to recognize the individual. The process is therefore performed by evaluating the individuality of the input speech by restoring the data right before the current data.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は話者を認識する方法に関
するものである。FIELD OF THE INVENTION The present invention relates to a method of recognizing a speaker.

【０００２】[0002]

【従来の技術】話者認識に用いられる方法は大別して２
つの方法がある。その一つの方法は、入力された音声か
ら得られる発話の特徴ベクトルの時系列そのものを用い
る方法であり、もう一つの方法はその時系列を統計処理
して得られる統計的な特徴量を用いる方法である。これ
らの方法については、例えば古井貞煕著「ディジタル音
声処理」（東海大学出版会）第９章等に詳しく述べられ
ている。2. Description of the Related Art Methods used for speaker recognition are roughly classified into two.
There are two ways. One method is to use the time series of the feature vector of the utterance obtained from the input speech, and the other method is to use the statistical feature quantity obtained by statistically processing the time series. is there. These methods are described in detail in, for example, Chapter 9 of "Digital Audio Processing" by Toshihiro Furui (Tokai University Press).

【０００３】しかし、統計的な特徴量を用いる方法にお
いて精度の高い認識を行うためには、大量のデータを収
集する必要がある。しかし、話者からこのそのような大
量のデータを得る事は実用上非常に困難な事である。例
えば、話者認識装置の前で、認識対象の話者に数分間も
話してもらう、と言うような事は話者認識装置の応用を
非常に制限してしまう。また、この制限を緩和するため
に、小量のデータから統計量を推定しようすれば、その
推定誤差のために認識精度を劣化させてしまうと言う問
題がある。However, it is necessary to collect a large amount of data in order to perform highly accurate recognition in the method using the statistical feature amount. However, obtaining such a large amount of data from the speaker is extremely difficult in practice. For example, inviting a speaker to be recognized to speak for several minutes in front of the speaker recognition device greatly limits the application of the speaker recognition device. Further, if a statistic is estimated from a small amount of data in order to relax this limitation, there is a problem that the estimation accuracy deteriorates due to the estimation error.

【０００４】もしも大量のデータを得る事が可能な場合
には上で述べたような統計的な方法を用いるのが良い選
択であると考えられるが、データが小量である場合は、
それから得られる話者特徴時系列を用いる方法の法が良
い結果を与える場合が多い。それは、大部分が静的な量
である統計的な量に比較して、時系列そのものを処理す
る方法は話者の特徴の動的な面も利用する事ができるか
らである。あるいは逆に、このような話者の動的な特性
を的確に処理できる方法でなければ、話者特徴データか
ら話者を認識するという事ができない。この事は、先に
述べた「ディジタル音声処理」の中でも述べられている
事ではあるが、それらの話者認識のための方法はそのた
めの特別の配慮が必要ではあるけれども、本質的には音
声認識と非常に類似した方法が必要である言う事であ
る。If a large amount of data can be obtained, it is considered that the statistical method described above is a good choice, but if the amount of data is small,
The method of using the speaker feature time series obtained therefrom often gives good results. This is because, compared to statistical quantities, which are mostly static quantities, the method of processing the time series itself can also utilize the dynamic aspect of the speaker's characteristics. On the contrary, unless a method capable of accurately processing such a dynamic characteristic of the speaker, the speaker cannot be recognized from the speaker feature data. This is also mentioned in the above-mentioned "digital speech processing", but although the method for speaker recognition requires special consideration for that, it is essentially speech. The point is that a method very similar to recognition is needed.

【０００５】さて従来の音声認識方法について考えてみ
ると、従来の音声認識方法は大別して、ＤＰマッチング
法（ＤＰ法）、隠れマルコフモデル法（ＨＭＭ法）、及
び、バックプロパゲーション学習法とニューラルネット
ワークである多層パーセプトロンを用いた方法（ＭＬＰ
法）とがある。これらの詳細については、例えば中川聖
一著「確率モデルによる音声認識」（電子情報通信学
会）、中川、鹿野、東倉共著「音声・聴覚と神経回路網
モデル」（オーム社）等に記述されている。Considering the conventional speech recognition methods, the conventional speech recognition methods are roughly classified into DP matching method (DP method), hidden Markov model method (HMM method), back propagation learning method and neural. Method using multi-layer perceptron which is a network (MLP
Law). Details of these are described in, for example, Seiichi Nakagawa "Speech Recognition by Probabilistic Model" (IEICE), and Nakagawa, Kano, and Higashikura "Speech / Hearing and Neural Network Model" (Ohmsha). There is.

【０００６】ＤＰ法を話者認識に用いる場合、まず各々
の話者についての標準データを収集し、認識時において
は、そのそれぞれについて、入力されたデータとの間で
始端と終端の対応を仮定し、その内部の要素の対応を様
々な時間正規化関数で変化させ、その差異が最小となる
対応関係と、その時のパタン間の差異を入力データと標
準パタン間の距離とし、その距離を最小とする標準パタ
ンに代表される話者を認識結果とするものである。When the DP method is used for speaker recognition, first, standard data for each speaker is collected, and at the time of recognition, the correspondence between the input end and the start end is assumed for each input data. Then, the correspondence of the internal elements is changed by various time normalization functions, and the correspondence relationship that minimizes the difference and the difference between the patterns at that time are set as the distance between the input data and the standard pattern, and the distance is minimized. The speaker represented by the standard pattern is used as the recognition result.

【０００７】この場合、始端と終端の対応を仮定すると
言う事は、入力パタンと標準パタンの間の距離がパタン
の長さに比例して大きくなると言う事による。例えば、
ある単語なり文章なりを用いて話者認識を行うとして
も、発話速度は人それぞれにおいて異なり、また同一話
者においても状況によって変化する。そのため長さの異
なる標準パタン間において、パタンの長さに依存しない
距離の比較をするためには、標準パタン、あるいは入力
データの長さに対して距離を正規化する必要があり、そ
のためにパタンの距離、つまりパタンの始端終端の対応
が必須となるのである。In this case, the assumption of the correspondence between the start end and the end is that the distance between the input pattern and the standard pattern increases in proportion to the length of the pattern. For example,
Even if the speaker recognition is performed using a certain word or a sentence, the speaking speed is different for each person, and even the same speaker changes depending on the situation. Therefore, in order to compare distances that do not depend on the pattern length between standard patterns of different lengths, it is necessary to normalize the distances to the standard pattern or the length of the input data. That is, the correspondence of the distance, that is, the beginning and end of the pattern is essential.

【０００８】ＨＭＭ法においてＤＰ法の標準パタンに代
わり話者を代表するのは、複数の状態と複数の遷移によ
り構成されるＨＭＭモデルである。ＨＭＭモデルの各々
の状態には存在確率が、また各々の遷移には遷移確率と
出力確率が与えられており、これらの確率値は学習用デ
ータを用いた学習により決定される。これらの学習され
た確率値によりＨＭＭモデルは統計的、確率的に一つの
話者を代表する。In the HMM method, what replaces the standard pattern of the DP method and represents a speaker is an HMM model composed of a plurality of states and a plurality of transitions. An existence probability is given to each state of the HMM model, and a transition probability and an output probability are given to each transition, and these probability values are determined by learning using learning data. With these learned probability values, the HMM model statistically and stochastically represents one speaker.

【０００９】ＨＭＭ法は話者認識時において、各々の話
者を代表するＨＭＭモデルのそれぞれについて、ＤＰ法
と同じく、入力されたデータとの間で始端と終端の対応
を仮定し、その入力されたデータ列を出力すると言う条
件のもとで、始状態から終状態へ遷移する確率としてそ
の入力データが各々の話者にどの程度近いかの確率が計
算される。そしてその確率を最大とするＨＭＭモデルに
代表される話者を、入力データが属するべき話者として
認識結果とするものである。In the HMM method, at the time of speaker recognition, for each of the HMM models representing each speaker, it is assumed that the beginning and the end correspond to the input data, as in the DP method. Under the condition that the output data sequence is output, the probability of how close the input data is to each speaker is calculated as the probability of transition from the start state to the end state. Then, the speaker represented by the HMM model which maximizes the probability is recognized as the speaker to which the input data belongs.

【００１０】ここで、ＨＭＭモデルは時系列データを状
態と遷移と言う統計、確率的な形での時系列でモデル化
する。従って学習時においては、学習用入力データの始
状態に近い部分、終状態に近い部分、その中間の部分等
を特定する必要がある。そのためにはデータの始端と終
端を正確に与える事が必要となる。仮に始端の与え方が
不正確であり、始端に近い部分に必要以上に多種のデー
タが与えられたすると、これはそのモデルの認識能力を
下げる事になる。また逆に学習用データの中に必要なデ
ータが欠けていたとしたら、その欠けたデータを含む入
力データは正確な認識が不可能となる。その結果、誤認
識される可能性が高くなる。Here, the HMM model models the time series data in a time series in a statistical and probabilistic form such as states and transitions. Therefore, at the time of learning, it is necessary to specify a portion close to the start state of the learning input data, a portion close to the end state, an intermediate portion, and the like. For that purpose, it is necessary to accurately give the start and end of the data. If the method of giving the starting point is inaccurate and more data than necessary is given to the portion near the starting point, this will reduce the recognition ability of the model. On the contrary, if the learning data lacks necessary data, the input data including the missing data cannot be accurately recognized. As a result, there is a high possibility of being erroneously recognized.

【００１１】また認識時において、ＨＭＭ法における判
断基準は始状態から終状態への遷移確率であり、最終状
態における存在確率である。この値は入力データ列の各
成分の出力確率、遷移確率、ある状態の存在確率の積で
あるのでデータの長さに依存して単調に、かつ非常に急
速に減少する。そのため、例えば最終状態の存在確率が
０．５であると言っても、その値が長さ１０のデータ列
に対しての値なのか、それとも長さ２０のデータ列に対
しての値なのかによって、その重み、または意味は全く
異なる。従ってデータの長さに依存しない判断をするた
めには、何らかのデータの長さを補正する処理が必要と
なる。これにはデータの長さ、つまりデータの始端と終
端が必要となる。At the time of recognition, the criterion in the HMM method is the transition probability from the initial state to the final state, and the existence probability in the final state. Since this value is the product of the output probability of each component of the input data string, the transition probability, and the existence probability of a certain state, it decreases monotonously and very rapidly depending on the length of the data. Therefore, for example, even if it is said that the existence probability of the final state is 0.5, whether the value is the value for the data string of length 10 or the value for the data string of length 20. Depending on the weight, the meaning or meaning is completely different. Therefore, in order to make a determination that does not depend on the data length, some kind of processing for correcting the data length is required. This requires the length of the data, that is, the beginning and end of the data.

【００１２】上で述べたようにＤＰ法、ＨＭＭ法のいず
れにおいても、データの始端、終端と言うような、ある
まとまったデータの単位が必要となり、しかもこの単位
毎についての結果しか得る事ができない。このデータの
単位が小さくなれば、データの収集は容易であるがデー
タのばらつきが大きくなり、認識精度は劣化してしま
う。また、この単位が極端に大きくなればデータの収集
が困難となり、またＤＰ法やＨＭＭ法におけるモデル化
の精度も悪くなり、その結果認識精度は劣化してしま
う。つまり、最適なデータの単位の大きさがあるはずで
あるが、それは先験的に決定されるものではない。ま
た、このようなサイズを変化させながら処理をすると
か、あるいは様々な始端終端の可能性をそれぞれ処理す
るとか言うような方法は非常に処理時間のかかるもので
ある。As described above, in both the DP method and the HMM method, a certain unit of data such as the beginning and end of data is required, and only the result for each unit can be obtained. Can not. If the unit of the data is small, the data can be collected easily, but the variation of the data becomes large, and the recognition accuracy deteriorates. Further, if this unit becomes extremely large, it becomes difficult to collect data, the modeling accuracy in the DP method or the HMM method also deteriorates, and as a result, the recognition accuracy deteriorates. That is, there should be an optimal data unit size, which is not a priori determined. In addition, such a method of performing processing while changing the size, or processing of various possibilities of various starting ends and terminations requires a very long processing time.

【００１３】更に重要な問題点は、これらの方法は基本
的に標準データとして持っているものと同一の内容であ
るデータでしか話者認識処理が行えないという事であ
る。これは、上で述べてきた処理が入力データと標準デ
ータとの単純なパタンマッチンング処理である必然的な
事である。An even more important problem is that these methods can basically perform the speaker recognition process only on the data having the same contents as the standard data. This is inevitably the process described above is a simple pattern matching process between input data and standard data.

【００１４】一方、従来法のもう一つの方法であるＭＬ
Ｐ法の場合は任意の数の出力値を、任意の時点で得る事
が可能である。また、データの始端終端のような単位を
仮定する事は特に必要はない。しかし従来のＭＬＰ法
は、データの始端、終端ではなく、入力データの範囲と
言う意味での新たな始端終端の問題が起こる。つまり、
ＭＬＰ法は基本的には静的なデータを認識するための方
法であり、それに時系列データを認識させるためには、
その入力データの時間構造を何らかの形でニューラルネ
ットワークの構造へ反映させなければならない。この方
法として最も多く用いられるのは、ある時間範囲のデー
タを１つの入力データとして入力し、等価的に時間情報
を処理すると言う方法である。しかし、この時間範囲は
ＭＬＰの構成上固定されたものでなければならない。On the other hand, ML which is another method of the conventional method
In the case of the P method, it is possible to obtain any number of output values at any time. In addition, it is not necessary to assume a unit such as the beginning and end of data. However, the conventional MLP method has a problem of a new start and end in terms of the range of input data, not the start and end of data. That is,
The MLP method is basically a method for recognizing static data, and in order to make it recognize time series data,
The temporal structure of the input data must be reflected in the neural network structure in some way. The method most often used as this method is to input data in a certain time range as one input data and process the time information equivalently. However, this time range must be fixed due to the structure of the MLP.

【００１５】この時、この入力時間範囲を越えた時間的
な特徴を認識する事は困難であり、また、同様にこの入
力時間範囲に比較して小さすぎる時間的な特徴を認識す
るのも困難である。つまり、認識したい時間的な特徴の
前後に不要なデータが挿入されるからである。一方入力
される話者特徴時系列データの長さは、話者により、ま
た同一話者においても非常に大きく変動し得るものであ
るので、このような入力範囲の不整合は非常に大きな確
率で起こり得るものである。At this time, it is difficult to recognize a temporal feature that exceeds the input time range, and it is also difficult to recognize a temporal feature that is too small compared to the input time range. Is. That is, unnecessary data is inserted before and after the temporal feature to be recognized. On the other hand, the length of the speaker feature time-series data that is input can vary greatly depending on the speaker and even within the same speaker, so such an input range mismatch is very likely to occur. It can happen.

【００１６】このような固定された入力範囲を持たない
例として、出力を入力側へフィードバックすると従来の
ＭＬＰ法の変形がある。この例としては文字認識の場合
であるが、例えば、電子情報通信学会論文誌Ｄ−ＩＩの
第Ｊ７４巻（１９９１年）の１５５６頁から１５６４頁
の「フィードバック結合をもつ３層ＢＰモデルを用いた
印刷手書き文字列の認識」などに見られる。As an example which does not have such a fixed input range, there is a modification of the conventional MLP method when the output is fed back to the input side. An example of this is the case of character recognition. For example, the three-layer BP model with feedback coupling is used, from page 1556 to page 1564 of J74 volume (1991) of the Institute of Electronics, Information and Communication Engineers D-II. "Printed handwritten character recognition".

【００１７】しかしこれらの方法には、上記文献よりも
明かであるように、ニューラルネットワークの学習を収
束させるのが困難である、また、そのための学習用出力
（教師信号）を試行錯誤的につくらなければならない等
と言う問題点がある。However, as is clear from the above-mentioned literature, it is difficult for these methods to converge the learning of the neural network, and the learning output (teaching signal) for that purpose is created by trial and error. There is a problem that it must be.

【００１８】[0018]

【発明が解決しようとする課題】上で述べてきたよう
に、従来的な話者認識方法に方法においては、１）、統計的な量を用いた方法においては、そのデータ
の収集が非常に困難であり、また、それを少ないデータ
から推定すると言う方法は誤差が発生しやすい。As described above, in the conventional speaker recognition method, 1), in the method using the statistical quantity, the data collection is very difficult. It is difficult, and the method of estimating it from a small amount of data is prone to error.

【００１９】２）、特徴時系列データをＤＰ法やＨＭＭ
法で処理する方法は、適当な長さの処理データと、その
始端と終端とを必要とし、処理時間がかかる。また、結
果を連続的に得るのが困難である。2) The characteristic time series data is converted to the DP method or the HMM.
The method of processing by the method requires processing data of an appropriate length, its start end and its end, and takes a long processing time. Also, it is difficult to obtain results continuously.

【００２０】３）、また、この話者認識処理において
は、入力されるデータは、標準データと同じ発話内容で
なければならないと言う制約がある。3) Also, in this speaker recognition processing, there is a constraint that the input data must have the same utterance content as the standard data.

【００２１】４）、特徴時系列を従来的なＭＬＰ法で処
理する方法は、入力範囲の始端と終端を必要とし、デー
タの長さの変化に対応するのが困難である。また、学習
を収束させるのが困難である。4) The method of processing the characteristic time series by the conventional MLP method requires the beginning and the end of the input range, and it is difficult to deal with the change in the data length. Also, it is difficult to converge learning.

【００２２】等の問題がある。There are problems such as

【００２３】[0023]

【課題を解決するための手段】上記課題を解決するため
の、本発明の話者認識方法は、ニューラルネットワーク
を用いた話者認識方法において、そのニューラルネット
ワークが、少なくとも、内部状態値記憶手段、内部状態
値と外部入力値を入力により内部状態値を更新する内部
状態値更新手段、内部状態値を外部出力値へ変換する出
力値生成手段、を有する神経細胞様素子により構成され
ている事を特徴とする話者認識方法である。The speaker recognition method of the present invention for solving the above-mentioned problems is a speaker recognition method using a neural network, wherein the neural network has at least an internal state value storage means, It is configured by a nerve cell-like element having an internal state value updating means for updating the internal state value by inputting the internal state value and the external input value, and an output value generating means for converting the internal state value to the external output value. This is a characteristic speaker recognition method.

【００２４】[0024]

【実施例】（実施例１）図１は本発明におけるニューラルネットワ
ークを構成する神経細胞様素子の機能を模式的に示した
ものである。図中の番号１０１はその神経細胞様素子の
内部状態値記憶手段を、１０２は１０１に記憶された内
部状態値、及び以下に説明する外部入力値を入力として
内部状態値を更新する内部状態値更新手段を、１０３は
内部状態値を外部出力へ変換する出力値生成手段を、ま
た１０４は神経細胞様素子の全体をそれぞれ模式的に示
す。EXAMPLES Example 1 FIG. 1 schematically shows the function of a nerve cell-like element that constitutes a neural network according to the present invention. In the figure, reference numeral 101 is an internal state value storage means of the nerve cell-like element, and 102 is an internal state value for updating the internal state value by inputting the internal state value stored in 101 and an external input value described below. The updating means, 103 is an output value generating means for converting an internal state value to an external output, and 104 is a schematic representation of the entire nerve cell-like element.

【００２５】この図に示した外部入力値としては、ある
結合重みを剰算されたその神経細胞様素子自身の出力、
また同様に結合重みを剰算された他の神経細胞様素子の
出力、等価的に内部状態更新手段へバイアスを与えるた
めの結合重みを剰算された固定出力値、またそのニュー
ラルネットワークに入力される入力データ等が考えられ
る。As the external input values shown in this figure, the output of the nerve cell-like element itself with a certain coupling weight added,
Similarly, the output of another neuron-like element with the coupling weights added thereto, the fixed output value with the coupling weights equivalently applied to bias the internal state updating means, and its neural network are input. Input data, etc. can be considered.

【００２６】図２は従来例のＭＬＰ法によるニューラル
ネットワークを構成する神経細胞様素子の機能を模式的
に示したものである。図中の番号２０１は内部状態値を
計算する内部状態値計算手段を、２０２は２０１により
計算された内部状態値を外部出力へ変換する出力値生成
手段を、２０３は神経細胞様素子の全体をそれぞれ模式
的に示す。FIG. 2 schematically shows the function of a nerve cell-like element which constitutes a neural network by the conventional MLP method. In the figure, numeral 201 is an internal state value calculating means for calculating an internal state value, 202 is an output value generating means for converting the internal state value calculated by 201 into an external output, and 203 is an entire nerve cell-like element. Each is shown schematically.

【００２７】図２より明かであるように、従来の神経細
胞様素子の出力値は、その時点での入力値のみで決定さ
れる。その意味において、従来の神経細胞様素子の動作
は静的なものである。この静的な神経細胞様素子に、時
系列データを処理させるためには、何らかの形で対象と
なる時系列データの時間的な構造をニューラルネットワ
ークの構造へ反映させる事が必要となる。As is clear from FIG. 2, the output value of the conventional nerve cell-like element is determined only by the input value at that time. In that sense, the operation of the conventional nerve cell-like element is static. In order for this static nerve cell-like device to process time series data, it is necessary to reflect the temporal structure of the target time series data in the structure of the neural network in some way or another.

【００２８】一方、本発明の神経細胞様素子を用いたニ
ューラルネットワークでは、データの過去の履歴が神経
細胞様素子の内部状態値として変換、保持されている。
つまり、この内部状態値として、入力の過去の履歴が保
存され、出力に反映されると言う意味で、本発明の神経
細胞様素子の動作は動的なものである。従って、従来の
神経細胞様素子を用いたニューラルネットワークと異な
り、本発明のニューラルネットワークは、ニューラルネ
ットワークの構造等によらずに時系列データを処理する
事ができる。On the other hand, in the neural network using the nerve cell-like element of the present invention, the past history of data is converted and held as the internal state value of the nerve cell-like element.
That is, the operation of the nerve cell-like device of the present invention is dynamic in the sense that the past history of the input is stored as this internal state value and is reflected in the output. Therefore, unlike the conventional neural network using a nerve cell-like element, the neural network of the present invention can process time series data regardless of the structure of the neural network.

【００２９】従来例の変形として、このような履歴の情
報をコンテキストとして特別な神経細胞様素子のグルー
プに記憶させる場合もある。しかし、このような構成に
おいてはニューラルネットワークを構成する神経細胞様
素子の機能が不均一となり、処理が複雑になると言う問
題がある。何れにおいても従来技術においては先に問題
として述べたように、処理の複雑化、データ量、及びデ
ータメモリーの増大、認識精度の低下をもたらす。As a modification of the conventional example, such history information may be stored as a context in a special neuron-like element group. However, in such a configuration, there is a problem that the function of the nerve cell-like element that constitutes the neural network becomes non-uniform and the processing becomes complicated. In any case, in the conventional technique, as described above as a problem, the process becomes complicated, the amount of data and the data memory increase, and the recognition accuracy decreases.

【００３０】本発明を構成する神経細胞様素子の動作を
詳細に説明すると、その内部状態値Ｘ、出力値Ｙのそれ
ぞれの時間変化において、現在の内部状態値をＸｃｕｒ
ｒ、更新された内部状態値をＸｎｅｘｔ、またその更新
動作時点での先に述べた外部入力値をＺｉ（ｉは０から
ｎであり、ｎはその神経細胞様素子への外部入力数）と
し、内部状態更新手段の動作を形式的に関数Ｇと表す
と、Ｘｎｅｘｔ＝Ｇ（Ｘｃｕｒｒ、Ｚ１、−−−、Ｚｉ、−
−−、Ｚｎ）と表現できる。この表現の具体的な形は様々のものが考
えられるが、例えば１階の微分方程式を用いた次の数２
のようなものも可能である。ここでτはある定数であ
る。The operation of the nerve cell-like element that constitutes the present invention will be described in detail. The current internal state value is Xcur at each time change of the internal state value X and the output value Y.
r, the updated internal state value is Xnext, and the previously described external input value at the time of the updating operation is Zi (i is 0 to n, and n is the number of external inputs to the neuron-like element). , The operation of the internal state updating means is formally expressed as a function G, Xnext = G (Xcurr, Z1, ---, Zi,-
---, Zn) can be expressed. Although various concrete forms of this expression can be considered, for example, the following equation 2 using the first-order differential equation is used.
Something like is also possible. Where τ is a constant.

【００３１】[0031]

【数２】 [Equation 2]

【００３２】また、これをもう少し変形した形としては
以下の数３のような表現も可能である。Further, as a slightly modified form of this, an expression such as the following formula 3 is also possible.

【００３３】[0033]

【数３】 [Equation 3]

【００３４】この中で、Ｗｉｊはｊ番目の神経細胞様素
子の出力を、ｉ番目の神経細胞様素子の入力へ結合する
結合強度を示す。またＤｉは外部入力値を示す。またθ
ｉはバイアス値を示す。このバイアス値は、固定された
値との結合として、Ｗｉｊの中に含めて考える事も可能
である。Among them, Wij represents the coupling strength for coupling the output of the j-th nerve cell-like element to the input of the i-th nerve cell-like element. Di represents an external input value. Also θ
i indicates a bias value. This bias value can be considered to be included in Wij as a combination with a fixed value.

【００３５】このようにして決定されたある瞬間の神経
細胞様素子の内部状態をＸとし、出力値生成手段の動作
を形式的に関数Ｆで表すと、神経細胞様素子の出力Ｙ
は、Ｙ＝Ｆ（Ｘ）と表現できる。Ｆの具体的な形としては以下の数４で示
されるような正負対称出力のシグモイド（ロジスティッ
ク）関数等が考えられる。When the internal state of the nerve cell-like element at a certain moment thus determined is X and the operation of the output value generating means is formally represented by a function F, the output Y of the nerve cell-like element is expressed.
Can be expressed as Y = F (X). As a concrete form of F, a sigmoid (logistic) function of positive / negative symmetrical output as shown by the following Expression 4 can be considered.

【００３６】[0036]

【数４】 [Equation 4]

【００３７】しかし、この関数型は必須のものではな
く、その他にもより単純な線形変換や、あるいはしきい
値関数等も考えられる。However, this function type is not indispensable, and simpler linear conversion or a threshold function may be considered.

【００３８】このような式に従い本発明におけるニュー
ラルネットワークの出力の時系列は図７に示したような
処理により計算される。図７においては簡略のため神経
細胞様素子を単にノードと記載している。The time series of the output of the neural network according to the present invention is calculated by the processing as shown in FIG. In FIG. 7, the nerve cell-like element is simply described as a node for simplification.

【００３９】このニューラルネットワークに所望の処理
をさせるためには、学習が必要である。この学習方法に
ついては、例えば次のような数５により導入される量Ｃ
を用いた学習則がある。Learning is required in order for the neural network to perform desired processing. Regarding this learning method, for example, the quantity C introduced by the following equation 5
There is a learning rule using.

【００４０】[0040]

【数５】 [Equation 5]

【００４１】ここで、Ｃはある学習評価値であり、Ｅは
ある誤差評価値である。このような式に従い、Ｃは図８
に示したような処理により決定される。Here, C is a certain learning evaluation value and E is a certain error evaluation value. According to such a formula, C is shown in FIG.
It is determined by the processing as shown in.

【００４２】この誤差評価Ｅの具体的な形としては、実
際の出力値をＹ、所望の出力値をＴとすると以下の数６
で表されるKullback-leibler距離等が考えられる。As a concrete form of this error evaluation E, when the actual output value is Y and the desired output value is T, the following equation 6 is obtained.
The Kullback-leibler distance represented by is considered.

【００４３】[0043]

【数６】 [Equation 6]

【００４４】また、出力値の範囲が−１から１の間であ
る場合は、数６の式と実質的に同等であるが、以下の数
７のような表現をする。When the output value range is between -1 and 1, it is substantially equivalent to the equation (6), but expressed as the following equation (7).

【００４５】[0045]

【数７】 [Equation 7]

【００４６】これらを仮定すると、上の数５はより具体
的に以下の数８のように書ける。Assuming these, the above equation 5 can be more specifically written as the following equation 8.

【００４７】[0047]

【数８】 [Equation 8]

【００４８】これらを与える事により、先述べた種々の
外部入力の重み付けの係数の更新則は次の数９のように
与えられる。By giving these, the updating rule of the weighting coefficient of various external inputs described above is given by the following Expression 9.

【００４９】[0049]

【数９】 [Equation 9]

【００５０】本発明においては、従来例のＭＬＰ法にお
けるバックプロパゲーション学習とは異なり、ニューラ
ルネットワークを構成する全ての神経細胞様素子の内部
状態と出力値は、同時に更新される事が可能である。勿
論逐次的に更新されてもかまわない。また同様に、本発
明におけるニューラルネットワークの構成は、従来例の
ＭＬＰ法のように層状である必要はない。全結合型のニ
ューラルネットワーク、層状のニューラルネットワー
ク、またそれら以外のより一般的な構成のニューラルネ
ットワークも可能である。In the present invention, unlike the back propagation learning in the conventional MLP method, the internal states and output values of all the nerve cell-like elements constituting the neural network can be updated at the same time. . Of course, it may be updated sequentially. Similarly, the configuration of the neural network in the present invention does not need to be layered as in the conventional MLP method. Fully-connected neural networks, layered neural networks, and other general-purpose neural networks are also possible.

【００５１】図３は話者認識のための話者特徴時系列デ
ータをどのような処理により抽出するかを模式的に示し
た図である。その概略を説明すると、まず音声入力手段
により入力された音声は、ＡＤ変換器等によりディジタ
ル化される。その後ディジタル化された入力から、フレ
ームと呼ばれる１部分が取り出され、その特徴が抽出さ
れ、一つの特徴ベクトルとなる。このような特徴ベクト
ルの時間的な連続が、入力話者の特徴時系列となる。FIG. 3 is a diagram schematically showing by what process the speaker feature time series data for speaker recognition is extracted. To explain the outline, first, the voice input by the voice input means is digitized by an AD converter or the like. After that, one part called a frame is extracted from the digitized input, and its features are extracted to form one feature vector. The temporal continuity of such a feature vector becomes the feature time series of the input speaker.

【００５２】図５は本発明の１実施例である話者認識方
法の構成の模式図である。図中の番号５０１は図３で説
明したような音声特徴抽出手段を、５０２は上で説明し
たようなニューラルネットワークを、５０４はニューラ
ルネットワークの入力を記憶する入力値記憶手段を、５
０３は記憶された入力と出力とを比較するデータ比較手
段をそれぞれ模式的に示す。FIG. 5 is a schematic diagram of the configuration of a speaker recognition method according to an embodiment of the present invention. In the figure, reference numeral 501 is the speech feature extraction means as described in FIG. 3, 502 is the neural network as described above, and 504 is the input value storage means for storing the input of the neural network.
Reference numeral 03 schematically represents a data comparison means for comparing the stored input and output.

【００５３】この実施例において、ニューラルネットワ
ークは、あるフレームに対応する特徴ベクトルを入力と
し、その一つ前のフレームの特徴ベクトルを出力するよ
うに学習させた。ここで用いた話者特徴量は８次のＰＡ
ＲＣＯＲ係数である。話者特徴量としてはＰＡＲＣＯＲ
係数の他にも種々のものを使用する事が可能であるが、
ＰＡＲＣＯＲ係数においてはその値が原理的に−１から
１の間にある事、また、比較的に話者に依存する割合が
高い等の特徴があり、話者認識においてはより有効な特
徴量である。In this embodiment, the neural network is trained so that the feature vector corresponding to a certain frame is input and the feature vector of the immediately preceding frame is output. The speaker feature used here is the PA of the 8th order.
RCOR coefficient. PARCOR as the speaker feature
It is possible to use various other than the coefficient,
The PARCOR coefficient has such a characteristic that its value is in principle between -1 and 1, and has a relatively high rate of dependence on the speaker, which is a more effective characteristic amount in speaker recognition. is there.

【００５４】この実施例では、入力値記憶手段は、１フ
レーム前の入力データのみを記憶し、それを現フレーム
による出力データと比較する事により話者認識を行う事
ができる。つまり、音声特徴時系列は、発声のメカニズ
ムから容易に理解できる事であるが、発話、調音器官の
連続的な運動を反映したものである。そして、話者の個
人性はこれらの運動の特徴として現れるので、この特徴
時系列を処理する事により個人の認識を行う事ができ
る。この実施例ではその処理を、現データから１つ前の
データを復元する、と言う形で入力音声の個人性を評価
する事になる。この例でのニューラルネットワークの構
成としては、自己ループを含む非対称な完全結合型の構
成とした。しかし上でも述べたが、本発明を構成するニ
ューラルネットワークは、層状結合、完全結合等を特殊
例として含むランダムな構成をとる事が可能である。ま
た、入力素子、隠れ素子、出力素子のそれぞれの個数は
すべてとした。In this embodiment, the input value storage means stores only the input data of one frame before, and compares it with the output data of the current frame to recognize the speaker. In other words, the speech feature time series, which can be easily understood from the mechanism of utterance, reflects the continuous movements of speech and articulatory organs. Since the individuality of the speaker appears as a feature of these movements, it is possible to recognize the individual by processing this feature time series. In this embodiment, the processing evaluates the individuality of the input voice by restoring the previous data from the current data. The structure of the neural network in this example is an asymmetric complete connection type structure including a self-loop. However, as described above, the neural network that constitutes the present invention can have a random configuration including layered connections and complete connections as special examples. The numbers of input elements, hidden elements, and output elements are all set.

【００５５】また、この実施例においては、ニューラル
ネットワークを訓練する標準データとして９つの単語、
「終点」「腕前」「拒絶」「超越」「とりあえず」「分
類」「ロッカー」「山脈」「隠れピューリタン」を用い
た。また音声データとしては、ＡＴＲ者の研究用日本語
音声データベースに収録されているものを用いた。In this embodiment, nine words are used as standard data for training the neural network,
"End point", "Prowess", "Rejection", "Transcendence", "For the time being", "Classification", "Rocker", "Mountains", and "Hidden Puritan" were used. Moreover, as the voice data, the one recorded in the Japanese voice database for research of the ATR person was used.

【００５６】また、以上の構成である本発明の方法によ
れば、従来例のＭＬＰ法の変形である「フィードバック
結合を持つＢＰモデル」型ニューラルネットワーク等に
見られた、学習を収束させるのが困難であり、また、そ
のための学習用出力を試行錯誤的に作成しなければなら
ない等の問題点は存在せず、本発明の話者認識方法のニ
ューラルネットワークは、極めて容易に数１００回から
数１０００回の学習で所望の出力を生成するようにでき
た。Further, according to the method of the present invention having the above-described configuration, it is possible to converge the learning found in the "BP model with feedback coupling" type neural network which is a modification of the conventional MLP method. The neural network of the speaker recognition method of the present invention is extremely easy to perform from several hundred times to several times, and there is no problem that the learning output for it is required to be created by trial and error. It was possible to generate the desired output after 1000 learnings.

【００５７】図９、図１０はそのようにして学習させた
ニューラルネットワークによる話者認識の結果の例であ
る。図中の実線は話者ＭＡＵの音声を認識させるために
学習させたニューラルネットワークの生成した出力によ
る誤差の時間変化を、また波線は話者ＭＸＭの音声を認
識させるために学習させたニューラルネットワークが生
成しった出力の誤差の時間変化を示したものである。こ
こで示した誤差は、８次の入力ベクトルデータ、及び出
力ベクトルとによりデータ比較手段により生成された誤
差ベクトルの長さの絶対値を、その時点でのフレームの
前後３２フレームについて平均した値を示したものであ
る。また図９の入力話者はＭＡＵであり、図１０入力話
者はＭＸＭである。9 and 10 show examples of results of speaker recognition by the neural network trained in this way. The solid line in the figure represents the time change of the error due to the output generated by the neural network trained to recognize the voice of the speaker MAU, and the broken line represents the neural network trained to recognize the voice of the speaker MXM. It shows the time change of the error of the generated output. The error shown here is a value obtained by averaging the absolute values of the lengths of the error vectors generated by the data comparison means with the 8th-order input vector data and the output vector for 32 frames before and after the frame at that time. It is shown. The input speaker in FIG. 9 is MAU, and the input speaker in FIG. 10 is MXM.

【００５８】図より明かであるように、図９の場合はＭ
ＡＵの声で訓練されたニューラルネットワークによるデ
ータ復元誤差が小さく、ＭＸＭで訓練されたニューラル
ネットワークによる復元誤差の方が大きい。これはＭＡ
Ｕの発話特徴を用いたデータ復元の方が精度の良い復元
が可能である事を示し、つまり入力された音声がＭＡＵ
によるものである事を示している。As is clear from the figure, in the case of FIG.
The data restoration error by the neural network trained by AU voice is small, and the restoration error by the neural network trained by MXM is larger. This is MA
It is shown that the data restoration using the utterance feature of U can perform the restoration more accurately, that is, the input voice is MAU.
It is due to.

【００５９】また図１０の場合は図９の場合とは逆にＭ
ＸＭの声で訓練されたニューラルネットワークによるデ
ータ復元誤差が小さく、つまりこの入力された音声がＭ
ＸＭによるものである事を示している。Further, in the case of FIG. 10, contrary to the case of FIG.
The data restoration error due to the neural network trained by the XM voice is small, that is, the input voice is M
This is due to XM.

【００６０】上の図より明かであるように、本発明の話
者認識方法によれば、連続した話者認識結果を得る事が
できる。As is clear from the above figure, according to the speaker recognition method of the present invention, continuous speaker recognition results can be obtained.

【００６１】下の表１は上の例の二つのニューラルネッ
トワークに、訓練話者以外の９話者を含む合計１１人の
音声を入力した場合の誤差の平均値を示したものであ
る。入力は訓練に用いた９単語そのもであり、平均はそ
の全発話区間について行った。表より明かであるよう
に、それぞれのニューラルネットワークにおいて、１１
人の音声入力に対し訓練話者に対する誤差が一番小さ
く、１１人の中から正確に訓練話者認識している事が示
される。Table 1 below shows the average value of the errors when a total of 11 voices including 9 speakers other than the training speaker are input to the two neural networks of the above example. The input was the 9 words that were used for training, and the average was performed over the entire utterance section. As is clear from the table, in each neural network, 11
The error with respect to the training speaker is the smallest with respect to the voice input of the person, and it is shown that the training speaker is correctly recognized from the 11 persons.

【００６２】[0062]

【表１】 [Table 1]

【００６３】また、下の表２は表１と同様の結果である
が、上の場合と異なり、訓練に用いいた単語音声とは内
容が異なる単語音声を入力した場合の結果である。ここ
で用いた単語は「カレンダー」「いらっしゃる」「極
端」「駐車」「プログラム」「録音」「購入」「タイピ
ュータ」である。Table 2 below shows the same result as that of Table 1, but unlike the above case, it shows the result when a word voice having a different content from the word voice used for training is input. The words used here are "calendar", "welcome", "extreme", "parking", "program", "record", "purchase", and "typuter".

【００６４】[0064]

【表２】 [Table 2]

【００６５】上の表より明かであるように、本発明の話
者認識方法は入力された音声の発話内容が異なっても正
確に訓練話者を認識している事が示される。As is clear from the above table, it is shown that the speaker recognition method of the present invention accurately recognizes the training speaker even if the utterance content of the input voice is different.

【００６６】また、上の説明は時間的に離散的な場合に
ついて説明をしてきたが、例えばアナログ的な処理を行
う事により連続時間処理においても適用可能である。Further, although the above description has explained the case of being discrete in time, it is also applicable to continuous time processing by performing analog processing, for example.

【００６７】（実施例２）図４は実施例１の変形とし
て、入力された音声特徴そのものを出力するように訓練
した例である。図中の番号４０１は音声特徴入力手段
を、４０２はニューラルネットワークを、４０３はデー
タ比較手段をそれぞれ模式的に示す。(Embodiment 2) FIG. 4 is a modification of Embodiment 1 in which training is carried out so that the input speech feature itself is output. In the figure, reference numeral 401 is a voice feature input means, 402 is a neural network, and 403 is a data comparison means.

【００６８】この例においても実施例１と同様の効果を
得る事ができる。Also in this example, the same effect as that of the first embodiment can be obtained.

【００６９】（実施例３）図６は実施例１の変形とし
て、入力された音声特徴から、ｎステップ将来のフレー
ムの入力データを予測して出力するように訓練した例で
ある。図中の番号６０１は音声特徴入力手段を、６０２
はニューラルネットワークを、６０３はデータ比較手段
を、６０４は出力値記憶手段をそれぞれ模式的に示す。(Third Embodiment) FIG. 6 shows a modification of the first embodiment in which training is performed so as to predict and output input data of a frame n future frames from the input speech feature. Reference numeral 601 in the figure denotes a voice feature input means 602.
Is a neural network, 603 is a data comparison means, and 604 is an output value storage means.

【００７０】この例においても実施例１と同様の効果を
得る事ができる。Also in this example, the same effect as that of the first embodiment can be obtained.

【００７１】（実施例４）図１１は上の実施例とは異な
り、入力された話者と訓練に用いた話者の類似度を直接
に出力するように学習させる例である。この場合の入力
は上の実施例と同様のものが可能であり、また学習用出
力としては、ある特定の話者の入力に対し、その話者に
対応付けられた特定の類似度出力素子が出力を出すよう
にすれば良い。この出力は任意の数である事が可能であ
る。(Fourth Embodiment) FIG. 11 is an example different from the above embodiment in which the similarity between the input speaker and the speaker used for training is learned so as to be directly output. The input in this case can be the same as the one in the above embodiment, and as the learning output, for the input of a specific speaker, a specific similarity output element associated with that speaker is used. It should output the output. This output can be any number.

【００７２】（実施例５）図１２は上の実施例４と類似
したものである。この場合、実施例４の、ある特定の話
者の入力に対し、その話者に対応付けられた特定の類似
度出力素子が出力を出すと言う事に加えて、目的とする
話者以外の入力に対し、非類似度出力が出力を出すよう
に学習させる例である。この出力は上と同様に任意の数
である事が可能である。一般にこのような学習の結果得
られる非類似度出力は、類似度出力を単純に反転したも
のにはならず、それらを組み合わせたより高度な判断が
可能となる。(Embodiment 5) FIG. 12 is similar to Embodiment 4 above. In this case, in addition to the fact that the specific similarity output element associated with a specific speaker outputs an output in response to the input of a specific speaker in the fourth embodiment, a speaker other than the target speaker is output. This is an example in which the input is learned so that the dissimilarity output outputs. This output can be any number as above. Generally, the dissimilarity output obtained as a result of such learning is not a simple inversion of the similarity output, and a higher degree of judgment can be made by combining them.

【００７３】[0073]

【発明の効果】以上述べてきたように、本発明の話者認
識方法によれば、１）、非常に少数の学習データで精度の高い話者認識が
可能である。As described above, according to the speaker recognition method of the present invention, 1) it is possible to perform highly accurate speaker recognition with a very small amount of learning data.

【００７４】２）、話者の発話特徴そのものを認識する
ために、話者認識処理の際のデータが訓練時のものと異
なっても話者認識がかのうである。2) In order to recognize the utterance feature itself of the speaker, it is possible to recognize the speaker even if the data in the speaker recognition process is different from that in the training.

【００７５】３）、学習が極めて容易であり、そのため
の試行錯誤的な部分が非常に少ない。などの効果がある。3) The learning is extremely easy, and there are very few trial and error parts for that. And so on.

【００７６】また本発明の方法は話者認識のみではな
く、未知話者と既知話者との類似度の判定等に用いる事
ができる。また本発明の方法は音声のみではなく、広く
時系列情報一般の処理においても有効である。The method of the present invention can be used not only for speaker recognition, but also for determining the degree of similarity between an unknown speaker and a known speaker. Further, the method of the present invention is effective not only for speech but also for processing of time series information in general.

[Brief description of drawings]

【図１】本発明におけるニューラルネットワークを構成
する神経細胞様素子の機能の模式図である。FIG. 1 is a schematic diagram of the function of a nerve cell-like element that constitutes a neural network in the present invention.

【図２】従来例のニューラルネットワークを構成する神
経細胞様素子の機能の模式図である。FIG. 2 is a schematic diagram of the function of a nerve cell-like element that constitutes a conventional neural network.

【図３】音声特徴抽出手段の構成の模式図である。FIG. 3 is a schematic diagram of a configuration of voice feature extraction means.

【図４】本発明の話者認識方法の構成の１実施例の模式
図である。FIG. 4 is a schematic diagram of an embodiment of the configuration of the speaker recognition method of the present invention.

【図５】本発明の話者認識方法の構成の１実施例の模式
図である。FIG. 5 is a schematic diagram of one embodiment of the configuration of the speaker recognition method of the present invention.

【図６】本発明の話者認識方法の構成の１実施例の模式
図である。FIG. 6 is a schematic diagram of an embodiment of the configuration of the speaker recognition method of the present invention.

【図７】本発明の話者認識方法におけるニューラルネッ
トワークの処理の流れの模式図である。FIG. 7 is a schematic diagram of a flow of processing of a neural network in the speaker recognition method of the present invention.

【図８】本発明の話者認識方法におけるニューラルネッ
トワークの学習の際お誤差評価の流れを示す模式図であ
る。FIG. 8 is a schematic diagram showing a flow of error evaluation when learning a neural network in the speaker recognition method of the present invention.

【図９】本発明の１実施例における話者認識の結果を示
す図である。FIG. 9 is a diagram showing a result of speaker recognition according to an embodiment of the present invention.

【図１０】本発明の１実施例における話者認識の結果を
示す図である。FIG. 10 is a diagram showing a result of speaker recognition according to an embodiment of the present invention.

【図１１】本発明の話者認識方法の構成の１実施例の模
式図である。FIG. 11 is a schematic diagram of an embodiment of the configuration of the speaker recognition method of the present invention.

【図１２】本発明の話者認識方法の構成の１実施例の模
式図である。FIG. 12 is a schematic diagram of one embodiment of the configuration of the speaker recognition method of the present invention.

[Explanation of symbols]

１０１：内部状態値記憶手段１０２：内部状態値更新手段１０３：出力値生成手段１０４：神経細胞様素子２０１：内部状態値計算手段２０２：出力値生成手段２０３：神経細胞様素子４０１：音声特徴抽出手段４０２：ニューラルネットワーク４０３：データ比較手段５０１：音声特徴抽出手段５０２：ニューラルネットワーク５０３：データ比較手段５０４：入力値記憶手段６０１：音声特徴抽出手段６０２：ニューラルネットワーク６０３：データ比較手段６０４：出力値記憶手段１１０１：音声特徴抽出手段１１０２：ニューラルネットワーク１１０３：話者類似度出力１１１０４：話者類似度出力２１２０１：音声特徴抽出手段１２０２：ニューラルネットワーク１２０３：話者類似度・非類似度出力１１２０４：話者類似度・非類似度出力２ 101: internal state value storage means 102: internal state value updating means 103: output value generation means 104: nerve cell-like element 201: internal state value calculation means 202: output value generation means 203: nerve cell-like element 401: voice feature extraction Means 402: Neural network 403: Data comparison means 501: Voice feature extraction means 502: Neural network 503: Data comparison means 504: Input value storage means 601: Voice feature extraction means 602: Neural network 603: Data comparison means 604: Output value Storage unit 1101: Speech feature extraction unit 1102: Neural network 1103: Speaker similarity output 1 1104: Speaker similarity output 2 1201: Speech feature extraction unit 1202: Neural network 1203: Speaker similarity / dissimilarity output 1 1204 : Speaker similarity / dissimilarity output 2

Claims

[Claims]

1. A speaker recognition method using a neural network, wherein the neural network updates at least an internal state value storage means and an internal state value by inputting an internal state value and an external input value. A speaker recognition method comprising a nerve cell-like element including: an output value generating means for converting an internal state value into an external output value.

2. The internal state value updating means of the nerve cell-like element, and the internal state value X of the nerve cell-like element, and the input Zi to the nerve cell-like element (i is 0 to n: n is a natural number). ), The following is obtained. The speaker recognition method according to claim 1, wherein the internal state value is updated to a value that satisfies

3. The speaker recognition method according to claim 1, wherein the speaker recognition method has a data comparison means for comparing data of an input value and an output value. .

4. The speaker recognition method has an input value storage means for storing an input value before a certain time T and a data comparison means for comparing the stored input value with the current output data. The speaker recognition method according to claim 1, wherein the speaker recognition method is a speaker recognition method.

5. The speaker recognition method has output value storage means for storing an output value before a certain time T, and data comparison means for comparing the stored output data with the current input data. The speaker recognition method according to any one of claims 1 and 2, which is characterized.

6. The speaker recognition method according to claim 1, wherein the speaker recognition method has a similarity output corresponding to each of one or more recognition target speakers. Speaker recognition method.

7. The speaker recognition method according to claim 1, wherein the speaker recognition method has a similarity output and a dissimilarity output corresponding to each of the one or more recognition target speakers. 2. The speaker recognition method according to any one of 2 above.

8. The input Zi is at least PARC.
8. The speaker recognition method according to claim 1, further comprising an OR coefficient.

9. The output value generating means is an output function having an output value between -1 and 1.
9. The speaker recognition method according to claim 8.

10. The speaker recognition method according to claim 1, wherein the input Zi includes at least a value obtained by multiplying an output of the nerve cell-like element itself by a weight. .

11. The speaker recognition according to claim 1, wherein the input Zi includes at least a value obtained by multiplying an output of another nerve cell-like element by a weight. Method.

12. The speaker recognition method according to claim 1, wherein the input Zi includes at least desired data provided from the outside.