JPH0415698A

JPH0415698A - Speaker collating system

Info

Publication number: JPH0415698A
Application number: JP2120865A
Authority: JP
Inventors: Shingo Nishimura; 新吾西村
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1990-05-09
Filing date: 1990-05-09
Publication date: 1992-01-21

Abstract

PURPOSE:To improve the collating rate by using a fresh neural network corresponding to the secular change of a voice when a neural network is restructured with input data for update learning. CONSTITUTION:The neural network is used to outputs a output value for decision making which indicates whether a current input speaker is registered or not. In this case, the neural network is restructured by learning based upon the learning input data on registered speakers. Then new input data on registered speakers are added to the existent learning input data to update the learning input data and the neural network is restructured by additional learning based upon the input data for update learning. Consequently, even if there is a period difference from the stage of the initial structuring of the neural network, the speaker matching system which does not deteriorate in collating rate is obtained.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は話者照合システムに関する。[Detailed description of the invention] [Industrial application field] The present invention relates to a speaker verification system.

［従来の技術］本出願人は、ニューラルネットワークを用いて、今回の
入力話者が登録話者と非登録話者のいずれに属するかの
判定用出力値を出力できる話者照合システムを提案して
いる（音響画論２−６−４、ＰＰ、　５３−５４．１９
８９．３）　。[Prior Art] The present applicant has proposed a speaker verification system that uses a neural network to output an output value for determining whether the current input speaker belongs to a registered speaker or an unregistered speaker. (Sound Picture Theory 2-6-4, PP, 53-54.19
89.3).

この話者照合システムにあっては、登録話者の学習用入
力データに基づく学習によりニューラルネットワークを
構築することとしている。In this speaker verification system, a neural network is constructed by learning based on learning input data of registered speakers.

［発明が解決しようとする課題］然るに、従来技術では、学習用入力データかある限られ
た時期のもののみに固定化されている。[Problems to be Solved by the Invention] However, in the prior art, the input data for learning is fixed only to data from a limited period.

このため、照合を行なう際に入力される音声が、学習時
の音声に対して経時変化し、結果として照合率の劣化を
みる場合かある。For this reason, the voice input when performing verification may change over time compared to the voice used during learning, and as a result, the verification rate may deteriorate.

本発明は、ニューラルネットワークの初期構築の段階か
ら時期差かあっても、照合率が劣化しない話者照合シス
テムを提供することを目的とする。An object of the present invention is to provide a speaker verification system in which the verification rate does not deteriorate even if there is a time difference from the initial construction stage of the neural network.

［課題を解決するための手段］請求項１に記載の本発明は、ニューラルネットワークを
用いて、今回の入力話者が登録話者と非登録話者のいず
れに属するかの判定用出力値を出力できる話者照合シス
テムてあって、登録話者の学習用入力データに基づく学
習によりニューラルネットワークを構築した後、登録話
者の新しい入力データを既存学習用入力データに累積す
ることにより学習用入力データを更新し、更新学習用入
力データに基づく追加学習により上記ニューラルネット
ワークを再構築するようにしたものである。[Means for Solving the Problem] The present invention according to claim 1 uses a neural network to obtain an output value for determining whether the current input speaker belongs to a registered speaker or an unregistered speaker. There is a speaker matching system that can output, and after building a neural network by learning based on the training input data of registered speakers, the new input data of registered speakers is accumulated on the existing training input data to generate the training input data. The neural network is reconstructed by updating the data and performing additional learning based on the input data for update learning.

請求項２記載の本発明は、前記ニューラルネットワーク
への入力として、 ■音声の周波数特性の時間的変化、 ■音声の平均的な線形予測係数、 ■音声の平均的なＰＡＲＣＯＲ係数、 ■音声の平均的な周波数特性、及びピッチ周波数、 ■高域強調を施された音声波形の平均的な周波数特性、
並びに ■音声の平均的な周波数特性のうちの１つ以上を使用するようにしだものである。The present invention according to claim 2 provides, as inputs to the neural network, (1) temporal changes in frequency characteristics of audio, (2) average linear prediction coefficients of audio, (2) average PARCOR coefficients of audio, and (2) average of audio. frequency characteristics and pitch frequency, ■Average frequency characteristics of high-frequency emphasized audio waveform,
and (2) it uses one or more of the average frequency characteristics of voice.

［作用コ請求項１に記載の本発明によれば、下記■の作用効果か
ある。[Function] According to the present invention as set forth in claim 1, there is the following function and effect.

■本発明にあっては、第３図に示す如く、登録話者の既
存学習用入力データか、時系列的にＡ　１　、　Ａ　２
　、　Ａ　３　・・・Ａ　ｎ　（Ａ　１か最古、Ａｎか
最新）のｎ個のデータから構成されていたとする。■In the present invention, as shown in FIG. 3, existing learning input data of registered speakers or A 1 , A 2 in chronological order
, A 3 ...A n (A 1 is the oldest, An is the latest).

そして、登録話者の新しい入力データＡ　ｎ＋１がサシ
プルされたとすると、このＡ　ｎ＋１をＡｌ−Ａｎに累
積し、Ａｌ　、Ａ２　、Ａ３　＝Ａｎ　、Ａｎ＋１のｎ
＋１個のデータを更新学習用入力データとし、ニューラ
ルネットワークをこの更新学習用入力データにより再構
築することになる。Then, if new input data A n+1 of the registered speaker is sursipulled, this A n+1 is accumulated in Al-An, and Al, A2, A3 = An, n of An+1.
+1 data is used as input data for update learning, and the neural network is reconstructed using this input data for update learning.

即ち、更新学習用入力データによりニューラルネットワ
ークを再構築するものてあり、音声の経時変化に対応し
た新鮮なニューラルネットワークを用いることによって
照合率を向上することかできる。That is, the neural network is reconstructed using input data for update learning, and the matching rate can be improved by using a fresh neural network that corresponds to changes in speech over time.

請求項２に記載の本発明によれば、下記■の作用効果が
ある。According to the present invention as set forth in claim 2, there is the following effect (2).

■ニューラルネットワークへの入力として、請求項２に
記載の■〜■の各要素のうちの１つ以上を用いるから、
入力を得るための前処理が単純となり、この前処理に要
する時間か短くて足りるため、話者照合システムを複雑
な処理装置によることなく容易に実時間処理できる。■Since one or more of the elements of ■ to ■ according to claim 2 are used as input to the neural network,
Since the preprocessing for obtaining input is simple and the time required for this preprocessing is short, the speaker verification system can be easily processed in real time without using a complicated processing device.

［実施例コ第１図はニューラルネットワークの学習系統な示すブロ
ック図、第２図はニューラルネ・ソトワークによる話者
照合系統を示すブロック図、第３図は学習用入力データ
を示す模式図である。[Example] Figure 1 is a block diagram showing the learning system of the neural network, Figure 2 is a block diagram showing the speaker verification system using the neural network, and Figure 3 is a schematic diagram showing input data for learning. .

（Ａ）　まず、ニューラルネットワークの学習系統につ
いて説明する（第１図参照）。(A) First, the learning system of the neural network will be explained (see Figure 1).

この系統は、音声入力部１１、前処理部１２、学習デー
タ記憶部１３、ニューラルネットワーク１４にて構成さ
れる。This system includes a voice input section 11, a preprocessing section 12, a learning data storage section 13, and a neural network 14.

以下、前処理部１２、学習データ記憶部１３、ニューラ
ルネットワーク１４の構成について説明する。The configurations of the preprocessing section 12, learning data storage section 13, and neural network 14 will be explained below.

（１）前処理部前処理部１２は、入力音声に簡単な前処理を施し、ニュ
ーラルネットワーク１４への入力データを作成する。(1) Preprocessing unit The preprocessing unit 12 performs simple preprocessing on input audio to create input data to the neural network 14.

前処理部１２の具体的構成を例示すれば、以下の如くで
ある。A specific example of the configuration of the preprocessing section 12 is as follows.

即ち、前処理部１２としては、ローパスフィルタ、バン
ドパスフィルタ、平均化回路の結合からなるものを用い
ることかできる。That is, the preprocessing section 12 may be a combination of a low-pass filter, a band-pass filter, and an averaging circuit.

■入力音声の音声信号の高域の雑音成分を、ローパスフ
ィルタにてカットする。そして、この入力音声を４つの
ブロックに時間的に等分割する。■Cut the high-frequency noise components of the input audio signal using a low-pass filter. Then, this input audio is temporally equally divided into four blocks.

■音声波形を、複数（ｎ個）チャンネルのバントパスフ
ィルタに通し、各ブロック即ち各一定時間毎の周波数特
性を得る。(2) Pass the audio waveform through a plurality of (n) channel band pass filters to obtain frequency characteristics for each block, that is, for each fixed time period.

この時、バントパスフィルタの出力信号は、平均化回路
にて、各ブロック毎、即ち一定時間て平均化される。At this time, the output signal of the band pass filter is averaged for each block, that is, for a certain period of time, in an averaging circuit.

以上の前処理により、「音声の一定時間内における平均
的な周波数特性の時間的変化」か得られる。Through the above pre-processing, the "temporal change in the average frequency characteristics of audio within a certain period of time" can be obtained.

（２）学習データ記憶部学習データ記憶部１３は、学習用入力データを記憶する
。(2) Learning data storage unit The learning data storage unit 13 stores learning input data.

この時、学習データ記憶部１３は、第３図に示す如く、
登録話者の学習用入力データを更新する。即ち、登録話
者の既存学習用人力データか、時系列的にＡＩ　、Ａ２
　、Ａ３−Ａｎ　　（ＡＩか最６、Ａｎか最新）のｎ個
のデータから構成されていたとする。そして、登録話者
の新しい入力データＡ　ｎ＋１かサンプルされたとする
と、このデータＡ　ｎ＋１をＡｌ〜Ａｎに累積し、ＡＩ
、Ａ２．Ａ３・・・Ａｎ、Ａｒｕｌのｎ＋１個のデータ
を更新学習入力データとする。At this time, the learning data storage unit 13 stores, as shown in FIG.
Update the learning input data of the registered speaker. In other words, existing human training data of registered speakers or AI, A2 in chronological order
, A3-An (AI is the latest 6, An is the latest). Then, if new input data A n+1 of the registered speaker is sampled, this data A n+1 is accumulated in Al to An, and the AI
, A2. A3... n+1 data of An and Arul are set as update learning input data.

（３）ニューラルネットワークニューラルネットワーク１４は、上記学習データ記憶部
１３の既存学習用入力データ、及び更新学習用入力デー
タを用いて構築され、入力話者か登録話者か否かを判定
するための出力値を出力できる。(3) Neural Network The neural network 14 is constructed using the existing learning input data and the updated learning input data in the learning data storage section 13, and is used to determine whether the input speaker is an input speaker or a registered speaker. Output values can be output.

ニューラルネットワーク１４の具体的構成を例示すれば
、以下の如くである。A specific example of the configuration of the neural network 14 is as follows.

■構造ニューラルネットワーク１４は例えば３層パーセプトロ
ン型であり、入カニニット数は前処理部１２の４ブロツ
ク、ｎチャンネルに対応する１２ｎ個、出カニニット数
は登録話者と同数個である。(2) The structural neural network 14 is, for example, a three-layer perceptron type, and the number of input units is 12n corresponding to the four blocks of the preprocessing section 12 and n channel, and the number of output units is the same as the number of registered speakers.

■学習目標値は、■登録話者については対応する出カニニット
の出力値を　１、その他の出力値を　０とし、■非登録
話者については、金山カニニットの出力値をＤとする。■The learning target value is:■For registered speakers, the output value of the corresponding output value is 1, and for other output values, it is 0.■For non-registered speakers, the output value of KanayamaKaninit is D.

（ａ）登録話者の音声に前処理部１２による前処理を施
し、ニューラルネットワーク１４に入力する。目標値に
近づくようにニューラルネットワーク１４の重みと変換
関数を修正する。(a) The voice of the registered speaker is subjected to preprocessing by the preprocessing unit 12 and is input to the neural network 14. The weights and conversion function of the neural network 14 are modified so as to approach the target values.

（ｂｌ非登録話者の音声に前処理部１２による前処理を
施し、ニューラルネットワーク１４に入力する。目標値
に近づくようにニューラルネットワーク１４の重みと変
換関数を修正する。(The voice of the unregistered speaker is preprocessed by the preprocessing unit 12 and input to the neural network 14. The weight and conversion function of the neural network 14 are corrected so as to approach the target value.

（ａ）　、　（ｂ）を目標値と出カニニットの出力値の
誤差か、十分に小さな値（例えば、Ｉ　Ｘ　１０−’）
になるまて繰り返す。(a) and (b) are the errors between the target value and the output value of the output unit, or are sufficiently small values (for example, I x 10-')
Repeat until it becomes.

（Ｂ）次に、ニューラルネットワークによる話者照合系
統について説明する（第２図参照）。(B) Next, a speaker verification system using a neural network will be explained (see FIG. 2).

この系統は、上述（Ａ）の音声入力部１１、前処理部１
２、ニューラルネットワーク１４、及び判定部１５にて
構成される。This system includes the audio input section 11 and the preprocessing section 1 in (A) above.
2, a neural network 14, and a determination unit 15.

この時、判定部１５は、ニューラルネ・ソトワーク１４
の出カバターンを転送され、ニューラルネットワーク１
４の各出カニニットのうちのいずれかの出力値かあるし
きい値を超えて°１゛°に近ければ、今回の入力話者を
登録話者として認識する。At this time, the determination unit 15 determines that the neural network software 14
The output pattern is transferred to neural network 1.
If the output value of any one of the four output units exceeds a certain threshold value and is close to 1°, the current input speaker is recognized as a registered speaker.

以下、上記話者認識システムの具体的実施結果について
説明する。Hereinafter, specific implementation results of the above speaker recognition system will be explained.

（１）第１月〜第５月の５ケ月間に渡る音声試料を学習
用入力データとした。登録話者５名、非登録話者２５名
に付き、前処理を行ない、６４次元（４ブロツクＸ１６
チヤンネル）の特徴ベクトルを得た。(1) Voice samples over a period of five months from the first month to the fifth month were used as input data for learning. Preprocessing was performed on 5 registered speakers and 25 non-registered speakers, and 64 dimensions (4 blocks x 16
channel) was obtained.

（２）上記（１）の学習用入力データに基づ〈学習によ
り、ニューラルネットワークを構築した。(2) A neural network was constructed by learning based on the input data for learning in (1) above.

（ａ）追加学習を行なわないもの第１月のみの音声試料を用い、登録話者１名当たり２５
個のデータを学習用入力データとした。この学習用入力
データに基づく学習により、ニューラルネットワークを
構築した。(a) Those without additional learning: 25 yen per registered speaker using speech samples from the first month only.
This data was used as the input data for learning. A neural network was constructed by learning based on this learning input data.

（ｂ）追加学習を行なうもの（ｂ−１）第１月のみの音声試料を用い、登録話者１場
当たり２５個のデータを学習用入力データとした。この
学習用入力データに基づく学習により、ニューラルネッ
トワークを初期構築した。(b) Additional learning (b-1) Speech samples from only the first month were used, and 25 pieces of data per registered speaker were used as input data for learning. A neural network was initially constructed by learning based on this learning input data.

（ｂ−２）第１月から１ケ月経過毎に新しくサンプルさ
れた５個のデータを、既存学習用入力データに累積して
更新学習用入力データを作成し、この更新学習用入力デ
ータに基づく追加学習により、ニューラルネットワーク
を再構築した。(b-2) Create update learning input data by accumulating 5 newly sampled data every month after the first month into the existing learning input data, and create update learning input data based on this update learning input data. The neural network was rebuilt through additional learning.

（ｂ−３）上記（ｂ−２）の追加学習を第２月〜第５月
の各月毎全４回行なった。(b-3) The additional learning described in (b-2) above was conducted a total of four times each month from the second to the fifth month.

（３）上記（２）の（ａ）のニューラルネットワークと
、（ｂ）のニューラルネットワークのそれぞれに、登録
話者と非登録話者の評価用データを入力し、判定した。(3) The evaluation data of registered speakers and non-registered speakers were input into each of the neural networks in (a) and (b) of (2) above, and judgments were made.

結果、追加学習なしのニューラルネットワークの照合率
９０．２％に対し、追加学習ありのニューラルネットワ
ークの照合率９４．５％てあった。即ち、本発明により
、照合率で４．３％の向上がみられた。As a result, the matching rate of the neural network without additional learning was 90.2%, while the matching rate of the neural network with additional learning was 94.5%. That is, the present invention improved the matching rate by 4.3%.

又、前述の前処理部１２により、入力音声を前処理され
て作成されるニューラルネットワーク１４への入力とし
ては、 ■音声の周波数特性の時間的変化、 ■音声の平均的な線形予測係数、 ■音声の平均的なＰＡＲ（：ＯＲ係数、■音声の平均的
な周波数特性、及びピッチ周波数、 ■高域強調を施された音声波形の平均的な周波数特性、
並びに ■音声の平均的な周波数特性のうちの１つ以上を使用できる。In addition, the inputs to the neural network 14 created by preprocessing the input voice by the preprocessing unit 12 described above include: (1) temporal changes in the frequency characteristics of the voice, (2) average linear prediction coefficients of the voice, (2) Average PAR of audio (: OR coefficient, ■ Average frequency characteristics of audio and pitch frequency, ■ Average frequency characteristics of audio waveform with high frequency emphasis,
and ■ one or more of the average frequency characteristics of voice can be used.

そして、上記■の要素は「音声の一定時間内における平
均的な周波数特性の時間的変化」、上記■の要素は「音
声の一定時間内における平均的な線形予測係数の時間的
変化」、上記■の要素は「音声の一定時間内における平
均的なＰＡＲＣＯＲ係数の時間的変化」、上に■の要素
は「音声の一定時間内における平均的な周波数特性、及
びピッチ周波数の時間的変化」、上記■の要素は、「高
域強調を施された音声波形の一定時間内における平均的
な周波数特性の時間的変化」として用いることができる
。The element of ■ above is the "temporal change in the average frequency characteristics of the voice within a certain time", the element of ■ above is the "temporal change of the average linear prediction coefficient within a certain time of the voice", and the element of the above The element of ■ is "temporal change in the average PARCOR coefficient within a certain time of audio", and the element of ■ above is "the average frequency characteristic and temporal change of pitch frequency within a certain time of audio", The above element (3) can be used as a "temporal change in the average frequency characteristic within a certain period of time of a high-frequency emphasized audio waveform."

尚、上記■の線形予測係数は、以下の如く定義される。Incidentally, the linear prediction coefficient of (2) above is defined as follows.

即ち、音声波形のサンプル値（χｎ）の間には、一般に
高い近接相関かあることが知られている。That is, it is known that there is generally a high proximity correlation between sample values (χn) of audio waveforms.

そこて次のような線形予測が可能であると仮定する。Therefore, assume that the following linear prediction is possible.

線形予測値　　χ、＝−Σα１χｔ−１・・・（１）線
形予測誤差　εｔ＝χを−χｔ　　・・・（２）ここて
、χｔ：時刻ｔにおける音声波形のサンプル値、（α＋
）（ｉ＝１．・＝、ｐ）＝　（２次の）線形予測係数さて、本発明の実施においては、線形予測誤差ε、の２
乗平均値が最小となるように線形予測係数（α五１を求
める。Linear predicted value χ, = -Σα1χt-1...(1) Linear prediction error εt=χ -χt...(2) Here, χt: sample value of the audio waveform at time t, (α+
) (i=1..=, p)= (quadratic) linear prediction coefficient Now, in the implementation of the present invention, the linear prediction error ε, 2
Find the linear prediction coefficient (α51) so that the root mean value is the minimum.

具体的には　（εｔ）２を求め、その時間平均を（εｔ
）２と表わして、ａ（εｔ）２／θα１＝Ｏ，ｉ＝１．
２．・・・、ｐとおくことによって、次の式から（α、
）か求められる。Specifically, (εt)2 is calculated, and its time average is (εt)
)2, a(εt)2/θα1=O, i=1.
2. ..., p, from the following equation (α,
) is required.

又、上記■のＰＡＲＣＱＲ係数は以下の如く定義される
。Furthermore, the PARCQR coefficient of (2) above is defined as follows.

即ち、（ｋｊ（ｎ＝１．・・・、ｐ）を（９次の）ＰＡ
Ｒ（：ＯＲ係数（偏自己相関係数）とする時、ＰＡＲＣ
ＯＲ係数ｋ　ｎｉｌは、線形予測による前向き残差ε♂
ｎと後向き残差ε、−１，。（ｂ１間の正規化相関係数
として、次の式によって定義される。That is, (kj (n=1..., p) is (9th order) PA
R(:OR coefficient (partial autocorrelation coefficient), PARC
The OR coefficient k nil is the forward residual ε♂ due to linear prediction
n and the backward residual ε,−1,. (The normalized correlation coefficient between b1 is defined by the following equation.

・・・（４）ここで、ε、　（ｆｌ　：χ。...(4) Here, ε, (fl: χ.

−Σ α　１Ｘｔ−ｉ（α１）　：前向き予測係数、ｔ　ｔ−＋ｎ＋１）　Ｃｂ’＝　Ｘ　ｔ−（ｎ＋１＋−
ｆ、、ｉ３・・χｔ−Ｊ（β、）：後向き予測係数又、上記■の音声のピッチ周波数とは、声帯波の繰り返
し周期（ピッチ周期）の逆数である。-Σ α 1Xt-i (α1): Forward prediction coefficient, t t-+n+1) Cb'= X t-(n+1+-
f, , i3...χt-J (β,): Backward prediction coefficient Further, the pitch frequency of the voice in the above (①) is the reciprocal of the repetition period (pitch period) of the vocal cord wave.

尚、ニューラルネットワークへの入力として、個人差か
ある声帯の基本的なパラメータであるピッチ周波数を付
加したから、特に大人／小人、男性／女性間の話者の認
識率を向上することかできる。Furthermore, since we added pitch frequency, which is a basic parameter of the vocal cords that varies from person to person, as an input to the neural network, it is possible to improve the recognition rate of speakers, especially between adults/dwarfs and male/female. .

又、上記■の高域強調とは、音声波形のスペクトルの平
均的な傾きを補償して、低域にエネルギが集中すること
を防止することである。然るに、音声波形のスペクトル
の平均的な傾きは話者に共通のものてあり、話者の認識
には無関係である。Furthermore, the above-mentioned high frequency enhancement (2) is to compensate for the average slope of the spectrum of the audio waveform to prevent concentration of energy in the low frequency range. However, the average slope of the spectrum of the speech waveform is common to all speakers and is unrelated to speaker recognition.

ところか、このスペクトルの平均的な傾きか補償されて
いない音声波形をそのままニューラルネットワークへ入
力する場合には、ニューラルネットワークか学習する時
にスペクトルの平均的な傾きの特徴の方を抽出してしま
い、話者の認識に必要なスペクトルの山と谷を抽出する
のに時間がかかる。これに対し、ニューラルネットワー
クへの入力を高域強調する場合には、話者に共通て、認
識には無関係てありながら、学習に影響を及ぼすスペク
トルの平均的な傾きを補償できるため、学習速度か速く
なるのである。On the other hand, if the average slope of the spectrum is not compensated for and the audio waveform is directly input to the neural network, the neural network will extract the feature of the average slope of the spectrum during learning. It takes time to extract the peaks and valleys of the spectrum necessary for speaker recognition. On the other hand, when emphasizing the high frequencies of the input to a neural network, it is possible to compensate for the average slope of the spectrum, which is common to all speakers and has nothing to do with recognition, but which affects learning, which speeds up the learning process. or faster.

上記実施例によれば、下記■、■の作用効果かある。According to the above embodiment, there are the following effects (1) and (2).

■更新学習用入力データによりニューラルネットワーク
１４を再構築するものであり、音声の軽時変化に対応し
た新鮮なニューラルネットワーク１４を用いることによ
って照合率を向上することかできる。(2) The neural network 14 is reconstructed using input data for update learning, and the matching rate can be improved by using a fresh neural network 14 that responds to slight changes in voice.

■ニューラルネットワーク１４への入力として、「音声
の一定時間内における平均的な周波数特性の時間的変化
」等、前述■〜■の各要素のうちの１つ以上を用いるか
ら、入力を得るための前処理が単純となり、この前処理
に要する時間か短くて足りるため、話者照合システムを
複雑な処理装置によることなく容易に実時間処理できる
。■As an input to the neural network 14, one or more of the elements from ■ to ■ mentioned above, such as "temporal changes in the average frequency characteristics of audio within a certain period of time", are used, so it is difficult to obtain input. Since the preprocessing is simple and the time required for this preprocessing is short, the speaker verification system can be easily processed in real time without using a complicated processing device.

［発明の効果］以上のように本発明によれば、ニューラルネットワーク
の初期構築の段階から時期差かあっても、照合率か劣化
しない話者照合システムを得ることかできる。[Effects of the Invention] As described above, according to the present invention, it is possible to obtain a speaker verification system in which the verification rate does not deteriorate even if there is a time difference from the initial construction stage of the neural network.

[Brief explanation of drawings]

第１図はニューラルネットワークの学習系統を示すブロ
ック図、第２図はニューラルネットワークによる話者照
合系統を示すブロック図、第３図は学習用入力データを
示す模式図である。１１・・・音声入力部、１２・・・前処理部、１３・・・学習データ記憶部、１４・・・ニューラルネットワーク、１５・・・判定部。第１図第２図第３図（Ａ）（Ｂ）特許出願人　積水化学工業株式会社代表者　廣　１）　馨データFIG. 1 is a block diagram showing a learning system of a neural network, FIG. 2 is a block diagram showing a speaker verification system using a neural network, and FIG. 3 is a schematic diagram showing input data for learning. 11... Voice input section, 12... Preprocessing section, 13... Learning data storage section, 14... Neural network, 15... Judgment section. Figure 1 Figure 2 Figure 3 (A) (B) Patent applicant Hiroshi Sekisui Chemical Co., Ltd. Representative 1) Kaoru Data

Claims

[Claims]

(1) A speaker verification system that uses a neural network to output an output value for determining whether the current input speaker belongs to a registered speaker or an unregistered speaker, and is used for learning registered speakers. After constructing a neural network through learning based on input data, the learning input data is updated by accumulating new input data of registered speakers on the existing learning input data, and the above is performed by additional learning based on the updated learning input data. A speaker matching system that reconstructs neural networks.

(2) As inputs to the neural network, [1] Temporal changes in the frequency characteristics of speech, [2] Average linear prediction coefficients of speech, [3] Average PARCOR coefficients of speech, [4] speech using one or more of the following: average frequency characteristics and pitch frequency; [5] average frequency characteristics of high-frequency emphasized audio waveform; and [6] average frequency characteristics of audio. The speaker verification system according to claim 1.