JP2515609B2

JP2515609B2 - Speaker recognition method

Info

Publication number: JP2515609B2
Application number: JP2075634A
Authority: JP
Inventors: 英樹麻生; 多喜夫栗田; 雅幸海野; 新吾西村
Original assignee: Agency of Industrial Science and Technology; Sekisui Chemical Co Ltd
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Sekisui Chemical Co Ltd
Priority date: 1990-03-27
Filing date: 1990-03-27
Publication date: 1996-07-10
Anticipated expiration: 2011-07-10
Also published as: JPH03276200A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、電子錠等において入力音声からその話者を
認識するに好適な話者認識方法に関する。The present invention relates to a speaker recognition method suitable for recognizing a speaker from an input voice in an electronic lock or the like.

［従来の技術］本出願人は、ニューラルネットワークを用いた話者認
識方法を提案している。ニューラルネットワークを用い
た話者認識方法は、登録話者の特定学習単語についての
音声をニューラルネットワークに入力し、この入力に対
応するニューラルネットワークの出力が一定の目標値に
近づくように、ニューラルネットワークを構成する各ユ
ニットの変換関数及び重みを修正する学習動作を行な
う。そして、この学習動作の繰り返しにより構築された
ニューラルネットワークに任意話者の音声を入力し、対
応するニューラルネットワークの出力から今回話者が登
録話者であるか否かを認識することとしている。[Prior Art] The present applicant has proposed a speaker recognition method using a neural network. A speaker recognition method using a neural network inputs a voice about a specific learning word of a registered speaker to a neural network, and the neural network is controlled so that the output of the neural network corresponding to this input approaches a certain target value. A learning operation is performed to correct the conversion function and weight of each unit. Then, the voice of an arbitrary speaker is input to the neural network constructed by repeating this learning operation, and whether or not the speaker is the registered speaker this time is recognized from the output of the corresponding neural network.

［発明が解決しようとする課題］然しながら、従来のニューラルネットワークを用いた
話者認識方法にあっては、予め学習した発声内容（学習
単語）と同一の音声内容についてのみ話者認識を行なっ
ているに過ぎない。そして、発声内容を限定しない入力
音声から話者認識を行なうものとすれば、ニューラルネ
ットワークは入力音声中の種々の音韻に共通の話者情報
を利用する必要があるから、入力音程としてある程度長
い発声が必要となり、又、高い認識率も得にくい。[Problems to be Solved by the Invention] However, in the speaker recognition method using the conventional neural network, the speaker recognition is performed only for the same voice content as the utterance content (learning word) learned in advance. Nothing more than. If the speaker recognition is performed from the input voice that does not limit the utterance content, the neural network needs to use the speaker information common to various phonemes in the input voice. Is required, and it is difficult to obtain a high recognition rate.

本発明は、発声内容を限定しない入力音声に基づく話
者認識において、比較的短い発声で高い認識率を得るこ
とを目的とする。An object of the present invention is to obtain a high recognition rate with a relatively short utterance in speaker recognition based on an input voice whose utterance content is not limited.

［課題を解決するための手段］請求項１に記載の本発明は、入力層から音声の特徴量
を入力し、出力層の各ユニットが話者に対応するように
学習されたニューラルネットワークを用いた話者認識方
法であって、入力音声を所定周期で複数のフレームに分
割し、各フレームから抽出した特徴量をフレーム単位で
前記ニューラルネットワークの入力とし、当該入力に対
して得られたフレーム毎の出力値のうち、予め設定した
しきい値にて選択された出力値に基づいて話者の判断を
行なうようにしたものである。[Means for Solving the Problem] The present invention according to claim 1 uses a neural network in which a feature amount of a voice is input from an input layer and each unit of the output layer is learned so as to correspond to a speaker. In the conventional speaker recognition method, the input voice is divided into a plurality of frames at a predetermined cycle, the feature amount extracted from each frame is used as an input of the neural network in frame units, and each frame obtained for the input is determined. The speaker's judgment is made based on the output value selected from the output values selected in advance by the preset threshold value.

請求項２に記載の本発明は、請求項１に記載の本発明
において更に、前記話者の判断は、前記出力値の和、又
は積の値に基づいてなされるようにしたものである。According to a second aspect of the present invention, in addition to the first aspect of the present invention, the determination of the speaker is made based on the value of the sum or product of the output values.

［作用］本発明にあっては、先ず、学習用の音声を所定周期で
複数のフレームに分割し、各フレームから抽出した特徴
量をフレーム単位でニューラルネットワークへ入力する
学習動作により、ニューラルネットワークを構築する。
学習用の音声は、ある程度の長さの文章のすべて（例え
ば、「明日は東京に出ますのですみませんが留守にしま
す。」）、又は、文章中から選択した代表的な音素（例
えば、「ａ」、「ｉ」…）を用いる。[Operation] In the present invention, first, the learning network is divided into a plurality of frames at a predetermined cycle, and the learning operation of inputting the feature amount extracted from each frame to the neural network in a frame unit is performed. To construct.
The voice for learning can be all sentences of a certain length (for example, "I'm sorry I'm going to Tokyo tomorrow, but I'll be away."), Or a typical phoneme selected from the sentences (for example, "a , "I" ...).

学習により構築されたニューラルネットワークを用い
る認識時には、発声内容を任意とする不特定話者の入力
音声を、学習時と同様に、所定周期で複数のフレームに
分割し、各フレームから抽出した特徴量をフレーム単位
でニューラルネットワークへ入力する。そして、当該入
力に対して得られたニューラルネットワークのフレーム
毎の出力値を得る。この時、各出力値は、それぞれが短
時間の入力（各フレーム毎の入力）に対する話者を示唆
しているが、本発明にあっては予めしきい値を用いて、
全部の出力ベクトルのうちである話者のみを一定以上の
確度で示唆している出力ベクトル（換言すれば、信頼性
の高い出力ベクトル）のみを選択し、選択された出力ベ
クトルの系列全体で、和、又は積等にて判断することに
より、１つの話者認識結果を得る。At the time of recognition using a neural network constructed by learning, the input voice of an unspecified speaker whose utterance content is arbitrary is divided into a plurality of frames at a predetermined cycle as in the case of learning, and the feature amount extracted from each frame Is input to the neural network in frame units. Then, the output value for each frame of the neural network obtained for the input is obtained. At this time, each output value suggests a speaker for a short time input (input for each frame), but in the present invention, a threshold value is used in advance,
Of all the output vectors, only the output vector that suggests only the speaker with a certain degree of accuracy (in other words, a highly reliable output vector) is selected, and the entire series of the selected output vectors is selected. One speaker recognition result is obtained by judging the sum or the product.

即ち、本発明は、特に「各フレームから抽出した特徴
量をフレーム単位で前記ニューラルネットワークの入力
とし、当該入力に対して得られたフレーム毎の出力値の
うち、予め設定したしきい値にて選択された出力値に基
づいて話者の判断を行なう」とした構成に特徴がある。That is, the present invention is particularly characterized in that the feature amount extracted from each frame is used as an input of the neural network on a frame-by-frame basis, and among the output values for each frame obtained for the input, a preset threshold value is set. The configuration is characterized in that the speaker is judged based on the selected output value.

換言すると、フレーム毎に話者を一旦おおまかに判断
し、その際、その出力値から判断の明確なもののみを選
択し、得られた判断結果に基づいて最終的な話者の判断
を行なうことに特徴を有している。In other words, the speaker is once roughly determined for each frame, and at that time, only those whose judgment is clear are selected, and the final judgment of the speaker is made based on the judgment result obtained. It has features.

従って、請求項１の本発明によれば、１フレームが音
素や音素の渡りの部分に収まる長さであるため、上記本
発明の特徴的構成によって、音素レベルでの比較が可能
となり、種々の音素を学習することにより、発声内容を
限定しない任意の発声に対応することができ、然も短い
発声で高い認識率を得ることができる。また、請求項２
の発明では、出力の系列を和又は積にて総合的に判断す
ることによりその認識率を向上できるという顕著な作用
効果を奏する。Therefore, according to the present invention of claim 1, since one frame has a length that can be accommodated in a phoneme or a phoneme transition portion, the characteristic configuration of the present invention enables comparison at a phoneme level and various types of phonemes. By learning phonemes, it is possible to deal with arbitrary utterances without limiting the utterance content, and it is possible to obtain a high recognition rate with a short utterance. In addition, claim 2
In the invention described above, the recognition effect can be improved by comprehensively judging the output series by sum or product, and thus the recognition rate can be improved.

更に、最終的な話者の判断にあたって、事前の判断が
明確となっているもののみを特に選択しているため、認
識結果の信頼性が格段に向上するという作用効果も有す
る。Furthermore, in the final speaker's judgment, only those for which the prior judgment is clear are selected, so that the reliability of the recognition result is significantly improved.

［実施例］第１図は本発明の実施に用いられる話者認識装置を示
すブロック図、第２図は本発明の話者認識原理を示す工
程図である。[Embodiment] FIG. 1 is a block diagram showing a speaker recognition apparatus used for implementing the present invention, and FIG. 2 is a process diagram showing a speaker recognition principle of the present invention.

話者認識装置10は、第１図に示す如く、音声入力部1
1、前処理部12、ニューラルネットワーク13、出力ベク
トル選択部14A、出力ベクトル演算部14B、判定部15を有
して構成されている。以下、この話者認識装置10による
本発明の実施例について説明する。The speaker recognition device 10 includes a voice input unit 1 as shown in FIG.
1, a preprocessing unit 12, a neural network 13, an output vector selection unit 14A, an output vector calculation unit 14B, and a determination unit 15. An embodiment of the present invention using the speaker recognition device 10 will be described below.

（Ａ）学習対象とする登録話者に男性５名で、学習用の短文（５
秒程度）として「明日は東京にでますのですみませんが
留守にします。」を用意した。そして、この学習用の音
声を音声入力部11に入力した。(A) Learning There are 5 males as registered speakers, and short sentences for learning (5
About a second), "I'm sorry I'm leaving for tomorrow, but I'll be away." Then, this learning voice is input to the voice input unit 11.

上記の入力音声を前処理部12において、サンプリン
グ周波数10KHz、フレーム長25.6msec、フレーム周期12.
8msecでフーリエ分析（全ｎフレーム）し、各１フレー
ムにつき100〜5000Hzの帯域で68ch（1/12Oct.）のパワ
ーベクトルを系列を得た（第２図参照）。これにより、
学習用入力データとしてｎ組のｍ＝68次元のパワーベク
トルの系列が得られることになる。The pre-processing unit 12 processes the above input voice with a sampling frequency of 10 KHz, a frame length of 25.6 msec, and a frame period of 12.
Fourier analysis was performed for 8 msec (all n frames), and a power vector of 68 ch (1/12 Oct.) was obtained in a band of 100 to 5000 Hz for each frame (see FIG. 2). This allows
As input data for learning, a series of n sets of m = 68-dimensional power vectors will be obtained.

上記で得たベクトルをニューラルネットワーク13へ
の入力とし、出力層の各ユニットが話者に対応するよう
に、十分学習する。The vector obtained above is used as an input to the neural network 13 and sufficiently learned so that each unit in the output layer corresponds to the speaker.

今回用いたニューラルネットワーク13は３層の階層型
ネットワークであり、各層のユニット数は入力層68、中
間層30、出力層５で、学習には誤差逆伝播学習法を用い
た。ニューラルネットワーク13への入力としては、前述
の如く、68次元のベクトルが、１回の発声についてフ
レームの数だけ得られる。出力層での各ユニットの目標
出力値は、それぞれ、（１、０、０、０、０）、（０、
１、０、０、０）、（０、０、１、０、０、）、（０、
０、０、１、０）、（０、０、０、０、１）である。The neural network 13 used this time is a three-layer hierarchical network. The number of units in each layer is the input layer 68, the intermediate layer 30, and the output layer 5, and the error backpropagation learning method is used for learning. As described above, a 68-dimensional vector is obtained as the input to the neural network 13 by the number of frames for one utterance. The target output value of each unit in the output layer is (1, 0, 0, 0, 0), (0,
(1, 0, 0, 0), (0, 0, 1, 0, 0,), (0,
0, 0, 1, 0) and (0, 0, 0, 0, 1).

（Ｂ）認識次に、上記（Ａ）で構築されたニューラルネットワー
ク13を用いて、話者の同定を行なう。(B) Recognition Next, the speaker is identified using the neural network 13 constructed in (A) above.

音声入力部11にて採取された任意の発声について、前
処理部12において上記と同様にｎ組のｍ＝68次元のパ
ワーベクトルの系列を得る。For an arbitrary utterance sampled by the voice input unit 11, the preprocessing unit 12 obtains n sets of m = 68-dimensional power vector sequences in the same manner as described above.

上記で得たベクトルをニューラルネットワーク13に
入力し、下記の出力ベクトルの系列を得る。The vector obtained above is input to the neural network 13 to obtain the following output vector series.

｛X¹、X²…Xⁿ｝ …（１） X^t＝（X^t ₁、…、X^t ₅） …（２）但し、上記（１）は全フレーム分の出力ベクトルの系
列を表わし、上記（２）は第ｔフレームについての出力
ベクトルを表わす。上記（２）の出力ベクトルX^tにおい
て、X^t ₁の値が他のX^t ₂〜X^t ₅の値に比して大きければ、
この出力ベクトルX^tは、第ｔフレームの入力に対する話
者が第１話者〜第５者のうちの第１話者であることを示
唆する。{X ¹ , X ² ... X ⁿ } ... (1) X ^t = (X ^t ₁ , ..., X ^t ₅ ) ... (2) where (1) represents a series of output vectors for all frames, The above (2) represents the output vector for the t-th frame. In the output vector X ^t of (2) above, if the value of X ^t ₁ is larger than the other values of X ^t _{2 to} X ^t ₅ ,
This output vector X ^t indicates that the speaker for the input of the t-th frame is the first speaker among the first speaker to the fifth speaker.

出力ベクトル選択部14Aは、上記で得られた全出力
ベクトルX^tのうち、構成要素X^t _i（ｉ＝１−５）のどれ
が１つが敷居値θ１以上であり、かつ残りの要素のすべ
てが敷居値θ２以下であるような、出力ベクトルX^tのみ
を選択する。The output vector selection unit 14A determines which one of the constituent elements X ^t _i (i = 1-5) among all the output vectors X ^t obtained above has the threshold value θ1 or more and all the remaining elements. Select only the output vector X ^t such that is less than the threshold value θ2.

出力ベクトル演算部14Bは、上記で選択されたの出
力ベクトルの系列を、以下の（ａ）、（ｂ）の３手法に
より総合的に判断し、入力音声がどの話者のものである
かを認識し、この認識結果を判定部15に表示する。The output vector calculation unit 14B comprehensively determines the output vector sequence of the above selected by the following three methods (a) and (b), and determines which speaker the input voice belongs to. The recognition is performed and the recognition result is displayed on the determination unit 15.

（ａ）各出力ベクトルX^t _sの積、即ちII_tX^t _sが最大にな
る話者ｓ（ｂ）各出力ベクトルX^t _sの和、即ちΣ_tX^t _sが最大にな
る話者ｓ尚、任意発声の一例として、学習用短文「明日は東京
に出ますのすみませんが留守にします。」に対して、
「ただいま」、「こんにちわ」、「おはようごさいま
す」の３単語を用いて話者認識実験を行なった結果、話
者５名を完全に同定できた。(A) The product of each output vector X ^t _s , that is, the speaker _s that maximizes II _t X ^t s (b) The sum of each output vector X ^t _s , that is, the speaker _s that maximizes Σ _t X ^t s As an example of voluntary utterance, in response to the short sentence for learning, "I'm sorry I'll be in Tokyo tomorrow, but I'll be away."
As a result of a speaker recognition experiment using three words, "I am now", "Hello" and "Good morning", we were able to completely identify 5 speakers.

次に、上記実施例の作用について説明する。 Next, the operation of the above embodiment will be described.

上記実施例にあっては、先ず、学習用の音声を所定周
期で複数のフレームに分割し、各フレームから抽出した
短時間（１フレーム長25.6msec）スペクトルの概形を表
わす各フレーム毎のベクトル（特徴量）を求め、このベ
クトルの系列をフレーム単位でニューラルネットワーク
13へ入力する学習動作により、ニューラルネットワーク
13を構築した。In the above-described embodiment, first, the learning voice is divided into a plurality of frames at a predetermined cycle, and a vector for each frame that represents the outline of the short-time (1 frame length 25.6 msec) spectrum extracted from each frame. (Feature amount) is calculated, and the sequence of this vector is neural-framed in frame units.
Neural network by learning operation input to 13
Built 13.

学習により構築されたニューラルネットワーク13を用
いる認識時には、発声内容を任意とする不特定話者の入
力音声を、学習時と同様に、所定周期で複数のフレーム
に分割し、各フレームから抽出した短時間スペクトルの
概形を表わすベクトル（特徴量）を求め、このベクトル
の系列をフレーム単位でニューラルネットワーク13へ入
力した。そして、当該入力に対して得られたニューラル
ネットワーク13のフレーム毎の出力ベクトル（出力値）
の系列を得た。この時、系列を構成する各出力ベクトル
は、それぞれが短時間の入力（各フレーム毎の入力）に
対する話者を示唆しており、上記実施例では、出力ベク
トル演算部14により、これを系列全体で、和、又は積に
て総合的に判断することにより、１つの話者認識結果を
得た。At the time of recognition using the neural network 13 constructed by learning, the input voice of an unspecified speaker whose utterance content is arbitrary is divided into a plurality of frames at a predetermined cycle in the same manner as at the time of learning, and a short extracted from each frame. A vector (feature amount) representing the outline of the time spectrum was obtained, and the series of this vector was input to the neural network 13 in frame units. Then, the output vector (output value) for each frame of the neural network 13 obtained for the input
I got a series of. At this time, each output vector forming the series suggests a speaker for a short-time input (input for each frame), and in the above-described embodiment, the output vector calculation unit 14 outputs this as a whole series. Then, one speaker recognition result was obtained by making a comprehensive judgment based on the sum or product.

即ち、本発明は、特に「各フレームから抽出した特徴
量をフレーム単位で前記ニューラルネットワーク13の入
力とし、当該入力に対して得られたフレーム毎の出力値
のうち、予め設定したしきい値にて選択された出力値に
基づいて話者の判断を行なう」とした構成に特徴があ
る。That is, the present invention is particularly characterized in that the feature amount extracted from each frame is used as an input of the neural network 13 on a frame-by-frame basis, and among the output values for each frame obtained for the input, the threshold value set in advance is set. The determination is made based on the output value selected by the speaker ”.

従って、本発明によれば、１フレームが音素や音素の
渡りの部分に収まる長さであるため、上記本発明の特徴
的構成によって、音素レベルでの比較が可能となり、種
々の音素を学習することにより、発声内容を限定しない
任意の発声に対応することができ、然も短い発声で高い
認識率を得ることができる。また、出力の系列を和又は
積にて総合的に判断することによりその認識率を向上で
きるという顕著な作用効果を奏する。Therefore, according to the present invention, since one frame has a length that can be accommodated in a phoneme or a transition portion of the phoneme, the characteristic configuration of the present invention enables comparison at the phoneme level and learns various phonemes. As a result, it is possible to deal with any utterance in which the utterance content is not limited, and it is possible to obtain a high recognition rate with a short utterance. In addition, the recognition rate can be improved by comprehensively judging the output series by sum or product, which is a remarkable effect.

［発明の効果］以上のように本発明によれば、発声内容を限定しない
入力音声に基づく話者認識において、比較的短い発声で
高い認識率を得ることができる。[Effects of the Invention] As described above, according to the present invention, a high recognition rate can be obtained with a relatively short utterance in speaker recognition based on an input voice whose utterance content is not limited.

[Brief description of drawings]

第１図は本発明の実施に用いられる話者認識装置を示す
ブロック図、第２図は本発明の話者認識原理を示す工程
図である。 10……話者認識装置、 11……音声入力部、 12……前処理部、 13……ニューラルネットワーク、 14A……出力ベクトル選択部、 14B……出力ベクトル演算部、 15……判定部。FIG. 1 is a block diagram showing a speaker recognition device used for implementing the present invention, and FIG. 2 is a process diagram showing the speaker recognition principle of the present invention. 10 ... Speaker recognition device, 11 ... Voice input unit, 12 ... Preprocessing unit, 13 ... Neural network, 14A ... Output vector selection unit, 14B ... Output vector calculation unit, 15 ... Determination unit.

フロントページの続き (72)発明者西村新吾茨城県つくば市和台32番地積水化学工業株式会社応用電子研究所内審査官渡邊聡 (56)参考文献特開昭59−111699（ＪＰ，Ａ) 日本音響学会講演論文集平成元年10 月２−Ｐ−19 Ｐ．167〜168Continuation of the front page (72) Inventor Shingo Nishimura 32, Wadai, Tsukuba, Ibaraki Satoshi Chemical Industry Co., Ltd. Applied Electronics Research Laboratory Satoshi Watanabe (56) References JP 59-111699 (JP, A) Japan Proceedings of the Acoustical Society of Japan October 1989 2-P-19 P. 167 ~ 168

Claims

(57) [Claims]

1. A speaker recognition method using a neural network in which a feature amount of a voice is input from an input layer and each unit of the output layer is trained so as to correspond to a speaker. It is divided into a plurality of frames with, and the feature amount extracted from each frame is used as an input of the neural network in frame units, and among the output values for each frame obtained for the input, a preset threshold value is used. A speaker recognition method for judging a speaker based on a selected output value.

2. The speaker recognition method according to claim 1, wherein the speaker is judged based on a value of a sum or a product of the output values.