JPH04369699A

JPH04369699A - Unspecified spealer voice recognition system using neural network

Info

Publication number: JPH04369699A
Application number: JP3147224A
Authority: JP
Inventors: Hidefumi Sawai; 沢井　秀文; Satoru Nakamura; 悟中村
Original assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Current assignee: A T R JIDO HONYAKU DENWA KENKYUSHO KK
Priority date: 1991-06-19
Filing date: 1991-06-19
Publication date: 1992-12-22
Anticipated expiration: 2010-06-05
Also published as: JPH0752359B2

Abstract

PURPOSE:To offer the neural network which extending a neural network architecture, proposed for the recognition of a specific person or a limited number of speakers, so that the voice of the specific speaker can be recognized. CONSTITUTION:A voice feature quantity is inputted to an input layer 3 in the form of a feature parameter time sequence, and propagated to 1st hidden layers 40, 41...4n in parallel at the same time to extract features characteristic to respective speakers. The 1st hidden layer 40 extracts effective feature quantities for discriminating between the speakers at the same time. The extracted features are propagated to 2nd hidden layers 50, 51, 52...5n, the network is learnt by an error reverse propagating method, and the result is outputted to an output layer 6.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明はニューラルネットワー
クによる不特定話者音声認識方式に関し、特に、ニュー
ラルネットワークを用いて不特定話者の音声認識を行な
う音声認識技術分野に適用されるようなニューラルネッ
トワークによる不特定話者音声認識方式に関する。[Industrial Application Field] The present invention relates to a method for recognizing speaker-independent speech using a neural network, and in particular, to a neural network applied to the field of speech recognition technology that uses a neural network to perform speech recognition for an independent speaker. This paper relates to a speaker-independent speech recognition method.

【０００２】0002

【従来の技術および発明が解決しようとする課題】近年
、音声認識の分野において、ニューラルネットワークの
応用が活発に行なわれてきている。特に、時間遅れ神経
回路網（ＴＤＮＮ）により、有声破裂音／ｂ，ｄ，ｇ／
の音素認識において高い性能が示されて以来、ＴＤＮＮ
を基本構造とする１８子音認識用のネットワークや２３
音素認識用のネットワークやマルチスピーカの音素認識
を行なうネットワークが多数提案されてきた。BACKGROUND OF THE INVENTION In recent years, neural networks have been actively applied in the field of speech recognition. In particular, voiced plosives /b, d, g/
Since its high performance in phoneme recognition was demonstrated, TDNN
A network for recognizing 18 consonants with the basic structure and 23
Many networks for phoneme recognition and networks for multi-speaker phoneme recognition have been proposed.

【０００３】しかしながら、不特定話者の音声認識を音
素認識のレベルから本格的に認識し得るシステムは、い
まだ出現していない。ただし、限られた少数の話者の音
素認識を行なうものは、たとえば　Ｈａｍｐｓｈｉｒｅ
　Ｊ．，　ａｎｄ　Ａ．　Ｗａｉｂｅｌ：　“Ｔｈｅ　
Ｍｅｔａ−Ｐｉ　Ｎｅｔｗｏｒｋ：　Ｃｏｎｎｅｃｔｉ
ｏｎｉｓｔ　Ｒａｐｉｄ　Ａｄａｐｔａｔｉｏｎ　ｆｏ
ｒ　Ｈｉｇｈ　Ｐｅｒｆｏｒｍａｎｃｅ　Ｍｕｌｔｉ−
ＳｐｅａｋｅｒＰｈｏｎｅｍｅ　Ｒｅｃｏｇｎｉｔｉｏ
ｉｎ　”，　Ｐｒｏｃｅｅｄｉｎｇｓ　ｏｆ　ｔｈｅ　
１９９０　ＩＥＥＥ　Ｉｎｔｅｒｎａｔｉｏｎａｌ　Ｃ
ｏｎｆｅｒｅｎｃｅ　ｏｎ　Ａｃｏｕｓｔｉｃｓ，　Ｓ
ｐｅｅｃｈ　ａｎｄ　Ｓｉｇｎａｌ　Ｐｒｏｃｅｓｓｉ
ｎｇ，　Ｓ３．９，　ｐｐ１６４−１６８，　１９９０
．において提案されている。しかし、これらの認識シス
テムも学習話者とは異なる未知話者の音声に対する性能
は検証されていなかった。[0003] However, a system that can fully recognize the speech of unspecified speakers from the level of phoneme recognition has not yet appeared. However, those that perform phoneme recognition for a limited number of speakers, such as Hampshire
J. , and A. Waibel: “The
Meta-Pi Network: Connecti
onist Rapid Adaptation for
r High Performance Multi-
SpeakerPhoneme Recognition
in ”, Proceedings of the
1990 IEEE International C
onference on acoustics, S
peach and signal process
ng, S3.9, pp164-168, 1990
．． It has been proposed in However, the performance of these recognition systems for the voices of unknown speakers, who are different from the learned speakers, has not been verified.

【０００４】それゆえに、この発明の主たる目的は、学
習時間やサンプル数を軽減でき、高精度な認識が可能な
ニューラルネットワークによる不特定話者音声認識方式
を提供することである。[0004] Therefore, the main object of the present invention is to provide a speaker-independent speech recognition method using a neural network that can reduce the learning time and the number of samples and can perform highly accurate recognition.

【０００５】[0005]

【課題を解決するための手段】この発明は各話者間に学
習されたネットワークと話者間を識別するために学習さ
れた話者識別用のネットワークを統合して単一のネット
ワークを構成し、追加学習により全体のネットワークを
構成したものである。[Means for Solving the Problems] The present invention configures a single network by integrating a network learned between each speaker and a network for speaker identification learned to discriminate between speakers. , the entire network is constructed by additional learning.

【０００６】[0006]

【作用】この発明に係るニューラルネットワークによる
不特定話者音声認識方式は、各話者ごとに学習されたネ
ットワークと、話者間を識別するために学習された話者
識別用のネットワークを統合し、各ネットワークの学習
を個別的に行なうことにより、学習時間やサンプル数を
軽減でき、高精度な認識を可能にする。[Operation] The speaker-independent speech recognition method using a neural network according to the present invention integrates a network trained for each speaker and a network for speaker identification trained to distinguish between speakers. By training each network individually, the training time and number of samples can be reduced, making highly accurate recognition possible.

【０００７】[0007]

【発明の実施例】図１はこの発明の一実施例の概略ブロ
ック図である。図１を参照して、音声入力信号は特徴分
析部１に与えられ、ＦＦＴ分析やＬＰＣ分析が行なわれ
、この発明の特徴となるニューラルネットワーク２に与
えられ、音声認識が行なわれて認識結果が出力される。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a schematic block diagram of an embodiment of the present invention. Referring to FIG. 1, a voice input signal is given to a feature analysis section 1, where FFT analysis and LPC analysis are performed, and then given to a neural network 2, which is a feature of the present invention, where voice recognition is performed and a recognition result is obtained. Output.

【０００８】図２は図１に示したニューラルネットワー
クの具体的なブロック図である。図２を参照して、ニュ
ーラルネットワークは入力層３と隠れ層第１層４０，４
１，４２…４ｎと、隠れ層第２層５０，５１，５２…５
ｎと出力層６とを含む。隠れ層第１層４１は話者１の学
習用サンプルで学習するサブネットワークであり、隠れ
層第２層５１は同じ話者１の学習用サンプルで学習する
サブネットワークであり、隠れ層第１層４２は話者２の
学習用サンプルで学習するサブネットワークであり、隠
れ層第２層５２は同じ話者２の学習用サンプルで学習す
るサブネットワークである。隠れ層第１層４ｎは話者Ｎ
の学習用サンプルで学習するサブネットワークであり、
隠れ層第２層５ｎは同じ話者Ｎのサブネットワークであ
る。隠れ層第１層４０は話者識別用ネットワークと呼ば
れる話者１から話者Ｎまでの学習用のサンプルを用いて
、いずれの話者の音素であるかを判定するためのサブネ
ットワークである。出力層６は各出力ユニットの値から
音素カテゴリーＣ１，Ｃ２，…Ｃｋ…ＣＫを最終的に判
定する。FIG. 2 is a concrete block diagram of the neural network shown in FIG. 1. Referring to FIG. 2, the neural network has an input layer 3 and a first hidden layer 40, 4.
1, 42...4n, and the second hidden layer 50, 51, 52...5
n and an output layer 6. The first hidden layer 41 is a sub-network that is trained using training samples of speaker 1, and the second hidden layer 51 is a sub-network that is trained using training samples of the same speaker 1. 42 is a sub-network that is trained using the learning samples of speaker 2, and the second hidden layer 52 is a sub-network that is trained using the learning samples of the same speaker 2. The first hidden layer 4n is the speaker N.
is a subnetwork trained with training samples of
The second hidden layer 5n is a subnetwork of the same speaker N. The first hidden layer 40 is a sub-network called a speaker identification network that uses learning samples from speakers 1 to N to determine which speaker's phoneme it belongs to. The output layer 6 finally determines the phoneme categories C1, C2,...Ck...CK from the values of each output unit.

【０００９】次に、この発明の一実施例の動作について
説明する。入力層３で特徴パラメータ時系列の形式で入
力された音声特徴量は入力層３と隠れ層第１層４１，４
２…４ｎとの間に接続されたコネクションを介して並列
かつ同時に隠れ層第１層４１，４２…４ｎに伝搬される
。このとき、各サブネットワークは各話者のサブネット
ワークごとに各話者特有の特徴抽出を行なうと同時に、
隠れ層第１層４０では各話者間を識別するために有効な
特徴量を同時に抽出する。Next, the operation of one embodiment of the present invention will be explained. The audio features input in the input layer 3 in the form of feature parameter time series are input to the input layer 3 and the first hidden layer 41, 4.
2...4n in parallel and simultaneously to the first hidden layers 41, 42...4n. At this time, each subnetwork extracts features specific to each speaker for each subnetwork of each speaker, and at the same time,
The first hidden layer 40 simultaneously extracts feature amounts effective for distinguishing between speakers.

【００１０】次に、隠れ層第１層４０，４１，４２…４
ｎの出力は、隠れ層第１層４０，４１，４２…４ｎと隠
れ層第２層５０，５１，５２…５ｎとの間に接続された
コネクションを介して隠れ層第２層５０，５１，５２…
５ｎに伝搬される。隠れ層第２層５０，５１，５２…５
ｎから出力層６へのコネクションは、図２に示すように
、各話者のサブネットワークのｋ番目のサブレイヤーが
出力層６のｋ番目のカテゴリーＣｋに対応するユニット
に接続されている。また、話者識別用ネットワークにつ
いても同様に接続されているが、隠れ層第２層５０，５
１，５２…５ｎから出力層６へのコネクションはフルコ
ネクションとなっている。また、モジュール性を保つた
めに、各サブネットワーク間は接続されていない。この
ネットワークの学習は、誤差逆伝搬法（　ＭｃＣｌｅｌ
ｌａｎｄ　Ｊ．　Ｌ．，　Ｄ．Ｅ．　Ｒｕｍｅｌｈａｒ
ｔ　ａｎｄ　ｔｈｅ　ＰＤＰ　Ｒｅｓｅａｒｃｈ　Ｇｒ
ｏｕｐ：　“Ｐａｒａｌｌｅｌ　Ｄｉｓｔｒｉｂｕｔｅ
ｄ　Ｐｒｏｃｅｓｓｉｎｇ　”，　ｖｏｌ．１．　Ｃｈ
ａｐ．８．　ＭＩＴＰｒｅｓｓ　（１９８８）　．）に
より行なうことができる。Next, the first hidden layer 40, 41, 42...4
The output of n is transmitted to the second hidden layer 50, 51, . 52...
5n. Hidden layer second layer 50, 51, 52...5
The connection from n to the output layer 6 is such that the kth sublayer of each speaker's subnetwork is connected to the unit corresponding to the kth category Ck of the output layer 6, as shown in FIG. The speaker identification network is also connected in the same way, but the second hidden layer 50, 5
The connections from 1, 52, . . . , 5n to the output layer 6 are full connections. Furthermore, in order to maintain modularity, each subnetwork is not connected. This network is trained using the error backpropagation method (McClel
land J. L. , D. E. Rumelhar
t and the PDP Research Group
oup: “Parallel Distribution
d Processing”, vol.1. Ch.
ap. 8. MITPress (1988). ) can be done.

【００１１】上述のような各話者ごとに学習されたネッ
トワークと、話者識別用ネットワークとを統合したネッ
トワークは、モジュール性が高いために各サブネットワ
ークごとに学習を行なうことができ、従来から提案され
ているネットワークや同程度の自由度（ネットワークの
コネクション数）を持つ単純な４層構成のネットワーク
と比較すると、学習時間や学習用のサンプルを大幅に軽
減できる利点がある。また、認識率も安定して高くなる
ことは、中村悟，沢井秀文：「不特定話者音素認識のた
めのニューラルネットワークアーキテクチャの検討」電
子情報通信学会音声研究会，ＳＰ９０−６１，１９９０
年１２月２０日で実験的に証明されている。[0011] A network that integrates a network learned for each speaker and a network for speaker identification as described above has a high modularity, so that learning can be performed for each sub-network, which has been conventionally Compared to the proposed network and a simple four-layer network with the same degree of freedom (number of network connections), this has the advantage of significantly reducing learning time and training samples. In addition, the recognition rate is also stable and high. Satoru Nakamura, Hidefumi Sawai: "Study of neural network architecture for speaker-independent phoneme recognition" Institute of Electronics, Information and Communication Engineers Speech Research Group, SP90-61, 1990
Experimentally proven on December 20th.

【００１２】0012

【発明の効果】以上のように、この発明によれば、ニュ
ーラルネットワークの構成を各話者ごとのサブネットワ
ークと、話者識別用ネットワークとからモジュールを構
成し、各サブネットワークの学習を個別的に行なえるよ
うにしたので、学習時間やサンプル数を軽減でき、高精
度な認識が可能となる。[Effects of the Invention] As described above, according to the present invention, the configuration of the neural network is configured into a module consisting of a subnetwork for each speaker and a network for speaker identification, and the learning of each subnetwork is performed individually. This makes it possible to reduce learning time and the number of samples, making highly accurate recognition possible.

[Brief explanation of the drawing]

【図１】この発明の一実施例の概略ブロック図である。FIG. 1 is a schematic block diagram of an embodiment of the invention.

【図２】図１に示したニューラルネットワークの具体的
なブロック図である。FIG. 2 is a specific block diagram of the neural network shown in FIG. 1.

[Explanation of symbols]

１　　特徴分析部２　　ニューラルネットワーク３　　入力層４０，４１，４２…４ｎ　　隠れ層第１層５０，５１，
５２…５ｎ　　隠れ層第２層６　　出力層1 Feature analysis unit 2 Neural network 3 Input layer 40, 41, 42...4n Hidden layer 1st layer 50, 51,
52...5n Hidden layer 2nd layer 6 Output layer

Claims

[Claims]

Claim 1: A network learned between each speaker and a network for speaker identification learned to discriminate between speakers are integrated to form a single network, and the entire network is constructed by additional learning. A speaker-independent speech recognition method using a neural network, which is characterized by comprising: