JPH05128286A

JPH05128286A - Keyword spotting system by neural network

Info

Publication number: JPH05128286A
Application number: JP3317545A
Authority: JP
Inventors: Hidefumi Sawai; 秀文沢井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1991-11-05
Filing date: 1991-11-05
Publication date: 1993-05-25

Abstract

PURPOSE:To establish the effective constitution method and learning method of a neural network, regarding the spotting system of keyword sound having a wide versatility. CONSTITUTION:When sound is inputted in a sound input part 5, the feature analysis of sound such as a FFT(fast Fourier transform) is performed in a feature extraction part 1. As the analysis is a keyword spotting, the segment processing of sound section especially by sound wave power is not performed. In a learning mode 6, a discrimination learning is performed for keywords and sound other than keywords (including noise) by using a neural network in a neural net learning part 2. In a recognition mode 7, a spotting is performed for keywords by using the learned neural network in a neural network recognition part 3. The spotting result is outputted in a keyword detection part 4.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は、ニューラルネットワークを用い
る音声キーワードスポッティング方式に関する。例え
ば、一般的に、ニューラルネットワークを用いた音声認
識装置に適用されるものである。TECHNICAL FIELD The present invention relates to a voice keyword spotting method using a neural network. For example, it is generally applied to a voice recognition device using a neural network.

【０００２】[0002]

【従来技術】本発明に係る従来技術を記載した公知文献
として、例えば、「時間遅れ神経回路網による音節スポ
ッティングの検討」（沢井，ワイベル，鹿野日本音響
学会講演論文集 2-P-11, PP223-224, 1988年10月）が
ある。この文献のものは、時間遅れ神経回路網（Time-D
elay Neural Network：TDNN）と呼ぶニューラルネット
ワークを用いて、日本語中の単音節のスポッティングを
行ったものである。音節例として "ＢＡ" を取り上げ、
これと、特に誤り易いと考えられる "ＤＡ"，"ＧＡ"，"
ＰＡ"，"ＴＡ"，"ＫＡ" の５音節との識別学習を誤差逆
伝搬法（バックプロパゲーション）を用いて行った。そ
の結果、良好な性能で日本語単語音声中の "ＢＡ" をス
ポッティングでき、"ＢＡ" 以外の全ての音節を抑制す
ることができた。しかしながら、従来技術においては、
日本語の単音節のスポッティングという非常に限られた
対象についてのみ、ニューラルネットワークが適用され
ていたに過ぎなかった。2. Description of the Related Art As a known document describing the prior art of the present invention, for example, "Study of syllable spotting by time-delayed neural network" (Sawai, Weibel, Shikano, Acoustical Society of Japan 2-P-11, PP223 -224, October 1988). This document describes a time-delayed neural network (Time-D
elay Neural Network (TDNN) is used for spotting single syllables in Japanese using a neural network. Taking "BA" as a syllable example,
This and "DA", "GA", which are considered to be particularly erroneous
The discriminative learning with five syllables of "PA", "TA", and "KA" was performed using the error back-propagation method (backpropagation). As a result, "BA" in the Japanese word speech was obtained with good performance. I was able to spot and suppress all syllables except "BA." However, in the prior art,
Neural networks were only applied to the very limited target of Japanese monosyllabic spotting.

【０００３】[0003]

【目的】本発明は、上述のごとき実情に鑑みてなされた
もので、汎用性の大きなキーワード音声のスポッティン
グ方式について、効果的なニューラルネットワークの構
成法および学習法を確立すること、また、キーワードを
含む音声が雑音を含む場合についても、ニューラルネッ
トワークの効果的な学習方法を構築することを目的とす
る。すなわち、ニューラルネットワークを用いたキーワ
ードスポッティング方式において、キーワードに対して
キーワード以外の単語は無数に存在するため、効率的な
単語の選定を行う必要がある（これに対して、キーワー
ドに対する音素のminimal pair (１音素のみ異なる単語
を選択する）。また、ニューラルネットワークの規模の
縮少化，学習の容易さ，キーワードとキーワード以外の
単語対の選択の容易さを目的として、１つのキーワード
に対して１つのニューラルネットワークを対応させるこ
と、また、キーワード数が増加した場合の対処法とし
て、一群のキーワードをまとめて１つのニューラルネッ
トワークに統合できれば、よりコンパクトなキーワード
スポッティング方式を実現すること、また、キーワード
音声以外のものとしては、単語音声の外に雑音がある
が、ニューラルネットワークの学習能力を活用して、雑
音を含めたキーワード以外の音声をキーワードと区別す
ること、さらに、ニューラルネットワークの学習方法と
して、一度に雑音の重畳された音声とキーワードとを区
別することは、容易ならざる場合が時に生じるので、最
初は雑音の付加されていないキーワードとそれ以外の単
語対から学習を開始し、徐々に雑音を付加することによ
り安定してニューラルネットの学習を制御することを目
的とする。[Purpose] The present invention has been made in view of the above-mentioned circumstances, and establishes an effective neural network configuration method and learning method for a keyword speech spotting method with great versatility. The objective is to construct an effective learning method of a neural network even when the included voice includes noise. That is, in the keyword spotting method using a neural network, since there are innumerable words other than the keywords for the keywords, it is necessary to efficiently select the words (in contrast, the minimal pair of phonemes for the keywords is used). (Select words that differ only in one phoneme.) Also, for the purpose of reducing the scale of the neural network, easiness of learning, and easiness of selecting a keyword and a word pair other than the keyword, 1 for each keyword. If one group of keywords can be integrated and integrated into one neural network as a coping method when the number of keywords increases, a more compact keyword spotting method can be realized, and the keyword speech Other than, There is noise outside the voice, but by using the learning ability of the neural network to distinguish the voice other than the keyword including the noise from the keyword, and as a learning method of the neural network, noise was superimposed at once. Sometimes it is not easy to distinguish between speech and keywords.Therefore, at first, learning is started from a keyword pair with no noise and other word pairs, and it becomes stable by gradually adding noise. The purpose is to control the learning of the neural network.

【０００４】[0004]

【構成】本発明は、上記目的を達成するために、（１）
予め定めたキーワードを含む音声波形を入力し、特徴分
析を行う特徴抽出部と、該特徴抽出部により抽出された
特徴量を用いてニューラルネットワークの学習を行うニ
ューラルネット学習部と、該ニューラルネット学習部に
よる学習済みのニューラルネットワークを用いてキーワ
ード認識を行うニューラルネットワークによる認識部
と、該認識部により認識されたキーワードを出力するキ
ーワード検出部とを有し、前記キーワードとキーワード
以外を区別するニューラルネットワークを用いてキーワ
ードスポッティングを行うキーワードスポッティング方
式において、キーワードに対する音素の最少限ペアを持
つ複数単語との対を学習するニューラルネットワークを
用いてキーワードを検出すること、更には、（２）前記
キーワード１つに対してニューラルネットワーク１つを
対応させ、該ニューラルネットワークの出力ユニットの
値 "１" をキーワードに対応させ、"０" をキーワード
以外の単語に対応させて、誤差逆伝搬法（バックプロパ
ゲーション法）を用いて学習すること、更には、（３）
複数のキーワードに対して単一のニューラルネットを用
い、キーワード各々に対する出力ユニットを持つこと、
更には、（４）キーワード以外の単語の外に雑音を重畳
させ、キーワードと雑音の重畳されたキーワード以外の
音声とを識別できるようにニューラルネットを学習する
こと、更には、（５）前記（４）において、前記雑音重
畳を徐々に加えながら遂次的にニューラルネットワーク
の学習を進めていくことを特徴としたものである。以
下、本発明の実施例に基づいて説明する。In order to achieve the above object, the present invention provides (1)
A feature extraction unit that inputs a voice waveform including a predetermined keyword and performs feature analysis, a neural network learning unit that learns a neural network using the feature amount extracted by the feature extraction unit, and the neural network learning A neural network that has a recognition unit by a neural network that performs keyword recognition using a neural network that has already been learned by the unit, and a keyword detection unit that outputs the keyword recognized by the recognition unit, and that distinguishes between the keyword and non-keywords. In the keyword spotting method of performing keyword spotting using, a keyword is detected by using a neural network that learns pairs with a plurality of words having a minimum phoneme pair for the keyword, and (2) one keyword Against Then, one neural network is made to correspond, the value "1" of the output unit of the neural network is made to correspond to a keyword, and "0" is made to correspond to a word other than the keyword, and the error back propagation method (back propagation method) is applied. Learning by using (3)
Use a single neural net for multiple keywords and have an output unit for each keyword,
Furthermore, (4) noise is superposed on the word other than the keyword, and the neural net is learned so that the keyword and the voice other than the keyword on which the noise is superimposed can be distinguished, and (5) the above ( In 4), the learning of the neural network is sequentially advanced while gradually adding the noise superposition. Hereinafter, description will be given based on examples of the present invention.

【０００５】図１は、本発明によるニューラルネットワ
ークによるキーワードスポッティング方式の一実施例を
説明するための構成図で、図中、１は特徴抽出部、２は
ニューラルネット学習部、３はニューラルネットワーク
認識部、４はキーワード検出部、５は音声入力部、６は
学習モード、７は認識モードである。まず、音声入力部
５で音声入力されると、特徴抽出部１でＦＦＴ（高速フ
ーリエ変換；Fast Fourier Transform）などの音声の特
徴分析が行われる。キーワードスポッティングであるた
め、特に音声波形パワーによる音声区間の切り出し処理
は行わない。学習モード６においてはキーワードとキー
ワード以外の音声（雑音を含む）とをニューラルネット
ワークを用いて識別学習をニューラルネット学習部２で
行う。認識モード７においては、学習済みのニューラル
ネットワークを用いて、ニューラルネットワーク認識部
３でキーワードのスポッティングを行う。スポッティン
グ結果をキーワード検出部４で出力する。FIG. 1 is a block diagram for explaining an embodiment of a keyword spotting system by a neural network according to the present invention. In the figure, 1 is a feature extraction unit, 2 is a neural network learning unit, and 3 is a neural network recognition. Part 4, keyword detection unit, 5 voice input unit, 6 learning mode, 7 recognition mode. First, when a voice is input from the voice input unit 5, the feature extraction unit 1 performs a voice feature analysis such as FFT (Fast Fourier Transform). Since the keyword spotting is performed, the voice segment is not particularly cut out by the voice waveform power. In the learning mode 6, the neural network learning unit 2 performs discrimination learning of a keyword and a voice (including noise) other than the keyword using a neural network. In the recognition mode 7, the neural network recognition unit 3 performs keyword spotting using a learned neural network. The keyword detection unit 4 outputs the spotting result.

【０００６】図２は、特徴ベクトル空間におけるキーワ
ードとキーワード以外の音声との配置を説明するための
図である。一般に、特徴空間は多次元空間である。２１
はキーワードの分布する局所空間，２１ａ〜２１ｇはキ
ーワード以外の音声（雑音を含む）が分布する空間であ
る。２２ａ〜２２ｄはキーワードとキーワード以外とを
識別する超平面である。これは図１のニューラルネット
学習部２で、誤差逆伝搬法（バックプロパゲーション）
により形成される。ここでキーワードを、例えば "θNs
ei（音声）" であるとすれば、キーワード以外の単語と
しては１音素のみが異なる、いわゆるminimal pair（最
少限ペア）の単語（有意味語にこだわる必要はないが、
発声のし易さから有意味語であるのが望ましい）を選択
する。例えば、"θNseN（温泉）" や "KoNsei（混成）"
などである。これらの単語の作成方法としては、ラン
ダムあるいは任意に１音素を入れ替えた単語を人工的に
作成し、これらの中から意味のある語を選択することも
できる。これらのキーワードと１音素のみ異なる単語
は、単語パターン全体の特徴量から成る多次元空間にお
いて、キーワードに隣接する２１ａ〜２１ｇの空間位置
を占める。したがって、キーワードの隣接境界に２２
ａ，２２ｂのような識別境界面が学習により形成されれ
ば、minimal pair以外の全ての単語は自動的にキーワー
ドと識別することが可能となる。これにより、キーワー
ドをキーワード以外の単語と区別して音声中からスポッ
ティングできる。FIG. 2 is a diagram for explaining the arrangement of keywords and voices other than the keywords in the feature vector space. In general, the feature space is a multidimensional space. 21
Is a local space in which keywords are distributed, and 21a to 21g are spaces in which voices (including noise) other than the keywords are distributed. 22a to 22d are hyperplanes for distinguishing between keywords and non-keywords. This is the neural network learning unit 2 of FIG. 1 and uses the error back propagation method (back propagation).
Is formed by. Here, the keyword is, for example, "θNs
"ei (speech)" means that only one phoneme is different from the words other than the keywords, so-called minimal pair words (there is no need to stick to meaningful words,
It is desirable that it is a meaningful word because it is easy to speak). For example, "θNseN (hot spring)" or "KoNsei (mixed)"
And so on. As a method of creating these words, it is also possible to artificially create words in which one phoneme is randomly or randomly replaced and select a meaningful word from these words. Words that differ from these keywords only in one phoneme occupy spatial positions 21a to 21g adjacent to the keywords in the multidimensional space formed by the feature amounts of the entire word pattern. Therefore, the 22
If the discrimination boundary surfaces such as a and 22b are formed by learning, all words other than the minimal pair can be automatically discriminated from the keywords. As a result, the keyword can be spotted from the voice by distinguishing it from words other than the keyword.

【０００７】図３は、キーワードスポッティング用のニ
ューラルネット（ＮＮ）の実施例を示す図で、図中、３
１は入力層、３２は中間層、３３は出力層（ユニッ
ト）、３１ａは入力層と中間層の間の結合部、３２ａは
中間層と出力層の間の結合部である。図では３層のＮＮ
を例に述ベたが、４層以上でも勿論良い。図では出力層
が単一の出力ユニットを持つ場合を示したが、この場合
には、キーワードであれば "１"、キーワード以外であ
れば "０" である。この例では、１個のキーワードに１
つのＮＮ（かつ１つの出力ユニット）を割り当てる場合
である。複数個のキーワード（ｍ個とする）の場合に
は、出力層３３中の出力ユニット数はｍ個となる。この
時は、ｉ番目のキーワードに対しては、FIG. 3 is a diagram showing an embodiment of a neural network (NN) for keyword spotting. In FIG.
1 is an input layer, 32 is an intermediate layer, 33 is an output layer (unit), 31a is a coupling portion between the input layer and the intermediate layer, and 32a is a coupling portion between the intermediate layer and the output layer. In the figure, three-layer NN
However, the number of layers may be 4 or more. The figure shows the case where the output layer has a single output unit. In this case, the keyword is "1" and the keyword other than the keyword is "0". In this example, 1 for each keyword
This is a case where one NN (and one output unit) is assigned. In the case of a plurality of keywords (m is set), the number of output units in the output layer 33 is m. At this time, for the i-th keyword,

【０００８】[0008]

【表１】 [Table 1]

【０００９】のように出力ユニットが発火するように学
習する。キーワード以外に対しては全て０の値を教え
る。The output unit is learned so as to fire. Teach all values except 0 to keywords.

【００１０】図４は、キーワードスポッティング方式の
説明図を示したものである。１０はキーワードを含む音
声波形、１３は音声波形１０を分析して得られたＦＦＴ
出力、Ａは図３に示したキーワードスポッティング用ニ
ューラルネットである。音声の先頭部分からニューラル
ネット３を順にスキャン（走査）していき、出力ユニッ
ト３３が発火した時にキーワードの存在が確認できる。
図中では、キーワードの存在区間に来た時に、出力ユニ
ット３３が発火していることがわかる。FIG. 4 is an explanatory diagram of the keyword spotting method. 10 is a voice waveform including a keyword, 13 is an FFT obtained by analyzing the voice waveform 10.
The output, A, is the keyword spotting neural network shown in FIG. The neural network 3 is sequentially scanned from the beginning of the voice, and the presence of the keyword can be confirmed when the output unit 33 fires.
In the figure, it can be seen that the output unit 33 is firing when the keyword presence section is reached.

【００１１】[0011]

【効果】以上の説明から明らかなように、本発明による
と、以下のような効果がある。（１）請求項１に対応する効果：キーワード以外の単語
音声としてminimal pairの単語を用いているので、必要
最少限の単語対を構成することにより、ニューラルネッ
トワークの学習量も最少限にでき、また、学習に用いな
かった単語についてもキーワードと区別して自動的にリ
ジェクトすることができる。（２）請求項２に対応する効果：キーワード１つに対し
てニューラルネットワーク１つを対応させているので、
キーワード数が比較的少ない場合には、ネットワーク規
模の縮少化，学習の容易さ，キーワード以外の単語の選
択の容易さ等を実現できる。（３）請求項３に対応する効果：キーワード数が比較的
に多い場合には、複数のキーワードを１つのニューラル
ネットワークに統合することにより、全体としてよりコ
ンパクトなニューラルネットワークによるキーワードス
ポッティング装置が実現できる。（４）請求項４に対応する効果：雑音を含めてキーワー
ド以外の音声として学習するので、キーワード検出の際
に自動的に雑音を除去できる。（５）請求項５に対応する効果：キーワード以外の音声
に対して、雑音付加を行う際に、最初は雑音の無い状態
からニューラルネットワークの学習を開始し、徐々に雑
音を重畳していくので、ニューラルネットワークの学習
が安定して進められる。As is apparent from the above description, the present invention has the following effects. (1) Effect corresponding to claim 1: Since words of a minimal pair are used as word sounds other than keywords, the learning amount of the neural network can be minimized by constructing the necessary minimum word pairs. In addition, words not used for learning can be automatically rejected by distinguishing them from keywords. (2) Effect corresponding to claim 2: Since one neural network is associated with one keyword,
When the number of keywords is relatively small, the network scale can be reduced, learning can be easily performed, and words other than keywords can be easily selected. (3) Effect corresponding to claim 3: When the number of keywords is relatively large, a keyword spotting device with a more compact neural network as a whole can be realized by integrating a plurality of keywords into one neural network. .. (4) Effect corresponding to claim 4: Since noises are learned as voices other than keywords, noises can be automatically removed at the time of keyword detection. (5) Effect corresponding to claim 5: When noise is added to a voice other than a keyword, learning of the neural network is first started from a state without noise, and noise is gradually superimposed. , The learning of the neural network can proceed stably.

[Brief description of drawings]

【図１】本発明によるニューラルネットワークによる
キーワードスポッティング方式の一実施例を説明するた
めの構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a keyword spotting method using a neural network according to the present invention.

【図２】特徴ベクトル空間におけるキーワードとキー
ワード以外の音素との配置を説明するための図である。FIG. 2 is a diagram for explaining an arrangement of keywords and phonemes other than the keywords in a feature vector space.

【図３】キーワードスポッティング用のニューラルネ
ット（ＮＮ）の実施例を示す図である。FIG. 3 is a diagram showing an embodiment of a neural net (NN) for keyword spotting.

【図４】キーワードスポッティング方式の説明図であ
る。FIG. 4 is an explanatory diagram of a keyword spotting method.

[Explanation of symbols]

１…特徴抽出部、２…ニューラルネット学習部、３…ニ
ューラルネットワーク認識部、４…キーワード検出部、
５…音声入力部、６…学習モード、７…認識モード。1 ... Feature extraction unit, 2 ... Neural network learning unit, 3 ... Neural network recognition unit, 4 ... Keyword detection unit,
5 ... voice input unit, 6 ... learning mode, 7 ... recognition mode.

Claims

[Claims]

1. A feature extraction unit for inputting a speech waveform containing a predetermined keyword and performing a feature analysis, and a neural network learning unit for learning a neural network using the feature amount extracted by the feature extraction unit. A keyword recognition unit that recognizes a keyword using a neural network that has been learned by the neural network learning unit, and a keyword detection unit that outputs the keyword recognized by the recognition unit. In a keyword spotting method that performs keyword spotting using a neural network that distinguishes between words, a neural network characterized by detecting a keyword using a neural network that learns pairs with a plurality of words that have a minimum phoneme pair for a keyword. Network keyword spotting method.

2. A neural network is associated with one of the keywords, a value "1" of an output unit of the neural network is associated with the keyword, and "0" is associated with a word other than the keyword, The keyword spotting method using a neural network according to claim 1, wherein learning is performed using an error backpropagation method.

3. The keyword spotting method according to claim 1, wherein a single neural net is used for a plurality of keywords, and an output unit for each keyword is provided.

4. The neural network according to claim 1, wherein noise is superposed on a word other than the keyword, and the neural network is learned so that the keyword and the voice other than the keyword on which the noise is superimposed can be distinguished. Network keyword spotting method.

5. The keyword spotting method by a neural network according to claim 4, wherein learning of the neural network is sequentially advanced while gradually adding the noise superposition.