JPH0457099A

JPH0457099A - Voice recognizing device

Info

Publication number: JPH0457099A
Application number: JP2169062A
Authority: JP
Inventors: Koichi Yamaguchi; 耕市山口
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1990-06-27
Filing date: 1990-06-27
Publication date: 1992-02-24
Anticipated expiration: 2013-09-03
Also published as: JP2792720B2

Abstract

PURPOSE:To recognize the recognition object vocabulary of an input voice even under the noise environment by constituting the device so that feature amounts inputted to each event net are shifted timewise from each other, a position in which an output value becomes maximum is selected therein, and a supernet inputs an output from a word net and a value corresponding to a recognition word to which an input voice belongs. CONSTITUTION:A voice inputted from a microphone 21 is amplified by an amplifier 22, A/D-converted 23, and thereafter, inputted to a sound analyzing part 24. Subsequently, after it is inputted to the sound analyzing part 24, a value of output power of each BPF 25 is outputted. A compressing part 26 decreases a dimension of a feature vector of the input voice. Next, all frames outputted from a feature vector store part 27 are inputted to an event net 28. Thereafter, a word net output store part 32 writes in an output from a word net 31 at the time when the event net selects the word head, selects the maximum one and outputs it to a supernet 33. As for the output of each word net, only the maximum output in plural outputs is selected, inputted to the supernet, and the output of the supernet is calculated. The calculated output of the supernet is sent to a result deciding part 34, and by deciding the threshold, the result of recognition is outputted to a result display part 35.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、ニューラルネットワークを用いた音声認識装
置に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device using a neural network.

［従来の技術］一般に音声認識装置では、マイクロホンから入力される
信号の中から発声の前後の無音区間及び雑音区間を取除
いて、音声区間だけを抽出する処理、即ち、音声区間の
検出が必要である。[Prior Art] In general, a speech recognition device requires a process of extracting only a voice section by removing silent sections and noise sections before and after utterance from a signal input from a microphone, that is, detecting a voice section. It is.

音声区間の検出は、信号対雑音比（以後、Ｓ／Ｎ比と称
する）が良い場合にはそれほど困難ではない。その場合
には、音声信号のパワー時系列の値が適当なしきい値を
越える区間を音声区間として検出すればよい。Detection of voice sections is not so difficult if the signal-to-noise ratio (hereinafter referred to as S/N ratio) is good. In that case, it is sufficient to detect an interval in which the value of the power time series of the audio signal exceeds an appropriate threshold value as an audio interval.

しかし、実際の環境では、種々の雑音のためＳ／Ｎ比が
劣化し、弱い摩擦音、音声の始端（以後、語頭と称する
）及び終端（以後、語尾と称する）に存在する振幅の小
さい有声音などの検出が困難になる。また、非定常雑音
を音声区間として誤検出してしまうこともある。However, in a real environment, the S/N ratio deteriorates due to various noises, such as weak fricatives, and voiced sounds with small amplitudes that exist at the beginning (hereinafter referred to as word-initial) and at the end (hereinafter referred to as word-final) of speech. etc. becomes difficult to detect. Furthermore, non-stationary noise may be erroneously detected as a voice section.

雑音環境下における音声区間の検出方法の↑つとして、
複数の区間候補から適格な音声区間を選択する方法があ
る。As one of the methods for detecting speech sections in a noisy environment,
There is a method of selecting a suitable speech interval from a plurality of interval candidates.

上記の方法は、複数の区間の各候補について実際に音声
認識を行って、照合得点の最も高い区間を適格な音声区
間として選択する。In the above method, speech recognition is actually performed for each candidate segment, and the segment with the highest matching score is selected as a qualified speech segment.

更に上記の方法を発展させて、データ上の全ての時刻を
語頭及び語尾の候補とし、全ての区間について音声認識
を行い、照合得点の高い区間を見つける方法がある。そ
の−例として、ワードスポツティングがある。ワードス
ポツティングのマツチングでは連続ダイナミック・プロ
グラミング法（以後、連続ＤＰ法と称する）が用いられ
ている。A further development of the above method is a method in which all times in the data are used as word beginning and word ending candidates, speech recognition is performed on all sections, and sections with high matching scores are found. An example of this is word spotting. A continuous dynamic programming method (hereinafter referred to as continuous DP method) is used in word spotting matching.

単語音声認識装置としては、山口、板本による「音声認
識装置」　（特願平２−６９２４８号）がある。As a word speech recognition device, there is a ``speech recognition device'' by Yamaguchi and Itamoto (Japanese Patent Application No. 69248/1999).

この単語音声認識装置は、音声を音響分析して得られた
特徴量を、多層パーセプトロン型ニューラルネットワー
クの入力層の各ユニットに入力し、出力層の各ユニット
の出力値に応じて音声の認識結果を得る。This word speech recognition device inputs the features obtained by acoustically analyzing speech into each unit of the input layer of a multilayer perceptron neural network, and then outputs speech recognition results according to the output value of each unit of the output layer. get.

上記の単語音声認識装置は、入力音声を各フレームにつ
いて音響分析して得られた特徴量をイベントネットの入
力層の各ユニットに入力するときに時間間隔情報に基づ
いて、あらかじめ所定の方法で検出した単語の語頭付近
から順に所定の範囲内で各イベントネットに入力する特
徴量を時間的にずらし、時間的にずらされた特徴量の中
で各イベントネットの出力値が最大になる位置を選択す
ることにより、入力音声の時間伸縮を補正すると共に、
最終のイベントネットの最大出力位置を入力音声の終端
としている。The word speech recognition device described above uses a predetermined method to detect features obtained by acoustically analyzing input speech for each frame based on time interval information when inputting them to each unit of the input layer of the event net. The features input to each event net are temporally shifted within a predetermined range starting from near the beginning of the word, and the position where the output value of each event net is maximum is selected among the temporally shifted features. By doing this, the time warp of the input audio is corrected, and
The maximum output position of the final event net is the end of the input audio.

［発明が解決しようとする課題］上述の連続ＤＰ法によるワードスポツティングを用いた
音声認識装置には、認識対象語彙以外の入力に対するリ
ジェクト能力が低く、耐騒音性も低いという問題点があ
る。また、フレーム毎の局所的な距離しか観測していな
いので余計な言葉の付加、単語及び音韻の脱落を生じや
すいと共に、ＤＰマツチングを常に実行しなければなら
ないのでフレーム間距離の計算量及び記憶量が多くなる
という問題点がある。[Problems to be Solved by the Invention] A speech recognition device using word spotting using the continuous DP method described above has problems in that it has a low ability to reject input other than the vocabulary to be recognized and has low noise resistance. In addition, since only local distances are observed for each frame, it is easy to add unnecessary words and drop words and phonemes. Also, since DP matching must be constantly performed, the amount of calculation and memory required for inter-frame distances is increased. The problem is that there are many.

上述の単語音声認識装置には、語頭を何らかの方法であ
らかじめ検出しなければならず、検出誤差が大きい場合
には誤認識及びリジェクトが発生するという問題点があ
る。The word speech recognition device described above has the problem that the beginning of a word must be detected in advance by some method, and if the detection error is large, erroneous recognition and rejection occur.

本発明の目的は、上述の従来の音声認識装置の問題点に
鑑み、雑音環境下で入力音声の認識対象語巣を認識する
ことができるニューラルネットワクを用いた音声認識装
置を提供することにある。SUMMARY OF THE INVENTION In view of the above-mentioned problems with conventional speech recognition devices, an object of the present invention is to provide a speech recognition device using a neural network that can recognize word nests to be recognized in input speech in a noisy environment. .

［課題を解決するための手段］本発明の上述した目的は、入力音声を各フレーム毎に音
響分析して得られた特徴量を入力するイベントネットと
、イベントネットからの出力を入力して入力音声に対し
て認識対象語彙のうちの特定の単語との類似度に相当す
る値を出力するワードネットと、ワードネットからの出
力を入力して入力音声の属する認識単語に応じた値を出
力するスーパーネットとを備えており、イベントネット
は、多数話者の音声サンプルを分析して得られた時間間
隔情報に基づいて、任意の時刻を語頭として所定の範囲
内で特徴量を時間的にずらし、時間的にずらされた特徴
量の中で出力値が最大になる位置を選択して、特定の単
語中の部分音韻系列の類似度に相当する値を出力するよ
うに構成されている音声認識装置によって達成される。[Means for Solving the Problems] The above-mentioned object of the present invention is to provide an event net that inputs feature quantities obtained by acoustically analyzing input audio for each frame, and an input system that inputs and inputs the output from the event net. WordNet outputs a value corresponding to the degree of similarity of speech to a specific word in the recognition target vocabulary, and inputs the output from WordNet and outputs a value according to the recognized word to which the input speech belongs. Based on time interval information obtained by analyzing voice samples from multiple speakers, EventNet temporally shifts features within a predetermined range, starting at an arbitrary time as the beginning of a word. , speech recognition configured to select the position where the output value is maximum among the temporally shifted features and output a value corresponding to the similarity of the partial phoneme sequence in a specific word. achieved by the device.

［作用］イベントネットが入力音声をフレーム毎に音響分析して
得られた特徴量を入力し、多数話者の音声サンプルを分
析して得られた隣り合うイベントネット間の時間間隔情
報に基づいて、任意の時刻を語頭として所定の範囲内で
各イベントネットに入力する特徴量を互いに時間的にず
らし、そのイベントネットが担当する単語かそうでない
かを判別すると共に、時間的にずらされた特徴量の中で
出力値が最大になる位置を選択して、認識対象語巣のう
ち特定の単語中の部分音韻系列の類似度に相当する値を
出力し、ワードネットがイベントネットからの出力を入
力して入力音声に対して特定の単語との類似度に相当す
る値を出力し、スーパーネットがワードネットからの出
力を入力して入力音声の属する認識単語に応じた値を出
力する。[Operation] The event net inputs the feature values obtained by acoustically analyzing the input audio frame by frame, and based on the time interval information between adjacent event nets obtained by analyzing the audio samples of multiple speakers. , the feature amounts input to each event net are temporally shifted within a predetermined range with an arbitrary time as the beginning of the word, and it is determined whether the word is handled by that event net or not, and the temporally shifted features are The position where the output value is maximum is selected among the amounts, and the value corresponding to the similarity of the partial phoneme sequence in a specific word among the recognition target word nests is output, and the word net uses the output from the event net. The supernet inputs the input speech and outputs a value corresponding to the degree of similarity with a specific word, and the supernet inputs the output from the wordnet and outputs a value corresponding to the recognized word to which the input speech belongs.

［実施例］以下、本発明の音声認識装置における一実施例を図面を
参照して詳述する。[Embodiment] Hereinafter, an embodiment of the speech recognition device of the present invention will be described in detail with reference to the drawings.

第１図は、本実施例による音声認識装置の構成を示す。FIG. 1 shows the configuration of a speech recognition device according to this embodiment.

第１図に示す音声認識装置は、マイクロホン２１、マイ
クロホン２１に接続されたアンプ２２、アンプ２２に接
続されたアナログ／デジタル変換器（以後、Ａ／Ｄ変換
器と称する）２３、Ａ／Ｄ変換器２３に接続されており
、複数の帯域通過フィルタ（以後、ＢＰＦと称する）２
５が並列に接続された音響分析部２４、音響分析部２４
に接続された圧縮部２６、圧縮部２６に接続された特徴
ベクトル格納部２７、特徴ベクトル格納部２７に接続さ
れており、それぞれに複数のイベントネット２８が並列
に接続された複数のイベントネット群２９、各イベント
ネット２８に接続されており各イベントネット群２９に
それぞれ備えられているイベントネット出力格納部３０
．それぞれが各イベントネット群２９に接続された複数
のワドネット３１．それぞれが各ワードネット３１に接
続された複数のワードネット出力格納部３２、複数のワ
ードネット３１に接続されたスーパーネット３３、スー
パーネット３３に接続された結果判定部３４、結果判定
部３４に接続された結果表示部３５により構成されてい
る。The speech recognition device shown in FIG. 1 includes a microphone 21, an amplifier 22 connected to the microphone 21, an analog/digital converter (hereinafter referred to as an A/D converter) 23 connected to the amplifier 22, and an A/D converter. A plurality of band pass filters (hereinafter referred to as BPF) 2
5 are connected in parallel, the acoustic analysis section 24 and the acoustic analysis section 24
A compression unit 26 connected to the compression unit 26, a feature vector storage unit 27 connected to the compression unit 26, and a plurality of event net groups each having a plurality of event nets 28 connected to the feature vector storage unit 27 in parallel. 29. Event net output storage unit 30 connected to each event net 28 and provided in each event net group 29
．． A plurality of wadnets 31 . each connected to each eventnet group 29 . A plurality of wordnet output storage sections 32 each connected to each wordnet 31, a supernet 33 connected to a plurality of wordnets 31, a result judgment section 34 connected to the supernet 33, and a connection to the result judgment section 34. It is configured by a result display section 35.

次に、第１図に示す音声認識装置の動作を説明する。Next, the operation of the speech recognition device shown in FIG. 1 will be explained.

まず、マイクロホン２１から入力された音声はアンプ２
２によって増幅され、Ａ／Ｄ変換器２３でアナログ信号
からデジタル信号に変換された後、音響分析部２４に入
力される。First, the audio input from the microphone 21 is transmitted to the amplifier 2.
2 and converted from an analog signal to a digital signal by an A/D converter 23, and then input to an acoustic analysis section 24.

音響分析部２４では、ＢＰＦ２５を用いて入力された音
声を音響分析し、フレーム毎に各ＢＰＦ２５の出力パワ
ーの値を出力する。The acoustic analysis unit 24 acoustically analyzes the input audio using the BPF 25 and outputs the output power value of each BPF 25 for each frame.

なお、上記の音響分析はＢＰＦ群による分析に限らず、
リニア・プレデイクチイブ・コーディング（（Ｌｉｎｅ
ａｒ　Ｐｒｅｄｉｃｔｉｖｅ　Ｃｏｄｉｎｇ）、以後Ｌ
ＰＧと称する）又はケプストラム分析等によって得られ
たパラメータを用いてもよい。Note that the above acoustic analysis is not limited to the analysis using the BPF group.
Linear Predictive Coding ((Line
ar Predictive Coding), hereafter L
(referred to as PG) or parameters obtained by cepstral analysis, etc. may be used.

圧縮部２６は、ネットワークの規模を小さくするために
、Ｋ−Ｌ変換を用いて入力音声の特徴ベクトルの次元を
減少させる。The compression unit 26 uses KL transformation to reduce the dimensions of the feature vector of the input speech in order to reduce the scale of the network.

特徴ベクトル格納部２７は、圧縮部２６でに一’Ｌ変換
により圧縮された特徴ベクトルを順次入力する。The feature vector storage section 27 sequentially receives the feature vectors compressed by the compression section 26 by the 1'L transformation.

しかし、動作の開始直後にはまだマイクロホン２１から
実際の入力がないので特徴ベクトル格納部２７は、特徴
ベクトルの初期値として、無音区間の特徴ベクトルを擬
似的に１秒間分だけ格納しておく（ここでＴの値は認識
対象語彙に依存する数を表す）。However, immediately after the start of the operation, there is no actual input from the microphone 21 yet, so the feature vector storage unit 27 pseudo stores the feature vector of the silent section for one second as the initial value of the feature vector ( Here, the value of T represents a number that depends on the vocabulary to be recognized).

第１図の音声認識装置では語頭検出を行わないので、特
徴ベクトル格納部２７から出力された全てのフレームは
、イベントネット２８に入力される。Since the speech recognition device shown in FIG. 1 does not detect the beginning of a word, all frames output from the feature vector storage section 27 are input to the event net 28.

なお、図に示すように複数のイベントネット２８が並列
に接続されてイベントネット群２９を形成している。Note that, as shown in the figure, a plurality of event nets 28 are connected in parallel to form an event net group 29.

特徴ベクトル格納部２７は、第２図に示すようにリング
バッファになっており、現時点の特徴ベクトルの格納場
所はＷポインタ（書き込み用）で示される。図中のＦポ
インタは、仮定した語頭の時刻（フレーム）を表す。実
際には、単語によって継続時間が異なるので上記Ｔの値
を、各単語ｒ（但し、ｒ＝１．２．　・・・、Ｒ１であ
り、Ｒは語巣数を表す）に対応して定めることにより処
理の効率が良くなる。なお、単語ｒはイベントネット及
びワードネットで構成される標準パターンである。The feature vector storage section 27 is a ring buffer as shown in FIG. 2, and the current storage location of the feature vector is indicated by a W pointer (for writing). The F pointer in the figure represents the time (frame) of the assumed beginning of a word. In reality, the duration differs depending on the word, so the value of T above is determined corresponding to each word r (r = 1.2..., R1, where R represents the number of word nests). This improves processing efficiency. Note that the word r is a standard pattern composed of an event net and a word net.

現在の時刻をｔｂとすると、Ｗポインタはｔ６、単語ｒ
の語頭はｔ、ｒでそれぞれ表される。If the current time is tb, the W pointer is t6, word r
The beginning of the word is represented by t and r, respectively.

上記Ｔの値は、語索中の最大の継続時間程度に設定すれ
ばよく、本実施例ではＴ＝１．２秒とする。The value of T may be set to approximately the maximum duration during the word search, and in this embodiment, T=1.2 seconds.

現在の時刻がｔｂのとき、単語ｒに対して仮定する語頭
は、区間（、ｔＳ、、ｒ＋Δ］に属する全てのフレーム
とする。ここでΔはΔ−ｔｂｔ１ｒ−ＴｌＩｌｉｎｒテ
表サレル。マタ、１ｍ１１１ｒハす語ｒの考えられ得る
最小継続時間である。When the current time is tb, the beginning of the word r is assumed to be all frames belonging to the interval (, tS,, r+Δ]. Here, Δ is Δ-tbt1r-TlIlinr. Mata, 1m111r is the minimum possible duration of the has word r.

第３図に現在の時刻ｔｂ１単語ｒの語頭ｔ。FIG. 3 shows the beginning t of word r at the current time tb1.

最小継続時間Ｔ、　　ｒ及びΔの関係を示す。The relationship between minimum duration T, r and Δ is shown.

１ｎ次に、第１図の音声認識装置による語頭の検出方法につ
いて説明する。1n Next, a method for detecting the beginning of a word using the speech recognition device shown in FIG. 1 will be explained.

まず、区間［、Ｉ、、＋＋Δ］内の全てのフレーム、即
チ、ｔ　　　、ｔ、　　＋１．ｔ、ｒ＋ｆ２、・・・　ｔ、ｒ＋Δの全てを語頭と仮定する。First, all frames within the interval [,I,,++Δ], i.e., t, t, +1. t, r+f 2,... t, r+Δ are all assumed to be word beginnings.

ｔ、Ｉが語頭のときには、単語ｒの先頭のイベントネッ
トＥ、１のサーチ範囲を前後にそれぞれにフレーム（Ｋ
は単語によって一般に異なるがここでは３とする）に設
定すると、イベントネットＥ「１１の演算の対象となるフレームの中心は、ｔｌ３、ｔ
　１−２．・・・、ｔｆｒ＋３になる。When t and I are at the beginning of a word, frames (K
(generally varies depending on the word, but is set to 3 here), the center of the frame that is the target of the operation in event net E "11 is tl3, t
1-2. ..., tfr+3.

また、ｔｌ　＋１が語頭のときには、イベントネット”
ｒｌの演算の対象となるフレームの中心は、ｔ　　’−
２．ｔ、’　−１，・、ｔｆｒ＋４である。Also, when tl +1 is the beginning of a word, "event net"
The center of the frame that is the target of rl calculation is t'-
2. t,'-1,·,tfr+4.

が、これらのフレームのうちｔｆ　　　２．ｔ＋−１，
・・・、ｔ’＋３は、ｔｆｒを語頭としたと【きに算出されおり、既にイベントネット出力格納部３０
に納められているので、その算出結果を利用する。However, among these frames, tf 2. t+-1,
..., t'+3 is calculated when tfr is the beginning of the word, and has already been stored in the event net output storage unit 30.
Since it is stored in , use the calculation result.

このイベントネット出力格納部３０も特徴ベクトル格納
部２７と同様にリングバッファ構造になっている。また
、イベントネット出力格納部３０は、単語ｒに対応する
各イベントネット群２９にそれぞれ備えられている。即
ち、イベントネット出力格納部３０は、１つの単語ｒに
ついてＮ個（Ｎはイベントネット群２９の数であり、本
実施例ではＮ＝５）存在する。This event net output storage section 30 also has a ring buffer structure like the feature vector storage section 27. Further, the event net output storage section 30 is provided for each event net group 29 corresponding to the word r. That is, N event net output storage units 30 exist for one word r (N is the number of event net groups 29, and in this embodiment, N=5).

上述のようにイベントネットＥ＋１に関しては、ｔｌ　
＋１が語頭のときに新たに算出するのはｔ１＋４のフレ
ームのみとなる。As mentioned above, regarding event net E+1, tl
When +1 is the beginning of a word, only the frame t1+4 is newly calculated.

以下、各イベントネットＥ　、２．　　Ｅ　、３．　　
Ｅ　、４．　　Ｅ、５に対して、重複する計算の部分に
ついては、同様に各イベントネット出力格納部３０から
読み出しを行う。また、新たに計算した場合は、イベン
トネット２８からの出力結果を各イベントネット出力格
納部３０に書き込む。Below, each event net E, 2. E, 3.
E, 4. For E and 5, the duplicated calculation portions are similarly read from each event net output storage section 30. In addition, when a new calculation is performed, the output result from the event net 28 is written into each event net output storage section 30.

以上、ｔ　　ｒからｔＳ十Δを語頭と仮定したときの、
現在の時刻ｔｂにおけるイベントネット２８からの出力
は、上述のようにして得られる。Above, when assuming that tS+Δ is the beginning of a word from tr,
The output from the event net 28 at the current time tb is obtained as described above.

次ニ、区間［ｔｌｌ、ｔＩｒ＋Δ］でイベントネットＥ
、■のサーチ範囲の最大値選択により決定された語頭を
ｆ　１、ｆ２′　・・・、ｆｐＬと表す。Next, event net E in the interval [tll, tIr+Δ]
, ■ The beginnings of words determined by selecting the maximum value of the search range are expressed as f 1 , f2' . . . , fpL.

但しｐはｐ〈Δの条件を満足する値であり通常は２〜３
である。However, p is a value that satisfies the condition of p<Δ, and is usually 2 to 3.
It is.

ワードネット出力格納部３２は、イベントネットＥ、１
が上記の語頭ｆｊ’　　（ｊ＝１．２．・・・、ｐ）を
選択したときのワードネット３１からの出力を書き込む
。The word net output storage unit 32 stores the event net E,1.
writes the output from the word net 31 when the above word initial fj' (j=1.2, . . . , p) is selected.

そしてワードネット出力格納部３２に格納されている値
のうち最大のものを選択してスーパーネット３３に出力
する。Then, the largest value among the values stored in the word net output storage section 32 is selected and output to the supernet 33.

イベントネット２８、ワードネット３１及びスーパーネ
ット３３の基本動作を以下に説明する。The basic operations of the event net 28, word net 31, and supernet 33 will be explained below.

第４図において、特徴ベクトル系列のうち、イベントネ
ット２８の入力層に相当する範囲のフレーム系列が各イ
ベントネット２８に入力される。In FIG. 4, among the feature vector series, a frame series in a range corresponding to the input layer of the event net 28 is input to each event net 28.

イベントネット２８には、特定の認識対象の単語につい
て、入力層に入力する特徴ベクトル系列を時間軸方向に
ずらしたものがＮ個（但し、Ｎは正の整数）あり、本実
施例ではＮ＝５である。In the event net 28, for a specific recognition target word, there are N pieces (where N is a positive integer) of feature vector sequences input to the input layer shifted in the time axis direction, and in this embodiment, N= It is 5.

なお、単語によってＮを異なる値としてもよい。Note that N may be set to a different value depending on the word.

３〜４音節以下の通常の単語ならばＮ＝５とし、５音節
以上の長い単語は、Ｎ＝　［ｍ／２＋３．５］（但しｍ
は音節数、［ｘ］はＸを越えない最大の整数）とする。For normal words of 3 to 4 syllables or less, N = 5, and for long words of 5 or more syllables, N = [m/2 + 3.5] (however, m
is the number of syllables, and [x] is the largest integer not exceeding X).

次に、認識時において特徴ベクトル系列を時間軸方向に
ずらす方法について述べる。Next, a method of shifting the feature vector sequence in the time axis direction during recognition will be described.

認識対象の第ｉ番目の単語を認識する第ｉ番目のイベン
トネットの名称をＥｌ、とすると、イベン＋１トネットＥ　、の出力層には２つのユニットＣ４０、Ｉ
Ｉ　　　　　　　　　　　　　　　　　　　　　　　　
　１１Ｃ、がある。Let El be the name of the i-th event net that recognizes the i-th word to be recognized, and the output layer of event+1 tonet E has two units C40 and I.
I
There is 11C.

」イベントネットＥ９．が認識を担当している単語ｊ（第ｉ番目に相当する）の部分音韻系列（単語の継続時
間長を１とおくと、語頭からｊ／Ｎ付近に相当する）が
入力された場合には、（Ｃ，、、Ｃ，、）＝　（１，Ｏ）Ｉ　　　　　ＩＩとなるように、イベントネットＥ　、は学習されてｊいる。” Event Net E9. When the partial phoneme sequence of word j (corresponding to the i-th word) that is in charge of recognition (corresponding to around j/N from the beginning of the word if the duration of the word is set to 1) is input, The event net E is trained so that , (C,,,C,,)=(1,O)III.

逆に、上記の部分音韻系列以外のものが入力された場合
には、（Ｃ，、、Ｃ，、）＝　（０，１）ｊ　　　　　ＩＩとなるようにイベントネットＥ　、はが学習されている
。即ち、ユニットＣ、はイベントネットＥ、。Conversely, if a partial phoneme sequence other than the above is input, the event net E is trained so that (C,,,C,,) = (0,1) j II. . That is, unit C is event net E.

１１　　　　　　　　　　　　　　　　　　Ｎが担当す
る単語中の特定の時点に対して高い値になる。11 N has a high value for a specific point in the word it is responsible for.

時間軸方向へのずらし間隔は、圧縮された特徴ベクトル
系列の１フレームとする。なお、計算量を削減させたい
ときはこれを２フレームとしてもよい。The shift interval in the time axis direction is one frame of the compressed feature vector series. Note that this may be set to two frames if it is desired to reduce the amount of calculation.

時間軸方向へのずらし範囲の量（サーチ範囲のフレーム
数と同じ量）をｎとすると、このｎの値は、イベントネ
ットＥ４．によって異なる値であり、＋１第４図においてはイベントネットＥｉ１に対しては、ｎ
＝５、イベントネットＥｉ２に対してはｎ＝７にそれぞ
れ設定されている。Assuming that the amount of shift range in the time axis direction (the same amount as the number of frames in the search range) is n, the value of n is equal to the value of event net E4. +1 In Figure 4, for event net Ei1, n
= 5, and n = 7 for the event net Ei2.

また、イベントネットＥ　、は前から順にＥｉｊｌ’ＪＥ　　　・・・、Ｅｌ、で示され、出力はそれぞれＣ＋
ｉ２’　　　　　＋１ｎ、Ｃ１・・・、Ｃ１，で−膜内に表される。In addition, the event net E is indicated by Eijl'J E ..., El, in order from the front, and the output is C+
i2′ +1n , C1 . . . , C1, represented in the − membrane.

ｊｌ　　　＋ｉ２’　　　　　＋１ｎ第４図には、その一部分としてＥ、　　、Ｅ、　　。jl +i2' +1n Figure 4 shows E, , E, as part of it.

１　　　＋１２Ｅ、　　、Ｅ　　　、Ｅ、　　、Ｃ，及びＣ１１２が示
＋１３　　　＋２１　　　＋２２　　　ｄｌされている
。1 +12 E, , E , E, , C, and C112 are shown +13 +21 +22 dl.

ワードネット３１への入力としては、これらｎ個のＣ，
、Ｃ，、・・・、Ｃ１，中の最大値を各ｊｉｆ　　　＋
ｊ２　　　　　Ｂｎの値に対して選択する。As inputs to the word net 31, these n C,
, C, ..., C1, each jif +
Select for the value of j2 Bn.

なお、イベントネットＥ　１のサーチ範囲は、仮定され
た語頭を中心として前後に一定量、たとえば３フレーム
ずつとする。または、多数話者の統計により、単語全体
の継続時間長の標準偏差の定数倍としてもよい。Note that the search range of the event net E1 is set to be a fixed amount, for example, three frames each, around the assumed beginning of the word. Alternatively, the standard deviation of the duration of the entire word may be multiplied by a constant based on the statistics of many speakers.

図中、イベントネットＥ１．のサーチ範囲は、横棒の矢
印で示されており、各ユニットＣ１（ｊ＋１７＝１．２．、、、．５）の最大値選択で、最大値として
選択された位置が太い実線で表されている。In the figure, event net E1. The search range is indicated by a horizontal bar arrow, and the position selected as the maximum value for each unit C1 (j+17 = 1.2., .5) is represented by a thick solid line. ing.

例えば、イベントネットＥ　ではＥ、　　　Ｅ。For example, in Eventnet E, E, E.

ｉｌ　　　　１１２ゝ　　１２ではＥ１２５がそれぞれ選択されている。il 112ゝ 12 In each case, E125 is selected.

次に、イベントネットＥｉ　ｉ−１をイベントネットＥ
　、（ｊ〉１）の１つ前のイベントネットとする（例え
ば、イベントネットＥ１４の１つ前のイベントネットは
、Ｅ　　　即ちＥｌ３である。以下、７１４−１ゝイナス（−）の記号は全ての符号のサブスクリプトｊの
みに作用するものとする）。Next, we convert event net Ei i-1 to event net E
, (j〉1) (For example, the event net before event net E14 is E, that is, El3. Hereinafter, 714-1ゝ Inus (-) symbol is all ).

イベントネットＥ、、（ｊ＞１）のサーチ範囲は、ｊ多数話者の統計によりあらかじめ求められているイベン
トネットＥ１．とイベントネットＥｉ　ｉ−１との時間
的な差の平均（ｍ）及び標準偏差（σ、）に基づいて、
以下のように算出される。なお、ｍはｊによらず一定で
ある。The search range for event nets E, , (j>1) is j for event nets E1, . Based on the mean (m) and standard deviation (σ, ) of the temporal difference between and event net Ei i-1,
It is calculated as follows. Note that m is constant regardless of j.

出力Ｃ，，Ｃ，、・・・、Ｃ１，の中から最大＋−１”
　　＋　１−２’　　　　＋　１−ｎ個を選ぶことでイ
ベントネットＥ１ｊ−１の位置が決定される。Maximum +-1 from output C,,C,...,C1"
By selecting +1-2' +1-n, the position of event net E1j-1 is determined.

イベントネットＥ　、のサーチ範囲は、この出力１］Ｃｊ−１の最大位置を基準にｍ　−Ｋａｊからｍ　＋　
Ｋａｊの範囲である。ここでＫは定数で２〜３くらいと
する。ただし、Ｃ１１−１の最大位置よりｍ−にＣｊが
小さい場合は前者を採用する。The search range of event net E is from m −Kaj to m + based on the maximum position of Cj−1]
It is within the range of Kaj. Here, K is a constant and is about 2 to 3. However, if Cj is smaller than the maximum position of C11-1 by m-, the former is adopted.

即ち、サーチ範囲を（Ｌ、、Ｒ，）とおくと、Ｊｊｍａｘ　（ｍ−Ｋａｊ、Ｃ，−１の最大位置）、Ｒ・−
ｍ＋にσ。That is, if the search range is set as (L,,R,), then J j max (maximum position of m-Kaj,C,-1), R・-
σ to m+.

Ｊ　　　　　　　　　　　Ｊとして表される。J It is expressed as

一例として、ｊ−２のときは上記の関係を用いて、出力
ｃ、　　、ｃ、　　、・・・”　＋２７から出力Ｃ＋２
１　　　　　＋２２２５が最大値として選択される（第４図及び第５図を参
照）。As an example, when j-2, using the above relationship, output c, , c, ,...''+27 to output C+2
1 +22 25 is chosen as the maximum value (see Figures 4 and 5).

また、最大値選択に際しては、単純にｍａｘ　（Ｃ１ｊ
、）とせずに、イベントネットの性質及び計算量により
、次のような変形も考えられる。In addition, when selecting the maximum value, simply max (C1j
, ), the following modifications may be considered depending on the properties of the event net and the amount of calculation.

まず、第１に、全ての出力Ｃ，ｊ、（ｆ＝１．２゜・・
・、ｎ）が小さい値のときは、最大値選択を行わずにサ
ーチ範囲の中心ｌ−ｍを選択する。これによりイベント
ネットＥ１．の担当する単語以外の入ｊ力に対して不必要な整合を避け、リジェクト能力を高め
ることができる。First, all outputs C,j, (f=1.2°...
·, n) is a small value, the center l−m of the search range is selected without selecting the maximum value. As a result, event net E1. It is possible to avoid unnecessary matching for input words other than those for which the user is responsible, and improve the rejection ability.

第２に、全ての出力Ｃ１ｊ、　　（Ｊ＝Ｌ　　２．・・
・ｎ）が大きい値のときも、上記第１の場合と同様にＪ
＝ｍとする。これにより、長母音等に見られる同じよう
な特徴ベクトルが長く続く場合に不自然な整合を避ける
ことができる。Second, all outputs C1j, (J=L 2...
・When n) is a large value, J
= m. This makes it possible to avoid unnatural matching when similar feature vectors, such as those found in long vowels, continue for a long time.

第３に、全ての出力Ｃ，，，（ｊ！＝１．　２．・・・
ｍ）が小さい値のときは、サーチ範囲をある一定量αだ
け拡大し、ｍ＝ｍ＋αとしてＡ＝ｍ＋１゜ｍ＋２．・・
・９ｍ十αについて出力Ｃ１ｊ、！を求めて最大値選択
を行う。これにより、特に発声速度の遅いサンプルに対
して有効に作用する。Third, all outputs C,,,(j!=1. 2....
When m) is a small value, the search range is expanded by a certain amount α, and m=m+α, A=m+1°m+2.・・・
・Output C1j for 9m1α,! Find the maximum value selection. This is particularly effective for samples with slow speaking speeds.

次に、上記のイベントネット２８、ワードネット３１及
びスーパーネット３３の学習について説明する。Next, learning of the above event net 28, word net 31, and supernet 33 will be explained.

イベントネット２８、ワードネット３１及びスーパーネ
ット３３は、基本的には多層パーセプトロン型ニューラ
ルネットワークにおける誤差逆伝播法を用いて学習され
る。The event net 28, word net 31, and supernet 33 are basically trained using the error backpropagation method in a multilayer perceptron type neural network.

ただし、イベントネット２８、ワードネット３１及びス
ーパーネット３３は、音声サンプルだけでなく、無音サ
ンプル、即ち雑音区間についても学習を行つ０雑音区間の学習のときの教師信号としては、イベントネ
ットに対して、（Ｃ，、、Ｃ，、）＝　（０，１）＋１　　　　１］を与える。即ち、雑音区間をそのイベントネットが担当
する部分音韻系列ではないとする。However, the event net 28, word net 31, and super net 33 perform learning not only on voice samples but also on silent samples, that is, noise sections. Then, (C,,,C,,)=(0,1)+1 1] is given. That is, it is assumed that the noise section is not a partial phoneme sequence handled by the event net.

ここで、そのイベントネットが促音等の長い無音区間を
担当している場合は、上記のような雑音区間のサンプル
は与えない。Here, if the event net is in charge of a long silent section such as a consonant, samples of the above-mentioned noise section are not provided.

雑音サンプルを与えるか否かについては、学習過程で誤
差が大きいままに維持されるサンプルを検索して、それ
が雑音サンプルであれば、それ以降の学習からは除外す
るように決定する。Regarding whether or not to provide a noise sample, it is decided to search for a sample that maintains a large error during the learning process, and if it is a noise sample, to exclude it from subsequent learning.

ワードネットに対しても、雑音サンプルが入力されたと
きは、そのワードネットが担当する単語ではないとして
、（Ｃ，、Ｃ，）　−（０，１）の教師信号を与える。When a noise sample is input to a word net, a teacher signal of (C,,C,)-(0,1) is given, assuming that it is not a word for which the word net is responsible.

スーパーネットでは、このようなワードネットの出力に
対しては、リジェクトに相当するユニットに１を与えて
学習する。In the supernet, learning is performed by assigning 1 to units corresponding to rejects for such wordnet outputs.

実際の音声認識の動作時には、ｔｂを現在の時刻に合わ
せて、ｔｂ　＋１．　　ｔｂ　＋２というように１フレ
ームずつインクリメントする。それに伴ない、語頭ｔ１
　　も１フレームずつインクリメントされる。During actual speech recognition operation, tb is set to the current time and tb +1. Increment one frame at a time, such as tb +2. Along with this, word-initial t1
is also incremented by one frame.

全ての語頭ｔＳついて一様に１フレームずつインクリメ
ントするときは、単語ｒによらず語頭ｔＩ　は同じ値と
なる。When incrementing all word beginnings tS uniformly by one frame, the word beginning tI has the same value regardless of the word r.

イベントネット出力格納部３０に格納されている区間［
ｔ　ｒＸｔ、′十Δ］におけるイベントネットＥｒ１の
計算結果を参照して、イベントネットＥ　の出力Ｃ４１
が低い値のフレームは、計算の効重化のためにスキップ
してもよい。The section [
With reference to the calculation result of event net Er1 at
Frames with low values may be skipped for more efficient calculation.

しきい値をθ、（通常０．１〜０．２）とし、ｔ１１＋
ｉ　（但し、１≦ｉ≦Δ）においてＣ２１〈θｌならば
、インクリメント量をｉ＋１、即ち、次の語頭仮定フレ
ームをｔ’ｒ　ｒ十ｔ　＋　１とする。The threshold value is θ (usually 0.1 to 0.2), and t11+
If C21<θl in i (1≦i≦Δ), then the increment amount is i+1, that is, the next word-initial assumption frame is t'r r+t + 1.

上述の方法により、現在の時刻ｔｂでは、それぞれの単
語ｒに対して複数の語頭候補が存在する。According to the method described above, at the current time tb, a plurality of word initial candidates exist for each word r.

しかし各ワードネットの出力としては、複数の出力中の
最大の出力のみが選択される。However, only the maximum output among the plurality of outputs is selected as the output of each word net.

上記の選択されたワードネットの出力が、スーパーネッ
トに入力され、現在の時刻ｔｂ毎にスパーネットの出力
が算出される。The output of the word net selected above is input to the supernet, and the output of the supernet is calculated at each current time tb.

算出されたスーパーネットの出力は、結果判定部３４に
送られる。結果判定部３４では、次に述べるようなしき
い値判定により認識結果を結果表示部３５に出力する。The calculated supernet output is sent to the result determination section 34. The result determination unit 34 outputs the recognition result to the result display unit 35 by threshold determination as described below.

まず、Ｃ９を第１番目の単語に対応するスーパネットの
出カニニットの値とし、認識語彙数をｎとする。更にＣ
は、リジェクトに対応ずｎ＋するスーパーネットの出カニニットの値とし、θ　。First, let C9 be the value of the supernet output corresponding to the first word, and let n be the number of recognized vocabulary. Further C
is the value of the output of the supernet that does not respond to rejection and is n+, and θ.

θ１はしきい値であり本実施例ではθａ＝０．６゜θ、
＝０．１　とする。θ1 is a threshold value, and in this example, θa=0.6°θ,
=0.1.

そして、以下のルールに従って認識を行う：ｍａｘ　（
Ｃ，）＜θ８ならば、リジェクトする（ルール１）。Then, recognition is performed according to the following rules: max (
If C, ) < θ8, reject (Rule 1).

ｍａｘ　（Ｃ，）−ｍａｘ　（Ｃ，）＜θｄ（ここで、
Ｉはｍ　ａ　ｘ　（Ｃ−）　＝　Ｃ１を満足する■であ
る）ならば、リジェクトする（ルール２）。max (C,) - max (C,) < θd (where,
If I is ■ that satisfies m a x (C-) = C1), reject (Rule 2).

Ｃ〉θ ｎ＋１　　　ａならば、リジェクトする（ルール３）。C〉θ n+1 a If so, reject it (Rule 3).

上記ルール１〜３以外の場合、ｍａｘ　（Ｃ，）　＝Ｃ１を満足するＩを認識結果とする（ルール４）。In cases other than rules 1 to 3 above, max (C,) = C1 The recognition result is I that satisfies (Rule 4).

上記の認識結果は結果表示部３５に入力されて表示され
る。The above recognition results are input to the result display section 35 and displayed.

なお、イベントネット２８、ワードネット３１及びスー
パーネット３３の学習の対象として認識語索具外の音声
を取り扱ってもよい。この場合、雑音サンプルと同様の
学習方法となる。It should be noted that speech other than recognition word tools may be handled as a learning target for the event net 28, word net 31, and super net 33. In this case, the learning method is the same as that for noise samples.

学習サンプルの増加に伴って学習が収束するのに必要な
時間は長くなるが、認識対象語彙以外の入力に対するリ
ジェクト能力の向上及び、連続して発声された音声から
認識対象語巣を見付は出すこともできる。As the number of learning samples increases, the time required for learning to converge increases, but it improves the ability to reject input other than the target vocabulary and the ability to find target word nests from continuously uttered speech. You can also take it out.

従って、比較的定常な雑音に対しても有効に作用する。Therefore, it works effectively even against relatively stationary noise.

また、イベントネット２８の学習の際に、数種類のレベ
ルの定常雑音を付加した音声サンプルを併せて学習の対
象とすることでニューラルネッワークの泥化能力により
、様々なレベルの定常雑音に対して正しい音声認識を行
うことができる。In addition, when training the event net 28, by using speech samples to which several levels of stationary noise have been added as learning targets, the neural network's muddying ability can be used to improve accuracy for various levels of stationary noise. Voice recognition can be performed.

［発明の効果］入力音声を各フレーム毎に音響分析して得られた特徴量
を入力するイベントネットと、イベントネットからの出
力を入力して入力音声に対して認識対象語案のうちの特
定の単語との類似度に相当する値を出力するワードネッ
トと、ワードネットからの出力を入力して入力音声の属
する認識単語に応じた値を出力するスーパーネットとを
備えており、イベントネットは、多数話者の音声サンプ
ルを分析して得られた時間間隔情報に基づいて、任意の
時刻を語頭として所定の範囲内で特徴量を時間的にずら
し、時間的にずらされた特徴量の中で出力値が最大にな
る位置を選択して、特定の単語中の部分音韻系列の類似
度に相当する値を出力するように構成されているので、
ワードスポツティングを能率的に行うことができ、認識
対象語檗以外の音声に対して誤動作せず、連続して発声
された音声から認識対象語巣だけを自動的に抽出できる
。その結果、騒音等の雑音下における音声の認識が向上
する。[Effects of the invention] An event net that inputs the feature values obtained by acoustically analyzing the input voice for each frame, and an input of the output from the event net to identify words to be recognized for the input voice. The event net is equipped with a word net that outputs a value corresponding to the similarity with the word of , based on time interval information obtained by analyzing voice samples of multiple speakers, temporally shift the features within a predetermined range with an arbitrary time as the beginning of a word, and then It is configured to select the position where the output value is maximum and output a value corresponding to the similarity of the partial phoneme sequence in a specific word.
Word spotting can be performed efficiently, and only the word nests to be recognized can be automatically extracted from continuously uttered speech without malfunctioning for sounds other than the words to be recognized. As a result, speech recognition under noise such as noise is improved.

[Brief explanation of drawings]

第１図は本発明の音声認識装置実施例における一実施例
の構成を示すブロック図、第２図は第１図中の特徴ベク
トル格納部の構成を示す図、第３図は現時刻及び仮定し
た語頭の時間的位置関係を示す図、第４図は第１図の音
声認識部におけるニューラルネットの構成図、第５図は
イベントネットの出力の最大値選択を説明する図である
。２１・・・マイクロホン、２２・・・アンプ、２３・・
・Ａ／Ｄ変換器、２４・・・音響分析部、２５・・・帯
域通過フィルタ、２６・・・圧縮部、２７・・・特徴ベ
クトル格納部、２８川イベントネツト、２９・・・イベ
ントネット群、３０・・・イベントネット出力格納部、
３Ｉ・・・ワードネット、３２・・・ワードネット出力
格納部、３３・・・スーパーネット、３４・・・結果判
定部、３５・・・結果表示部。FIG. 1 is a block diagram showing the configuration of an embodiment of the speech recognition device of the present invention, FIG. 2 is a diagram showing the configuration of the feature vector storage section in FIG. 1, and FIG. 3 is a block diagram showing the configuration of the feature vector storage section in FIG. FIG. 4 is a diagram showing the configuration of the neural network in the speech recognition section of FIG. 1, and FIG. 5 is a diagram illustrating the selection of the maximum value of the output of the event net. 21...Microphone, 22...Amplifier, 23...
- A/D converter, 24... Acoustic analysis unit, 25... Band pass filter, 26... Compression unit, 27... Feature vector storage unit, 28 River event net, 29... Event net Group, 30...Event net output storage unit,
3I...WordNet, 32...WordNet output storage section, 33...Supernet, 34...Result determination section, 35...Result display section.

Claims

[Claims]

An event net that inputs the feature values obtained by acoustically analyzing input speech for each frame, and an event net that inputs the output from the event net and identifies specific words from the vocabulary to be recognized for the input speech. The event net includes a word net that outputs a value corresponding to the degree of similarity, and a supernet that inputs the output from the word net and outputs a value according to the recognized word to which the input speech belongs, and the event net includes: Based on time interval information obtained by analyzing voice samples of multiple speakers, the feature amount is temporally shifted within a predetermined range with an arbitrary time as the beginning of a word, and the feature amount shifted in time is calculated. A speech recognition device characterized in that the speech recognition device is configured to select a position where the output value is maximum and output a value corresponding to the degree of similarity of partial phoneme sequences in the specific word.