JPH0546196A

JPH0546196A - Speech recognition device

Info

Publication number: JPH0546196A
Application number: JP3233993A
Authority: JP
Inventors: Haruyuki Hayashi; 晴之林
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1991-08-21
Filing date: 1991-08-21
Publication date: 1993-02-26
Anticipated expiration: 2015-02-14
Also published as: JP3008593B2

Abstract

PURPOSE:To correctly recognize an input speech to which a non-steady noise is added by learning plural noise patterns including non-steady noises from past input voice patterns. CONSTITUTION:A noise learning part 5 learns steady noises and non-steady noises from patterns judged as non-speech sections by a speech detection part 2 and a recognition part 3. Those are registered in a speech pattern storage part 6 and the registered noise patterns are used for recognition for a speech which is inputted next. Consequently, a high recognition ratio can be maintained even in environment where there are many non-steady noises.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は音声認識装置に関し、特に非定常
雑音の多い環境で使用される音声認識装置の雑音学習方
式に関するものである。TECHNICAL FIELD The present invention relates to a speech recognition apparatus, and more particularly to a noise learning method for a speech recognition apparatus used in an environment with a lot of non-stationary noise.

【０００２】[0002]

【従来技術】雑音パターンの学習がフィールド上での音
声認識装置に有効であることが知られており、すでに実
用化されている。2. Description of the Related Art It is known that learning a noise pattern is effective for a voice recognition device on the field, and has already been put to practical use.

【０００３】その従来の音声認識装置のブロック図を図
２に示す。図において、先ず、入力信号ａは分析部１に
よって特徴ベクトルで表現される入力パターンｂに変換
される。次に、入力パターンｂは音声検出部２によって
音声区間と非音声区間に分けられ、前者は入力音声パタ
ーンｃとして認識部３に、後者は非音声パターンｄとし
て雑音学習部５に夫々出力される。A block diagram of the conventional speech recognition apparatus is shown in FIG. In the figure, first, the input signal a is converted into an input pattern b represented by a feature vector by the analysis unit 1. Next, the input pattern b is divided into a voice section and a non-voice section by the voice detection unit 2, the former is output to the recognition unit 3 as the input voice pattern c, and the latter is output to the noise learning unit 5 as the non-voice pattern d. ..

【０００４】次に、雑音学習部５は非音声パターンｄか
ら、例えばパワーレベルの最小となる１区間を抽出し、
あるいは全区間の平均を計算し、雑音パターンｆとして
認識部３に出力する。Next, the noise learning unit 5 extracts, for example, one section having the minimum power level from the non-voice pattern d,
Alternatively, the average of all sections is calculated and output to the recognition unit 3 as a noise pattern f.

【０００５】最後に、認識部３では入力音声パターンｃ
と、標準パターン記憶部４からの出力である標準パター
ンｅの前後に雑音パターンｆを結合したものとのマッチ
ングを行い、この結果類似度の最も高い標準パターンの
カテゴリを認識結果ｇとして出力する。また、マッチン
グの際に入力音声パターンｃから雑音パターンｆを減算
するノイズサブトラクションを行う場合もある。Finally, the recognition unit 3 inputs the input voice pattern c.
And a pattern obtained by combining the noise pattern f before and after the standard pattern e, which is the output from the standard pattern storage unit 4, as a result, the category of the standard pattern having the highest similarity is output as the recognition result g. In addition, noise subtraction may be performed in which the noise pattern f is subtracted from the input voice pattern c during matching.

【０００６】従来のこの種の雑音パターンは定常雑音で
あるという前提で音声検出の結果の非音声区間から１パ
ターンのみ学習されるものであった。Conventionally, this kind of noise pattern is learned only from one pattern from a non-voice section as a result of voice detection on the assumption that it is stationary noise.

【０００７】図６の（ａ）は非定常雑音のない入力パタ
ーンのパワー波形例を示し、（ｂ）は発声前に舌打ち
音、発声後に呼気音が付加した入力パターンのパワー波
形例を示し、（ｃ）は発声の直前から直後まで電話の呼
出し音が付加した入力パターンのパワー波形例を示して
いる。FIG. 6 (a) shows an example of a power waveform of an input pattern without non-stationary noise, and FIG. 6 (b) shows an example of a power waveform of an input pattern in which a tongue tap sound is added before vocalization and an exhalation sound is added after vocalization. (C) shows an example of the power waveform of the input pattern in which the ringing tone of the telephone is added from immediately before to immediately after the utterance.

【０００８】従来方式では、図（ａ）〜（ｃ）のパワー
波形のいずれも非音声定常区間ＳＮのみを雑音パターン
として学習している。実際のフィールド上での誤認識の
原因は定常雑音よりも非定常雑音のほうが多く、また非
定常雑音が多い環境で使用されるアプリケーションが多
いにもかかわらず、従来の音声認識装置における雑音学
習では、非定常雑音を学習することができない。従っ
て、フィールド上では高い認識率を維持できないという
問題点がある。In the conventional method, only the non-voice steady section SN is learned as a noise pattern in each of the power waveforms shown in FIGS. The cause of erroneous recognition in the actual field is that non-stationary noise is more common than stationary noise, and even though there are many applications used in environments where there is much non-stationary noise, noise learning in conventional speech recognition systems , Unsteady noise cannot be learned. Therefore, there is a problem that a high recognition rate cannot be maintained in the field.

【０００９】[0009]

【発明の目的】本発明の目的は、非定常雑音を学習する
ことが可能な音声認識装置を提供することである。It is an object of the present invention to provide a speech recognition apparatus capable of learning non-stationary noise.

【００１０】[0010]

【発明の原理】先ず、一般的に非定常雑音は繰返される
ことが期待されていないのだが、実際にはある一定の環
境で使われている間、同じ様な非定常雑音が繰返される
ことが多い。First of all, although it is generally not expected that non-stationary noise will be repeated, in practice, similar non-stationary noise may be repeated while being used in a certain environment. Many.

【００１１】例えば、バンキングサービス等の様に電話
からの音声入力のアプリケーションでは、電話回線が接
続されてから切断されるまでの間同一環境で同一話者か
らの音声入力が数十回ある。この間、例えば電話の呼出
音とか電車の走行音が大きく尚且つ頻繁に聞こえるオフ
ィスでは、これらの非定常雑音は繰返し入力音声に混入
する。For example, in an application of voice input from a telephone such as a banking service, voice input from the same speaker is performed several tens of times in the same environment from the time the telephone line is connected to the time the line is disconnected. During this period, for example, in an office where a ringing tone of a telephone or a running noise of a train is loud and frequently heard, these non-stationary noises are repeatedly mixed with the input voice.

【００１２】又、イスをギーギー鳴らしながら電話をす
る癖のある人や、発声前後に付加される舌打ち音や呼気
音、鼻息音等が大きい人等、発声者に特有の非定常雑音
が繰返し混入することが多い。Further, non-stationary noise peculiar to a speaker is repeatedly mixed, such as a person who has a habit of making a phone call by whispering a chair, or a person who has a loud tongue sound, an exhalation sound, or a nasal breath sound added before and after utterance. I often do it.

【００１３】次に、同じ非定常雑音が混入しても、ある
人の特定の言葉（雑音に強い音声）は正しく認識する
が、他の特定の言葉（雑音に弱い音声）では誤認識する
場合が多い。つまり、雑音に強い音声に付加した非定常
雑音は認識結果から逆にこれを学習することができ、こ
こで学習した非定常雑音に弱い音声を認識する時に利用
すれば誤認識を防ぐことができる。Next, when the same non-stationary noise is mixed, a certain person's specific word (voice that is strong against noise) is correctly recognized, but another specific word (voice that is weak against noise) is erroneously recognized. There are many. In other words, non-stationary noise added to speech that is strong against noise can be learned in reverse from the recognition result, and misrecognition can be prevented by using it when recognizing speech that is weak to non-stationary noise learned here. ..

【００１４】[0014]

【発明の構成】そこで、本発明によれば、入力信号を入
力パータンに変換する分析部と、この入力パターンを入
力音声パターンと非入力音声パターンとに分ける音声検
出部と、前記非音声パターンから雑音パターンを学習す
る雑音学習部と前記雑音パターンを登録する雑音パター
ン記憶部と、予め準備された標準パターンが登録された
標準パターン記憶部と、前記入力音声パターン，前記標
準パターン，更には前記雑音パターンとから認識結果を
出力する認識部とを含み、前記雑音学習部は、前記非音
声パターンから特徴ベクトルの変化量を算出する手段
と、前記変化量の所定フレーム分の移動平均を算出する
手段と、前記移動平均の最小区間を検出してこれを定常
雑音パターンとし、前記移動平均が予め設定された値よ
りも大なる区間を検出してこれを非定常雑音パターンと
する手段とを含むことを特徴とする音声認識装置が得ら
れる。Therefore, according to the present invention, an analyzing section for converting an input signal into an input pattern, a voice detecting section for dividing the input pattern into an input voice pattern and a non-input voice pattern, and the non-voice pattern A noise learning unit for learning a noise pattern, a noise pattern storage unit for registering the noise pattern, a standard pattern storage unit for storing a standard pattern prepared in advance, the input voice pattern, the standard pattern, and the noise. A recognition unit that outputs a recognition result from the pattern, and the noise learning unit calculates a change amount of the feature vector from the non-voice pattern, and a calculation unit that calculates a moving average of the change amount for a predetermined frame. Then, the minimum section of the moving average is detected and used as a stationary noise pattern, and the section in which the moving average is larger than a preset value is detected. Speech recognition device is obtained which comprises a means for the non-stationary noise pattern this by.

【００１５】[0015]

【実施例】以下に、本発明の実施例について図面を参照
しつつ詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１６】図１は本発明の実施例のブロック図であ
り、図２と同等部分は同一符号により示している。図３
〜図５は図１の各部の動作を示す処理フローチャートで
あり、図１の構成及び動作について、図３〜図５を参照
して説明する。FIG. 1 is a block diagram of an embodiment of the present invention, and the same parts as in FIG. 2 are designated by the same reference numerals. Figure 3
5 is a processing flowchart showing the operation of each unit of FIG. 1, and the configuration and operation of FIG. 1 will be described with reference to FIGS.

【００１７】先ず、入力信号ａは分析部１によって特徴
ベクトルで表現される入力パターンｂに変換される。入
力パターンｂは音声検出部２によって音声区間と非音声
区間に分けられ（ステップ２１）、前者を入力音声パタ
ーンｃとして認識部３に出力し（ステップ２２）、後者
を非音声パターンｄとして雑音学習部５に出力する（ス
テップ２３）。ここで、入力音声パターンが真の音声区
間を十分に含む様に音声検出パラメータを設定したり、
ハングオーバ区間をつける場合がある。First, the input signal a is converted by the analysis unit 1 into an input pattern b represented by a feature vector. The input pattern b is divided into a voice section and a non-voice section by the voice detection unit 2 (step 21), the former is output to the recognition unit 3 as the input voice pattern c (step 22), and the latter is noise-learned as the non-voice pattern d. Output to the unit 5 (step 23). Here, the voice detection parameter is set so that the input voice pattern sufficiently includes the true voice section,
A hangover section may be added.

【００１８】次に、雑音学習部５では非音声パターンｄ
から特徴ベクトルの変化が小さい区間の平均を算出し
（ステップ５１，５２）、定常雑音パターンｈとして雑
音パターン記憶部６に出力する（ステップ５３，５
４）。ここでもし非音声パターンｄのなかで特徴ベクト
ルの変化が大きい区間があれば、その区間を非定常雑音
パターンｉとして雑音パターン記憶部６に出力する（ス
テップ５５，５４）。Next, in the noise learning section 5, the non-voice pattern d
The average of the sections in which the change of the feature vector is small is calculated (steps 51 and 52) and is output to the noise pattern storage unit 6 as the stationary noise pattern h (steps 53 and 5)
4). If there is a section in which the change of the feature vector is large in the non-voice pattern d, the section is output to the noise pattern storage unit 6 as the non-stationary noise pattern i (steps 55 and 54).

【００１９】図６において、仮に音声検出が正確に行わ
れ“VOICE ”区間以外が非音声パターンとなり、さらに
特徴ベクトルの変化量がパワーとほぼ同じ波形だと考え
ることにする。この場合、いずれも特徴ベクトルの変化
が小さい“ＳＮ”区間を定常雑音パターンｈとして学習
し、変化が大きい区間（Ｎ１，Ｎ２，Ｎ３，Ｎ３′）を
非定常雑音パターンｉとして学習する。In FIG. 6, it is assumed that the voice is accurately detected and the non-voice pattern is present except in the "VOICE" section, and that the variation amount of the feature vector is substantially the same as the power waveform. In this case, the "SN" section in which the change in the feature vector is small is learned as the stationary noise pattern h, and the section (N1, N2, N3, N3 ') in which the change is large is learned as the non-stationary noise pattern i.

【００２０】次に、すでに雑音パターン記憶部６にいく
つかの雑音パターンが登録されている場合、登録済みの
雑音パターンの中で今回学習した雑音パターン（Ｎ１，
Ｎ２，Ｎ３，Ｎ３′）と類似度の高いもの（Ｎ８，Ｎ９
とする）があれば、これら（Ｎ８，Ｎ９）を廃棄する
（ステップ６１，６２）。Next, when some noise patterns have already been registered in the noise pattern storage unit 6, the noise pattern (N1,
N2, N3, N3 ') with a high degree of similarity (N8, N9
If there is any), these (N8, N9) are discarded (steps 61, 62).

【００２１】また、Ｎ３とＮ３′は類似度が高いため、
Ｎ３′は登録しない。Since N3 and N3 'have a high degree of similarity,
N3 'is not registered.

【００２２】さらに、雑音パターン記憶部６に登録した
雑音パターンの数が予め決められた数（Ｍ個）以上にな
った場合には（ステップ６３）、最も過去に登録した雑
音パターン（Ｎ７とする）を廃棄する（ステップ６
４）。Furthermore, when the number of noise patterns registered in the noise pattern storage unit 6 exceeds a predetermined number (M) (step 63), the noise pattern registered the earliest (N7). ) Is discarded (step 6)
4).

【００２３】次に、認識部３では入力音声パターンｃと
標準パターン記憶部４からの出力である標準パターンｅ
及び雑音パターン記憶部６からの出力である雑音パター
ンｆとのマッチングを行う。Next, in the recognition section 3, the input voice pattern c and the standard pattern e which is the output from the standard pattern storage section 4 are inputted.
And matching with the noise pattern f which is the output from the noise pattern storage unit 6.

【００２４】マッチングの方法は例えば、先ず入力音声
パターンから雑音パターンとのマッチングを開始する。
雑音パターンは定常雑音パターンと非定常雑音パターン
が結合したものも含まれる、最も類似度の高くなった雑
音パターンの次に、標準パターンと入力音声パターンの
続きの区間とのマッチングを行い、最後にまた雑音パタ
ーンと入力音声パターンの続きから終端までの区間との
マッチングを行う。As the matching method, for example, first, the matching with the noise pattern is started from the input voice pattern.
The noise pattern includes a combination of a stationary noise pattern and a non-stationary noise pattern.After the noise pattern with the highest similarity, matching is performed between the standard pattern and the subsequent section of the input speech pattern, and finally It also matches the noise pattern with the section from the continuation to the end of the input speech pattern.

【００２５】マッチングの結果、類似度の最も高くなっ
た標準パターンのカテゴリを認識結果ｇとして出力す
る。また、標準パターンとのマッチングの際、入力音声
パターンから定常雑音パターンを減算するノイズサブト
ラクションを行う場合もある。The category of the standard pattern having the highest similarity as a result of matching is output as the recognition result g. In addition, when performing matching with a standard pattern, noise subtraction may be performed in which a stationary noise pattern is subtracted from an input voice pattern.

【００２６】次に、認識結果が正解となった場合には
（ステップ３１）、正解の標準パターンとのマッチング
バックトレースを行い（ステップ３２）、標準パターン
に対応して入力音声パターンの区間を真の音声区間と判
断し、雑音パターンに対応した入力音声パターンの区間
を非音声区間と判断し、これを非音声パターンｊとして
雑音学習部５に出力する（ステップ３３）。Next, when the recognition result is correct (step 31), a matching back trace with the correct standard pattern is performed (step 32), and the section of the input voice pattern is checked in correspondence with the standard pattern. Of the input voice pattern corresponding to the noise pattern is determined to be a non-voice segment, and this is output to the noise learning unit 5 as a non-voice pattern j (step 33).

【００２７】図６において、仮に音声検出が正確に行わ
れず“DETECT”区間が入力音声パターンになったとする
と、この区間がマッチングの対象となり、図６（ｂ）の
入力音声パターンの場合のマッチングバックトレースを
図７のに示す。ここで、標準パターンに対応した区間は
ｔ２〜ｔ３であり、この区間を真の音声区間と判断す
る。逆に、雑音パターンに対応した区間はｔ１〜ｔ２と
ｔ３〜ｔ４であり、この区間を非音声区間と判断し、非
音声パターンｊとして雑音学習部に出力する。In FIG. 6, if voice detection is not performed accurately and the "DETECT" section becomes the input voice pattern, this section becomes the target of matching, and the matching back in the case of the input voice pattern of FIG. 6B is performed. The trace is shown in FIG. Here, the section corresponding to the standard pattern is t2 to t3, and this section is determined to be a true voice section. On the contrary, the sections corresponding to the noise pattern are t1 to t2 and t3 to t4, and this section is determined to be a non-voice section and is output to the noise learning unit as a non-voice pattern j.

【００２８】次に、雑音学習部５では非音声パターンｊ
の中で特徴ベクトルの変化が小さい区間であればこ平均
をとり定常雑音パターンｈとして雑音パターン記憶部６
に出力する（ステップ５１〜５４）。また、非音声パタ
ーンｊの中で特徴ベクトルの変化が大きい区間があれ
ば、その区間を非定常雑音パターンｉとして雑音パター
ン記憶部６に出力する（ステップ５１，５２，５５）。Next, in the noise learning section 5, the non-voice pattern j
If there is a small change in the feature vector, the average is taken and the noise pattern storage unit 6 is used as the stationary noise pattern h.
(Steps 51 to 54). Also, if there is a section in which the change of the feature vector is large in the non-voice pattern j, the section is output to the noise pattern storage unit 6 as the non-stationary noise pattern i (steps 51, 52, 55).

【００２９】図６においては、特徴ベクトルの変化量が
パワーとほぼ同じ波形だと考えることにすると、変化が
小さい区間がないためいずれも定常雑音パターンはな
く、非定常雑音パターン（Ｎ１，Ｎ２，Ｎ３，Ｎ３′）
を学習することになる。In FIG. 6, assuming that the amount of change in the feature vector is almost the same as the power, there is no interval where the change is small, so there is no stationary noise pattern, and there is a non-stationary noise pattern (N1, N2). N3, N3 ')
Will be learning.

【００３０】最後に、雑音パターン記憶部６に登録され
ている雑音パターンの廃棄方法や登録方法は前述と同じ
である。Finally, the method of discarding and registering the noise pattern registered in the noise pattern storage unit 6 is the same as described above.

【００３１】このように、同一の環境（場所や人等）で
使用されている過去の入力パターンから非定常雑音を含
めた雑音パターンを学習し、次の認識時にこれらの雑音
パターンを用いたマッチングを行うことができる。As described above, a noise pattern including non-stationary noise is learned from past input patterns used in the same environment (place, person, etc.), and matching is performed using these noise patterns at the next recognition. It can be performed.

【００３２】[0032]

【発明の効果】以上述べた様に、本発明によれば、定常
雑音の他に非定常雑音をも学習して認識できるようにし
たので、非定常雑音を含めた複数の雑音パターンを用い
たマッチングを行うことができ、非定常雑音が多い用途
での音声認識が高い認識率で可能になるという効果があ
る。As described above, according to the present invention, non-stationary noise can be learned and recognized in addition to stationary noise. Therefore, a plurality of noise patterns including non-stationary noise are used. There is an effect that matching can be performed, and voice recognition can be performed with a high recognition rate in an application having a lot of non-stationary noise.

[Brief description of drawings]

【図１】本発明の実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of the present invention.

【図２】従来の音声認識装置のブロック図である。FIG. 2 is a block diagram of a conventional voice recognition device.

【図３】音声検出部２及び雑音学習部５の処理フロー図
である。FIG. 3 is a processing flowchart of a voice detection unit 2 and a noise learning unit 5.

【図４】雑音パターン記憶部６の記憶手順を示すフロー
図である。FIG. 4 is a flowchart showing a storage procedure of a noise pattern storage unit 6.

【図５】認識部３の処理フロー図である。FIG. 5 is a processing flowchart of the recognition unit 3.

【図６】（ａ）は非定常雑音のない入力パターンのパワ
ー波形図、（ｂ），（ｃ）は非定常雑音が混在した入力
パターンのパワー波形図である。6A is a power waveform diagram of an input pattern having no non-stationary noise, and FIGS. 6B and 6C are power waveform diagrams of an input pattern in which non-stationary noise is mixed.

【図７】図６（ｂ）の入力音声パターンと標準パターン
及び雑音パターンとのマッチングバックトレースを示す
図である。FIG. 7 is a diagram showing a matching back trace of the input voice pattern of FIG. 6 (b) with a standard pattern and a noise pattern.

[Explanation of symbols]

１分析部２音声検出部３認識部４標準パターン記憶部５雑音学習部６雑音パターン記憶部 1 analysis unit 2 voice detection unit 3 recognition unit 4 standard pattern storage unit 5 noise learning unit 6 noise pattern storage unit

Claims

[Claims]

1. An analysis unit for converting an input signal into an input pattern, a voice detection unit for dividing the input pattern into an input voice pattern and a non-input voice pattern, and a noise learning unit for learning a noise pattern from the non-voice pattern. And a noise pattern storage unit for registering the noise pattern, a standard pattern storage unit for storing a standard pattern prepared in advance, the input voice pattern, the standard pattern, and the noise pattern. Including a recognition unit,
The noise learning unit calculates a change amount of the feature vector from the non-voice pattern, a unit that calculates a moving average of a predetermined frame of the change amount, and detects a minimum section of the moving average and detects the minimum interval. A speech recognition apparatus comprising: a stationary noise pattern, and means for detecting a section in which the moving average is larger than a preset value and setting this as a non-stationary noise pattern.

2. The noise learning unit further comprises means for newly learning a noise pattern from a section of an input voice pattern corresponding to the noise pattern as a result of matching processing which is recognition processing performed by the recognition unit. The voice recognition device according to claim 1, comprising: