JPH11327593A

JPH11327593A - Voice recognition system

Info

Publication number: JPH11327593A
Application number: JP10132208A
Authority: JP
Inventors: Shinichi Tamura; 震一田村
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 1998-05-14
Filing date: 1998-05-14
Publication date: 1999-11-26

Abstract

PROBLEM TO BE SOLVED: To improve voice recognition rate by correcting distortion generated at the time of suppressing noise included in an input signal mixedly including a voice to be recognized and the noise by using a spectrum subtraction method to an optimum level and reducing the distortion as much as possible even in the case of using the voice recognition system under an environment generating each different noise sort. SOLUTION: In a noise suppressing device 10, a mapping calculation part 17 constituted by a neutral network model corrects the distortion of a noise subtraction power spectrum outputted from a power spectrum clipping calculation part 16. When a voice is correctly recognized by the voice recognition device 20, a teacher power spectrum previously stored in a buffer 26a built in a judgement part 26 is outputted to a learning control part 18 in the device 10. The control part 18 uses a noise subtraction spectrum stored in a buffer 17a built in the calculation part 17 as an input and uses a teacher spectrum outputted from the judgement part 16 as a teacher output to control a parameter of a neural network.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識等の音声
信号処理の前処理として用いる雑音抑圧に関し、特に、
認識対象となる音声信号と雑音信号とが混在した入力信
号から雑音成分を極力除去するスペクトラムサブトラク
ション法を用いた雑音抑圧装置を用いた音声認識システ
ムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to noise suppression used as pre-processing of speech signal processing such as speech recognition.
The present invention relates to a speech recognition system using a noise suppression device using a spectrum subtraction method for removing a noise component from an input signal in which a speech signal to be recognized and a noise signal are mixed as much as possible.

【０００２】[0002]

【従来の技術】従来より、例えばカーナビゲーションシ
ステムにおける目的地の設定などを音声によって入力で
きるようにする場合などに有効な音声認識装置が提案さ
れ、また実現されている。このような音声認識装置にお
いては、入力音声を予め記憶されている複数の比較対象
パターン候補と比較し、一致度合の高いものを認識結果
とするのであるが、現在の認識技術ではその認識結果が
完全に正確なものとは限らない。これは、静かな環境下
にあってもそうであるため、周囲に雑音が発生するよう
な環境下ではなおさらである。特に、上述したカーナビ
ゲーションシステムなどの実際の使用環境を考慮する
と、雑音がないことは想定しにくい。したがって、認識
率の向上を実現する上では、音声認識装置への入力の前
処理として、認識に必要な音声信号と雑音信号とが混在
した入力信号から雑音成分を極力除去する雑音抑圧を行
なうことが望ましい。2. Description of the Related Art Conventionally, a speech recognition device effective for, for example, a case where a destination setting in a car navigation system can be input by voice has been proposed and realized. In such a speech recognition device, an input speech is compared with a plurality of candidate patterns for comparison stored in advance, and a speech with a high degree of coincidence is used as a recognition result. It may not be completely accurate. This is true even in a quiet environment, especially in an environment where noise is generated in the surroundings. In particular, it is difficult to assume that there is no noise in consideration of an actual use environment such as the car navigation system described above. Therefore, in order to improve the recognition rate, it is necessary to perform noise suppression as much as possible to remove noise components from an input signal in which a speech signal and a noise signal necessary for recognition are mixed, as a preprocessing of an input to a speech recognition device. Is desirable.

【０００３】この音声と雑音とが混在した入力信号から
雑音成分を除去する手法としては、スペクトラムサブト
ラクション法が非常に有効な手法として知られている、
このスペクトラムサブトラクション法については、例え
ばSTEVEN F BOLL、”Suppression of Acoustic Noise i
n Speech Using Spectral Subtruction”、IEEE Transa
ctions on Acoustics, Speech and Signal processin
g、Vol.Assp-27、No.2、April 1979、pp.113-120をはじ
めとして多くの研究成果が発表されている。スペクトラ
ムサブトラクション法は、雑音の混入した音声信号の振
幅スペクトラムから雑音の振幅スペクトラムを差し引く
か、または雑音の混入した音声信号のパワースペクトラ
ムから雑音のパワースペクトラムを差し引くことによっ
て雑音抑圧を実現するものである。なお、パワースペク
トラムは振幅スペクトラムを２乗したものである。スペ
クトラムサブトラクション法による出力は、雑音の抑圧
された振幅スペクトラムか、雑音の抑圧されたパワース
ペクトラムである。As a method of removing a noise component from an input signal in which speech and noise are mixed, a spectrum subtraction method is known as a very effective method.
Regarding this spectrum subtraction method, for example, STEVEN F BOLL, “Suppression of Acoustic Noise i
n Speech Using Spectral Subtruction ”, IEEE Transa
ctions on Acoustics, Speech and Signal processin
g, Vol.Assp-27, No.2, April 1979, pp.113-120, and many other research results have been published. The spectrum subtraction method realizes noise suppression by subtracting the amplitude spectrum of noise from the amplitude spectrum of an audio signal containing noise or by subtracting the power spectrum of noise from the power spectrum of an audio signal containing noise. . The power spectrum is obtained by squaring the amplitude spectrum. The output by the spectrum subtraction method is an amplitude spectrum in which noise is suppressed or a power spectrum in which noise is suppressed.

【０００４】そして、このような雑音抑圧を行なってか
ら音声認識を行なうシステム構成として、例えば図２
（ａ）のような音声認識システム２００が考えられてい
る。つまり、マイク２０１からは雑音が混入した音声信
号あるいは雑音信号のみが入力される。マイク２０１か
らの入力信号は雑音抑圧装置２０３へ入力され、雑音抑
圧装置２０３で雑音抑圧された音声信号が音声認識装置
２０４へ転送される。また、この場合、利用者がＰＴＴ
（Push-To-Talk）スイッチ２０５を押しながらマイク２
０１を介して音声を入力するようにされている。そし
て、雑音抑圧装置２０３での雑音抑圧は次のように行わ
れる。As a system configuration for performing speech recognition after performing such noise suppression, for example, FIG.
A speech recognition system 200 as shown in FIG. That is, only a voice signal or a noise signal mixed with noise is input from the microphone 201. An input signal from the microphone 201 is input to the noise suppression device 203, and the speech signal noise-suppressed by the noise suppression device 203 is transferred to the speech recognition device 204. Also, in this case, the PTT
(Push-To-Talk) While pressing switch 205, microphone 2
A voice is input through the input unit 01. Then, noise suppression in the noise suppression device 203 is performed as follows.

【０００５】つまり、図２（ｂ）に示すように、ＰＴＴ
スイッチ２０５が押されるまでは雑音区間であるとし
て、雑音抑圧装置２０３はマイク２０１からの入力信号
を取り込む。そして、ＰＴＴスイッチ２０５が押される
と音声区間であるとして、雑音抑圧装置２０３はマイク
２０１からの入力信号を取り込む。しかし、音声区間に
て取り込んだものは「音声信号＋雑音信号」となる。し
たがって、雑音区間で取り込んだ「雑音信号」を、音声
区間において取り込んだ「音声信号＋雑音信号」から差
し引けば、雑音信号の抑圧された音声信号を抽出するこ
とができるというものである。[0005] That is, as shown in FIG.
The noise suppression device 203 takes in an input signal from the microphone 201, assuming that a noise section is present until the switch 205 is pressed. Then, when the PTT switch 205 is pressed, it is determined that the input signal is in the voice section, and the noise suppression device 203 takes in the input signal from the microphone 201. However, what is captured in the voice section is “voice signal + noise signal”. Therefore, by subtracting the “noise signal” captured in the noise section from the “voice signal + noise signal” captured in the voice section, a voice signal in which the noise signal is suppressed can be extracted.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、この手
法は、基本的には推定雑音に基づくものである。つまり
図２（ｂ）に示す音声区間においては混入した雑音を直
接検知しているのではなく、音声区間の開始以前の雑音
区間にて取り込んだ雑音信号を基に音声区間における雑
音を推定し、その推定雑音のパワースペクトラムを音声
区間にて取り込んだ入力音声のパワースペクトラムから
差し引く処理を行なうのである。そして、一般的には、
推定雑音のパワースペクトラムに所定の係数（サブトラ
クト係数）を乗じた値を入力音声のパワースペクトラム
から差し引いており、このサブトラクト係数は１より大
きな値に設定されることが多い。このように、サブトラ
クト係数を１より大きな値に設定することは、推定雑音
のパワースペクトラムを差し引く際に必要以上に差し引
いてしまうことに相当する。However, this technique is basically based on estimated noise. That is, in the voice section shown in FIG. 2B, the noise mixed is not directly detected, but the noise in the voice section is estimated based on the noise signal captured in the noise section before the start of the voice section, The power spectrum of the estimated noise is subtracted from the power spectrum of the input voice captured in the voice section. And, in general,
A value obtained by multiplying the power spectrum of the estimated noise by a predetermined coefficient (subtract coefficient) is subtracted from the power spectrum of the input voice, and the subtract coefficient is often set to a value larger than 1. As described above, setting the subtract coefficient to a value larger than 1 corresponds to subtracting more than necessary when subtracting the power spectrum of the estimated noise.

【０００７】音声の母音部分のように音声のパワーがあ
る程度確保されている区間では、推定雑音のパワースペ
クトラムを多少引き過ぎた場合でも音声のパワースペク
トラムの形状にはほとんど影響がない。しかし、音声中
のポーズ区間や摩擦子音部分のように音声のパワーが小
さいところでは、引き過ぎてマイナスの値になってしま
う場合がある。上述したようにパワースペクトラムは振
幅スペクトラムを２乗したものなのでマイナス値になる
ことは理論的にあり得ない。そのため、引き過ぎてマイ
ナスの値になってしまう部分はクリッピングしてゼロ
（０）あるいは相対的に小さな正の定数に設定するよう
にしている。したがって、スペクトラムサブトラクショ
ン法によって得られる雑音抑圧された入力音声のパワー
スペクトラムには特有の歪みが生じてしまうのである。In a section in which the power of the voice is secured to some extent, such as a vowel part of the voice, even if the power spectrum of the estimated noise is slightly reduced, the shape of the power spectrum of the voice is hardly affected. However, when the power of the sound is low, such as in a pause section or a fricative consonant part in the sound, the sound may be overdrawn and become a negative value. As described above, since the power spectrum is obtained by squaring the amplitude spectrum, it cannot theoretically be a negative value. For this reason, a portion which is too negative and becomes a negative value is clipped and set to zero (0) or a relatively small positive constant. Therefore, the power spectrum of the noise-suppressed input voice obtained by the spectrum subtraction method has a specific distortion.

【０００８】なお、パワースペクトラムを用いずに振幅
スペクトラムを用いた場合でも同様である。つまり、推
定雑音の振幅スペクトラムにサブトラクト係数を乗じた
値を入力音声の振幅スペクトラムから差し引くと計算上
マイナス値になることがある。この場合も振幅スペクト
ラム自体は本来マイナス値になることはあり得ないた
め、その部分はクリッピングしてゼロ（０）あるいは相
対的に小さな正の定数に設定する。したがって、スペク
トラムサブトラクション法によって得られる雑音抑圧さ
れた入力音声の振幅スペクトラムには特有の歪みが生じ
てしまう。[0008] The same applies to the case where the amplitude spectrum is used without using the power spectrum. In other words, subtracting the value obtained by multiplying the amplitude spectrum of the estimated noise by the subtraction coefficient from the amplitude spectrum of the input voice may result in a negative value in calculation. Also in this case, since the amplitude spectrum itself cannot be a negative value originally, the portion is clipped and set to zero (0) or a relatively small positive constant. Therefore, a specific distortion is caused in the amplitude spectrum of the input speech whose noise has been suppressed obtained by the spectrum subtraction method.

【０００９】図２（ａ）に示す雑音抑圧装置２０３から
音声認識装置２０４には、スペクトラムサブトラクショ
ン法によって雑音抑圧された入力音声のパワースペクト
ラム、あるいはそのパワースペクトラムを逆フーリエ変
換して得た自己相関係数が出力される。上述したよう
に、この音声認識装置２０４に入力されるパワースペク
トラムあるいは自己相関係数に歪みが発生しているた
め、音声認識装置２０４での認識率が低下してしまう。[0009] The noise suppressor 203 shown in FIG. 2A supplies the speech recognizer 204 with a power spectrum of the input speech noise-suppressed by the spectrum subtraction method or a self-phase obtained by performing an inverse Fourier transform of the power spectrum. The relation number is output. As described above, since the power spectrum or the autocorrelation coefficient input to the speech recognition device 204 is distorted, the recognition rate of the speech recognition device 204 is reduced.

【００１０】このような認識率の低下を押さえるため
に、スペクトラムサブトラクション法による出力である
推定雑音の振幅スペクトラムを差し引いた振幅スペクト
ラム又は推定雑音のパワースペクトラムを差し引いたパ
ワースペクトラム、あるいは、そのパワースペクトラム
を逆フーリエ変換して得た自己相関係数を補正して、上
述したようなスペクトラムサブトラクション法による特
有の歪みを低減させることが考えられる。このとき、音
声認識システムの使用される環境が異なれば雑音の種類
が異なることを考えると、音声認識システムの使用され
る環境で発生する雑音に合わせた補正を行うことが望ま
しい。ところが、想定される使用環境で発生する雑音に
合わせた補正を行うようシステムを設計すると、想定さ
れなかった環境で使用した場合に認識率の低下を招致す
ることが懸念される。In order to suppress such a decrease in the recognition rate, the amplitude spectrum obtained by subtracting the amplitude spectrum of the estimated noise output from the spectrum subtraction method, the power spectrum obtained by subtracting the power spectrum of the estimated noise, or the power spectrum obtained by subtracting the power spectrum is used. It is conceivable to correct the autocorrelation coefficient obtained by performing the inverse Fourier transform to reduce the distortion peculiar to the above-described spectrum subtraction method. At this time, considering that different types of noise are used in different environments in which the speech recognition system is used, it is desirable to perform correction in accordance with noise generated in the environment in which the speech recognition system is used. However, when a system is designed to perform correction in accordance with noise generated in an assumed use environment, there is a concern that a recognition rate may be reduced when the system is used in an unexpected environment.

【００１１】そこで本発明は、発生する雑音種類が異な
るどのような環境下で使用しても、認識対象となる音声
と雑音とが混在した入力信号に対しスペクトラムサブト
ラクション法を用いて雑音抑圧を行なう際に生じる歪み
を最適に補正し極力低減させ、音声認識における認識率
の向上に寄与することを目的とする。Therefore, the present invention suppresses noise by using a spectrum subtraction method for an input signal in which speech and noise to be recognized are mixed, regardless of the type of noise generated in any environment. It is an object of the present invention to optimally correct distortion generated at the time and reduce the distortion as much as possible, thereby contributing to improvement of a recognition rate in speech recognition.

【００１２】[0012]

【課題を解決するための手段及び発明の効果】上述した
目的を達成するためになされた本発明の音声認識システ
ムは、マイクロフォンなどを介して入力された音声に対
して雑音抑圧を行う雑音抑圧装置と、その雑音抑圧装置
に接続され、その雑音抑圧装置から入力された音声を認
識する音声認識装置とを備えたものである。A speech recognition system according to the present invention, which has been made to achieve the above-mentioned object, comprises a noise suppressing apparatus for suppressing noise inputted to a voice input through a microphone or the like. And a speech recognition device connected to the noise suppression device and recognizing speech input from the noise suppression device.

【００１３】本発明の音声認識システムの雑音抑圧装置
では、例えばマイクロフォンなどを介して入力された入
力信号を、フレーム分割手段が所定の処理時間毎にフレ
ーム信号として切り出し、スペクトラム算出手段が、そ
のフレーム信号をフーリエ変換するなどしてスペクトラ
ムを算出する。入力信号に音声が含まれている音声区間
であるか音声が含まれていない雑音区間であるかは判定
手段によって判定され、雑音スペクトラム推定手段は、
雑音区間の入力信号に基づいて算出したスペクトラムを
用いて雑音スペクトラムを推定する。そして、減算手段
が、音声区間の入力信号に基づいて算出したスペクトラ
ムから雑音スペクトラムに所定のサブトラクト係数を乗
じたものを減算し、クリッピング手段が、減算手段によ
る減算結果のマイナスとなった部分をゼロ又は相対的に
小さな正の定数とした雑音減算スペクトラムを算出す
る。そして、さらに、神経回路網モデルを用いて構成さ
れた写像計算手段は、神経回路網のパラメータが適切に
調節されている場合には、クリッピング手段によって算
出された雑音減算スペクトラムを、クリッピングによる
歪みを低減したスペクトラムへ写像する。また、クリッ
ピング手段によって算出された雑音減算スペクトラム
は、雑音減算スペクトラム記憶手段に記憶されて保持さ
れる。[0013] In the noise suppression apparatus of the speech recognition system according to the present invention, the frame dividing means cuts out an input signal input through, for example, a microphone or the like as a frame signal at every predetermined processing time. The spectrum is calculated by performing a Fourier transform on the signal. Whether the input signal is a speech section in which speech is included or a noise section in which speech is not included is determined by the determination unit, and the noise spectrum estimation unit includes:
The noise spectrum is estimated using the spectrum calculated based on the input signal in the noise section. Then, the subtraction means subtracts a value obtained by multiplying the noise spectrum by a predetermined subtraction coefficient from the spectrum calculated based on the input signal of the voice section, and the clipping means subtracts a negative part of the subtraction result by the subtraction means to zero. Alternatively, a noise subtraction spectrum having a relatively small positive constant is calculated. Further, when the parameters of the neural network are properly adjusted, the mapping calculation means configured using the neural network model converts the noise subtraction spectrum calculated by the clipping means into distortion due to clipping. Map to reduced spectrum. Further, the noise subtraction spectrum calculated by the clipping means is stored and held in the noise subtraction spectrum storage means.

【００１４】写像計算手段によって計算されたスペクト
ラムが音声入力装置へ入力されると、音声入力装置で
は、認識手段が、予め記憶されている複数の比較対象パ
ターン候補と比較して一致度合の高いものを認識結果と
する。さらに、音声認識装置の教師スペクトラム記憶手
段には、各比較対象パターン候補に対応して用意された
雑音の含まれない教師スペクトラムが記憶されており、
音声認識装置の教師スペクトラム出力手段は、認識手段
による認識結果に対応する教師スペクトラムを雑音抑圧
装置へ出力する。When the spectrum calculated by the mapping calculation means is input to the voice input device, the voice input device uses the recognition means to compare with the plurality of pre-stored comparison target pattern candidates with a high degree of matching. Is the recognition result. Further, the teacher spectrum storage means of the speech recognition device stores a noise-free teacher spectrum prepared for each comparison target pattern candidate,
The teacher spectrum output means of the speech recognition device outputs a teacher spectrum corresponding to the recognition result by the recognition means to the noise suppression device.

【００１５】雑音抑圧装置の学習制御手段は、音声認識
装置の認識手段による認識結果が正しい場合には、上述
したように写像計算手段からの出力に対応して音声認識
装置から出力される教師スペクトラムを、写像計算手段
を構成する神経回路網の教師出力とし、上述のように雑
音減算スペクトラム記憶手段に記憶され保持されたその
教師スペクトラムに対応する雑音減算スペクトラムを神
経回路網の入力として、神経回路網のパラメータを調節
する。[0015] The learning control means of the noise suppression device, if the recognition result by the recognition means of the speech recognition device is correct, as described above, the teacher spectrum output from the speech recognition device corresponding to the output from the mapping calculation means. Is used as a teacher output of the neural network constituting the mapping calculation means, and a noise subtraction spectrum corresponding to the teacher spectrum stored and held in the noise subtraction spectrum storage means as described above is input to the neural network. Adjust network parameters.

【００１６】なお、スペクトラム算出手段が算出するス
ペクトラムには、振幅スペクトラムやパワースペクトラ
ムが考えられる。つまり、フレーム信号をフーリエ変換
すると周波数スペクトラムＳ（ｆ）が算出される。こ
の周波数スペクトラムＳ（ｆ）の振幅成分である振幅
スペクトラムＡ（ｆ）を用いてもよいし、その振幅スペ
クトラムＡ（ｆ）を２乗して得たパワースペクトラムＰ
（ｆ）を用いてもよい。The spectrum calculated by the spectrum calculating means may be an amplitude spectrum or a power spectrum. That is, when the frame signal is Fourier-transformed, the frequency spectrum S (f) is calculated. An amplitude spectrum A (f) which is an amplitude component of the frequency spectrum S (f) may be used, or a power spectrum P obtained by squaring the amplitude spectrum A (f).
(F) may be used.

【００１７】例えば、スペクトラム算出手段が振幅スペ
クトラムＡ（ｆ）を算出する場合には、雑音スペクトラ
ム推定手段が雑音振幅スペクトラムＡＮ（ｆ）を推定
し、減算手段が、音声区間の入力信号に基づいて算出し
た振幅スペクトラムＡＳ（ｆ）から、雑音振幅スペクト
ラムＡＮ（ｆ）に所定のサブトラクト係数を乗じたもの
を減算する。For example, when the spectrum calculating means calculates the amplitude spectrum A (f), the noise spectrum estimating means estimates the noise amplitude spectrum AN (f), and the subtracting means based on the input signal of the voice section. From the calculated amplitude spectrum AS (f), a value obtained by multiplying the noise amplitude spectrum AN (f) by a predetermined subtraction coefficient is subtracted.

【００１８】また、スペクトラム算出手段がパワースペ
クトラムＰ（ｆ）を算出する場合には、雑音スペクトラ
ム推定手段が雑音パワースペクトラムＰＮ（ｆ）を推定
し、減算手段が、音声区間の入力信号に基づいて算出し
たパワースペクトラムＰＳ（ｆ）から、雑音パワースペ
クトラムＰＮ（ｆ）に所定のサブトラクト係数を乗じた
ものを減算する。When the spectrum calculating means calculates the power spectrum P (f), the noise spectrum estimating means estimates the noise power spectrum PN (f), and the subtracting means calculates the power spectrum P (f) based on the input signal of the voice section. From the calculated power spectrum PS (f), a value obtained by multiplying the noise power spectrum PN (f) by a predetermined subtraction coefficient is subtracted.

【００１９】このように減算処理を施すと、推定雑音の
パワースペクトラムあるいは振幅スペクトラムにサブト
ラクト係数を乗じた値を、入力音声のパワースペクトラ
ムあるいは振幅スペクトラムから差し引くこととなる
が、サブトラクト係数が大きい場合には計算上マイナス
値になることがある。パワースペクトラムあるいは振幅
スペクトラムは理論上マイナス値になることはあり得な
いため、その部分をクリッピングしてゼロ（０）あるい
は相対的に小さな正の定数に設定した雑音減算スペクト
ラムを算出する。したがって、この雑音減算スペクトラ
ムにはクリッピングによる特有の歪みが生じてしまう。
これをそのまま音声認識に用いると認識率が低下してし
まう。When the subtraction process is performed as described above, a value obtained by multiplying the power spectrum or the amplitude spectrum of the estimated noise by the subtraction coefficient is subtracted from the power spectrum or the amplitude spectrum of the input voice, but when the subtraction coefficient is large. May be negative in calculations. Since the power spectrum or the amplitude spectrum cannot theoretically be a negative value, the portion is clipped to calculate a noise subtraction spectrum set to zero (0) or a relatively small positive constant. Therefore, a specific distortion due to clipping occurs in the noise subtraction spectrum.
If this is used as it is for speech recognition, the recognition rate will decrease.

【００２０】そこで本発明では、神経回路網モデルを用
い、クリッピングによって歪みの生じたスペクトラムか
らクリッピングによる影響を補正したスペクトラムへの
写像を計算することで歪み低減を実現する。神経回路網
モデルは、任意の入力信号（スカラ、ベクトル）を任意
の出力信号（スカラ、ベクトル）へ写像する能力を有し
ている。つまり、ニューロン数を十分多く取れば任意の
入出力関係を実現できる。これについては、例えば船橋
賢一「ニューラル・ネットワークのｃａｐａｂｉｌｉｔ
ｙについて」（電子情報通信学会技術研究報告vol.８
８，Ｎo.１２６、ＭＢＥ８８−５２）などにおいて証明
されている。従って、神経回路モデルのパラメータを調
節すれば、すなわち神経回路モデルを学習させれば、ク
リッピングによって歪みの生じたスペクトラムを、クリ
ッピングによる影響を補正したスペクトラムへ写像する
ことができる。Therefore, in the present invention, distortion reduction is realized by calculating a mapping from a spectrum in which distortion is caused by clipping to a spectrum in which the influence of clipping is corrected, using a neural network model. The neural network model has the ability to map any input signal (scalar, vector) to any output signal (scalar, vector). That is, an arbitrary input / output relationship can be realized by taking a sufficiently large number of neurons. Regarding this, for example, Kenichi Funabashi "Capability of neural network
About y ”(IEICE Technical Report vol.8
8, No. 126, MBE88-52). Therefore, if the parameters of the neural circuit model are adjusted, that is, if the neural circuit model is learned, the spectrum in which distortion has occurred due to clipping can be mapped to the spectrum in which the influence of clipping has been corrected.

【００２１】神経回路網モデルとしては、例えば請求項
５に示すように、フィードフォワード型神経回路網モデ
ルを用いることが考えられる。例えば入力層、隠れ層、
出力層の３層から成るフィードフォワード型神経回路網
モデルを用いる場合、入力されるスペクトラムが、フレ
ーム単位の処理時間ｔに対し、１〜１２９というような
１２９の周波数ｆに対応するサンプルデータであれば、
入力層、出力層に１２９のニューロンを有するものを用
いることが考えられる。その結果、フレーム単位の処理
時間ｔ毎に１２９の周波数ｆに対応するサンプルデータ
を入力層から入力し、出力層に１２９の周波数ｆに対応
する出力を得ることができるのである。この場合、神経
回路網に入力される雑音減算スペクトラムをＰＳＲＣ
（ｆ，ｔ）（ｆ＝１，２，・・・，１２９）とし、隠れ
層のニューロン数を例えば１００とすれば、神経回路網
の出力ＰＳＲＣＮＮ（ｆ，ｔ）（ｆ＝１，２，・・・，
１２９）は以下の式で示される。As the neural network model, for example, a feedforward neural network model may be used. For example, input layer, hidden layer,
In the case of using a feedforward neural network model including three output layers, the input spectrum is sample data corresponding to a frequency f of 129 such as 1 to 129 with respect to a processing time t of a frame unit. If
It is conceivable to use one having 129 neurons in the input layer and the output layer. As a result, sample data corresponding to 129 frequencies f can be input from the input layer for each processing time t in frame units, and an output corresponding to 129 frequencies f can be obtained in the output layer. In this case, the noise subtraction spectrum input to the neural network is PSRC
(F, t) (f = 1, 2,..., 129) and the number of neurons in the hidden layer is, for example, 100, the output PSRCNN (f, t) (f = 1, 2, ...
129) is represented by the following equation.

【００２２】[0022]

【数１】 (Equation 1)

【００２３】ここでＷ（ｆ，ｈ）（ｆ＝１，２，・・
・，１２９，ｈ＝１，２，・・・１００），ｗ（ｈ，
ｉ）（ｈ＝１，２，・・・，１００，ｉ＝０，１，２，
・・・，１２９）が神経回路網モデルのパラメータであ
る。関数Ｓ（・）は、Ｓ（ｘ）＝１／（１＋ｅ^-x）であ
る。Here, W (f, h) (f = 1, 2,...)
, 129, h = 1, 2, ... 100), w (h,
i) (h = 1, 2,..., 100, i = 0, 1, 2,
, 129) are parameters of the neural network model. The function S (•) is S (x) = 1 / (1 + e ^−x ).

【００２４】このパラメータＷ（ｆ，ｈ）及びｗ（ｈ，
ｉ）を調節することによって、クリッピングによって歪
みの生じたスペクトラムからクリッピングによる影響を
補正したスペクトラムへの写像が実現される。ところ
で、音声認識システムの使用される環境が異なれば雑音
の種類が異なることを考えると、音声認識システムの使
用される環境で発生する雑音に合わせてクリッピングに
よる影響を補正することが望ましい。例えば、このよう
な神経回路網モデルを用いた音声認識システムが自動車
内にて用いられる場合には、予め多くの自動車雑音を含
んだ音声にて神経回路網のパラメータを調節しておくと
いう具合である。ところが、想定される使用環境で発生
する雑音に合わせてクリッピングによる歪みを補正する
ようシステムを設計すると、想定されなかった環境下で
使用した場合には雑音の種類の違いによって認識率の低
下を招くことが懸念される。例えば、上述のような自動
車雑音を含んだ音声にて神経回路網のパラメータを調節
した音声認識システムを、例えば街頭やパーキングエリ
アなどに設定される情報端末装置等に使用した場合、発
生する雑音の種類の違いによって認識率の低下を招致す
ることがある。The parameters W (f, h) and w (h,
By adjusting i), mapping from a spectrum in which distortion is caused by clipping to a spectrum in which the influence of clipping is corrected is realized. By the way, considering that different types of noise are used in different environments where the speech recognition system is used, it is desirable to correct the influence of clipping in accordance with the noise generated in the environment where the speech recognition system is used. For example, when a speech recognition system using such a neural network model is used in a car, the parameters of the neural network are adjusted in advance by speech containing a lot of car noise. is there. However, if the system is designed to correct the distortion due to clipping in accordance with the noise generated in the assumed usage environment, the recognition rate will decrease due to the difference in the type of noise when used in an environment that was not assumed. It is concerned. For example, when a speech recognition system in which the parameters of a neural network are adjusted with speech including vehicle noise as described above is used in an information terminal device or the like set in a street or a parking area, for example, Depending on the type, the recognition rate may decrease.

【００２５】これに対して、本発明の音声認識システム
では、音声認識装置の認識対象語彙を示す複数の比較パ
ターン候補に対応させ、雑音の含まれない教師スペクト
ラムを音声認識装置に記憶しておき、また、神経回路網
に入力される雑音減算スペクトラムを保持しておく。そ
して、雑音減算スペクトラムに基づいて音声認識装置で
正しい認識がなされた場合には、その認識結果に対応す
る教師スペクトラムを神経回路網の教師出力とし、その
教師スペクトラムに対応する雑音減算スペクトラムを神
経回路網の入力として、神経回路網のパラメータを調節
し神経回路網を学習させる。つまり、クリッピングによ
る歪みのあるスペクトラムを理想的な出力である教師ス
ペクトラムへ写像するように神経回路網のパラメータを
調整すれば最適な補正となるため、歪みのあるスペクト
ラムの写像から正しく認識がなされた場合、記憶してお
いた教師スペクトラムを教師出力としてパラメータの調
節を行う。On the other hand, in the speech recognition system of the present invention, a teacher spectrum that does not include noise is stored in the speech recognition device in correspondence with a plurality of comparison pattern candidates indicating the vocabulary to be recognized by the speech recognition device. Also, the noise subtraction spectrum input to the neural network is held. Then, when the speech recognition device performs correct recognition based on the noise subtraction spectrum, the teacher spectrum corresponding to the recognition result is used as the teacher output of the neural network, and the noise subtraction spectrum corresponding to the teacher spectrum is used as the neural circuit. As the input of the network, the parameters of the neural network are adjusted and the neural network is learned. In other words, if the parameters of the neural network were adjusted so that the spectrum distorted due to clipping was mapped to the teacher spectrum, which is the ideal output, the optimal correction would be made, so the correct recognition was made from the mapping of the distorted spectrum. In this case, the parameters are adjusted using the stored teacher spectrum as the teacher output.

【００２６】このように本発明の音声認識システムで
は、システム稼働中、正しい音声認識がなされる度に神
経回路網のパラメータが調節されていく。そのため、発
生する雑音種類が異なるどのような環境下で使用して
も、認識対象となる音声と雑音とが混在した入力信号に
対しスペクトラムサブトラクション法を用いて雑音抑圧
を行う際に生じる歪みを適切に補正することができる。
しかも、神経回路網モデルが任意の写像を実現可能であ
ることから、神経回路網のパラメータ調節、すなわち学
習が進めば、歪みのあるスペクトラムを教師スペクトラ
ムに近いスペクトラムに写像することができ、スペクト
ラムの歪みに対し最適に近い補正が実現できる。その結
果、音声認識における認識率の向上に寄与することがで
きる。As described above, in the speech recognition system of the present invention, the parameters of the neural network are adjusted each time correct speech recognition is performed during operation of the system. Therefore, even when used in any environment where the type of generated noise is different, the distortion generated when performing noise suppression using the spectrum subtraction method on the input signal in which speech and noise to be recognized are mixed is appropriate. Can be corrected.
In addition, since the neural network model can realize an arbitrary mapping, if parameter adjustment of the neural network, that is, learning progresses, a distorted spectrum can be mapped to a spectrum close to the teacher spectrum, and the spectrum can be mapped. Nearly optimal correction for distortion can be realized. As a result, it is possible to contribute to improvement of the recognition rate in voice recognition.

【００２７】以上説明したように、本発明においては、
神経回路網モデルを用いているが、、その理由は、神経
回路網モデルが、任意の入力信号（スカラ、ベクトル）
を任意の出力信号（スカラ、ベクトル）へ写像する能力
を有していると共に、周知の学習能力を備えているから
である。すなわち、神経回路モデルでは、入力信号及び
その入力信号に対応する教師出力信号に基づいて、パラ
メータの調節、すなわち学習を行うことができるため、
入力信号及び出力信号の間に存在する因果関係を人間が
把握する必要がない。その結果、神経回路モデルを用い
ることによって、音声認識システムの使用される種々の
環境で発生する任意の雑音に合わせ、クリッピングによ
るスペクトラムの歪みを低減させることができる。As described above, in the present invention,
The neural network model is used because the neural network model uses arbitrary input signals (scalar, vector).
Is mapped to an arbitrary output signal (scalar, vector), and has a well-known learning ability. That is, in the neural network model, the parameter adjustment, that is, learning can be performed based on the input signal and the teacher output signal corresponding to the input signal.
There is no need for a human to grasp the causal relationship existing between the input signal and the output signal. As a result, by using the neural network model, the distortion of the spectrum due to clipping can be reduced in accordance with any noise generated in various environments where the speech recognition system is used.

【００２８】なお、上述した減算手段が、音声区間の入
力信号に基づいて算出したパワースペクトラムＰＳ
（ｆ）から雑音パワースペクトラムＰＮ（ｆ）に所定の
サブトラクト係数を乗じたものを減算する構成を前提と
する場合には、請求項４に示すようにしてもよい。すな
わち、雑音抑圧装置が自己相関係数算出手段を備える構
成とし、この自己相関係数算出手段が、クリッピング手
段によって算出された雑音減算スペクトラムに基づいて
自己相関係数を算出する。このとき、写像計算手段は、
自己相関係数算出手段によって算出された自己相関係数
を、クリッピングによる歪みを低減した自己相関係数へ
写像する。The power spectrum PS calculated by the subtracting means based on the input signal of the voice section.
If it is assumed that a configuration obtained by multiplying the noise power spectrum PN (f) by a predetermined subtraction coefficient from (f) is to be subtracted, a configuration as claimed in claim 4 may be adopted. That is, the noise suppression device is configured to include an autocorrelation coefficient calculation unit, and the autocorrelation coefficient calculation unit calculates an autocorrelation coefficient based on the noise subtraction spectrum calculated by the clipping unit. At this time, the mapping calculation means
The autocorrelation coefficient calculated by the autocorrelation coefficient calculation means is mapped to an autocorrelation coefficient with reduced distortion due to clipping.

【００２９】このように自己相関係数を用いても同様に
歪み低減が実現できると共に、雑音抑圧装置からの出力
を用いて音声認識を行なう音声認識装置におけるメモリ
容量及び処理負荷の低減の面で有効である。これは、自
己相関係数のフーリエ変換がパワースペクトラムにな
る、つまりパワースペクトラムの逆フーリエ変換が自己
相関係数になることに着目したものである。自己相関係
数をＣ（ｒ，ｔ）、逆フーリエ変換をＦ^-1とすると、パ
ワースペクトラムＰ（ｆ，ｔ）との関係は次のようにな
る。As described above, the distortion can be similarly reduced by using the autocorrelation coefficient, and the memory capacity and the processing load in the speech recognition apparatus for performing the speech recognition using the output from the noise suppression apparatus can be reduced. It is valid. This focuses on the fact that the Fourier transform of the autocorrelation coefficient becomes a power spectrum, that is, the inverse Fourier transform of the power spectrum becomes an autocorrelation coefficient. Assuming that the autocorrelation coefficient is C (r, t) and the inverse Fourier transform is F- ¹ , the relationship with the power spectrum P (f, t) is as follows.

【００３０】Ｃ（ｒ，ｔ）＝Ｆ^-1［Ｐ（ｆ，ｔ）］なお、ｒは自己相関係数の指数であり、パワースペクト
ラムにおける周波数ｆに対応する。このように、パワー
スペクトラムと自己相関係数とは等価であるため、自己
相関係数の写像を計算させれば、パワースペクトラムを
用いた場合と同様の結果、つまり歪みの低減された出力
を得ることができる。C (r, t) = F ^-1 [P (f, t)] Here, r is an index of the autocorrelation coefficient, and corresponds to the frequency f in the power spectrum. As described above, since the power spectrum and the autocorrelation coefficient are equivalent, if the mapping of the autocorrelation coefficient is calculated, the same result as when the power spectrum is used, that is, an output with reduced distortion is obtained. be able to.

【００３１】そして、このような自己相関係数を用いる
と音声認識装置におけるメモリ容量及び処理負荷が低減
されるのであるが、その点について説明する。音声認識
装置が線形予測分析（linear predictive coding：ＬＰ
Ｃ）を行なう構成であり、雑音抑圧装置からパワースペ
クトラムを出力する場合を想定すると、音声認識装置に
おいては、まず雑音抑圧装置から出力されたパワースペ
クトラムから自己相関係数を算出しなければならなくな
る。そのため、処理負荷やメモリ容量の増大につながっ
ていた。それに対して、雑音抑圧装置において自己相関
係数化し、これを音声認識装置側へ渡せば、音声認識装
置における処理負荷やメモリ容量の削減を実現できる。
音声認識装置がＰ次のＬＰＣを実行する場合には、指数
ｒがｒ＝０，１，２，……Ｐの自己相関係数Ｃ（ｒ，
ｔ）しか用いず、一般にＰ＝１７程度である。The use of such an autocorrelation coefficient reduces the memory capacity and processing load of the speech recognition apparatus. This will be described. The speech recognizer uses linear predictive coding (LP)
C), and assuming that the power spectrum is output from the noise suppression device, the speech recognition device must first calculate the autocorrelation coefficient from the power spectrum output from the noise suppression device. . Therefore, the processing load and the memory capacity are increased. On the other hand, if the noise suppression device converts the autocorrelation coefficient to the speech recognition device and passes it to the speech recognition device, the processing load and the memory capacity of the speech recognition device can be reduced.
When the speech recognizer performs the P-order LPC, the autocorrelation coefficient C (r, r) of the index r is r = 0, 1, 2,.
t), and generally P = about 17.

【００３２】したがって、パワースペクトラムを逆フー
リエ変換して自己相関係数化し、その自己相関係数の写
像を出力することで、音声認識装置におけるメモリ容量
及び処理負荷の低減を実現できる。ところで、本発明の
音声認識システムにおいては、音声認識装置の認識結果
が正しい場合に神経回路網のパラメータを調節するので
あるが、認識結果が正しいか否かは、例えば請求項６に
示すように、認識手段による認識結果を報知し、当該報
知に対応して外部から入力される確認信号に基づき、当
該認識結果が正しいか否かを判断するよう構成すること
が考えられる。ここで認識結果の報知は、例えば認識結
果を合成音声として出力したり、例えばナビゲーション
システム等のようにディスプレイを備えたシステムであ
れば、そのディスプレイに認識結果を表示したりするこ
とによって行う。そして、例えば発声者自身が、報知さ
れた認識結果が正しいか否かを判定し、確認スイッチを
押下したり、所定の音声入力を行うと、音声認識システ
ムが、例えば確認スイッチの押下による信号や入力され
た音声信号などの確認信号に基づいて認識結果が正しい
か否かを判定する。Therefore, the power spectrum can be inversely Fourier-transformed into autocorrelation coefficients, and a map of the autocorrelation coefficients is output, thereby reducing the memory capacity and processing load in the speech recognition device. By the way, in the speech recognition system of the present invention, when the recognition result of the speech recognition device is correct, the parameters of the neural network are adjusted. Whether the recognition result is correct is determined, for example, as described in claim 6. It is conceivable to notify the recognition result by the recognition means and determine whether or not the recognition result is correct based on a confirmation signal input from outside in response to the notification. Here, the notification of the recognition result is performed by, for example, outputting the recognition result as a synthesized voice, or displaying the recognition result on a display of a system having a display such as a navigation system. Then, for example, the speaker itself determines whether or not the notified recognition result is correct, and presses a confirmation switch or performs a predetermined voice input. It is determined whether or not the recognition result is correct based on an input confirmation signal such as an audio signal.

【００３３】また、例えば請求項７に示すように、認識
手段によって一致度合が高いと判断されたパターンの一
致度合と、他の比較対象パターン候補の一致度合とに所
定値以上の差がある場合に、認識手段による認識結果が
正しいと判断するよう構成してもよい。音声認識装置の
認識手段は、予め記憶された複数の比較対象パターン候
補毎に雑音抑圧装置からの出力との一致度合を計算し一
致度合の高いものを認識結果とするのであるが、このと
き、計算される一致度合に差がない場合は誤まって認識
さする可能性が高い。そこで、認識結果となったパター
ンの一致度合が他のパターンの一致度合から比較的大き
い場合に正しく認識されていると推定するのである。こ
のようにすれば、上述したように利用者が認識結果の正
誤を確認する必要がなく便利である。Further, for example, when there is a difference of a predetermined value or more between the matching degree of a pattern determined to have a high matching degree by the recognizing means and the matching degree of another comparison target pattern candidate. Alternatively, a configuration may be adopted in which the recognition result by the recognition means is determined to be correct. The recognition means of the speech recognition device calculates the degree of coincidence with the output from the noise suppression device for each of a plurality of pre-stored comparison target pattern candidates, and uses the one with a high degree of coincidence as the recognition result. If there is no difference between the calculated degrees of coincidence, there is a high possibility that the degree of recognition will be erroneously recognized. Therefore, when the degree of coincidence of the pattern as the recognition result is relatively large from the degree of coincidence of the other patterns, it is estimated that the pattern is correctly recognized. This is convenient because the user does not need to confirm the correctness of the recognition result as described above.

【００３４】なお、上述した判定手段は、入力信号に音
声が含まれている音声区間であるか音声が含まれていな
い雑音区間であるかを判定するのであるが、これは入力
信号のパワーに基づいて判定することが考えられる。ま
た、音声を入力させる期間を発声者自身が指定するため
に設けられた入力期間指定手段によって指定された入力
期間を音声区間として判定するようにしてもよい。この
入力期間指定手段としては、例えばＰＴＴ（Push-To-Ta
lk）スイッチなどが考えられる。つまり、利用者がＰＴ
Ｔスイッチを押しながら音声を入力すると、そのＰＴＴ
スイッチが押されている間に入力された音声を処理対象
として受け付けるのである。このようにすることで、雑
音抑圧対象となる入力信号に対してのみ雑音抑圧処理を
実行すればよいので、処理負荷軽減の点で有効である。The above-described determination means determines whether the input signal is a voice section in which speech is included or a noise section in which no speech is included. This is based on the power of the input signal. It is conceivable that the determination is made based on this. Also, the input period specified by the input period specifying means provided for the speaker to specify the period for inputting the voice may be determined as the voice section. For example, PTT (Push-To-Ta)
lk) switch. In other words, if the user
Pressing the T switch while inputting a voice, the PTT
The voice input while the switch is pressed is received as a processing target. By doing so, it is only necessary to execute the noise suppression processing only on the input signal to be subjected to noise suppression, which is effective in reducing the processing load.

【００３５】このような音声認識システムは、音声信号
に混入した雑音の種類が異なるどのような環境下でも適
応的にスペクトラムサブトラクション法による特有の歪
みを補正することができるため、種々の適用先が考える
ことができる。例えばいわゆるカーナビゲーションシス
テム用として用いることが考えられる。この場合には、
例えば経路設定のための目的地などが音声にて入力でき
れば非常に便利である。また、ナビゲーションシステム
だけでなく、例えば音声認識システムを車載空調システ
ム用として用いることも考えられる。この場合には、空
調システムにおける空調状態関連指示を利用者が音声に
て入力するために用いることとなる。さらには、例え
ば、携帯用の情報端末装置、あるいは街頭やパーキング
エリアなどに設定される情報端末装置などにも同様に適
用できる。Such a speech recognition system can adaptively correct a specific distortion by the spectrum subtraction method under any environment where the type of noise mixed into the speech signal is different. You can think. For example, it can be used for a so-called car navigation system. In this case,
For example, it is very convenient if a destination for route setting can be input by voice. It is also conceivable to use not only a navigation system but also a voice recognition system for an in-vehicle air conditioning system, for example. In this case, the air-conditioning system-related instructions in the air-conditioning system are used by the user to input by voice. Further, for example, the present invention can be similarly applied to a portable information terminal device or an information terminal device set in a street or a parking area.

【００３６】[0036]

【発明の実施の形態】図１は本発明の実施形態の音声認
識システムの概略構成を示すブロック図である。本音声
認識システムは、マイク３０を介して入力された音声に
対して雑音抑圧を行なう雑音抑圧装置１０と、その雑音
抑圧装置１０からの出力を、予め記憶されている複数の
比較対象パターン候補と比較して一致度合の高いものを
認識結果とする音声認識装置２０とを備えている。FIG. 1 is a block diagram showing a schematic configuration of a speech recognition system according to an embodiment of the present invention. The speech recognition system includes a noise suppression device 10 that performs noise suppression on speech input via a microphone 30, and outputs an output from the noise suppression device 10 to a plurality of comparison target pattern candidates stored in advance. And a speech recognition device 20 that recognizes a result having a higher degree of coincidence as a recognition result.

【００３７】まず、雑音抑圧装置１０について説明す
る。図１に示すように、雑音抑圧装置１０は、音声入力
部１１と、入力音声切り出し部１２と、フレームパワー
スペクトラム計算部１３と、雑音パワースペクトラム推
定部１４と、雑音パワースペクトラム減算部１５と、パ
ワースペクトラムクリッピング計算部１６と、写像計算
部１７と、学習制御部１８とを備えている。以下、各ブ
ロックでの処理内容について説明する。First, the noise suppression device 10 will be described. As shown in FIG. 1, the noise suppression device 10 includes a voice input unit 11, an input voice cutout unit 12, a frame power spectrum calculation unit 13, a noise power spectrum estimation unit 14, a noise power spectrum subtraction unit 15, It includes a power spectrum clipping calculator 16, a mapping calculator 17, and a learning controller 18. Hereinafter, processing contents in each block will be described.

【００３８】音声入力部１１は、マイク３０を介して入
力されたアナログ音声信号を例えば１０ＫＨｚのサンプ
リング周波数でデジタル信号に変換し、入力音声切り出
し部１２へ出力する。入力音声切り出し部１２は、音声
入力部１１からの入力信号を順次所定の長さ所定のサン
プル、例えば２５．６ミリ秒で２５６サンプルのフレー
ム毎に切り出し、フレームパワースペクトラム計算部１
３と雑音パワースペクトラム推定部１４へ出力する。The audio input section 11 converts an analog audio signal input via the microphone 30 into a digital signal at a sampling frequency of, for example, 10 KHz, and outputs the digital signal to the input audio cutout section 12. The input audio cutout unit 12 sequentially cuts out the input signal from the audio input unit 11 for each frame of a predetermined length and a predetermined sample, for example, 256 samples in 25.6 milliseconds.
3 and the noise power spectrum estimating unit 14.

【００３９】雑音パワースペクトラム推定部１４は、入
力音声切り出し部１２からのフレーム信号を、フーリエ
変換してスペクトラムＳＮ（ｆ，ｔ）を求め、そのスペ
クトラムＳＮ（ｆ，ｔ）の振幅の２乗を計算して求めた
パワースペクトラムＰＮ（ｆ，ｔ）を内部のバッファ
（図示せず）に記憶する。このパワースペクトラムＰＮ
（ｆ，ｔ）の算出式は以下の通りである。The noise power spectrum estimating unit 14 Fourier-transforms the frame signal from the input speech extracting unit 12 to obtain a spectrum SN (f, t), and calculates the square of the amplitude of the spectrum SN (f, t). The calculated power spectrum PN (f, t) is stored in an internal buffer (not shown). This power spectrum PN
The formula for calculating (f, t) is as follows.

【００４０】[0040]

【数２】 (Equation 2)

【００４１】このバッファには過去の全てのパワースペ
クトラムが記憶されているのではなく、現在時点から５
×２５．６ミリ秒まで過去のパワースペクトラム、つま
り最新の５つのフレームに対するパワースペクトラムＰ
Ｎ（ｆ，ｔ）が順次更新しながら記憶されていくことと
なる。This buffer does not store all the past power spectra, but only 5 buffers from the current time.
× 25.6 ms in the past power spectrum, that is, the power spectrum P for the latest five frames
N (f, t) is stored while being sequentially updated.

【００４２】なお、パワースペクトラムＰＮ（ｆ，ｔ）
において、ｆは周波数、ｔは時間（この場合はフレーム
単位の処理に対応する時間）である。２５６サンプルの
フレーム信号をフーリエ変換しているため、フーリエ変
換で計算されるスペクトラムの対称性から、パワースペ
クトラムＰＮ（ｆ，ｔ）は、周波数（ｆ）方向に１２９
個のサンプルデータとなる。また、ｔ＝０が現在、ｔ＝
１が直前の過去、ｔ＝２がさらに前の過去、というよう
に数字が増えるほど前の過去を示すものとする。したが
って、最新の５つフレームに対するパワースペクトラム
ＰＮ（ｆ，ｔ）とは、ＰＮ（ｆ，０）、ＰＮ（ｆ，
１）、ＰＮ（ｆ，２）、ＰＮ（ｆ，３）、ＰＮ（ｆ，
４）（ここでｆ＝１，２，・・・，１２９）の５つを指
す。なお、それ以前の過去のパワースペクトラムはバッ
ファから捨てられる。The power spectrum PN (f, t)
In the equation, f is a frequency, and t is a time (in this case, a time corresponding to processing in a frame unit). Since the frame signal of 256 samples is Fourier-transformed, the power spectrum PN (f, t) is 129 in the frequency (f) direction due to the symmetry of the spectrum calculated by the Fourier transform.
Sample data. Also, t = 0 is now t =
It is assumed that the more the number increases, the more the past, such as 1 indicates the immediately preceding past, t = 2 indicates the previous past, and so on. Therefore, the power spectra PN (f, t) for the latest five frames are PN (f, 0), PN (f, t).
1), PN (f, 2), PN (f, 3), PN (f,
4) (here, f = 1, 2,..., 129). The previous power spectrum before that is discarded from the buffer.

【００４３】そして、雑音パワースペクトラム推定部１
４は、音声が入力されたことを示す音声入力検出信号を
受け取ると雑音パワースペクトラムの推定処理を中止す
る。本実施形態においては、図示しないＰＴＴ（Push-T
o-Talk）スイッチが押されている場合にはこの音声入力
検出信号が出力される。つまり、本音声認識システムで
は、利用者がＰＴＴスイッチを押しながらマイク３０を
介して音声を入力するという使用方法である。そのた
め、ＰＴＴスイッチが押されているということは利用者
が音声を入力しようとする意志をもって操作したことで
あるので、その場合には実際には音声入力があるかない
かを判断することなく、音声入力がされる期間（音声区
間）であると捉えて処理しているのである。Then, the noise power spectrum estimating section 1
4 receives the voice input detection signal indicating that the voice has been input, and stops the process of estimating the noise power spectrum. In the present embodiment, a PTT (Push-T
When the o-Talk) switch is pressed, this voice input detection signal is output. That is, in the present voice recognition system, the user inputs a voice via the microphone 30 while pressing the PTT switch. Therefore, when the PTT switch is pressed, it means that the user has operated with an intention to input a voice. In this case, the user does not need to determine whether or not there is a voice input. Processing is perceived as a period during which an input is made (voice section).

【００４４】音声入力検出信号を受け取った雑音パワー
スペクトラム推定部１４では、雑音パワースペクトラム
の推定処理を中止し、バッファに記憶されている５つの
パワースペクトラムＰＮ（ｆ，０）、ＰＮ（ｆ，１）、
ＰＮ（ｆ，２）、ＰＮ（ｆ，３）、ＰＮ（ｆ，４）の平
均値を算出して、スペクトラムサブトラクション法での
減算に用いる雑音パワースペクトラムＰＮＡＶ（ｆ）
（ｆは周波数）を作成し、雑音パワースペクトラム減算
部１５へ渡す。なお、この雑音パワースペクトラムＰＮ
ＡＶ（ｆ）の算出式は以下の通りである。The noise power spectrum estimating section 14 which has received the voice input detection signal stops the noise power spectrum estimating process, and the five power spectra PN (f, 0) and PN (f, 1) stored in the buffer. ),
The average value of PN (f, 2), PN (f, 3) and PN (f, 4) is calculated, and the noise power spectrum PNAV (f) used for subtraction in the spectrum subtraction method
(F is a frequency) and passes it to the noise power spectrum subtraction unit 15. Note that this noise power spectrum PN
The formula for calculating AV (f) is as follows.

【００４５】[0045]

【数３】 (Equation 3)

【００４６】一方、フレームパワースペクトラム計算部
１３は、音声入力検出信号を受け取った場合にだけ処理
を行なう。その処理は、入力音声切り出し部１２からの
フレーム信号、上述したような２５．６ミリ秒のフレー
ム毎の入力音声信号に対してフーリエ変換を行い入力音
声信号のスペクトラムＳＳ（ｆ，ｔ）を求め、スペクト
ラムＳＳ（ｆ，ｔ）の振幅の２乗を計算してパワースペ
クトラムＰＳ（ｆ，ｔ）を求めて雑音パワースペクトラ
ム減算部１５へ渡す。なお、このパワースペクトラムＰ
Ｓ（ｆ，ｔ）の算出式は以下の通りである。On the other hand, the frame power spectrum calculator 13 performs processing only when receiving a voice input detection signal. In the processing, a Fourier transform is performed on the frame signal from the input voice cut-out unit 12 and the input voice signal for each 25.6 millisecond frame as described above to obtain a spectrum SS (f, t) of the input voice signal. , And calculates the square of the amplitude of the spectrum SS (f, t) to obtain a power spectrum PS (f, t), which is passed to the noise power spectrum subtraction unit 15. Note that this power spectrum P
The formula for calculating S (f, t) is as follows.

【００４７】[0047]

【数４】 (Equation 4)

【００４８】雑音パワースペクトラム減算部１５では、
フレームパワースペクトラム計算部１３から送られたパ
ワースペクトラムＰＳ（ｆ，ｔ）から、雑音パワースペ
クトラム推定部１４から送られた雑音パワースペクトラ
ムＰＮＡＶ（ｆ）に所定のサブトラクト係数を乗じたも
のを減算してパワースペクトラムクリッピング計算部１
６へ送る。ここでサブトラクト係数は１．２である。従
って、雑音パワースペクトラム減算部１５からの出力Ｐ
ＳＲ（ｆ，ｔ）は以下の式に示す通りとなる。In the noise power spectrum subtractor 15,
The power spectrum PS (f, t) sent from the frame power spectrum calculator 13 is subtracted by multiplying the noise power spectrum PNAV (f) sent from the noise power spectrum estimator 14 by a predetermined subtraction coefficient. Power spectrum clipping calculator 1
Send to 6. Here, the subtract coefficient is 1.2. Therefore, the output P from the noise power spectrum subtractor 15 is
SR (f, t) is as shown in the following equation.

【００４９】[0049]

【数５】 (Equation 5)

【００５０】パワースペクトラムクリッピング計算部１
６では、雑音パワースペクトラム減算部１５からの出力
ＰＳＲ（ｆ，ｔ）に対し、減算によってマイナスとなっ
た部分をクリッピングしてゼロ又は相対的に小さな正の
定数に置き換えたパワースペクトラムＰＳＲＣ（ｆ，
ｔ）を計算し、写像計算部１７へ渡す。従って、パワー
スペクトラムＰＳＲＣ（ｆ，ｔ）は以下の式に示す通り
となる。Power spectrum clipping calculator 1
6, the power spectrum PSRC (f, t) obtained by clipping the portion that has become negative by the subtraction to the output PSR (f, t) from the noise power spectrum subtractor 15 and replacing it with zero or a relatively small positive constant.
t) is calculated and passed to the mapping calculation unit 17. Therefore, the power spectrum PSRC (f, t) is as shown in the following equation.

【００５１】[0051]

【数６】 (Equation 6)

【００５２】写像計算部１７は、３層のフィードフォワ
ード型神経回路網モデルを用いて構成されており、入力
層、隠れ層、出力層は、各々１２９，１００，１２９の
ニューロンから成る。写像計算部１７は、パワースペク
トラムクリッピング計算部１６からのパワースペクトラ
ムＰＳＲＣ（ｆ，ｔ）を入力してパワースペクトラムＰ
ＳＲＣＮＮ（ｆ，ｔ）を出力する。フレーム単位の時系
列データであるパワースペクトラムＰＳＲＣ（ｆ，ｔ）
は、周波数（ｆ）方向に１２９サンプルのデータであ
り、この神経回路網モデルに入力されることによって、
フレーム単位の処理に対応する時間ｔ毎に１２９サンプ
ルの時系列データであるＰＳＲＣＮＮ（ｆ，ｔ）が出力
される。写像計算部１７からの出力ＰＳＲＣＮＮ（ｆ，
ｔ）は以下の式に示す通りである。The mapping calculation unit 17 is configured using a three-layer feed-forward type neural network model. The input layer, the hidden layer, and the output layer are composed of 129, 100, and 129 neurons, respectively. The mapping calculation unit 17 receives the power spectrum PSRC (f, t) from the power spectrum clipping calculation unit 16 and receives the power spectrum PRC (f, t).
SRCNN (f, t) is output. Power spectrum PSRC (f, t), which is time-series data in frame units
Is data of 129 samples in the frequency (f) direction, and is input to this neural network model,
PSRCNN (f, t), which is 129 samples of time-series data, is output every time t corresponding to processing in units of frames. The output PSRCNN (f,
t) is as shown in the following equation.

【００５３】[0053]

【数７】 (Equation 7)

【００５４】ここでＷ（ｆ，ｈ）（ｆ＝１，２，・・
・，１２９，ｈ＝１，２，・・・１００），ｗ（ｈ，
ｉ）（ｈ＝１，２，・・・，１００，ｉ＝０，１，２，
・・・，１２９）が神経回路網モデルのパラメータであ
る。関数Ｓ（・）は、Ｓ（ｘ）＝１／（１＋ｅ^-x）であ
る。Here, W (f, h) (f = 1, 2,...)
, 129, h = 1, 2, ... 100), w (h,
i) (h = 1, 2,..., 100, i = 0, 1, 2,
, 129) are parameters of the neural network model. The function S (•) is S (x) = 1 / (1 + e ^−x ).

【００５５】また、写像計算部１７は、ＰＴＴスイッチ
が押されている期間（音声区間）にパワースペクトラム
クリッピング計算部１６から出力されるパワースペクト
ラムＰＳＲＣ（ｆ，ｔ）をバッファ１７ａに順次記憶す
る。このようにして、入力音声切り出し部１２での切り
出し単位である２５．６ミリ秒のフレーム毎に得られる
パワースペクトラムＰＳＲＣＮＮ（ｆ）が順次音声認識
装置２０へ送られる。The mapping calculation section 17 sequentially stores the power spectrum PSRC (f, t) output from the power spectrum clipping calculation section 16 in the buffer 17a while the PTT switch is pressed (voice section). In this way, the power spectrum PSRCNN (f) obtained for each frame of 25.6 milliseconds, which is a cutout unit in the input speech cutout unit 12, is sequentially sent to the speech recognition device 20.

【００５６】次に、この音声認識装置２０について説明
する。音声認識装置２０は、逆フーリエ変換部２１と、
ＬＰＣ分析部２２と、ケプストラム計算部２３と、標準
パターン格納部２４と、照合部２５と、判定部２６とを
備えている。Next, the speech recognition device 20 will be described. The speech recognition device 20 includes an inverse Fourier transform unit 21;
An LPC analysis unit 22, a cepstrum calculation unit 23, a standard pattern storage unit 24, a collation unit 25, and a determination unit 26 are provided.

【００５７】逆フーリエ変換部２１では、写像計算部１
７からの出力ＰＳＲＣＮＮ（ｆ，ｔ）に対して逆フーリ
エ変換を施して自己相関係数を求め、ＬＰＣ分析部２２
へ渡す。ＬＰＣ分析部２２では、逆フーリエ変換部２１
からの出力を用いて線形予測分析を行う。線形予測分析
は音声信号処理の分野では一般的な分析手法であり、例
えば、古井「ディジタル音声処理」（東海大学出版会）
などに詳しく説明されている。本実施形態においてはこ
の線形予測分析に自己相関法を用いており、自己相関係
数を用いて１２次のＬＰＣ係数を計算する。In the inverse Fourier transform unit 21, the mapping calculation unit 1
7 is subjected to an inverse Fourier transform on the output PSRCNN (f, t) to obtain an autocorrelation coefficient.
Pass to In the LPC analysis unit 22, the inverse Fourier transform unit 21
A linear prediction analysis is performed using the output from. Linear prediction analysis is a common analysis technique in the field of speech signal processing. For example, Furui "Digital Speech Processing" (Tokai University Press)
And so on. In the present embodiment, the autocorrelation method is used for the linear prediction analysis, and a 12th-order LPC coefficient is calculated using the autocorrelation coefficient.

【００５８】そして、ケプストラム計算部２３では、Ｌ
ＰＣ分析部２２において計算されたＬＰＣ係数を基に、
フレーム毎のスペクトラム上の特徴パラメータとしての
ＬＰＣケプストラム係数を１次から１７次まで計算す
る。一方、標準パターン格納部２４には予め計算してお
いた認識対象語彙の標準パターン（特徴パラメータ系
列）が格納してあり、照合部２５は、標準パターン格納
部２４に格納されている標準パターンと、ケプストラム
計算部２３にて計算されたＬＰＣケプストラム係数との
間で類似度計算を行なう。これらは周知のＤＰマッチン
グ法、ＨＭＭ（隠れマルコフモデル）あるいは神経回路
網モデルなどによって、辞書データとして格納された単
語に対応する類似度を求める。そして、判定部２６は、
各認識対象語彙のうち照合部２５で計算した類似度が最
も高い語彙を認識結果として出力する。なお、本実施形
態の音声認識装置２０は、上述したように単語単位の認
識を行うものであるが、句や文章の単位で認識を行うも
のとしてもよい。Then, the cepstrum calculator 23 calculates L
Based on the LPC coefficient calculated by the PC analysis unit 22,
The LPC cepstrum coefficient as a feature parameter on the spectrum for each frame is calculated from the first order to the 17th order. On the other hand, a standard pattern (feature parameter sequence) of the vocabulary to be recognized, which has been calculated in advance, is stored in the standard pattern storage unit 24, and the matching unit 25 compares the standard pattern stored in the standard pattern storage unit 24 with the standard pattern. , And the LPC cepstrum coefficient calculated by the cepstrum calculation unit 23. For these, a similarity corresponding to a word stored as dictionary data is obtained by a known DP matching method, an HMM (Hidden Markov Model), a neural network model, or the like. Then, the determination unit 26
The vocabulary having the highest similarity calculated by the matching unit 25 among the vocabularies to be recognized is output as a recognition result. Note that the speech recognition device 20 of the present embodiment performs recognition in units of words as described above, but may perform recognition in units of phrases or sentences.

【００５９】このように、本実施形態の音声認識システ
ムによれば、雑音抑圧装置１０の雑音パワースペクトラ
ム減算部１５において、音声区間の入力信号に基づいて
算出したパワースペクトラムＰＳ（ｆ，ｔ）から雑音ス
ペクトラムＰＮＡＶ（ｆ）に所定のサブトラクト係数
（ここでは１．２）を乗じたものを減算し、写像計算部
１７においてＰＳＲＣＮＮ（ｆ，ｔ）を計算している。
雑音パワースペクトラム減算部１５においては、上述し
たように雑音パワースペクトラムＰＮ（ｆ）を１．２倍
したものを、音声区間の入力信号に基づいて算出したパ
ワースペクトラムＰＳ（ｆ，ｔ）から減算している。こ
の場合はサブトラクト係数が１より大きい１．２である
ため、計算上マイナス値になることがある。パワースペ
クトラムは理論上マイナス値になることはあり得ないた
め、パワースペクトラムクリッピング計算部１６におい
て、その部分をクリッピングしてゼロ（０）あるいは相
対的に小さな正の定数に設定している。したがって、パ
ワースペクトラムクリッピング計算部１６からの出力で
ある雑音の抑圧されたパワースペクトラムに特有の歪み
が生じてしまい、これをそのまま音声認識に用いると認
識率が低下してしまう。As described above, according to the speech recognition system of the present embodiment, the noise power spectrum subtracting section 15 of the noise suppression device 10 uses the power spectrum PS (f, t) calculated based on the input signal of the speech section. The noise spectrum PNAV (f) multiplied by a predetermined subtraction coefficient (here, 1.2) is subtracted, and the mapping calculation unit 17 calculates PSRCNN (f, t).
The noise power spectrum subtracting section 15 subtracts 1.2 times the noise power spectrum PN (f) from the power spectrum PS (f, t) calculated based on the input signal in the voice section as described above. ing. In this case, since the subtract coefficient is 1.2, which is larger than 1, it may be a negative value in calculation. Since the power spectrum cannot theoretically be a negative value, the power spectrum clipping calculation unit 16 clips that part and sets it to zero (0) or a relatively small positive constant. Therefore, a distortion peculiar to the noise-suppressed power spectrum output from the power spectrum clipping calculation unit 16 occurs, and if this is used for speech recognition as it is, the recognition rate is reduced.

【００６０】そこで、神経回路網モデルを用い、クリッ
ピングによって歪みの生じたパワースペクトラムからク
リッピングによる影響を補正したパワースペクトラムへ
の写像を計算することで歪み低減を実現する。神経回路
網モデルは、任意の入力信号（スカラ、ベクトル）を任
意の出力信号（スカラ、ベクトル）に写像する能力を有
している。つまり、ニューロン数を十分多く取れば任意
の入出力関係を実現できる。これについては、例えば船
橋賢一「ニューラル・ネットワークのｃａｐａｂｉｌｉ
ｔｙについて」（電子情報通信学会技術研究報告vol.８
８，Ｎo.１２６、ＭＢＥ８８−５２）などにおいて証明
されている。従って、神経回路網モデルのパラメータを
調節すれば、すなわち神経回路網モデルを適切に学習さ
せれば、写像計算部１７において、クリッピングによっ
て歪みの生じたパワースペクトラムからクリッピングに
よる歪みを低減したパワースペクトラムを計算すること
ができる。Therefore, distortion is reduced by calculating a mapping from a power spectrum in which distortion is caused by clipping to a power spectrum in which the influence of clipping is corrected, using a neural network model. The neural network model has the ability to map any input signal (scalar, vector) to any output signal (scalar, vector). That is, an arbitrary input / output relationship can be realized by taking a sufficiently large number of neurons. Regarding this, for example, Kenichi Funabashi, “Capabili of Neural Network
About ty "(IEICE Technical Report vol.8
8, No. 126, MBE88-52). Therefore, if the parameters of the neural network model are adjusted, that is, if the neural network model is appropriately learned, the mapping calculation unit 17 obtains a power spectrum in which distortion due to clipping has been reduced from the power spectrum in which distortion has occurred due to clipping. Can be calculated.

【００６１】そこで次に、本実施形態の音声認識システ
ムにおける雑音抑圧装置１０の写像計算部１７を構成す
る神経回路モデルのパラメータの調節について説明す
る。上述したように音声認識装置２０では、判定部２６
が、各認識対象語彙のうち照合部２５で計算した類似度
が最も高い語彙を認識結果として出力するのであるが、
この判定部２６の認識結果は、音声認識装置２０の外部
に接続された音声合成部４１へ出力される。Next, adjustment of the parameters of the neural network model constituting the mapping calculation unit 17 of the noise suppression device 10 in the speech recognition system of the present embodiment will be described. As described above, in the speech recognition device 20, the determination unit 26
Outputs the vocabulary having the highest similarity calculated by the matching unit 25 among the vocabularies to be recognized as a recognition result.
The recognition result of the determination unit 26 is output to a speech synthesis unit 41 connected to the outside of the speech recognition device 20.

【００６２】音声合成部４１は、判定部２６からの出力
を所定周波数のディジタル信号として合成し、このディ
ジタル信号を、音声出力部４２が、アナログ音声信号に
変換して出力する。その結果、スピーカ４３から認識さ
れた単語が音声として出力される。The voice synthesizing section 41 synthesizes the output from the judging section 26 as a digital signal having a predetermined frequency, and the voice output section 42 converts this digital signal into an analog voice signal and outputs it. As a result, the word recognized from the speaker 43 is output as a voice.

【００６３】そして、その後、判定部２６は外部からの
確認信号の入力を待つ。本実施形態においては、図示し
ない確認スイッチが押されている場合に、確認信号が出
力される。つまり、本音声認識システムにおいて、利用
者は、スピーカ４３から出力された単語が正しく認識さ
れたものであるとき、確認スイッチを押さえる取り決め
としておく。これによって、判定部２６に確認信号が入
力された場合、利用者が入力した音声が正しく認識され
たことを意味し、その場合には、後述するような学習動
作に移る。一方、判定部２６に確認信号が入力されない
場合、利用者が入力した音声が間違って認識されたこと
を意味し、その場合には、学習動作を行わない。Thereafter, the determination section 26 waits for the input of a confirmation signal from the outside. In the present embodiment, a confirmation signal is output when a confirmation switch (not shown) is pressed. That is, in the present voice recognition system, the user presses the confirmation switch when the word output from the speaker 43 is correctly recognized. Accordingly, when the confirmation signal is input to the determination unit 26, it means that the voice input by the user has been correctly recognized, and in that case, the operation proceeds to a learning operation to be described later. On the other hand, when the confirmation signal is not input to the determination unit 26, it means that the voice input by the user is incorrectly recognized, and in that case, the learning operation is not performed.

【００６４】ここで、本実施形態の音声認識システムに
おける学習動作について説明する。判定部２６には、バ
ッファ２６ａが用意されており、このバッファに各認識
対象語彙に対応した雑音の含まれない教師パワースペク
トラムを記憶している。そして、判定部２６は、上述し
たように確認信号が入力されると、認識結果に対応した
教師パワースペクトラムを雑音抑圧装置１０の学習制御
部１８へ出力する。Here, the learning operation in the speech recognition system of the present embodiment will be described. The determination unit 26 is provided with a buffer 26a, which stores a teacher power spectrum including no noise corresponding to each vocabulary to be recognized. Then, when the confirmation signal is input as described above, the determination unit 26 outputs the teacher power spectrum corresponding to the recognition result to the learning control unit 18 of the noise suppression device 10.

【００６５】上述したように、写像計算部１７のバッフ
ァ１７ａには、ＰＴＴスイッチを押している期間（音声
区間）のパワースペクトラムＰＳＲＣ（ｆ，ｔ）が記憶
されている。従って、このバッファ１７ａに記憶された
パワースペクトラムＰＳＲＣ（ｆ，ｔ）あるいはその一
部分が認識された単語に対応している。As described above, the power spectrum PSRC (f, t) of the period (voice section) during which the PTT switch is pressed is stored in the buffer 17a of the mapping calculation unit 17. Therefore, the power spectrum PSRC (f, t) stored in the buffer 17a or a part thereof corresponds to the recognized word.

【００６６】そこで、学習制御部１８は、判定部２６か
ら出力された認識結果に対応した教師パワースペクトラ
ムを写像計算部１７の神経回路網の教師出力とし、パワ
ースペクトラムＰＳＲＣ（ｆ，ｔ）（ｔは認識された単
語を構成するフレーム単位の処理に対応する時間）を写
像計算部１７の神経回路網の入力として、神経回路網の
パラメータを調整する。この調整は、例えばバックプロ
パゲーション学習法（技術評論社「ニューロコンピュー
タ」第２章バックプロパゲーションｐｐ．２８〜ｐｐ．
８４参照）を用いて行うことが考えられる。Therefore, the learning control unit 18 uses the teacher power spectrum corresponding to the recognition result output from the determination unit 26 as the teacher output of the neural network of the mapping calculation unit 17, and uses the power spectrum PSRC (f, t) (t Is used as an input of a neural network of the mapping calculation unit 17 to adjust parameters of the neural network. This adjustment can be performed by, for example, a back propagation learning method (Technical Review, “Neurocomputer”, Chapter 2, Back Propagation pp. 28-pp.
84).

【００６７】ここで、上述した本実施形態の雑音抑圧装
置１０の写像計算部１７を構成する神経回路網の学習手
順の一具体例を示す。まず、神経回路網が入力信号の値
を変えずに入力信号をそのまま出力するようにパラメー
タの調整を行う。これについては、例えば男性１０人、
女性１０人というような複数の人間の音声から複数のパ
ワースペクトラムを作成し、各パワースペクトラムを神
経回路網の入力及び教師出力として、バックプロパゲー
ション学習法によりパラメータを調整する。Here, a specific example of a learning procedure of the neural network constituting the mapping calculation unit 17 of the above-described noise suppression apparatus 10 of the present embodiment will be described. First, parameters are adjusted so that the neural network outputs the input signal as it is without changing the value of the input signal. About this, for example, 10 men,
A plurality of power spectra are created from a plurality of human voices such as 10 women, and parameters are adjusted by a back propagation learning method using each power spectrum as an input and a teacher output of a neural network.

【００６８】このような学習を行った神経回路網モデル
によって上述した写像計算部１７を構成する。これによ
って、雑音抑圧装置１０は、パワースペクトラムクリッ
ピング計算部１６からの出力ＰＳＲＣ（ｆ，ｔ）をその
まま出力することになる。つまり、写像計算部１７から
の出力ＰＳＲＣＮＮ（ｆ，ｔ）は、写像計算部１７への
入力ＰＳＲＣ（ｆ，ｔ）と等しくなる。すなわち、クリ
ッピングの影響を何ら補正していないパワースペクトラ
ムＰＳＲＣＮＮ（ｆ，ｔ）を音声認識装置２０へ出力す
ることになる。The above-described mapping calculation unit 17 is constituted by the neural network model that has performed such learning. Accordingly, the noise suppression device 10 outputs the output PSRC (f, t) from the power spectrum clipping calculation unit 16 as it is. That is, the output PSRCNN (f, t) from the mapping calculation unit 17 is equal to the input PSRC (f, t) to the mapping calculation unit 17. That is, the power spectrum PSRCNN (f, t) in which the influence of clipping is not corrected at all is output to the speech recognition device 20.

【００６９】ところが、この場合であっても、認識率は
クリッピングによる歪みによって低下するものの、正し
く認識されることがある。正しく認識された場合には、
上述したように、神経回路網の教師出力を音声認識装置
の判定部２６のバッファ２６ａに記憶された教師パワー
スペクトラムとし、神経回路網の入力を写像計算部１７
のバッファ１７ａに記憶されたパワースペクトラムＰＳ
ＲＣ（ｆ，ｔ）としてパラメータの調節が行われる。そ
して、利用者から入力される音声が正しく認識される度
に、繰り返しパラメータの調節が行われる。パラメータ
の調節が行われれば、認識率が向上していくため、より
頻繁に学習されることになる。However, even in this case, although the recognition rate is lowered by the distortion due to clipping, the recognition may be performed correctly. If recognized correctly,
As described above, the teacher output of the neural network is used as the teacher power spectrum stored in the buffer 26a of the determination unit 26 of the speech recognition device, and the input of the neural network is used as the mapping calculation unit 17.
Power spectrum PS stored in buffer 17a
The parameter is adjusted as RC (f, t). Then, each time the voice input from the user is correctly recognized, the adjustment of the repetition parameter is performed. If the parameters are adjusted, the recognition rate is improved, so that learning is performed more frequently.

【００７０】このように本実施形態の音声認識システム
では、システム稼働中、正しい音声認識がなされる度に
写像計算部１７の神経回路網のパラメータが調節されて
いく。そのため、発生する雑音種類が異なるどのような
環境下で使用しても、認識対象となる音声と雑音とが混
在した入力信号に対しスペクトラムサブトラクション法
を用いて雑音抑圧を行う際に生じる歪みを適切に補正す
ることができる。しかも、神経回路網モデルが任意の写
像を実現可能であることから、写像計算部１７を構成す
る神経回路網のパラメータ調節、すなわち学習が進め
ば、歪みのあるスペクトラムを教師スペクトラムに近い
スペクトラムに写像することができ、スペクトラムの歪
みに対し最適に近い補正が実現できる。その結果、音声
認識における認識率の向上に寄与することができる。As described above, in the speech recognition system of this embodiment, the parameters of the neural network of the mapping calculation unit 17 are adjusted each time correct speech recognition is performed during operation of the system. Therefore, even when used in any environment where the type of generated noise is different, the distortion generated when performing noise suppression using the spectrum subtraction method on the input signal in which speech and noise to be recognized are mixed is appropriate. Can be corrected. In addition, since the neural network model can realize an arbitrary mapping, if the parameter adjustment of the neural network constituting the mapping calculation unit 17, that is, learning progresses, the distorted spectrum is mapped to a spectrum close to the teacher spectrum. , And a near-optimal correction for spectrum distortion can be realized. As a result, it is possible to contribute to improvement of the recognition rate in voice recognition.

【００７１】なお、本実施形態においては、入力音声切
り出し部１２における切り出し機能が「フレーム分割手
段」に相当する。また、入力音声切り出し部１２におい
て、音声入力検出信号の入力があると切り出し処理を始
めたり、雑音パワースペクトラム推定部１４において、
音声入力検出信号の入力があると雑音パワースペクトラ
ムの推定を止めているが、これが「判定手段」による音
声区間と雑音区間の判定結果に基づく処理内容の変更に
相当する。そして、フレームパワースペクトラム計算部
１３が「スペクトラム算出手段」に相当し、雑音パワー
スペクトラム推定部１４が「雑音スペクトラム推定手
段」に相当する。また、雑音パワースペクトラム減算部
１５が「減算手段」に相当し、パワースペクトラムクリ
ッピング計算部１６が「クリッピング手段」に相当し、
写像計算部１７が「写像計算手段」に相当する。この写
像計算部１７のバッファ１７ａが「雑音減算スペクトラ
ム記憶手段」に相当する。In the present embodiment, the cut-out function of the input sound cut-out unit 12 corresponds to “frame division means”. In addition, in the input speech clipping unit 12, when a speech input detection signal is input, a clipping process starts, or the noise power spectrum estimating unit 14
The estimation of the noise power spectrum is stopped when a voice input detection signal is input, but this corresponds to a change in the processing content based on the determination result of the voice section and the noise section by the "determining means". The frame power spectrum calculator 13 corresponds to “spectrum calculator”, and the noise power spectrum estimator 14 corresponds to “noise spectrum estimator”. Further, the noise power spectrum subtraction unit 15 corresponds to “subtraction means”, the power spectrum clipping calculation unit 16 corresponds to “clipping means”,
The mapping calculation unit 17 corresponds to “mapping calculation means”. The buffer 17a of the mapping calculation unit 17 corresponds to "noise subtraction spectrum storage means".

【００７２】また、音声認識装置２０の判定部２６が
「認識手段」及び「教師スペクトラム出力手段」に相当
し、この判定部２６のバッファ２６ａが「教師スペクト
ラム記憶手段」に相当する。以上、本発明はこのような
実施形態に何等限定されるものではなく、本発明の主旨
を逸脱しない範囲において種々なる形態で実施し得る。The determination unit 26 of the voice recognition device 20 corresponds to “recognition means” and “teacher spectrum output means”, and the buffer 26a of the determination unit 26 corresponds to “teacher spectrum storage means”. As described above, the present invention is not limited to such an embodiment at all, and can be implemented in various forms without departing from the gist of the present invention.

【００７３】（１）例えば、上記実施形態においては、
写像計算部１７からの出力であるパワースペクトラムＰ
ＳＲＣＮＮ（ｆ，ｔ）を音声認識装置２０の逆フーリエ
変換部２１に渡しているが、雑音抑圧装置１０側に逆フ
ーリエ変換部を備える構成とし、パワースペクトラムク
リッピング計算部１６からの出力であるパワースペクト
ラムＰＳＲＣ（ｆ，ｔ）を逆フーリエ変換し自己相関係
数化し、その後、写像計算部１７において写像計算を行
ってもよい。このように自己相関係数を用いても同様に
歪み低減が実現できると共に、この場合には、後段の音
声認識装置２０におけるメモリ容量及び処理負荷の低減
の面で有効である。(1) For example, in the above embodiment,
Power spectrum P output from the mapping calculation unit 17
Although SRCNN (f, t) is passed to the inverse Fourier transform unit 21 of the speech recognition device 20, the noise suppressor 10 is provided with an inverse Fourier transform unit, and the power which is the output from the power spectrum clipping calculation unit 16 The spectrum PSRC (f, t) may be inverse Fourier-transformed into autocorrelation coefficients, and then the mapping calculation may be performed by the mapping calculation unit 17. As described above, the use of the autocorrelation coefficient can also reduce the distortion, and in this case, it is effective in reducing the memory capacity and the processing load in the subsequent speech recognition device 20.

【００７４】これは、パワースペクトラムの逆フーリエ
変換が自己相関係数になることに着目したものである。
つまり、自己相関係数をＣ（ｒ，ｔ）、逆フーリエ変換
をＦ ^-1とすると、パワースペクトラムＰ（ｆ，ｔ）との
関係は次のようになる。Ｃ（ｒ，ｔ）＝Ｆ^-1［Ｐ（ｆ，ｔ）］なお、ｒは自己相関係数の指数であり、パワースペクト
ラムにおける周波数ｆに対応する。This is the inverse Fourier of the power spectrum.
It focuses on the fact that the conversion becomes an autocorrelation coefficient.
That is, the autocorrelation coefficient is C (r, t), and the inverse Fourier transform
To F ^-1Then, the power spectrum P (f, t)
The relationship is as follows: C (r, t) = F^-1[P (f, t)] Here, r is an index of the autocorrelation coefficient, and the power spectrum
It corresponds to the frequency f in the ram.

【００７５】このように、パワースペクトラムと自己相
関係数とは等価であるため、パワースペクトラムを用い
ても自己相関係数を用いても同様の結果、つまり歪みの
低減された出力を得ることができる。そして、上述した
ように雑音抑圧装置１０において自己相関係数化を行う
ようにすれば、音声認識装置２０には、逆フーリエ変換
部２１による処理が必要なくなるため、音声認識装置２
０の処理負荷やメモリ容量の削減を実現できる。As described above, since the power spectrum and the autocorrelation coefficient are equivalent, the same result can be obtained by using the power spectrum and the autocorrelation coefficient, that is, an output with reduced distortion can be obtained. it can. If the noise suppression device 10 performs autocorrelation coefficient conversion as described above, the speech recognition device 20 does not need to perform the processing by the inverse Fourier transform unit 21.
Thus, the processing load and the memory capacity can be reduced.

【００７６】（２）また、上記実施形態においては、フ
ーリエ変換して得た周波数スペクトラムＳ（ｆ）の振
幅を２乗したパワースペクトラムＰＳ（ｆ）を用い、同
様に雑音パワースペクトラムＰＮ（ｆ）を用いていた
が、周波数スペクトラムＳ（ｆ）の振幅成分である振
幅スペクトラムＡ（ｆ）そのものを用いてもよい。その
場合には、雑音振幅スペクトラムＡＮ（ｆ）を推定し、
音声区間の入力信号に基づいて算出した振幅スペクトラ
ムＡＳ（ｆ）から雑音振幅スペクトラムＡＮＡＶ（ｆ）
に所定のサブトラクト係数を乗じたものを減算すればよ
い。(2) In the above embodiment, the power spectrum PS (f) obtained by squaring the amplitude of the frequency spectrum S (f) obtained by Fourier transform is used, and the noise power spectrum PN (f) is similarly obtained. However, the amplitude spectrum A (f) itself, which is the amplitude component of the frequency spectrum S (f), may be used. In that case, the noise amplitude spectrum AN (f) is estimated,
From the amplitude spectrum AS (f) calculated based on the input signal of the voice section, a noise amplitude spectrum ANAV (f)
Is multiplied by a predetermined subtraction coefficient.

【００７７】但し、自己相関係数Ｃ（ｒ，ｔ）は上述し
たようにパワースペクトラムＰ（ｆ，ｔ）との間で等価
であるため、振幅スペクトラムを用いる場合には、自己
相関係数化はできないため、自己相関係数を用いた場合
のメリットは得られない。しかし、逆に考えれば、雑音
抑圧装置１０において自己相関係数化するのは、これを
音声認識装置２０へ渡した場合に音声認識装置２０にお
ける処理負荷やメモリ容量の削減を実現できるからであ
り、この利点を享受しないのであれば、自己相関係数化
しなくてもよい。However, since the autocorrelation coefficient C (r, t) is equivalent to the power spectrum P (f, t) as described above, when the amplitude spectrum is used, the autocorrelation coefficient C (r, t) Therefore, the advantage of using the autocorrelation coefficient cannot be obtained. However, when considered conversely, the reason why the noise suppression apparatus 10 converts the autocorrelation coefficient into an autocorrelation coefficient is that when the autocorrelation coefficient is passed to the speech recognition apparatus 20, the processing load and the memory capacity of the speech recognition apparatus 20 can be reduced. If this advantage is not enjoyed, it is not necessary to convert to an autocorrelation coefficient.

【００７８】（３）また、上記実施形態においては、写
像計算部１７では、フレーム単位の処理に対応する時間
ｔ毎にパワースペクトラムＰＳＲＣ（ｆ）の写像を計算
しているが、各周波数ｆ毎に所定期間のパワースペクト
ラムＰＳＲＣ（ｔ）の写像を計算してもよい。なお、
（１）に示したように、自己相関係数化した場合には、
フレーム単位の処理に対応する時間ｔ毎に自己相関係数
ＣＳＳ（ｒ）の写像を計算してもよいし、各指数ｒ毎に
所定期間の自己相関係数ＣＳＳ（ｔ）の写像を計算して
もよい。(3) In the above embodiment, the mapping calculation unit 17 calculates the mapping of the power spectrum PSRC (f) for each time t corresponding to the processing in units of frames. The mapping of the power spectrum PSRC (t) for a predetermined period may be calculated. In addition,
As shown in (1), when the autocorrelation coefficient is used,
The mapping of the autocorrelation coefficient CSS (r) may be calculated for each time t corresponding to the processing in units of frames, or the mapping of the autocorrelation coefficient CSS (t) for a predetermined period may be calculated for each index r. You may.

【００７９】（４）さらにまた、上記実施形態において
は、音声認識装置２０の判定部２６による認識結果の正
誤判定を外部から入力される確認信号に基づいて行う構
成であった。すなわち、認識結果をスピーカ４３から合
成音声として出力し、発声者自身に確認させ、認識結果
が正しい場合に発声者が確認スイッチを押下することに
よって入力される確認信号に基づいて判定していた。(4) Furthermore, in the above embodiment, the correctness / incorrectness of the recognition result by the determination unit 26 of the voice recognition device 20 is performed based on a confirmation signal input from the outside. That is, the recognition result is output from the speaker 43 as a synthesized voice, and the speaker confirms itself. If the recognition result is correct, the determination is made based on the confirmation signal input by pressing the confirmation switch by the speaker.

【００８０】これに対して、例えば判定部２６が、各認
識対象語彙に対し照合部２５で計算した類似度を比較
し、最も高い類似度と次に高い類似度との差が所定値以
上である場合に、認識が正しいと判定するように構成し
てもよい。例えばＡ，Ｂ，Ｃの３つの認識対象語彙があ
る場合、語彙Ａ，Ｂ，Ｃのそれぞれの類似度が９０パー
セント、２０パーセント、３０パーセントというような
ときは、語彙Ａという認識結果が正しい可能性が高い
が、語彙Ａ，Ｂ，Ｃのそれぞれの類似度が、５０パーセ
ント、４５パーセント、４０パーセントである場合に
は、語彙Ａという認識結果が正しい可能性は低くなるか
らである。このように判定部２６が、認識結果の正誤判
定を行うようにすれば、上述したように発声者自身が認
識結果の正誤判定を行う必要がなくなるため便利であ
る。On the other hand, for example, the judgment unit 26 compares the similarity calculated by the matching unit 25 with each vocabulary to be recognized, and when the difference between the highest similarity and the next highest similarity is equal to or larger than a predetermined value. In some cases, the recognition may be determined to be correct. For example, if there are three vocabularies to be recognized, A, B, and C, and the similarities of vocabularies A, B, and C are 90%, 20%, and 30%, the recognition result of vocabulary A may be correct. This is because although the vocabulary A, B, and C have similarities of 50%, 45%, and 40%, the possibility that the recognition result of the vocabulary A is correct is low. As described above, if the determination unit 26 performs the correctness determination of the recognition result, it is convenient that the speaker does not need to perform the correctness determination of the recognition result as described above.

【００８１】（５）また、上記実施形態においては、音
声を入力させる期間を発声者自身が指定するために設け
られたＰＴＴスイッチを用い、利用者がＰＴＴスイッチ
を押しながら音声を入力すると、そのＰＴＴスイッチが
押されている間を音声区間とみなすようにしたが、実際
の入力信号に基づいて音声区間と雑音区間を判定するよ
うにしてもよい。例えば、入力信号のパワーに基づいて
判定することが考えられる。(5) In the above embodiment, when a user inputs a voice while pressing the PTT switch using a PTT switch provided for the speaker himself to designate a period for inputting the voice, While a period during which the PTT switch is pressed is regarded as a voice section, a voice section and a noise section may be determined based on an actual input signal. For example, it is conceivable to make the determination based on the power of the input signal.

[Brief description of the drawings]

【図１】本発明の実施形態の音声認識システムの概略
構成を示すブロック図である。FIG. 1 is a block diagram illustrating a schematic configuration of a speech recognition system according to an embodiment of the present invention.

【図２】従来の音声認識システムの概要を示す説明図
である。FIG. 2 is an explanatory diagram showing an outline of a conventional speech recognition system.

[Explanation of symbols]

１０…雑音抑圧装置１１…音声入力部１２…入力音
声切り出し部１３…フレームパワースペクトル計算部１４…雑音パワースペクトラム推定部１５…雑音パワースペクトラム減算部１６…パワースペクトラムクリッピング計算部１７…写像計算部１７ａ…バッ
ファ１８…学習制御部２０…音声認識装置２１…逆フーリエ変換部２２…ＬＰＣ
分析部２３…ケプストラム計算部２４…標準パ
ターン格納部２５…照合部２６…判定部２６ａ…バッファ３０…マイク４１…音声合
成部４２…音声出力部４３…スピー
カ２００…音声認識システム２０１…マイ
ク２０３…雑音抑圧装置２０４…音声
認識装置２０５…ＰＴＴスイッチDESCRIPTION OF SYMBOLS 10 ... Noise suppression apparatus 11 ... Audio input part 12 ... Input audio cutout part 13 ... Frame power spectrum calculation part 14 ... Noise power spectrum estimation part 15 ... Noise power spectrum subtraction part 16 ... Power spectrum clipping calculation part 17 ... Mapping calculation part 17a ... Buffer 18 ... Learning control unit 20 ... Speech recognition device 21 ... Inverse Fourier transform unit 22 ... LPC
Analysis unit 23 ... Cepstrum calculation unit 24 ... Standard pattern storage unit 25 ... Collation unit 26 ... Decision unit 26a ... Buffer 30 ... Microphone 41 ... Speech synthesis unit 42 ... Speech output unit 43 ... Speaker 200 ... Speech recognition system 201 ... Microphone 203 ... Noise suppression device 204 ... Speech recognition device 205 ... PTT switch

Claims

[Claims]

1. A frame dividing means for cutting out an input signal as a frame signal at every predetermined processing time, a spectrum calculating means for calculating a spectrum from the frame signal, a voice section in which voice is included in the input signal, and Determining means for determining a noise section that does not include voice, and a noise spectrum estimating means for estimating a noise spectrum using the spectrum calculated based on the input signal of the noise section determined by the determining means,
Subtraction means for subtracting a value obtained by multiplying the noise spectrum estimated by the noise spectrum estimation means by a predetermined subtraction coefficient from the spectrum calculated based on the input signal of the voice section; and A noise suppression device having a clipping unit that calculates a noise subtraction spectrum that is a zero or a relatively small positive constant by clipping a negative portion, and an output from the noise suppression device is stored in advance. A speech recognition system having a recognition unit that recognizes a pattern with a high degree of coincidence with a plurality of comparison target pattern candidates as a recognition result, wherein the speech recognition device further comprises: each of the comparison target pattern candidates Teacher spectrum that stores the teacher spectrum without noise prepared for A tram storage unit, and a teacher spectrum output unit that outputs a teacher spectrum corresponding to the recognition result by the recognition unit to the noise suppression device. On the other hand, the noise suppression device further includes: A noise subtraction spectrum storage means for storing a noise subtraction spectrum, and a neural network model capable of mapping the noise subtraction spectrum calculated by the clipping means to a spectrum in which distortion due to the clipping is reduced, A mapping calculating means for calculating a mapping of the noise subtraction spectrum; and, if the recognition result by the recognition means of the speech recognition device is correct, a teacher spectrum output from the speech recognition device corresponding to the output from the mapping calculation device. Neural network constituting the mapping calculation means Learning control means for adjusting the parameters of the neural network, using the noise subtracted spectrum corresponding to the teacher spectrum as an input to the neural network, as the teacher output of the neural network. A speech recognition system characterized by the following.

2. The speech recognition system according to claim 1, wherein the spectrum is an amplitude spectrum that is an amplitude component of a frequency spectrum obtained by performing a Fourier transform on a frame signal.

3. The speech recognition system according to claim 1, wherein the spectrum is a power spectrum obtained by squaring an amplitude spectrum which is an amplitude component of a frequency spectrum obtained by performing a Fourier transform on a frame signal. A speech recognition system characterized by the following.

4. The speech recognition system according to claim 3, further comprising: an autocorrelation coefficient calculation unit that calculates an autocorrelation coefficient based on the noise subtraction spectrum calculated by the clipping unit. The means is configured using a neural network model capable of mapping the autocorrelation coefficient calculated by the autocorrelation coefficient calculation means to an autocorrelation coefficient with reduced distortion due to the clipping. A speech recognition system characterized by the following.

5. The speech recognition system according to claim 1, wherein said mapping calculation means is configured using a feedforward neural network model.

6. The speech recognition system according to claim 1, wherein the recognition result is notified by the recognition unit, and the recognition result is determined based on a confirmation signal input from outside in response to the notification. A speech recognition system configured to determine whether or not is correct.

7. The speech recognition system according to claim 1, wherein the degree of coincidence of the pattern determined to be high in coincidence by the recognition unit and the degree of coincidence of another comparison target pattern candidate are determined. If there is a difference equal to or more than a predetermined value, the speech recognition system is configured to determine that the recognition result by the recognition unit is correct.

8. The speech recognition system according to claim 1, wherein the determination unit is configured to determine the voice section and the noise section based on the power of the input signal. Characteristic speech recognition system.

9. The speech recognition system according to claim 1, wherein the noise suppression device further comprises: an input period specifying unit provided for the speaker to specify a period during which the voice is input. A voice recognition system, wherein the determination unit is configured to determine the input period specified by the input period specification unit as the voice section.