JP2011033717A

JP2011033717A - Noise suppression device

Info

Publication number: JP2011033717A
Application number: JP2009178117A
Authority: JP
Inventors: Kazuyoshi Fukushi; 和義福士
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2009-07-30
Filing date: 2009-07-30
Publication date: 2011-02-17

Abstract

<P>PROBLEM TO BE SOLVED: To process acoustic signals input from two microphones to generate signals suppressed in noise. <P>SOLUTION: A noise suppression device calculates a cross spectrum from the acoustic signals obtained with the two microphones; measure a degree of time variation of a phase component of the cross spectrum; makes a frequency component with small variation a voice component; regards a part of large time variation as a noise component to calculate a correction coefficient to suppress amplitude of the noise component; and uses the calculated correction coefficient to generate the signal suppressed in noise from the acoustic signal. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、騒音環境下において音声を明瞭に検出するための雑音抑圧装置に関する。 The present invention relates to a noise suppression device for clearly detecting speech in a noisy environment.

従来、雑音環境下においてマイクロホンから入力される音響信号の雑音成分を抑圧して音声成分を精度よく検出するための雑音抑圧装置が提案されている。
中でも代表的な手法として、スペクトルサブトラクション法（例えば非特許文献１、以下ＳＳ法）がある。
ＳＳ法は、発声が無い区間において雑音のパワースペクトルを計測し、発話時のパワースペクトルから雑音スペクトルを減算することにより、目的音声のパワースペクトルを推定するものであり、ＳＳ法については多くの改良手法が提案されている(例えば特許文献１)。
また非特許文献２では２チャンネルのマイク入力を用いて雑音除去を行う適応アレー技術が開示されている。
この技術は、２つのマイクの信号の和による主パスで音声を強調し、差信号による副パスで目的音声が含まれない参照信号を生成し、主パスに含まれる雑音成分を、副パスからの参照信号を変形させて差し引くことにより、雑音を抑制しようとするものである。 Conventionally, there has been proposed a noise suppression device for accurately detecting a speech component by suppressing a noise component of an acoustic signal input from a microphone in a noisy environment.
Among them, a typical method is a spectral subtraction method (for example, Non-Patent Document 1, hereinafter SS method).
The SS method estimates the power spectrum of the target speech by measuring the power spectrum of noise in a section where there is no utterance, and subtracting the noise spectrum from the power spectrum at the time of speech. A technique has been proposed (for example, Patent Document 1).
Non-Patent Document 2 discloses an adaptive array technique for removing noise using a 2-channel microphone input.
This technique emphasizes the voice in the main path based on the sum of the signals of the two microphones, generates a reference signal that does not include the target voice in the sub path based on the difference signal, and extracts the noise component included in the main path from the sub path. The reference signal is deformed and subtracted to suppress noise.

特開平８−２２１０９２号公報JP-A-8-2221092

Ｓ．Ｆ．Ｂｏｌｌ，「Ｓｕｐｐｒｅｓｓｉｏｎｏｆａｃｏｕｓｔｉｃｎｏｉｓｅｉｎｓｐｅｅｃｈｕｓｉｎｇｓｐｅｃｔｒａｌｓｕｂｔｒａｃｔｉｏｎ」、ＩＥＥＥＴｒａｎｓ．Ａｃｏｕｓｔ．ＳｐｅｅｃｈＳｉｇｎａｌＰｒｏｃｅｓｓ．，ｖｏｌ．２７，ｎｏ．２７，ｐ．１１３−１２０，１９７９S. F. Boll, “Suppression of acoustic noise in speculation using subtraction”, IEEE Trans. Acoustic. SpeechSignal Process. , Vol. 27, no. 27, p. 113-120, 1979 Ｇｒｉｆｆｉｔｈｓ，Ｌ．Ｊ．およびＪｉｍ，Ｃ．Ｗ．、「Ａｎａｌｔｅｒｎａｔｉｖｅａｐｐｒｏａｃｈｔｏｌｉｎｅａｒｌｙｃｏｎｓｔｒａｉｎｅｄａｄａｐｔｉｖｅｂｅａｍｆｏｒｍｉｎｇ」、ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｎｔｅｎｎａｓａｎｄＰｒｏｐａｇａｔｉｏｎ、１９８２、ｖｏｌ．３０．、ｐ．２７Griffiths, L.M. J. et al. And Jim, C.I. W. "An alternative approach to linearly constrained adaptive beamforming", IEEE Transactions on Antennas and Propagation, 1982, vol. 30. , P. 27

しかし、ＳＳ法を用いる場合、環境雑音は絶えず変動しており、非発声時に計測した雑音レベルと、発声時に含まれている雑音レベルの間に生じる誤差のため、スペクトル減算の際に、雑音の残留或いは引き過ぎが発生し、人間が耳にした場合、耳障りなmusical noiseと呼ばれる雑音が発生する。
これに対して特許文献１では音声中のポーズ区間についてスペクトル減算の係数を変更することにより、引き過ぎの対策を行っている。しかし、雑音レベルが高くなると発声かどうかを切り分けることが困難となり、正しい雑音計測ができなくなる。また、ＳＳ法では、突発性の雑音には対応できないという問題がある。 However, when using the SS method, the environmental noise constantly fluctuates, and due to the error that occurs between the noise level measured during non-speech and the noise level included during utterance, the noise is reduced during spectral subtraction. When residual or excessive pulling occurs and a human hears it, an unpleasant noise called musical noise is generated.
On the other hand, Patent Document 1 takes measures against overdrawing by changing the coefficient of spectrum subtraction for the pause section in the voice. However, if the noise level increases, it becomes difficult to determine whether or not the voice is uttered, and correct noise measurement cannot be performed. Further, the SS method has a problem that it cannot cope with sudden noise.

非特許文献２の適応アレー技術では、２つのマイクの差信号として求める参照信号を音声信号が含まれないように作成するのが困難であるという問題がある。これは２チャンネルのマイクロホンアレーでは音声信号が含まれないようにするための死角形成が1方向にしか形成できないことに起因する。一般の残響環境下では、音声信号は空間的に直接マイクに入力する経路以外に、壁などの物体に反射してから入力する成分が存在する。
仮にパワーが大きい直接波をキャンセルできたとしても反射波はキャンセルできない場合があり、その結果、参照信号中に含まれている残存音声成分の影響で、主パスから雑音を差し引く際に、目的信号である音声信号を除去する方向に適応処理が働き、処理された信号の品質は著しく損なわれることになるという問題がある。
また適応アレー技術では、適応処理を動作させるタイミング信号（音声が発せられたかどうか）を知る必要があり、一般的な雑音環境下では少なからず誤りが発生するため、適応処理が意図しない方向に進んでしまうという問題もある。 In the adaptive array technique of Non-Patent Document 2, there is a problem that it is difficult to create a reference signal to be obtained as a difference signal between two microphones so that no audio signal is included. This is due to the fact that a blind spot can be formed only in one direction in order to prevent an audio signal from being included in a two-channel microphone array. In a general reverberant environment, there are components that are input after the audio signal is reflected by an object such as a wall, in addition to a path that is spatially input directly to the microphone.
Even if the direct wave with high power can be canceled, the reflected wave may not be cancelled.As a result, the target signal is subtracted when noise is subtracted from the main path due to the influence of the residual audio component contained in the reference signal. There is a problem that the adaptive processing works in the direction of removing the voice signal, and the quality of the processed signal is significantly impaired.
In addition, the adaptive array technology needs to know the timing signal (whether or not the voice is emitted) for operating the adaptive processing, and there are not a few errors in a general noise environment, so the adaptive processing proceeds in an unintended direction. There is also a problem that it ends up.

そこで、本発明は、かかる課題を解決するため、２つのマイクロホンからの入力信号を利用して簡易な手法にて精度よく雑音成分だけを抑圧可能な雑音抑圧装置の実現を目的とする。 Therefore, in order to solve such a problem, an object of the present invention is to realize a noise suppression device capable of accurately suppressing only a noise component by a simple method using input signals from two microphones.

本発明は、２つの集音器にて取得した音響信号を処理して雑音成分を抑圧する雑音抑圧装置であって、２つの音響信号同士の位相差の時間変動度合いを周波数成分毎に評価する位相差変動評価部と位相差の時間変動度合いが大きい周波数成分を雑音として当該周波数成分の振幅成分が小さくなるような振幅補正係数を算出し、音響信号に振幅補正係数を作用させて雑音成分を抑圧した信号を出力する振幅補正部を有する雑音抑圧装置を提供する。 The present invention is a noise suppression device that processes acoustic signals acquired by two sound collectors and suppresses noise components, and evaluates the degree of temporal variation in phase difference between two acoustic signals for each frequency component. An amplitude correction coefficient is calculated so that the amplitude component of the frequency component becomes small with the frequency component having a large degree of time variation of the phase difference as a noise, and the noise component is calculated by applying the amplitude correction coefficient to the acoustic signal. Provided is a noise suppression device having an amplitude correction unit that outputs a suppressed signal.

また、本発明の好適な態様は、位相差変動評価部は、２つの音響信号のクロススペクトルを所定周期ごとに算出するクロススペクトル算出部と、算出されたクロススペクトルを所定数記憶するバッファリング部と、バッファリング部に記憶されたクロススペクトルの位相成分の時間変動度合いを音響信号同士の位相差の時間変動度合いとして所定周期ごとに算出する変動測定部とを有する。 According to a preferred aspect of the present invention, the phase difference fluctuation evaluating unit includes a cross spectrum calculating unit that calculates cross spectra of two acoustic signals at predetermined intervals, and a buffering unit that stores a predetermined number of calculated cross spectra. And a fluctuation measuring unit that calculates a time fluctuation degree of the phase component of the cross spectrum stored in the buffering part as a time fluctuation degree of the phase difference between the acoustic signals for each predetermined period.

また、本発明の好適な態様では、振幅補正部は、振幅補正係数をクロススペクトルの振幅成分に乗算して雑音成分を抑圧したクロススペクトルを算出する。 In a preferred aspect of the present invention, the amplitude correction unit calculates a cross spectrum in which the noise component is suppressed by multiplying the amplitude component of the cross spectrum by the amplitude correction coefficient.

さらに、本発明の好適な態様として振幅補正部は、クロススペクトルの振幅成分に白色化処理を行う白色化部を有し、白色化されたクロススペクトルの振幅成分に対し、振幅補正係数を乗算して雑音成分を抑圧する。 Further, as a preferred aspect of the present invention, the amplitude correction unit includes a whitening unit that performs whitening processing on the amplitude component of the cross spectrum, and multiplies the amplitude component of the whitened cross spectrum by an amplitude correction coefficient. To suppress the noise component.

また、本発明の好適な態様として、振幅補正部は、振幅補正係数を逆フーリエ変換したフィルタ係数を算出するフィルタ係数算出部を有し、２つの音響信号のいずれか又は合成した音響信号にフィルタ係数を作用させて雑音が抑圧された音響信号を生成する。 Further, as a preferred aspect of the present invention, the amplitude correction unit includes a filter coefficient calculation unit that calculates a filter coefficient obtained by performing inverse Fourier transform on the amplitude correction coefficient, and performs filtering on one of the two acoustic signals or a synthesized acoustic signal. An acoustic signal in which noise is suppressed is generated by applying a coefficient.

本発明の雑音抑圧装置を発声検知装置に適用すれば、目的方向以外の音声や雑音に反応しにくい発声検知装置を構成できる。
さらに本発明を適用した音声再生装置によれば、定常的な環境雑音の影響が取り除かれた聞きやすい音声が提供できる。 When the noise suppression device of the present invention is applied to the utterance detection device, it is possible to configure an utterance detection device that does not easily react to voice or noise other than the target direction.
Furthermore, according to the sound reproducing apparatus to which the present invention is applied, it is possible to provide easy-to-hear sound from which the influence of stationary environmental noise is removed.

本発明に係る雑音抑圧装置を発声検出装置に適用した場合（第１の実施形態）のブロック図である。It is a block diagram when the noise suppression device according to the present invention is applied to the utterance detection device (first embodiment). 金融機関におけるＡＴＭ利用者を話者として認識するための発声検出装置の配置図である。It is an arrangement view of an utterance detection device for recognizing an ATM user in a financial institution as a speaker. 本発明の第1の実施形態に係る雑音抑圧装置のブロック図である。1 is a block diagram of a noise suppression device according to a first embodiment of the present invention. 本発明に係る雑音抑圧装置を音声再生装置に適用した場合（第２の実施形態）のブロック図である。It is a block diagram when the noise suppression device according to the present invention is applied to an audio reproduction device (second embodiment). ２つのマイクロホンから入力した音響信号のクロススペクトルを求めた例である。It is the example which calculated | required the cross spectrum of the acoustic signal input from two microphones. 音響信号中の音声を含む区間のスペクトルを求めた例である。It is an example which calculated | required the spectrum of the area containing the audio | voice in an acoustic signal. 雑音抑圧処理を行うための動作フローである。It is an operation | movement flow for performing a noise suppression process. 本発明の第２の実施形態に係る雑音抑圧装置のブロック図である。It is a block diagram of the noise suppression apparatus which concerns on the 2nd Embodiment of this invention.

以下、本発明に係る雑音抑圧装置を適用した実施形態について図を参照して説明する。
（第１の実施形態）
ここでは、本発明に係る雑音抑圧装置を、金融機関のＣＤ／ＡＴＭ前で操作者が携帯電話により会話を行っていることを検出する発声検出装置に使用したときの例を説明する。
近年増加しつつある振込め詐欺の被害を未然に防止することを目的とし、金融機関のＣＤ／ＡＴＭの操作者が音声を発していることを検出するとスピーカ等から警告を発する発声検出装置が提案されている。 Hereinafter, an embodiment to which a noise suppression device according to the present invention is applied will be described with reference to the drawings.
(First embodiment)
Here, an example will be described in which the noise suppression device according to the present invention is used in an utterance detection device that detects that an operator has a conversation using a mobile phone in front of a CD / ATM of a financial institution.
Proposed utterance detection device that issues a warning from a speaker or the like when it detects that a CD / ATM operator of a financial institution is uttering sound, in order to prevent the damage of wire fraud that has been increasing in recent years Has been.

振り込め詐欺では、犯罪者が被害者に携帯電話にてＣＤ／ＡＴＭの操作を誘導し、被害者のお金を加害者の口座に振り込ませる手口を使うことがあり、振り込め詐欺の可能性のある操作者は、携帯電話を使用して電話口の相手と会話しながら、操作をすることが多い。
本発声検出装置は、振り込め詐欺を防止するためＣＤ／ＡＴＭの上部左右両端に設置した２つのマイクロホンからの音響信号を解析することで、ＣＤ／ＡＴＭの正面にて操作者が発声した音声信号を検知するものである。
このようなＣＤ／ＡＴＭが設定される環境は、ＣＤ／ＡＴＭの作動音や設置ブース内外の周辺雑音が大きく、精度よく音声を検出するためには、この周辺雑音を抑圧する必要がある。 In transfer fraud, criminals sometimes use victims to direct CD / ATM operations on their mobile phones and transfer victims' money into the perpetrator's account. In many cases, a person performs an operation while having a conversation with a partner at the telephone port using a mobile phone.
This utterance detection device analyzes the audio signals from the two microphones installed at the upper left and right ends of the CD / ATM in order to prevent wire fraud, and the voice signal uttered by the operator in front of the CD / ATM It is something to detect.
In such an environment in which CD / ATM is set, the operating noise of CD / ATM and the surrounding noise inside and outside the installation booth are large, and it is necessary to suppress the surrounding noise in order to detect the sound with high accuracy.

図２は、金融機関におけるＡＴＭ３の利用者４の発声を検出するための発声検出装置の配置の例を示した図である。発声検出装置１は、本体装置が壁面に設置され、マイクロホン２がＡＴＭ３の上部の左右両端に所定距離を離隔させて２つ設置されている。本実施の形態では、マイクロホン２を２つ使用しているが、これに限られるものではなく、３以上を適宜の数を適宜の配置にて使用しても良く、この場合、２つずつのマイクロホンのペアで後述の処理を実行すればよい。 FIG. 2 is a diagram showing an example of the arrangement of the utterance detection device for detecting the utterance of the user 4 of the ATM 3 in the financial institution. In the utterance detection device 1, the main body device is installed on the wall surface, and two microphones 2 are installed at both left and right ends of the upper part of the ATM 3 with a predetermined distance therebetween. In the present embodiment, two microphones 2 are used, but the present invention is not limited to this, and an appropriate number of three or more may be used in an appropriate arrangement. What is necessary is just to perform the below-mentioned process with the pair of microphones.

次に図１を用いて、本発明に係る雑音抑圧装置を適用した発声検出装置１の構成について説明する。発声検出装置１は、集音器である２つのマイクロホン２と、増幅器１０、Ａ／Ｄコンバータ１１、本発明の雑音抑圧装置である雑音抑圧部１２、相互相関計算部２４、発声検出部２５から構成されている。 Next, the configuration of the utterance detection device 1 to which the noise suppression device according to the present invention is applied will be described with reference to FIG. The utterance detection device 1 includes two microphones 2 that are sound collectors, an amplifier 10, an A / D converter 11, a noise suppression unit 12 that is a noise suppression device of the present invention, a cross-correlation calculation unit 24, and an utterance detection unit 25. It is configured.

マイクロホン２は、全方向からの音声を集音するのが望ましいため、無指向性のものを使用している。マイクロホン２同士は、所定距離の間隔を空けて設置される（例えば５０ｃｍ）。この所定距離は、サンプリング周期や話者との想定距離範囲などに応じてＡＴＭ３の正面の操作者が発声したことを特定できるような値に決定される。
尚、この所定距離は、音声の発声方向を精度よく検出するために必要な距離であり、本発明に係る雑音抑圧装置に制限を課するものではない。
また、マイクロホン２は、ほぼ同程度の感度、特性を持ったものが使用するが、特別に高品質なものを用意する必要はない。 Since it is desirable to collect sound from all directions, the microphone 2 is omnidirectional. The microphones 2 are installed with a predetermined distance (for example, 50 cm). This predetermined distance is determined to be a value that can specify that the operator in front of the ATM 3 uttered according to the sampling period, the assumed distance range with the speaker, and the like.
Note that this predetermined distance is a distance necessary for accurately detecting the voice direction, and does not impose any restrictions on the noise suppression device according to the present invention.
The microphone 2 has almost the same sensitivity and characteristics, but it is not necessary to prepare a specially high quality microphone.

増幅器１０は、マイクロホン２により集音された音響信号を増幅するアンプである。この増幅率はＡ／Ｄコンバータ１１の入力電圧に応じて適宜設定される。
Ａ／Ｄコンバータ１１は増幅されたアナログ信号である音響信号を２チャネル同時に所定サンプリング周波数でサンプリングして離散時間信号（デジタル信号）に変換する。
増幅器１０、Ａ／Ｄコンバータ１１は、いずれも周知の部品であるので、詳細な説明は省略する。 The amplifier 10 is an amplifier that amplifies the acoustic signal collected by the microphone 2. This amplification factor is appropriately set according to the input voltage of the A / D converter 11.
The A / D converter 11 samples an acoustic signal, which is an amplified analog signal, at two sampling channels simultaneously at a predetermined sampling frequency and converts it into a discrete time signal (digital signal).
Since the amplifier 10 and the A / D converter 11 are both well-known components, detailed description is omitted.

雑音抑圧部１２は、位相差変動評価部１３と振幅補正部１４から構成される。第１の実施形態における雑音抑圧部１２では、Ａ/Ｄコンバータ１１から入力した２つチャネルの信号のクロススペクトルを求め、周波数軸上で雑音抑圧処理をして、雑音が抑圧されたクロススペクトルを出力する。 The noise suppression unit 12 includes a phase difference variation evaluation unit 13 and an amplitude correction unit 14. In the noise suppression unit 12 in the first embodiment, the cross spectrum of the signals of the two channels input from the A / D converter 11 is obtained, noise suppression processing is performed on the frequency axis, and the cross spectrum in which noise is suppressed is obtained. Output.

位相差変動評価部１３は、周波数軸上において各周波数帯域が音声成分か、雑音成分のいずれかが支配的かを測定する。具体的には、非常に細かい時間間隔で分析を行い、２チャネルの信号のクロススペクトルを求め、その位相成分（２つの音響信号の位相差）の時間変動度合いを測定する。
振幅補正部１４では、位相差変動評価部１３で２つのマイクの位相差の変動が大きい周波数成分を雑音として当該周波数成分の振幅値が小さくなるような雑音抑圧係数を算出する。そして元の信号のクロスペクトルに対して算出した雑用抑圧係数を周波数軸上で乗算して雑音抑圧処理を行う。雑音抑圧処理の詳細な処理については後述する。 The phase difference variation evaluation unit 13 measures whether each frequency band is dominant in the frequency axis on the frequency axis. Specifically, the analysis is performed at very fine time intervals, the cross spectrum of the two-channel signal is obtained, and the degree of time variation of the phase component (phase difference between the two acoustic signals) is measured.
In the amplitude correction unit 14, the phase difference variation evaluation unit 13 calculates a noise suppression coefficient such that the frequency component having a large variation in the phase difference between the two microphones is regarded as noise and the amplitude value of the frequency component becomes small. Then, noise suppression processing is performed by multiplying the chord suppression coefficient calculated for the original spectrum of the signal on the frequency axis. Detailed processing of the noise suppression processing will be described later.

相互相関算出部２４では、雑音成分が抑圧された２つのマイクロホン２のクロススペクトルを所定時間毎に逆フーリエ変換して相互相関関数を算出し、発声検出部２５へ出力する。 The cross-correlation calculation unit 24 calculates a cross-correlation function by performing inverse Fourier transform on the cross spectrum of the two microphones 2 in which the noise component is suppressed every predetermined time, and outputs the cross-correlation function to the utterance detection unit 25.

発声検出部２５では、相互相関算出部２４にて算出した相互相関値列のピークの高さ、ピークの幅、ピークの連続性を評価し、指定した方向から発声があったかを判定する。
無音声の音響フレームでは無秩序な音響信号が左右のマイクロホン２の入力に現れているため、相互相関値が相対的に小さくなるのに対し、有音声の音響フレームでは、例えばＡＴＭ３の操作者４が発声すると、正面方向からの音声が同位相で両方のマイクロホン２の入力に現れるため、相互相関値が相対的に大きくなる。
従って、発声検出部２５では相互相関値列の最大値を与えるピークの高さが一定以上で、かつ、その幅が一定以下を満たし、かつ、ピーク位置が所定方向に近く、かつ、前記条件が複数フレームに渡って満たされるときに、音声が発せられたと判断している。 The utterance detection unit 25 evaluates the peak height, peak width, and peak continuity of the cross-correlation value sequence calculated by the cross-correlation calculation unit 24, and determines whether there is a utterance from the designated direction.
In a voiceless sound frame, a disordered acoustic signal appears at the inputs of the left and right microphones 2, so that the cross-correlation value is relatively small, whereas in a voiced sound frame, for example, the operator 4 of the ATM 3 When speaking, since the sound from the front direction appears at the inputs of both microphones 2 in the same phase, the cross-correlation value becomes relatively large.
Therefore, in the utterance detection unit 25, the peak height that gives the maximum value of the cross-correlation value sequence is greater than or equal to a certain value, the width satisfies a certain value, and the peak position is close to a predetermined direction. When it is satisfied over a plurality of frames, it is determined that the voice is emitted.

以上、本発明に係る雑音抑圧装置を発声検出装置１に適用したときの構成について説明した。
尚、本発明に係る雑音抑圧装置である雑音抑圧部１２は、発声検出装置１を構成するソフトウェアの一部として実現できる。また、２チャンネルの信号入力機能と、雑音抑圧した信号を出力するモジュールしても実現可能である。 The configuration when the noise suppression device according to the present invention is applied to the utterance detection device 1 has been described above.
Note that the noise suppression unit 12 which is a noise suppression device according to the present invention can be realized as a part of software constituting the utterance detection device 1. It can also be realized by a two-channel signal input function and a module that outputs a noise-suppressed signal.

次に、本発明の雑音抑圧装置である雑音抑圧部１２の具体的な雑音抑圧原理について詳細に説明する。
一般に、人間が会話をしている場合などの音声信号を分析処理する場合、１０ｍｓ〜２０ｍｓの分析周期（シフト幅）で２０ｍｓ〜４０ｍｓの分析窓を用いて周波数分析を行う。これは音声信号の統計的性質として１０ｍｓ程度の間ではその統計的性質が変わらないという事実に基づいている。
図６は、騒音環境下での女性の音声の有声部を３０ｍｓのハミング窓で切り出して分析した際のパワースペクトルの例である（横軸は周波数［Ｈｚ］、縦軸は強度［ｄＢ］）。
この信号の音声の成分は、１ｋＨｚ以下の帯域では３００Ｈｚ、６００Ｈｚ、９００Ｈｚに鋭いピークが存在し、音声成分と見分けがつくが、１ｋＨｚ以上は雑音に埋もれており、どの帯域が音声成分であるかを見分けるのは難しい。このように、１つのフレーム分析結果だけを見ても、どの帯域が音声成分が優勢なのか、もしくは環境雑音が優勢なのかを判断することは難しい。 Next, a specific noise suppression principle of the noise suppression unit 12 which is the noise suppression device of the present invention will be described in detail.
In general, when an audio signal is analyzed, such as when a person is talking, frequency analysis is performed using an analysis window of 20 ms to 40 ms with an analysis period (shift width) of 10 ms to 20 ms. This is based on the fact that the statistical property of the voice signal does not change for about 10 ms.
FIG. 6 is an example of a power spectrum when a voiced part of a female voice under a noisy environment is cut out and analyzed with a 30 ms Hamming window (the horizontal axis is frequency [Hz] and the vertical axis is intensity [dB]). .
The audio component of this signal has a sharp peak at 300 Hz, 600 Hz, and 900 Hz in the band of 1 kHz or less, and can be distinguished from the audio component, but 1 kHz or more is buried in noise, and which band is the audio component It is difficult to distinguish. In this way, it is difficult to determine which band has the dominant speech component or environmental noise by looking at only one frame analysis result.

雑音抑圧部１２では、各周波数における音声成分の優勢性を、通常の分析周期よりも非常に細かい時間間隔で分析することにより精度良く推定するものであり、以下これを詳細に説明する。
図６の６００Hzのピークは音声の成分であるが、この６００Hzの成分の左右チャネルのクロススペクトル（複素数）を求め、その１０ｍｓの間の時間推移を表示したものを図５（ａ）に示す。
また、同様に雑音成分である１８４０Hzのクロススペクトルの１０ｍｓの間の時間推移を表示したものを図５（ｂ）に示す。
図５について説明する。図５は、特定の周波数におけるクロススペクトルの時間変動を表したものである。クロススペクトルは左右チャネルの相互相関関数をフーリエ変換したものを意味する。
図５において円周方向の変化は特定周波数のクロススペクトルの位相の時間変動度合い、即ち、２つのチャネルから入力した音響信号の相対的な位相差の変動度合いを示す。また半径方向の変化は特定周波数のクロススペクトルの振幅の時間変動度合い、即ち２つのチャネルの振幅値の積の変動度合いを示している。 The noise suppression unit 12 estimates the dominance of the speech component at each frequency with a time interval that is much finer than a normal analysis cycle, and will be described in detail below.
The peak at 600 Hz in FIG. 6 is a voice component. FIG. 5A shows the cross spectrum (complex number) of the left and right channels of this 600 Hz component, and the time transition for 10 ms is displayed.
Similarly, FIG. 5B shows a time transition for 10 ms of a cross spectrum of 1840 Hz which is a noise component.
FIG. 5 will be described. FIG. 5 shows the time variation of the cross spectrum at a specific frequency. The cross spectrum means a Fourier transform of the cross-correlation function of the left and right channels.
In FIG. 5, the change in the circumferential direction indicates the degree of temporal variation of the phase of the cross spectrum of the specific frequency, that is, the degree of variation of the relative phase difference between the acoustic signals input from the two channels. The change in the radial direction indicates the degree of time fluctuation of the amplitude of the cross spectrum of the specific frequency, that is, the degree of fluctuation of the product of the amplitude values of the two channels.

図５において黒丸印は分析中心フレームでの値であり、その前後５フレームを１ｍｓずつずらして分析した際の軌跡を太線で示している。
図５（ａ）をみると音声成分である６００Hzのクロススペクトル成分は、位相も振幅も１０ｍｓの間、ほとんど変動していないことがわかる。一方、図５（ｂ）を見ると雑音成分である１８４０Hzの成分は１０ｍｓの間で左右チャネルの振幅（積）は殆ど変化しないが、位相差は大きく変動しているのがわかる。 In FIG. 5, black circles are values in the analysis center frame, and the locus when the previous and next 5 frames are shifted by 1 ms and analyzed is indicated by a bold line.
From FIG. 5A, it can be seen that the cross spectrum component of 600 Hz, which is an audio component, hardly fluctuates in both phase and amplitude for 10 ms. On the other hand, FIG. 5B shows that the 1840 Hz component, which is a noise component, hardly changes in amplitude (product) of the left and right channels in 10 ms, but the phase difference fluctuates greatly.

音声成分は、１０ｍｓの間では特性が変化しないこと、音声の方向性が強いこと、１０ｍｓで残響特性が一定であることにより、２つのマイクロホンから入力された信号間の位相差の変動は、音声成分では少ない。一方、雑音成分は方向性が低く、さまざまな音源からの信号がランダムに左右のマイクロホンに到達するため、２つのマイクロホンから入力された信号間の位相差の変動が大きい。
よって、定められた時間内（ここでは１０ｍｓ）の間にクロススペクトルの位相、即ち２つのマイクロホンから入力した信号の位相差がどれだけ変動しているかを計測することで、注目している周波数成分が音声成分優位なのか、もしくは雑音成分優位なのかを判定することができる。雑音成分と判断された周波数帯域は、その振幅強度を落とすことにより、実質的に雑音抑圧を行うことが可能となる。 Since the sound component does not change in characteristics for 10 ms, the directionality of the sound is strong, and the reverberation characteristics are constant in 10 ms, the fluctuation in the phase difference between the signals input from the two microphones is There are few ingredients. On the other hand, the noise component has low directivity, and signals from various sound sources randomly reach the left and right microphones, so that the variation in the phase difference between the signals input from the two microphones is large.
Therefore, the frequency component of interest is measured by measuring how much the phase of the cross spectrum, that is, the phase difference between the signals input from the two microphones fluctuates within a predetermined time (here, 10 ms). It is possible to determine whether or not the voice component is dominant or the noise component is dominant. The frequency band determined to be a noise component can be substantially suppressed by reducing its amplitude intensity.

以上、雑音抑圧部１２の具体的な雑音抑圧の原理を説明した。
尚、ここでは、２つのマイクロホンから入力した音響信号の位相差の変動を、ＦＦＴを利用したクロススペクトル計算結果から計測する例で説明した。これにより、計算量を削減することができるという利点があるが、計算量を考慮する必要がない場合は、２つのマイクロホンから入力した音響信号の各々のＦＦＴ算出結果から直接位相成分を算出し、それらの差分から位相差を求め、その変動度合いを計測するようにしてもよい。その他、クロススペクトルは、時間領域で相互相関関数を求め、これをフーリエ変換して算出するようにしても同様の結果が得られる。
また、位相差の変動を計測して音声成分と雑音成分を識別するには、上述のように音声の性質から１０ｍｓ程度の区間から推定するのが適切であり、分析周期も通常の音声分析よりも短い１ｍｓ程度で行うことが適切である。これらは計算量と精度の兼ね合いで適宜決めればよい。 The specific principle of noise suppression of the noise suppression unit 12 has been described above.
Here, an example has been described in which the variation in the phase difference between the acoustic signals input from the two microphones is measured from the cross spectrum calculation result using FFT. Thereby, there is an advantage that the amount of calculation can be reduced, but when there is no need to consider the amount of calculation, the phase component is directly calculated from the FFT calculation results of each of the acoustic signals input from the two microphones, A phase difference may be obtained from these differences, and the degree of variation thereof may be measured. In addition, the cross spectrum can be obtained by calculating a cross-correlation function in the time domain and performing Fourier transform on the cross-correlation function.
Moreover, in order to measure the fluctuation of the phase difference and discriminate between the speech component and the noise component, it is appropriate to estimate from the section of about 10 ms from the nature of the speech as described above, and the analysis cycle is also longer than the normal speech analysis. However, it is appropriate to carry out in a short time of about 1 ms. These may be appropriately determined depending on the balance between calculation amount and accuracy.

次に、雑音抑圧部１２の具体的な雑音抑圧処理手順について説明する。雑音抑圧部１２は、図３に示すよう位相差変動評価部１３と、振幅補正部１４から構成される。位相差変動評価部１３は、さらに前処理部１５、フレーム切出部１６、ＦＦＴ計算部１７、クロススペクトル計算部１８、バッファリング部１９、変動測定部２０からなり、振幅補正部１４は、白色化部２１と振幅補正係数算出部２２と抑圧処理部２３からなる。
以下、図７のフローチャート及び適宜図３の構成図を参照して各部の機能と雑音抑圧のための具体的な処理手順について説明する。 Next, a specific noise suppression processing procedure of the noise suppression unit 12 will be described. The noise suppression unit 12 includes a phase difference variation evaluation unit 13 and an amplitude correction unit 14 as shown in FIG. The phase difference fluctuation evaluation unit 13 further includes a preprocessing unit 15, a frame cutout unit 16, an FFT calculation unit 17, a cross spectrum calculation unit 18, a buffering unit 19, and a fluctuation measurement unit 20, and the amplitude correction unit 14 is white. And an amplitude correction coefficient calculation unit 22 and a suppression processing unit 23.
Hereinafter, the function of each unit and a specific processing procedure for noise suppression will be described with reference to the flowchart of FIG. 7 and the configuration diagram of FIG. 3 as appropriate.

まずステップＳ１０では、Ａ/Ｄコンバータ１１で所定のサンプリング周期（例えば８ｋＨｚ）で変換された離散信号に対し、前処理部１５にて前処理が行われる。前処理部１５は、入力された離散信号の処理に不要な周波数帯域、例えば７０Ｈｚ以下の周波数成分をカットする低域カットフィルタ、および、信号のダイナミックレンジを圧縮して数値演算精度を高める高域強調処理からなる。
これらはともに必須の処理ではない。また、左右の両チャネルで同じ構成にする必要があるが、低域カットフィルタに関して、ＦＩＲ(Finite Impulse Response)型、ＩＩＲ(Infinite
Impulse Response)型の制限は無い。 First, in step S10, preprocessing is performed on the discrete signal converted by the A / D converter 11 at a predetermined sampling period (for example, 8 kHz). The preprocessing unit 15 is a frequency band unnecessary for processing the input discrete signal, for example, a low-frequency cut filter that cuts a frequency component of 70 Hz or less, and a high frequency band that compresses the dynamic range of the signal and increases numerical calculation accuracy. It consists of emphasis processing.
These are not essential processes. Moreover, although it is necessary to make it the same structure in both the left and right channels, the FIR (Finite Impulse Response) type, IIR (Infinite)
Impulse Response) There is no type restriction.

次にステップＳ２０で、前処理部１５で処理された信号に対しフレーム切出部１６でフレーム切出処理が行われる。フレーム切出部１６は、音響信号から固定長のフレーム（例えば３０ｍｓとする）を所定のシフト幅で切り出す。ここでシフト幅は分析周期を表し、上述のように通常の音声分析と比較して非常に短い周期で切り出すものとする。ここではシフト幅を１ｍｓとしている。フレームを切り出す際には、ハミング窓を窓関数として音響信号に乗じて切り出す。なお、窓関数は、ハミング窓に限られるものではなく、ハニング窓等を用いてもよい。 Next, in step S <b> 20, frame extraction processing is performed by the frame extraction unit 16 on the signal processed by the preprocessing unit 15. The frame cutout unit 16 cuts out a fixed-length frame (for example, 30 ms) from the acoustic signal with a predetermined shift width. Here, the shift width represents the analysis cycle, and is cut out in a very short cycle as compared with the normal speech analysis as described above. Here, the shift width is 1 ms. When a frame is cut out, it is cut out by multiplying the acoustic signal by using a Hamming window as a window function. The window function is not limited to the Hamming window, and a Hanning window or the like may be used.

次にステップＳ３０では、フレーム切出部１６で切出された音響信号が、ＦＦＴ計算部１７でＦＦＴ（Ｆａｓｔ
ＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）計算が実行され周波数成分に変換される。８ｋＨｚサンプリング、３０ｍｓ分析窓を使用する分析条件の場合、信号のポイント数は２４０点となるため、ＦＦＴサイズとしては２５６を採用する。先頭から２４０点は窓かけ信号を入力し、後ろの１６点は０を入力して処理を行う。 Next, in step S30, the acoustic signal cut out by the frame cutout unit 16 is converted into the FFT (Fast) by the FFT calculation unit 17.
(Fourier Transform) calculations are performed and converted to frequency components. In the case of analysis conditions using 8 kHz sampling and a 30 ms analysis window, the number of signal points is 240, so 256 is adopted as the FFT size. A 240-point signal is input to the first 240 points, and 0 is input to the subsequent 16 points.

次にステップＳ４０で、クロススペクトル算出部１８において、左右チャンネルのＦＦＴ計算結果から、以下の計算式（式１）によりクロススペクトルを算出する。 Next, in step S40, the cross spectrum calculation unit 18 calculates a cross spectrum from the FFT calculation result of the left and right channels by the following calculation formula (Formula 1).

ここで、Ｙ（ｋ，ｔ）は周波数番号ｋ、フレーム番号ｔでのクロススペクトル、Ｘ_１（ｋ，ｔ）は左チャンネルのＦＦＴ結果、Ｘ_２（ｋ，ｔ）は右チャンネルのＦＦＴ結果である。また＊は複素数の共役を表している。
尚、ここでは、クロススペクトルを左右チャネルの信号のフーリエ変換を求めてから計算しているが、左右チャネルの相互相関関数を求めてからフーリエ変換を行って求めるようにしてもよい。 Here, Y (k, t) is the cross spectrum at frequency number k and frame number t, X ₁ (k, t) is the left channel FFT result, and X ₂ (k, t) is the right channel FFT result. is there. * Represents a conjugate of a complex number.
Here, the cross spectrum is calculated after obtaining the Fourier transform of the left and right channel signals. However, the cross spectrum may be obtained by obtaining the cross correlation function of the left and right channels and then performing the Fourier transform.

算出されたクロススペクトルはバッファリング部１９に記憶される。バッファリング部１９は所定サイズのリングバッファであり、所定サイズを超えて新たなクロススペクトルが入力されると古いものから順次消去される。
リングバッファのサイズは、位相差の時間変動を観測するのに必要な分用意すればよい。前述のように、ここでは位相差の時間変動の観測時間を１０ｍｓ、クロススペクトルの算出を１ｍｓごとに行っており、前後５ｍｓ分のクロススペクトルがあればよいので、バッファサイズは１１とする。
ステップＳ５０では、予め設定した所定期間が経過したか否かを判定する。この所定期間は、後述の雑音抑圧処理を行う周期である。
所定期間（ここでは１０ｍｓ）が経過するごとにバッファリング部１９は、雑音抑圧処理を行うため、後段の変動測定部２０に蓄積されたクロススペクトルのデータを出力する。 The calculated cross spectrum is stored in the buffering unit 19. The buffering unit 19 is a ring buffer of a predetermined size, and when a new cross spectrum is input exceeding the predetermined size, the oldest one is erased sequentially.
The size of the ring buffer may be prepared as much as necessary for observing the temporal variation of the phase difference. As described above, here, the observation time of the time variation of the phase difference is 10 ms, the calculation of the cross spectrum is performed every 1 ms, and the cross spectrum for 5 ms before and after is sufficient, so the buffer size is 11.
In step S50, it is determined whether a predetermined period set in advance has elapsed. This predetermined period is a period for performing a noise suppression process described later.
The buffering unit 19 outputs the cross spectrum data accumulated in the subsequent fluctuation measurement unit 20 in order to perform noise suppression processing every time a predetermined period (here, 10 ms) elapses.

ステップＳ６０では、変動測定部２０が、所定期間ごとにバッファリング部１９に一時記憶されたクロススペクトル系列を使って、どの帯域が音声成分優勢か、雑音成分優勢かを判定するためのクロススペクトルの位相変動度合いを算出する。
発声検出装置１では、発声判定の周期をそれほど短くする必要はない。ここではその判定周期を１０ｍｓとし、バッファリング部へのクロススペクトル入力１０回につき１回の割合でバッファリング部１９のデータを処理する。 In step S60, the fluctuation measurement unit 20 uses the cross spectrum sequence temporarily stored in the buffering unit 19 for each predetermined period to determine which band is the voice component dominant and the noise component dominant. The degree of phase fluctuation is calculated.
In the utterance detection device 1, it is not necessary to shorten the utterance determination cycle so much. Here, the determination period is set to 10 ms, and the data in the buffering unit 19 is processed at a rate of once per 10 cross spectrum inputs to the buffering unit.

変動測定部２０による具体的な位相成分の変動度合いの算出方法を説明する。
いま、扱うクロススペクトルの数を２Ｍ＋１（ここではＭ＝５）、真ん中のフレーム番号をｔ_０とする。このとき、クロススペクトルの位相変動度合いとして、例えば式２の評価値を使うことができる。 A specific method of calculating the degree of fluctuation of the phase component by the fluctuation measuring unit 20 will be described.
Now, dealing with the number of 2M + 1 of the cross spectrum (in this case, M = 5), the frame number of middle and _{t 0.} At this time, for example, the evaluation value of Formula 2 can be used as the degree of phase variation of the cross spectrum.

ここで、Ｄ（ｋ，ｔ）は周波数番号ｋ、フレーム番号ｔでの位相誤差評価値、θ（ｋ，ｔ）は周波数番号ｋ、フレーム番号ｔでのクロススペクトルＹ（ｋ，ｔ）の位相情報である。この評価値は、式２から明らかなように１０ｍｓの範囲内でのクロススペクトルの直前のフレームからの位相の変動度の平均値を表しており、クロススペクトルの位相成分の変動が小さいほどＤ（ｋ，ｔ）の値は０に近くなり、位相成分の変動が大きいほどＤの値は大きくなる。 Here, D (k, t) is the phase error evaluation value at frequency number k and frame number t, θ (k, t) is the phase of cross spectrum Y (k, t) at frequency number k and frame number t. Information. As is apparent from Equation 2, this evaluation value represents an average value of the degree of phase variation from the immediately preceding frame of the cross spectrum within the range of 10 ms. The smaller the variation of the phase component of the cross spectrum, the smaller the D ( The value of k, t) is close to 0, and the value of D increases as the fluctuation of the phase component increases.

以上までが位相差変動評価部１３による２つのチャネルの信号の位相差の変動評価する処理となる。
尚、ここでは、位相差の変動度合いの評価値として式２のような値を用いたが、この評価値以外にも、位相情報の分散や、クロススペクトルの分散などを位相差の変動度合いの評価値として使用することができる。 Up to the above, the phase difference fluctuation evaluating unit 13 performs the process of evaluating the fluctuation of the phase difference between the signals of the two channels.
In this example, the value of Equation 2 is used as the evaluation value of the degree of fluctuation of the phase difference. However, in addition to this evaluation value, the dispersion of phase information, the dispersion of the cross spectrum, etc. It can be used as an evaluation value.

以下、振幅補正部１４の処理としてステップＳ６０で算出された評価値Ｄ（ｋ、ｔ）に基づいて雑音抑圧処理について説明する。
まず、ステップ７０において振幅補正係数算出部２２は変動測定部２０で算出した評価値Ｄ（ｋ、ｔ）を使って振幅補正係数を算出する。ここでは振幅補正係数の例として式３の関数を用いる。 Hereinafter, the noise suppression process will be described based on the evaluation value D (k, t) calculated in step S60 as the process of the amplitude correction unit 14.
First, in step 70, the amplitude correction coefficient calculation unit 22 calculates an amplitude correction coefficient using the evaluation value D (k, t) calculated by the variation measurement unit 20. Here, the function of Expression 3 is used as an example of the amplitude correction coefficient.

式３の関数は０に近い入力で１に近い値を出力し、絶対値が大きい入力ほど出力が小さくなる。ここでγは補正の傾斜を制御するパラメータで大きい値ほど抑圧率が高くなる。
振幅補正係数算出部２２では、フレームｔ_０、周波数ｋにおける振幅補正係数をｆ（Ｄ（ｋ，ｔ_０））として算出する。
即ち、フレームｔ_０、周波数ｋが雑音成分であればＤ（ｋ、ｔ_０）の値は大きくなるため、ｆ（Ｄ（ｋ，ｔ_０））の値は小さくなり、音声成分であればＤ（ｋ、ｔ_０）の値は小さくなるため、ｆ（Ｄ（ｋ，ｔ_０））の値は大きくなり、結果として雑音成分を抑圧する振幅補正係数となる。 The function of Equation 3 outputs a value close to 1 with an input close to 0, and the output becomes smaller as the input has a larger absolute value. Here, γ is a parameter for controlling the inclination of correction, and the larger the value, the higher the suppression rate.
The amplitude correction coefficient calculation unit 22 calculates the amplitude correction coefficient at frame t ₀ and frequency k as f (D (k, t ₀ )).
That is, if the frame t ₀ and the frequency k are noise components, the value of D (k, t ₀ ) increases, so the value of f (D (k, t ₀ )) decreases, and if it is a speech component, D Since the value of (k, t ₀ ) decreases, the value of f (D (k, t ₀ )) increases, resulting in an amplitude correction coefficient that suppresses noise components.

さらにオプションとして、振幅補正係数をフレーム間で平滑化する処理を加えてもよい。例えば次式のような更新を行う。 Further, as an option, processing for smoothing the amplitude correction coefficient between frames may be added. For example, the following update is performed.

ここで、Ａ（ｋ，ｔ）は周波数番号ｋ、フレーム番号ｔでの振幅補正係数、Ａ’（ｋ，ｔ）は周波数番号ｋ、フレーム番号ｔでの平滑化振幅補正係数である。βは平滑化の窓長を制御するパラメータで１に近く１を超えない正数である。 Here, A (k, t) is an amplitude correction coefficient at frequency number k and frame number t, and A '(k, t) is a smoothed amplitude correction coefficient at frequency number k and frame number t. β is a parameter that controls the smoothing window length and is a positive number close to 1 and not exceeding 1.

次にステップＳ８０では、抑圧処理部２３が雑音抑圧処理を行う。
まず、雑音抑圧処理を行う前に、バッファリング部１９に記憶されたクロススペクトルの白色化が行われる。上述の判定周期（例えば１０ｍｓ）で１つのクロススペクトルがバッファリング部１９から取り出され、白色化部２１において白色化（平坦化）される。取り出されるクロススペクトルはバッファリング部１９に記憶される１１個のクロススペクトルのうち、中央のクロススペクトルである。
この白色化は、後述の相互相関関数をパルス状にする効果がある。この処理はいくつかのバリエーションが考えられるが、一例を以下に示す。
クロススペクトルからパワースペクトルを計算し、これをＩＦＦＴすることにより相互相関関数の自己相関関数を求める。これらの数値列の適当な低次の項からＬＰＣ（Ｌｉｎｅａｒ
ＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）係数を求め、さらにＬＰＣケプストラム係数に変換する。このＬＰＣケプストラム係数によるスペクトル包絡を基に、平坦化処理を周波数軸上で行う。この白色化も必須の処理ではない。 Next, in step S80, the suppression processing unit 23 performs noise suppression processing.
First, before performing noise suppression processing, whitening of the cross spectrum stored in the buffering unit 19 is performed. One cross spectrum is extracted from the buffering unit 19 at the above-described determination period (for example, 10 ms), and whitened (flattened) by the whitening unit 21. The extracted cross spectrum is the center cross spectrum among the 11 cross spectra stored in the buffering unit 19.
This whitening has an effect of making a cross-correlation function described later into a pulse shape. There are several possible variations of this process, but an example is shown below.
A power spectrum is calculated from the cross spectrum, and an autocorrelation function of the cross-correlation function is obtained by IFFT. LPC (Linear) from the appropriate low-order terms of these numeric sequences
Predictive Coding) coefficients are obtained and further converted into LPC cepstrum coefficients. Based on the spectrum envelope by this LPC cepstrum coefficient, flattening processing is performed on the frequency axis. This whitening is not an essential process.

次に抑圧処理部２３は、白色化部２１で白色化されたクロススペクトルの振幅成分に対して、振幅補正係数算出部２２で算出された振幅補正係数を使って、周波数軸上で乗算処理を行い、雑音が抑圧されたクロススペクトル系列を出力する。クロススペクトル系列の出力は前述の通り、所定の判定周期（ここでは１０ｍｓ）に1回出力される。 Next, the suppression processing unit 23 multiplies the amplitude component of the cross spectrum that has been whitened by the whitening unit 21 on the frequency axis using the amplitude correction coefficient calculated by the amplitude correction coefficient calculation unit 22. And output a cross spectrum sequence in which noise is suppressed. As described above, the output of the cross spectrum series is output once in a predetermined determination period (here, 10 ms).

尚、上述の例では、クロススペクトルの位相成分の変動度合いに応じた振幅補正係数を算出して雑音抑圧処理を行っているが、本発明の範囲はこれに限られるものではない。別の態様として、位相成分の変動度合いが所定値以上の周波数成分を雑音成分と判定し、当該雑音と判定された周波数成分を抑圧するような補正係数を導入するようにしてもよい。 In the above example, the noise correction processing is performed by calculating the amplitude correction coefficient corresponding to the degree of fluctuation of the phase component of the cross spectrum, but the scope of the present invention is not limited to this. As another aspect, a frequency component having a phase component variation degree equal to or greater than a predetermined value may be determined as a noise component, and a correction coefficient that suppresses the frequency component determined to be the noise may be introduced.

以上、本発明の雑音抑圧装置である雑音抑圧部１２による、具体的な雑音抑圧処理について説明した。雑音抑圧処理されたクロススペクトルは前述のように相互相関算出部２４で逆フーリエ変換を行い相互相関関数に戻される。雑音抑圧処理が成された相互相関関数は、処理を施さないそれと比べて、非発声時における相関値が著しく低くなるという性質がある。そのため、発声検出部２５により発声検知処理では、発声の有無を判定する閾値設定が容易となり、環境雑音が大きい環境下においても精度よく発声検出が行われる。 The specific noise suppression processing by the noise suppression unit 12 which is the noise suppression device of the present invention has been described above. As described above, the cross spectrum subjected to the noise suppression process is subjected to inverse Fourier transform in the cross-correlation calculation unit 24 and returned to the cross-correlation function. The cross-correlation function subjected to the noise suppression process has a property that the correlation value at the time of non-voicing becomes significantly lower than that without the process. Therefore, in the utterance detection process by the utterance detection unit 25, it is easy to set a threshold value for determining the presence or absence of utterance, and the utterance detection is accurately performed even in an environment where the environmental noise is large.

（第２の実施形態）
次に、第２の実施形態として、本発明に係る雑音抑圧装置を、音声再生装置に適用した例について説明する。
本実施形態で説明する音声再生装置は、２つのマイクロホン２から入力した音響信号に含まれる雑音成分を本発明に係る雑音抑圧装置にて抑圧して、雑音が抑圧されたクリアの音響信号をスピーカから再生する装置である。 (Second Embodiment)
Next, as a second embodiment, an example in which the noise suppression device according to the present invention is applied to an audio reproduction device will be described.
The sound reproducing device described in the present embodiment suppresses a noise component included in the acoustic signals input from the two microphones 2 by the noise suppressing device according to the present invention, and the clear acoustic signal in which the noise is suppressed is obtained from the speaker. It is a device that reproduces from.

第１の実施形態である発声検出装置１と共通する部分については説明を適宜省略するものとし、以下本音声再生装置の動作について説明する。尚、第１の実施形態で説明に用いた図７雑音抑圧処理のフローチャートは基本的に第２の実施形態においても同じであるものとする。
図４に本実施形態に係る音声再生装置５のブロック図を示す。
マイクロホン２、増幅器１０、Ａ/Ｄコンバータ１１は第１の実施形態と同様であるため説明を省略する。
雑音抑圧部５２は、本願の雑音抑圧装置であり、Ａ／Ｄコンバータ１１で離散信号に変換された音響信号が入力されると、雑音が抑圧処理された離散信号を出力する。第１の実施形態では、雑音が抑圧されたクロススペクトルを出力したが、第２の実施形態に係る雑音抑圧部５２は雑音が抑圧された音響信号を出力する点が異なる。 Description of parts common to the utterance detection apparatus 1 according to the first embodiment will be omitted as appropriate, and the operation of the sound reproduction apparatus will be described below. Note that the flowchart of FIG. 7 noise suppression processing used in the description of the first embodiment is basically the same as that of the second embodiment.
FIG. 4 shows a block diagram of the audio reproduction device 5 according to the present embodiment.
Since the microphone 2, the amplifier 10, and the A / D converter 11 are the same as those in the first embodiment, description thereof is omitted.
The noise suppression unit 52 is the noise suppression device of the present application, and outputs a discrete signal in which noise is suppressed when an acoustic signal converted into a discrete signal by the A / D converter 11 is input. In the first embodiment, a cross spectrum in which noise is suppressed is output. However, the noise suppression unit 52 according to the second embodiment is different in that an acoustic signal in which noise is suppressed is output.

Ｄ／Ａコンバータ５８は雑音抑圧部５２で雑音が抑圧された離散信号をアナログ信号に変換する。
Ｄ／Ａコンバータ５８で変換されたアナログ信号は増幅器５９で増幅され、スピーカ６０により再生される。Ｄ／Ａコンバータ５８、増幅器５９、スピーカ６０は周知のものであるため詳細な説明を省略する。 The D / A converter 58 converts the discrete signal whose noise is suppressed by the noise suppression unit 52 into an analog signal.
The analog signal converted by the D / A converter 58 is amplified by the amplifier 59 and reproduced by the speaker 60. Since the D / A converter 58, the amplifier 59, and the speaker 60 are well known, detailed description thereof will be omitted.

次に図８を用いて本発明の雑音抑圧装置である雑音抑圧部５２の詳細ブロック図を説明する。
位相差変動評価部１３はクロススペクトルの位相の変動に応じて周波数成分ごとに時間変動度合いを算出するものであり、第１の実施形態と同じ機能を有する。 Next, a detailed block diagram of the noise suppression unit 52, which is the noise suppression device of the present invention, will be described with reference to FIG.
The phase difference fluctuation evaluation unit 13 calculates the degree of time fluctuation for each frequency component according to the fluctuation of the phase of the cross spectrum, and has the same function as that of the first embodiment.

振幅補正部５４は、波形合成部５５と、振幅補正係数算出部５６、抑圧処理部５７からなる。 The amplitude correction unit 54 includes a waveform synthesis unit 55, an amplitude correction coefficient calculation unit 56, and a suppression processing unit 57.

波形合成部５５では、左右チャンネルの音声信号を合成し、１チャンネルの信号とし、振幅補正係数算出部５６の処理に連動して必要な波形を出力する。
２本のマイクを結ぶ線に垂直な方向からの音声の場合には、単純な和でよい。それ以外の場合は、左右の音波の到達時間の差に基づいた位相シフトを行った上での加算が望ましい。もしくは、左右どちらかの信号を出力するのでも構わない。 The waveform synthesizer 55 synthesizes the left and right channel audio signals into a single channel signal, and outputs a necessary waveform in conjunction with the processing of the amplitude correction coefficient calculator 56.
In the case of audio from a direction perpendicular to a line connecting two microphones, a simple sum may be used. In other cases, it is desirable to perform addition after performing phase shift based on the difference between arrival times of the left and right sound waves. Alternatively, either the left or right signal may be output.

振幅補正係数算出部５６では、第１の実施形態と同様に、変動測定部２０で算出した変動評価値を使って振幅補正係数を算出する。
振幅補正係数算出部５６はさらにフィルタ係数計算部５６１を有し、フィルタ係数計算部５６１は、算出された振幅補正係数からＦＩＲ（Ｆｉｎｉｔｅ
ＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタの係数を計算する。第２の実施形態では、第１の実施形態と異なり、この算出されたフィルタ係数を用いて時間軸上で雑音抑圧処理が実施され、雑音抑圧処理された時間波形を出力する。
ここでは、高品質な音声の出力を行うため、振幅補正係数算出部５６およびフィルタ係数計算部５６１の更新周期は１ｍｓとしている。これは図７のフローチャートにおいて、ステップＳ５０の所定周期を１ｍｓとすることに相当する。
即ち、クロススペクトルを算出する分析周期も１ｍｓであるため、クロススペクトルが算出されるたびに、バッファリング１９に記憶されているクロススペクトルを用いてステップＳ６０〜ステップＳ８０の雑音抑圧処理を実施することを意味する。
更新周期を短くすることにより高品質な音声が再生可能となる。尚、多少の品質劣化を許容すればこの更新周期を延ばしてもよい。 In the amplitude correction coefficient calculation unit 56, the amplitude correction coefficient is calculated using the fluctuation evaluation value calculated by the fluctuation measurement unit 20, as in the first embodiment.
The amplitude correction coefficient calculation unit 56 further includes a filter coefficient calculation unit 561. The filter coefficient calculation unit 561 calculates FIR (Finite) from the calculated amplitude correction coefficient.
(Impulse Response) The coefficient of the filter is calculated. In the second embodiment, unlike the first embodiment, noise suppression processing is performed on the time axis using the calculated filter coefficient, and a time waveform subjected to noise suppression processing is output.
Here, in order to output high-quality sound, the update period of the amplitude correction coefficient calculation unit 56 and the filter coefficient calculation unit 561 is set to 1 ms. This corresponds to setting the predetermined cycle of step S50 to 1 ms in the flowchart of FIG.
That is, since the analysis period for calculating the cross spectrum is also 1 ms, the noise suppression processing in steps S60 to S80 is performed using the cross spectrum stored in the buffering 19 every time the cross spectrum is calculated. Means.
By shortening the update cycle, high-quality audio can be reproduced. Note that this update cycle may be extended if some quality degradation is allowed.

フィルタ係数計算部５６１における処理についてさらに説明する。
一般に、計算されたスペクトル補正特性は複雑な形状を有するため、ＦＩＲフィルタ係数長は長くなる傾向にある。そのため、本実施形態ではフィルタ係数を求めるのに、振幅補正係数を算出したＦＦＴ長よりも、長いＦＦＴ長を使用してフィルタ係数を計算している。具体的には、例えば２５６点の振幅特性を4倍のアップサンプリングを行い、これを逆ＦＦＴすることにより１０２４点のフィルタ係数を求め、これから適当な長さの窓関数を使って切り出す作業を行うようにしている。
この算出されたフィルタ係数を時間軸上で元の音響信号に畳み込むことにより、スペクトル形状が複雑で時々刻々変化したとしても変化に対応した高品質な音声の出力が可能となる。 The processing in the filter coefficient calculation unit 561 will be further described.
In general, the calculated spectral correction characteristic has a complicated shape, so that the FIR filter coefficient length tends to be long. For this reason, in this embodiment, to obtain the filter coefficient, the filter coefficient is calculated using an FFT length that is longer than the FFT length for which the amplitude correction coefficient is calculated. Specifically, for example, upsampling is performed by quadrupling the 256-point amplitude characteristics, and 1024-point filter coefficients are obtained by performing inverse FFT on the up-sampling, and the work is performed using a window function having an appropriate length. I am doing so.
By convolving the calculated filter coefficient with the original acoustic signal on the time axis, even if the spectrum shape is complex and changes from moment to moment, it is possible to output high-quality sound corresponding to the change.

抑圧処理部５７では、波形合成部５５の出力に対して、振幅補正係数算出部５６で算出されたフィルタ係数を用いて時間軸上で畳み込み演算を行いフィルタリング処理を行う。
処理対象の波形は、バッファリング部１９に記憶されている１１個のクロススペクトルのうちの中央のクロススペクトルの算出に用いた時間窓（３０ｍｓ）の中央部分１ｍｓに相当する波形合成部からの信号で、この信号にフィルタリング処理を施して１ｍｓ（＝８ポイント）の波形を出力する。
なお、ここではフィルタ処理を時間軸上で行う処理について説明したが、上述のように高品質の音声再生が要求されない場合は、算出された振幅補正係数を時間軸上に戻さず、周波数軸上で元の音響信号のＦＦＴ算出結果に乗算して雑音成分を抑圧する処理に置き換えても構わない。 The suppression processing unit 57 performs a filtering process on the output of the waveform synthesis unit 55 by performing a convolution operation on the time axis using the filter coefficient calculated by the amplitude correction coefficient calculation unit 56.
The waveform to be processed is a signal from the waveform synthesizer corresponding to the central portion 1 ms of the time window (30 ms) used for calculating the central cross spectrum among the 11 cross spectra stored in the buffering unit 19. Then, a filtering process is performed on this signal to output a 1 ms (= 8 points) waveform.
In addition, although the process which performs a filter process on a time axis was demonstrated here, when high quality audio | voice reproduction | regeneration is not requested | required as mentioned above, the calculated amplitude correction coefficient is not returned on a time axis, but on a frequency axis. In this case, the FFT calculation result of the original acoustic signal may be multiplied to suppress the noise component.

雑音抑圧処理が施された離散信号は、前述のようにＤ/Ａコンバータ５８にてアナログ
信号に変換され、増幅器５９にて増幅され、スピーカ６０から再生される。 The discrete signal subjected to the noise suppression processing is converted into an analog signal by the D / A converter 58 as described above, amplified by the amplifier 59, and reproduced from the speaker 60.

以上、本発明の雑音抑圧装置を音声再生装置に適用した第２の実施形態について説明した。
尚、本発明の雑音抑圧装置は上記実施形態以外にも適用が可能である。例えば本発明に係る雑音抑圧装置を音声認識処理の前段に使用すれば、雑音や目的方向以外の音声を認識対象音声から精度良く外すことができるため、湧き出し誤りを大幅に減少させることが可能になる。その上、認識対象音声に含まれる雑音成分が抑圧されることにより、音声認識率も大きく向上させることが可能となる。 In the foregoing, the second embodiment in which the noise suppression device of the present invention is applied to an audio reproduction device has been described.
The noise suppression device of the present invention can be applied to other than the above embodiment. For example, if the noise suppression device according to the present invention is used in the previous stage of speech recognition processing, speech other than noise and the target direction can be accurately removed from the speech to be recognized, so that errors can be greatly reduced. become. In addition, since the noise component included in the recognition target speech is suppressed, the speech recognition rate can be greatly improved.

１・・・発声検出装置の本体
１０・・・増幅器
１１・・・Ａ／Ｄコンバータ
１２・・・雑音抑圧部（本発明の雑音抑圧装置）
１３・・・位相差変動評価部
１４・・・振幅補正部
２４・・・相互相関計算部
２５・・・発声推定部
５８・・・Ｄ／Ａコンバータ
５９・・・増幅器
６０・・・スピーカ
２・・・マイクロホン
３・・・ＡＴＭ
４・・・話者
５・・・音声再生装置 DESCRIPTION OF SYMBOLS 1 ... Main body 10 of an utterance detection apparatus ... Amplifier 11 ... A / D converter 12 ... Noise suppression part (noise suppression apparatus of this invention)
13 ... Phase difference variation evaluation unit 14 ... Amplitude correction unit 24 ... Cross correlation calculation unit 25 ... Speech estimation unit 58 ... D / A converter 59 ... Amplifier 60 ... Speaker 2 ... Microphone 3 ... ATM
4 ... Speaker 5 ... Audio playback device

Claims

A noise suppression device that processes acoustic signals acquired by two sound collectors and suppresses noise components, wherein the phase difference variation evaluates the degree of temporal variation of the phase difference between the two acoustic signals for each frequency component. An evaluation unit;
A signal obtained by calculating an amplitude correction coefficient such that an amplitude component of the frequency component becomes small using a frequency component having a large degree of temporal variation of the phase difference as a noise, and suppressing the noise component by applying the amplitude correction coefficient to the acoustic signal. An amplitude correction unit that outputs
A noise suppression device comprising:

The phase difference variation evaluation unit is
A cross spectrum calculation unit for calculating a cross spectrum of the two acoustic signals for each predetermined period;
A buffering unit for storing a predetermined number of the calculated cross spectrum;
A fluctuation measuring unit that calculates a time fluctuation degree of the phase component of the cross spectrum stored in the buffering part for each predetermined period as a time fluctuation degree of the phase difference;
The noise suppression apparatus according to claim 1, comprising:

The noise suppression device according to claim 2, wherein the amplitude correction unit calculates a cross spectrum in which a noise component is suppressed by multiplying the amplitude component of the cross spectrum by the amplitude correction coefficient.

The amplitude correction unit includes a whitening unit that performs a whitening process on the amplitude component of the cross spectrum,
The noise suppression apparatus according to claim 3, wherein a noise component is suppressed by multiplying the amplitude component of the whitened cross spectrum by the amplitude correction coefficient.

The amplitude correction unit includes a filter coefficient calculation unit that calculates a filter coefficient obtained by performing inverse Fourier transform on the amplitude correction coefficient,
The noise suppression device according to claim 2, wherein the filter coefficient is applied to one of the two acoustic signals or a synthesized acoustic signal to generate an acoustic signal in which a noise component is suppressed.