JP2011203414A

JP2011203414A - Noise and reverberation suppressing device and method therefor

Info

Publication number: JP2011203414A
Application number: JP2010069531A
Authority: JP
Inventors: Tomoya Takatani; 智哉高谷; Hiroshi Saruwatari; 洋猿渡; Jani Even; ジャニエバン
Original assignee: Nara Institute of Science and Technology NUC; Toyota Motor Corp
Current assignee: Nara Institute of Science and Technology NUC; Toyota Motor Corp
Priority date: 2010-03-25
Filing date: 2010-03-25
Publication date: 2011-10-13

Abstract

PROBLEM TO BE SOLVED: To automatically suppress both noise and reverberation by adapting for environmental change.SOLUTION: A noise and reverberation suppressing device is equipped with: a filter creation section for creating a separation filter matrix for each frequency by using an input signal which is converted to a frequency domain; a noise estimation section for calculating estimation noise; a direct voice/initial reflection voice estimation section which obtains each element of a reverse matrix of the separation filter matrix by converting it to a time domain, as an estimation space transfer characteristic, and which extracts a section corresponding to direct voice and initial reflection of voice from the estimation space transfer characteristic, creates the direct voice and an initial reflection filter matrix for each frequency by converting the extracted section to the frequency domain, and calculates estimation direct voice and initial reflection voice; a late stage reverberation creation section which calculates a filter coefficient of a late stage reverberation characteristic for each frequency from reverberation time of a given space and a power amount of the section extracted by the direct voice/initial reflection voice estimation section, and which calculates a pseudo late stage reverberation. Noise and late stage reverberation in a mixed observation signal is suppressed thereby.

Description

本発明は雑音及び残響抑圧装置及びその方法に関する。 The present invention relates to a noise and dereverberation apparatus and method.

ロボットに対してユーザが音声コマンド入力を行うハンズフリー音声コマンド認識システム（以下、単に音声認識システムと称する。）が開発されている。ここで、図７に示すように、実環境においてロボットＲにより収音される音は、ユーザＰの音声（以下、ユーザ音声と称する。）の直接音と、ユーザ音声の後期残響音と、雑音とを含んでいる。なお、図では、ユーザ音声の直接音を実線で示し、ユーザ音声の後期残響音を破線で示し、雑音を一転鎖線で示している。 A hands-free voice command recognition system (hereinafter simply referred to as a voice recognition system) in which a user inputs voice commands to a robot has been developed. Here, as shown in FIG. 7, the sound picked up by the robot R in the real environment includes the direct sound of the user P's voice (hereinafter referred to as user voice), the late reverberation sound of the user voice, and noise. Including. In the figure, the direct sound of the user voice is indicated by a solid line, the late reverberation sound of the user voice is indicated by a broken line, and the noise is indicated by a chain line.

音声認識システでは、ロボットは、マイクロホンによる観測信号からユーザの音声を認識するが、音声認識性能を劣化させる要因として、（１）ユーザ音声以外の混入雑音、（２）ユーザ音声の後期残響成分の影響、などが挙げられる。 In the voice recognition system, the robot recognizes the user's voice from the observation signal from the microphone. However, as a factor that deteriorates the voice recognition performance, (1) mixed noise other than the user voice, and (2) late reverberation component of the user voice. Impact, etc.

なお、ユーザ音声の残響は音声認識性能の劣化要因となるが、後述する従来技術に示されるように、初期残響成分については、音声認識システム内の音響モデルにおいて「残響モデル」を用いることで、その影響が除去可能であることが知られている。 Although the reverberation of the user voice is a cause of deterioration of the speech recognition performance, as shown in the prior art described later, for the initial reverberation component, by using the “reverberation model” in the acoustic model in the speech recognition system, It is known that the effect can be removed.

従来、雑音や残響が発生する環境下において、雑音のみ、或いは、残響のみのいずれかの抑圧を目的とする技術が開発されている。
例えば、非特許文献１、２には、後期残響成分の抑圧を目的とする技術が開示されている。図８に、非特許文献１、２に開示された音声認識システムの機能構成を示す。尚、図では、通常の太さの線は信号を伝送していることを意味しており、それよりも太い線はフィルタを伝送していることを意味している。 Conventionally, in the environment where noise and reverberation occur, a technique for suppressing either noise alone or reverberation alone has been developed.
For example, Non-Patent Documents 1 and 2 disclose techniques aimed at suppressing late reverberation components. FIG. 8 shows a functional configuration of the speech recognition system disclosed in Non-Patent Documents 1 and 2. In the figure, a line having a normal thickness means that a signal is transmitted, and a line thicker than that means that a filter is transmitted.

図８に示すように、後期残響推定部５０１は、マイクロホン１からの観測信号を受けて、予め測定しておいた環境の残響特性から後期残響成分を推定する。また、予め測定しておいた環境の残響特性から補正係数が求められており、ゲイン補正部５０２は、与えられる補正係数を用いて、推定した後期残響成分の振幅を補正する。残響抑圧処理部５０３は、振幅が補正された後期残響成分を用いて、マイクロホン１からの観測信号に含まれる後期残響を抑圧する。このように、観測された信号から後期残響成分を推定して、その推定した後期残響成分を減算することで、ユーザ音声の後期残響成分を抑圧する。 As shown in FIG. 8, the late reverberation estimation unit 501 receives the observation signal from the microphone 1 and estimates the late reverberation component from the reverberation characteristics of the environment measured in advance. In addition, a correction coefficient is obtained from the reverberation characteristics of the environment measured in advance, and the gain correction unit 502 corrects the estimated amplitude of the late reverberation component using the given correction coefficient. The reverberation suppression processing unit 503 suppresses the late reverberation included in the observation signal from the microphone 1 using the late reverberation component whose amplitude is corrected. Thus, the late reverberation component is estimated from the observed signal, and the late reverberation component of the user voice is suppressed by subtracting the estimated late reverberation component.

また、例えば、非特許文献３には、雑音の抑圧を目的とする技術が開示されている。図９に、非特許文献３に開示された音声認識システムの機能構成を示す。尚、図９においても、各太さの線は図８と同様のことを意味している。 Further, for example, Non-Patent Document 3 discloses a technique aimed at noise suppression. FIG. 9 shows a functional configuration of the speech recognition system disclosed in Non-Patent Document 3. Also in FIG. 9, each thickness line means the same as in FIG.

図９において、ブラインド音原分離（ＢＳＳ）６０１、音声・雑音選択部６０２、及び多チャンネル雑音推定部６０３では、複数のマイクロホン素子からなるマイクロホンアレイ２からの観測信号を受けて、混合音を分離して雑音成分を推定する。マスク生成部６０４及び雑音抑圧処理部６０５では、推定した雑音成分からマスクを作成して、この作成したマスクを用いて観測信号に含まれる雑音を抑圧する。さらに、直接音声強調部６０６では、雑音抑圧後の観測信号に含まれるユーザ音声の直接音を強調する。このように、ブラインド音源分離（ＢＳＳ）（もしくは、ブラインド音源抽出（ＢＳＥ））アルゴリズムを用いて混入した雑音成分を推定し、観測信号と雑音推定量を入力としたＷｉｎｅｒＦｉｌｔｅｒ処理を行ってユーザ音声を抽出することで、ハンズフリーのための音声強調処理を行う。 In FIG. 9, blind sound source separation (BSS) 601, speech / noise selection unit 602, and multi-channel noise estimation unit 603 receives the observation signal from the microphone array 2 composed of a plurality of microphone elements and separates the mixed sound. To estimate the noise component. The mask generation unit 604 and the noise suppression processing unit 605 create a mask from the estimated noise component and suppress noise contained in the observation signal using the created mask. Further, the direct speech enhancement unit 606 enhances the direct sound of the user speech included in the observation signal after noise suppression. As described above, the noise component mixed in is estimated by using the blind sound source separation (BSS) (or blind sound source extraction (BSE)) algorithm, and the user filter is performed by performing the Wine Filter process using the observation signal and the noise estimation amount as input. Is extracted to perform voice enhancement processing for hands-free operation.

特開２００３−０６６９８６号公報JP 2003-066986 A 特開２００９−５０９６７３号公報JP 2009-509673 A 特開２００３−３３４７８５号公報JP 2003-334785 A 特開２００３−３０５６７０号公報JP 2003-305670 A 特開２００４−０９８２５２号公報JP 2004-098252 A

Randy Gomez, Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, "Robustness in Microphone-speaker Location under Reverberant Conditions for Speech Recognition," 日本音響学会講演論文集, pp. 159-160, ２００８年３月.Randy Gomez, Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, "Robustness in Microphone-speaker Location under Reverberant Conditions for Speech Recognition," Proceedings of the Acoustical Society of Japan, pp. 159-160, March 2008. Randy Gomez, Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, "DISTANT-TALKING ROBUST SPEECH RECOGNITION USING LATE REFLECTION COMPONENTS OF ROOM IMPULSE RESPONSE," ICASS, pp. 4581-4584, 2008.Randy Gomez, Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, "DISTANT-TALKING ROBUST SPEECH RECOGNITION USING LATE REFLECTION COMPONENTS OF ROOM IMPULSE RESPONSE," ICASS, pp. 4581-4584, 2008. Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, Tomoya Takatani, "Speech enhancement in presence of diffuse background noise using sparsity based blind signal extraction," 日本音響学会講演論文集, pp. 765-768, ２００９年９月.Jani Even, Hiroshi Saruwatari, Kiyohiro Shikano, Tomoya Takatani, "Speech enhancement in presence of diffuse background noise using sparsity based blind signal extraction," Proceedings of the Acoustical Society of Japan, pp. 765-768, September 2009. Jani Even, Randy Gomez, Hiroshi Saruwatari, Kiyohiro Shikano, "Combining blind signal separation and spectral subtraction of late impulse response effect for dereverberation in noisy and highly reverberant environment," 日本音響学会講演論文集, pp. 839-842, ２００８年９月.Jani Even, Randy Gomez, Hiroshi Saruwatari, Kiyohiro Shikano, "Combining blind signal separation and spectral subtraction of late impulse response effect for dereverberation in noisy and highly reverberant environment," Proceedings of the Acoustical Society of Japan, pp. 839-842, 2008 September. 吉岡拓也、中谷智広、奥乃博、"重みつき予測誤差法におけるＭＩＭＯ残響除去フィルタの効率的最適化法"、日本音響学会講演論文集、ｐｐ．６５１−６５４、２００９年９月。Takuya Yoshioka, Tomohiro Nakatani, Hiroshi Okuno, “Efficient optimization method of MIMO dereverberation filter in weighted prediction error method”, Proc. 651-654, September 2009. Takuya Yoshioka, Tomohiro Nakatani, and Masato Miyoshi, "FAST SLGORITHM FOR CONDITIONAL SEPARATION AND DEREVERBERATION," EURASIP, pp. 1432-1436, 2009.Takuya Yoshioka, Tomohiro Nakatani, and Masato Miyoshi, "FAST SLGORITHM FOR CONDITIONAL SEPARATION AND DEREVERBERATION," EURASIP, pp. 1432-1436, 2009.

一方で、本願発明者は、雑音及び残響の両方を抑圧可能とするために、図１０に示す技術を創作した。
図１０に示す技術では、まず、ブラインド音源分離（ＢＳＳ）７０１、音声・雑音選択部７０２、多チャンネル雑音推定部７０３、マスク生成部７０４、雑音抑圧処理部７０５、直接音声強調部７０６では、図９に示した技術と同様にして雑音を抑圧した後に直接音声を強調する。そして、後期残響推定部７０７、ゲイン補正部７０８、及び残響抑圧処理部７０９では、図８に示した技術と同様に、予め測定しておいた環境の残響特性を用いて後期残響を抑圧する。これにより、雑音及び残響の両方を抑圧可能とするものである。 On the other hand, the inventor of the present application has created a technique shown in FIG. 10 in order to suppress both noise and reverberation.
In the technique shown in FIG. 10, first, a blind sound source separation (BSS) 701, a voice / noise selection unit 702, a multi-channel noise estimation unit 703, a mask generation unit 704, a noise suppression processing unit 705, and a direct speech enhancement unit 706 The speech is directly enhanced after suppressing noise in the same manner as in the technique shown in FIG. Then, the late reverberation estimation unit 707, the gain correction unit 708, and the reverberation suppression processing unit 709 suppress the late reverberation using the reverberation characteristics of the environment measured in advance, similarly to the technique shown in FIG. Thereby, both noise and reverberation can be suppressed.

図１１は、図８、９、１０に示した各技術が備える機能を示す表である。図に示すように、図８に示した技術では雑音を抑圧することができず、図９に示した技術では後期残響を抑圧することができず、図１０に示した技術では、雑音及び後期残響の両方を抑圧できるが、ゲイン補正係数を求めるために事前に運用環境データの収集を行っておく必要があり、環境が変化した場合には、再度、運用環境データの収集が必要となる。なお、ゲイン補正係数は後期残響成分を抑圧するために用いるものであるため、図９に示した技術ではゲイン補正係数は不要である。 FIG. 11 is a table showing functions provided in the technologies shown in FIGS. As shown in the figure, the technique shown in FIG. 8 cannot suppress noise, the technique shown in FIG. 9 cannot suppress late reverberation, and the technique shown in FIG. Although both reverberations can be suppressed, it is necessary to collect operational environment data in advance in order to obtain a gain correction coefficient. When the environment changes, it is necessary to collect operational environment data again. Since the gain correction coefficient is used to suppress the late reverberation component, the technique shown in FIG. 9 does not require the gain correction coefficient.

しかしながら、図１０に示した雑音及び残響の両方を抑圧可能とする技術においても、依然として、部屋などの空間の残響特性を示す空間伝達特性については、各空間において予め残響特性を測定しておき、空間に応じた残響特性をユーザがロボットに与える必要があった。 However, even in the technology capable of suppressing both noise and reverberation shown in FIG. 10, the reverberation characteristic is measured in advance in each space for the spatial transfer characteristic indicating the reverberation characteristic of a space such as a room, The user has to give the robot reverberation characteristics according to the space.

ロボットが音声認識に用いるモデルは、響のない理想的な環境下において作成したものであるため、響によりユーザ音声の直接音以外の他の音が混入すると、モデルとの間でミスマッチを引き起こしてしまう。 The model used by the robot for speech recognition was created in an ideal environment without reverberation. If sound other than the direct sound of the user's sound is mixed due to reverberation, it will cause a mismatch with the model. End up.

響のある環境下で予め必要なデータを取得しておいた上でモデルを作成すれば、このようなミスマッチを回避することが可能であるが、様々な環境に対して予めこのような準備を行うのは、経済的・時間的にもコストがかかり現実的ではない。このため、事前に運用環境データの収集を必要とせずに、環境の変化に適応して、雑音及び残響の両方を自動的に抑圧可能な技術が強く求められている。 Such a mismatch can be avoided by creating a model after obtaining necessary data in a sound environment in advance. This is not realistic because of cost and time. For this reason, there is a strong demand for a technique that can automatically suppress both noise and reverberation in response to environmental changes without requiring collection of operational environment data in advance.

なお、雑音及び残響の両方を抑圧することを目的とする技術としては、他にも非特許文献４乃至６に開示された技術があるが、いずれの技術においても、事前に運用環境データの収集を必要とするものであり、空間の残響特性を自動的に作成する点については開示されていない。 There are other techniques for suppressing both noise and reverberation, as disclosed in Non-Patent Documents 4 to 6. However, in either technique, collection of operational environment data in advance is possible. The point of automatically creating the reverberation characteristics of the space is not disclosed.

また、その他の雑音抑圧技術として特許文献１乃至５などに開示される技術があるが、例えば特許文献１や特許文献２に開示された技術では、雑音の抑圧のみを可能とするものにすぎない。 In addition, as other noise suppression techniques, there are techniques disclosed in Patent Documents 1 to 5 and the like. For example, the techniques disclosed in Patent Document 1 and Patent Document 2 only allow noise suppression. .

従って、本発明は、上述した課題を解決して、環境の変化に適応して、雑音及び残響の両方を自動的に抑圧可能な雑音及び残響抑圧装置及びその方法を提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a noise and dereverberation apparatus and method capable of automatically suppressing both noise and reverberation by solving the above-described problems and adapting to environmental changes. .

本発明の第一の態様に係る雑音及び残響抑圧装置は、音声及び雑音を含む混合観測信号が周波数領域に変換された入力信号を用いて、前記混合観測信号から前記音声を分離する各周波数での分離フィルタ行列を作成するフィルタ作成部と、前記入力信号と、前記分離フィルタ行列と、前記分離フィルタ行列の逆行列と、を用いて、推定雑音を算出する雑音推定部と、前記分離フィルタ行列の逆行列の各要素を時間領域に変換して推定空間伝達特性として取得し、前記推定空間伝達特性から前記音声の直接音声及び初期反射に相当する区間を切出し、当該切出した区間を周波数領域に変換して各周波数での直接音声及び初期反射フィルタ行列を作成し、前記入力信号と、前記分離フィルタ行列と、前記直接音声及び初期反射フィルタ行列と、を用いて、推定直接音声及び初期反射音声を算出する直接音声・初期反射音声推定部と、与えられる空間の残響時間と、前記直接音声・初期反射音声推定部により切出した区間のパワー量と、から、各周波数での後期残響特性のフィルタ係数を算出し、前記推定直接音声及び初期反射音声と、前記分離フィルタ行列と、前記後期残響特性のフィルタ係数と、を用いて、擬似後期残響を算出する後期残響生成部と、を備え、前記雑音推定部により算出した前記推定雑音と、前記後期残響生成部により算出した前記擬似後期残響と、を用いて、前記混合観測信号中の雑音及び後期残響を抑圧するものである。 The noise and dereverberation apparatus according to the first aspect of the present invention uses an input signal obtained by converting a mixed observation signal including voice and noise into a frequency domain at each frequency for separating the voice from the mixed observation signal. A noise estimation unit that calculates estimated noise using a filter creation unit that creates a separation filter matrix, an input signal, the separation filter matrix, and an inverse matrix of the separation filter matrix, and the separation filter matrix Each element of the inverse matrix is converted to the time domain to obtain an estimated spatial transfer characteristic, and a section corresponding to the direct voice and initial reflection of the voice is extracted from the estimated spatial transfer characteristic, and the extracted section is converted to the frequency domain. Transform to create a direct speech and initial reflection filter matrix at each frequency, the input signal, the separation filter matrix, and the direct speech and initial reflection filter matrix. The direct voice / initial reflected voice estimation unit for calculating the estimated direct voice and the initial reflected voice, the reverberation time of the given space, and the power amount of the section extracted by the direct voice / initial reflected voice estimation unit, A late period of calculating pseudo-late reverberation using the estimated direct speech and early reflected sound, the separation filter matrix, and the filter coefficient of the late reverberation characteristic, calculating a filter coefficient of the late reverberation characteristic at each frequency Reverberation generation unit, and suppresses noise and late reverberation in the mixed observation signal using the estimated noise calculated by the noise estimation unit and the pseudo late reverberation calculated by the late reverberation generation unit To do.

これにより、環境が変化した場合においても、新たな空間の観測信号に基づいて推定直接音声及び初期反射音声を算出し、後期残響特性を自動的に作成して、算出した推定直接音声及び初期反射音声と、作成した後期残響特性と、から後期残響を算出することができるため、環境の変化に適応して、雑音及び残響の両方を自動的に抑圧することができる。 As a result, even when the environment changes, the estimated direct speech and early reflection sound are calculated based on the observation signal in the new space, the late reverberation characteristics are automatically created, and the calculated estimated direct speech and initial reflection are calculated. Since the late reverberation can be calculated from the voice and the created late reverberation characteristics, both noise and reverberation can be automatically suppressed in accordance with the change in the environment.

また、前記入力信号と、前記雑音推定部により算出した前記推定雑音と、を用いて、雑音抑圧マスクを算出する雑音抑圧マスク生成部と、前記雑音抑圧マスクを用いて、前記混合観測信号中の雑音を抑圧する雑音抑圧処理部と、前記雑音抑圧マスク生成部により算出した前記雑音抑圧マスクを用いて、前記後期残響生成部により算出した前記擬似後期残響の振幅を補正するゲイン補正部と、を更に備え、前記ゲイン補正部により振幅が補正された後の擬似後期残響を用いて、前記混合観測信号中の後期残響を抑圧するようにしてもよい。 Further, a noise suppression mask generation unit that calculates a noise suppression mask using the input signal and the estimated noise calculated by the noise estimation unit, and a noise suppression mask in the mixed observation signal using the noise suppression mask A noise suppression processing unit that suppresses noise; and a gain correction unit that corrects the amplitude of the pseudo late reverberation calculated by the late reverberation generation unit using the noise suppression mask calculated by the noise suppression mask generation unit; Further, the late reverberation in the mixed observation signal may be suppressed using the pseudo late reverberation after the amplitude is corrected by the gain correction unit.

さらにまた、前記入力信号と、前記ゲイン補正部により振幅が補正された後の擬似後期残響と、を用いて、後期残響音抑圧マスクを算出する後期残響音抑圧マスク生成部と、前記後期残響音抑圧マスクを用いて、前記混合観測信号中の残響を抑圧する後期残響音抑圧処理部と、を更に備えるようにしてもよい。 Furthermore, a late reverberation sound suppression mask generation unit that calculates a late reverberation sound suppression mask using the input signal and the pseudo late reverberation after the amplitude is corrected by the gain correction unit, and the late reverberation sound A late reverberation suppression unit that suppresses reverberation in the mixed observation signal by using a suppression mask may be further included.

また、前記フィルタ作成部は、前記入力信号を用いて適応学習処理を行い、前記混合観測信号から前記音声を分離する各周波数での分離フィルタ行列を作成するブラインド音源分離と、前記ブラインド音源分離により作成した前記分離フィルタ行列の第一要素が前記音声となるように入れ替えを行う音声・雑音選択部と、前記音声雑音・選択部により入れ替えを行った後の分離フィルタ行列の逆行列を算出する逆行列演算部と、を備えるようにしてもよい。 In addition, the filter creation unit performs adaptive learning processing using the input signal and creates a separation filter matrix at each frequency for separating the speech from the mixed observation signal, and the blind sound source separation A voice / noise selector that performs replacement so that the first element of the generated separation filter matrix becomes the speech, and an inverse that calculates an inverse matrix of the separation filter matrix after replacement by the speech noise / selection unit. A matrix operation unit.

また、前記雑音推定部により算出した前記推定雑音と、前記後期残響生成部により算出した前記擬似後期残響と、を用いて、前記混合観測信号中の雑音及び後期残響が抑圧された後の信号について、直接音声を強調する直接音声強調部を更に備えるようにしてもよい。 Further, with respect to the signal after suppression of the noise in the mixed observation signal and the late reverberation using the estimated noise calculated by the noise estimation unit and the pseudo late reverberation calculated by the late reverberation generation unit A direct voice emphasis unit that directly emphasizes the voice may be further provided.

本発明の第二の態様に係る雑音及び残響の抑圧方法は、音声及び雑音を含む混合観測信号を周波数領域に変換して入力信号とするステップと、前記入力信号を用いて、前記混合観測信号から前記音声を分離する各周波数での分離フィルタ行列を作成するステップと、作成した前記分離フィルタ行列の逆行列を算出するステップと、前記入力信号と、作成した前記分離フィルタ行列と、作成した前記分離フィルタ行列の逆行列と、を用いて、推定雑音を算出するステップと、作成した前記分離フィルタ行列の逆行列の各要素を時間領域に変換して推定空間伝達特性として取得するステップと、取得した前記推定空間伝達特性から前記音声の直接音声及び初期反射に相当する区間を切出すステップと、切出した前記区間を周波数領域に変換して各周波数での直接音声及び初期反射フィルタ行列を作成するステップと、前記入力信号と、作成した前記分離フィルタ行列と、作成した前記直接音声及び初期反射フィルタ行列と、を用いて、推定直接音声及び初期反射音声を算出するステップと、与えられる空間の残響時間と、切出した前記区間のパワー量と、から、各周波数での後期残響特性のフィルタ係数を算出するステップと、算出した前記推定直接音声及び初期反射音声と、作成した前記分離フィルタ行列と、算出した前記後期残響特性のフィルタ係数と、を用いて、擬似後期残響を算出するステップと、算出した前記推定雑音と、算出した前記擬似後期残響と、を用いて、前記混合観測信号中の雑音及び後期残響を抑圧するステップと、を有するものである。 The method for suppressing noise and reverberation according to the second aspect of the present invention includes a step of converting a mixed observation signal including speech and noise into a frequency domain to be an input signal, and using the input signal, the mixed observation signal. Generating a separation filter matrix at each frequency for separating the speech from the above, calculating an inverse matrix of the created separation filter matrix, the input signal, the created separation filter matrix, and the created Using the inverse matrix of the separation filter matrix to calculate the estimated noise, converting each element of the created inverse matrix of the separation filter matrix into the time domain, and obtaining the estimated spatial transfer characteristics; Cutting out the section corresponding to the direct voice and initial reflection of the voice from the estimated spatial transfer characteristics, and converting the cut out section into a frequency domain Estimating direct speech and initial reflection filter matrix using the step of creating a direct speech and initial reflection filter matrix at wave number, the input signal, the created separation filter matrix, and the created direct speech and initial reflection filter matrix. The step of calculating reflected speech, the step of calculating the filter coefficient of the late reverberation characteristic at each frequency from the reverberation time of the given space and the power amount of the extracted section, the calculated estimated direct speech and The step of calculating the pseudo late reverberation using the early reflection sound, the created separation filter matrix, and the calculated filter coefficient of the late reverberation characteristic, the calculated estimated noise, and the calculated pseudo late reverberation And suppressing noise and late reverberation in the mixed observation signal.

本発明によれば、環境の変化に適応して、雑音及び残響の両方を自動的に抑圧可能な雑音及び残響抑圧装置及びその方法を提供することができる。 According to the present invention, it is possible to provide a noise and dereverberation apparatus and method that can automatically suppress both noise and reverberation in response to environmental changes.

実施の形態１にかかる雑音及び残響抑圧装置の機能構成図である。1 is a functional configuration diagram of a noise and dereverberation device according to a first exemplary embodiment; 実施の形態１にかかるフィルタ作成部の機能構成図である。FIG. 3 is a functional configuration diagram of a filter creation unit according to the first embodiment. 実施の形態１にかかる切出し処理を説明するための図である。FIG. 6 is a diagram for explaining a cut-out process according to the first embodiment; その他の実施の形態にかかる雑音及び残響抑圧装置の機能構成図である。It is a functional block diagram of the noise and the dereverberation apparatus concerning other embodiment. その他の実施の形態にかかる雑音及び残響抑圧装置の機能構成図である。It is a functional block diagram of the noise and the dereverberation apparatus concerning other embodiment. その他の実施の形態にかかる雑音及び残響抑圧装置の機能構成図である。It is a functional block diagram of the noise and the dereverberation apparatus concerning other embodiment. 実環境において収音される音を示す図である。It is a figure which shows the sound collected in a real environment. 本発明に関連する残響抑圧技術の機能構成図である。It is a functional block diagram of the dereverberation technique relevant to this invention. 本発明に関連する雑音抑圧技術の機能構成図である。It is a functional block diagram of the noise suppression technique relevant to this invention. 本発明に関連する雑音及び残響抑圧技術の機能構成図である。It is a functional block diagram of the noise and the dereverberation technique relevant to this invention. 本発明に関連する各技術の課題を示す表である。It is a table | surface which shows the subject of each technique relevant to this invention. 本発明において用いる行列を例示する図である。It is a figure which illustrates the matrix used in this invention. 本発明において用いる行列を例示する図である。It is a figure which illustrates the matrix used in this invention.

実施の形態１
以下、図面を参照して本発明の実施の形態について説明する。図１及び図２は、本発明の一実施形態に係る雑音及び残響抑圧装置のシステム構成を示すブロック図である。尚、図では、通常の太さの線は信号を伝送していることを意味しており、それよりも太い線はフィルタを伝送していることを意味している。 Embodiment 1
Embodiments of the present invention will be described below with reference to the drawings. 1 and 2 are block diagrams showing a system configuration of a noise and dereverberation apparatus according to an embodiment of the present invention. In the figure, a line having a normal thickness means that a signal is transmitted, and a line thicker than that means that a filter is transmitted.

本実施の形態に係る雑音及び残響抑圧装置１００は、マイクロホンアレイ２と、フィルタ作成部１０と、雑音推定部２１と、雑音抑圧マスク生成部２２と、雑音抑圧処理部２３と、直接音声・初期反射音声推定部２４と、後期残響生成部２５と、ゲイン補正部２６と、後期残響音抑圧マスク生成部２７と、後期残響音抑圧処理部２８と、直接音声強調部２９と、を備えている。 The noise and dereverberation apparatus 100 according to the present embodiment includes a microphone array 2, a filter creation unit 10, a noise estimation unit 21, a noise suppression mask generation unit 22, a noise suppression processing unit 23, and a direct voice / initial stage. A reflected speech estimation unit 24, a late reverberation generation unit 25, a gain correction unit 26, a late reverberation suppression mask generation unit 27, a late reverberation suppression unit 28, and a direct speech enhancement unit 29 are provided. .

なお、雑音及び残響抑圧装置１００は、主要なハードウェア構成として、制御処理、演算処理等を行うＣＰＵ（Central Processing Unit）と、ＣＰＵによって実行される制御プログラム、演算プログラム等が記憶されたＲＯＭ（Read Only Memory）と、処理データ等を一時的に記憶するＲＡＭ（Random Access Memory）と、を有するマイクロコンピュータにより構成されている。また、フィルタ作成部１０と、雑音推定部２１と、雑音抑圧マスク生成部２２と、雑音抑圧処理部２３と、直接音声・初期反射音声推定部２４と、後期残響生成部２５と、ゲイン補正部２６と、後期残響音抑圧マスク生成部２７と、後期残響音抑圧処理部２８と、直接音声強調部２９と、は、例えば、上記ＲＯＭに格納され、上記ＣＰＵによって実行されるプログラムにより実現されていてもよい。 The noise and dereverberation apparatus 100 has, as main hardware configurations, a CPU (Central Processing Unit) that performs control processing, arithmetic processing, and the like, and a ROM (control program, arithmetic program, and the like that are executed by the CPU. The microcomputer includes a read only memory (RAM) and a random access memory (RAM) that temporarily stores processing data and the like. In addition, the filter creation unit 10, the noise estimation unit 21, the noise suppression mask generation unit 22, the noise suppression processing unit 23, the direct speech / initial reflection speech estimation unit 24, the late reverberation generation unit 25, and the gain correction unit 26, the late reverberation suppression mask generation unit 27, the late reverberation suppression processing unit 28, and the direct speech enhancement unit 29, for example, are realized by a program stored in the ROM and executed by the CPU. May be.

雑音及び残響抑圧装置１００は、フィルタ作成部１０と、雑音推定部２１と、雑音抑圧マスク生成部２２と、雑音抑圧処理部２３と、を含む部分において、雑音抑圧に関する処理を行う。また、フィルタ作成部１０と、直接音声・初期反射音声推定部２４と、後期残響生成部２５と、ゲイン補正部２６と、後期残響音抑圧マスク生成部２７と、後期残響音抑圧処理部２８と、を含む部分において、後期残響音抑圧に関する処理を行う。また、直接音声強調部２９を含む部分において、直接音の強調に関する処理を行う。さらに、直接音声強調２９から出力される信号に基づいて、音声認識処理が行われる。 The noise and dereverberation apparatus 100 performs processing related to noise suppression in a portion including the filter creation unit 10, the noise estimation unit 21, the noise suppression mask generation unit 22, and the noise suppression processing unit 23. In addition, the filter creation unit 10, the direct speech / early reflection speech estimation unit 24, the late reverberation generation unit 25, the gain correction unit 26, the late reverberation suppression mask generation unit 27, and the late reverberation suppression unit 28 In the portion including, processing related to the suppression of late reverberation is performed. Further, in a portion including the direct speech enhancement unit 29, processing related to direct sound enhancement is performed. Furthermore, a speech recognition process is performed based on the signal directly output from the speech enhancement 29.

また、フィルタ作成部１０では、バッチ処理を行い、所定量の観測信号（音声データ）を蓄積した上で、蓄積した音声データを用いてフィルタ作成処理を行う。フィルタ作成部１０を除く雑音推定部２１などでは、フィルタ作成部１０で作成されたフィルタを利用して、リアルタイム処理を行う。 Further, the filter creation unit 10 performs batch processing, accumulates a predetermined amount of observation signals (voice data), and then performs filter creation processing using the accumulated voice data. The noise estimation unit 21 and the like excluding the filter creation unit 10 perform real-time processing using the filter created by the filter creation unit 10.

雑音及び残響抑圧装置１０による処理の概要を説明する。まず、マイクロホンアレイ２により観測される観測信号には、ユーザ音声と雑音が含まれている。そして、ユーザ音声には、直接音声と、初期反射音と、後期残響音と、が含まれている。フィルタ作成部１０では、観測信号からユーザ音声と雑音とを分離し、この際、分離フィルタを作成する。この分離フィルタにより分離される推定雑音を利用して、観測信号の雑音が抑圧される。一方で、直接音声・初期反射音声推定部２４では、この分離フィルタを利用して、観測信号から、ユーザ音声の直接音声及び初期反射音を推定する。そして、後期残響生成部２５では、人工的な後期残響特性を作成して、推定されたユーザ音声の直接音声及び初期反射音と作成した後期残響特性とから擬似後期残響を作成する。そして、この作成された擬似後期残響を利用して、観測信号のユーザ音声の後期残響が抑圧される。 An outline of processing by the noise and dereverberation apparatus 10 will be described. First, the observation signal observed by the microphone array 2 includes user voice and noise. The user voice includes a direct voice, an early reflection sound, and a late reverberation sound. The filter creation unit 10 separates user speech and noise from the observation signal, and creates a separation filter at this time. Using the estimated noise separated by the separation filter, the noise of the observation signal is suppressed. On the other hand, the direct sound / initial reflected sound estimation unit 24 estimates the direct sound and the initial reflected sound of the user sound from the observation signal using this separation filter. Then, the late reverberation generation unit 25 creates an artificial late reverberation characteristic, and creates pseudo late reverberation from the estimated direct sound and initial reflected sound of the user voice and the created late reverberation characteristic. Then, by using the created pseudo late reverberation, the late reverberation of the user voice of the observation signal is suppressed.

マイクロホンアレイ２は、複数のマイクロホン素子から構成され、ユーザ音声及び雑音が混合された混合音を観測する。これら複数のマイクロホンは、例えばロボットＲの頭部に設けられており、水平方向に複数配置されている。各マイクロホンの観測信号が各チャネルｉに対応する。 The microphone array 2 is composed of a plurality of microphone elements, and observes mixed sound in which user voice and noise are mixed. The plurality of microphones are provided on the head of the robot R, for example, and are arranged in the horizontal direction. The observation signal of each microphone corresponds to each channel i.

各マイクロホンによる観測信号が、図示しないＡＤコンバータによりデジタルデータ（以下、音声データと称する。）に変換される。さらに、各マイクロホンの音声データは所定の時間分蓄積され、フレーム単位に分割される。そして、フレーム単位の音声データに対して離散フーリエ変換処理が行われ、時間・周波数領域の入力信号ベクトルＸ（ｆ，ｔ）に変換される。 An observation signal from each microphone is converted into digital data (hereinafter referred to as audio data) by an AD converter (not shown). Furthermore, audio data of each microphone is accumulated for a predetermined time and divided into frames. Discrete Fourier transform processing is performed on the audio data in units of frames, and converted to an input signal vector X (f, t) in the time / frequency domain.

フィルタ作成部１０は、図２に示すように、ブラインド音源分離（ＢＳＳ）１１と、音声・雑音選択部１２と、逆行列演算部１３と、を備えている。フィルタ作成部１０は、入力信号ベクトルＸ（ｆ，ｔ）を用いて、入力信号ベクトルＸ（ｆ，ｔ）からユーザ音声を分離するための、各周波数での分離フィルタ行列を作成する。以下、図２を参照して、ブラインド音源分離（ＢＳＳ）１１などについて説明する。 As shown in FIG. 2, the filter creation unit 10 includes a blind sound source separation (BSS) 11, a voice / noise selection unit 12, and an inverse matrix calculation unit 13. The filter creation unit 10 creates a separation filter matrix at each frequency for separating user speech from the input signal vector X (f, t) using the input signal vector X (f, t). Hereinafter, the blind sound source separation (BSS) 11 and the like will be described with reference to FIG.

ブラインド音源分離（ＢＳＳ）１１は、入力信号ベクトルＸ（ｆ，ｔ）を用いて適応学習処理を行い、各周波数での分離フィルタ行列Ｗ（ｆ）を作成する。また、ブラインド音源分離（ＢＳＳ）１１は、作成した分離フィルタ行列Ｗ（ｆ）を用いて、出力信号ベクトルＹ（ｆ，ｔ）＝Ｗ（ｆ）Ｘ（ｆ，ｔ）を出力する。
なお、ブラインド音源分離（ＢＳＳ）１１は、ここでは、従来提案されている独立成分分析や主成分分析を用いて、事前情報を用いることなく適応学習が可能な処理を行う。 The blind sound source separation (BSS) 11 performs an adaptive learning process using the input signal vector X (f, t) and creates a separation filter matrix W (f) at each frequency. The blind sound source separation (BSS) 11 outputs an output signal vector Y (f, t) = W (f) X (f, t) using the created separation filter matrix W (f).
Here, the blind sound source separation (BSS) 11 performs a process capable of adaptive learning without using prior information by using conventionally proposed independent component analysis and principal component analysis.

音声・雑音選択部１２は、出力信号ベクトルＹ（ｆ，ｔ）の第１要素（Ｙ^（１）（ｆ，ｔ））が音声データとなるように、分離フィルタ行列Ｗ（ｆ，ｔ）の要素を入れ替える。
これは、ブラインド音源分離（ＢＳＳ）１１の出力信号は、ユーザ音声又は雑音というようにクラスタリングされており、これは周波数ビンごとのクラスタリング結果になっているが、ユーザ音声と雑音とが入れ替わっている可能性があるためである。 The voice / noise selection unit 12 uses the separation filter matrix W (f, t) so that the first element (Y ⁽¹⁾ (f, t)) of the output signal vector Y (f, t) becomes voice data. Swap elements.
This is because the output signal of the blind sound source separation (BSS) 11 is clustered as user speech or noise, which is a clustering result for each frequency bin, but the user speech and noise are interchanged. This is because there is a possibility.

なお、音声・雑音選択部１２による処理は、入れ替わり（ｐｅｒｍｕａｔｉｏｎ）解法とも呼ばれ、例えば、従来提案されている解法（出力信号ベクトルＹ（ｆ，ｔ）間の結合密度確率分布を求め、この結合確率密度分布の形状に基づいて、ユーザ音声と雑音を振り分ける手法）を利用すればよい。また、ここで利用する解法は多数存在するため、その他の解法を用いてもよい。例えば、（１）周辺確率密度分布の尖度を利用する解法、（２）音源信号の分離信号の包絡線を利用する解法、（３）位相情報の連続性を利用する解法、（４）空間スペクトルを利用する解法、などを利用してもよい。 Note that the processing by the voice / noise selection unit 12 is also called a permutation solution. For example, a conventionally proposed solution (a joint density probability distribution between output signal vectors Y (f, t) is obtained and this combination is performed. A method of distributing user speech and noise based on the shape of the probability density distribution may be used. Further, since there are many solutions used here, other solutions may be used. For example, (1) a solution using the kurtosis of the peripheral probability density distribution, (2) a solution using the envelope of the separation signal of the sound source signal, (3) a solution using the continuity of the phase information, (4) space A solution using a spectrum may be used.

逆行列演算部１３は、出力信号ベクトルＹ（ｆ，ｔ）の算出に用いた分離フィルタ行列Ｗ（ｆ）について、その逆行列Ｈ（ｆ）＝Ｗ^−１（ｆ）を算出する。求めた逆行列Ｈ（ｆ）は、空間伝達特性行列の推定値であり、部屋などの空間の残響特性を意味している。 The inverse matrix calculation unit 13 calculates an inverse matrix H (f) = W ⁻¹ (f) for the separation filter matrix W (f) used for calculating the output signal vector Y (f, t). The obtained inverse matrix H (f) is an estimated value of the spatial transfer characteristic matrix, and means a reverberation characteristic of a space such as a room.

ユーザ音声と雑音とが混合された観測信号が得られる関係は、ユーザ音声及び雑音の各音源から観測信号への写像に相当し、これに対して、雑音が混入された観測信号をユーザ音声と雑音とに分離することは、観測信号から各音源への写像に相当する。ブラインド音源分離（ＢＳＳ）１１において作成する分離行列フィルタ行列Ｗ（ｆ）が観測信号から各音源への写像を意味するため、分離行列フィルタ行列Ｗ（ｆ）の逆行列Ｈ（ｆ）を求めることは、ユーザ音声及び雑音の各音源から観測信号への写像を求めることを意味している。 The relationship in which an observation signal in which user speech and noise are mixed corresponds to the mapping of the user speech and noise from each sound source to the observation signal. On the other hand, the observation signal mixed with noise is the user speech and the noise. Separation into noise corresponds to mapping from the observation signal to each sound source. Since the separation matrix filter matrix W (f) created in the blind sound source separation (BSS) 11 means a mapping from the observed signal to each sound source, the inverse matrix H (f) of the separation matrix filter matrix W (f) is obtained. Means obtaining a mapping of the user voice and noise from each sound source to the observation signal.

図１に戻って説明を続ける。
雑音推定部２１は、以下の計算式を用いて、入力信号ベクトルＸ（ｆ，ｔ）と、分離フィルタ行列Ｗ（ｆ，ｔ）と、空間伝達特性行列の推定値Ｈ（ｆ）とから、観測信号（入力信号ベクトルＸ（ｆ，ｔ））に含まれる雑音信号ベクトルＸ_Ｎ（ｆ，ｔ）を算出する。すなわち、雑音推定部２１は、観測信号（入力信号）に含まれる雑音成分を推定する。
Ｘ_Ｎ（ｆ，ｔ）＝Ｈ（ｆ）Ｄ_Ｎ（ｆ）Ｗ（ｆ）Ｘ（ｆ，ｔ）
ただし、Ｄ_Ｎ（ｆ）はｉ行ｉ列（１＜ｉ）目の要素が１である対角行列を示す。なお、一般的な対角行列では全ての対角要素が１であるが、ここで用いるＤ_Ｎ（ｆ）は、ユーザ音声を求めないようにするため、１行１列目の要素を０とし、残りの対角要素は全て１とする。このため、例えば３行３列の場合には図１２（ａ）に示す行列となり、４行４列の場合には、図１２（ｂ）に示す行列となる。 Returning to FIG. 1, the description will be continued.
The noise estimation unit 21 uses the following calculation formula from the input signal vector X (f, t), the separation filter matrix W (f, t), and the estimated value H (f) of the spatial transfer characteristic matrix: A noise signal vector X _N (f, t) included in the observation signal (input signal vector X (f, t)) is calculated. That is, the noise estimation unit 21 estimates a noise component included in the observation signal (input signal).
X _N (f, t) = H (f) D _N (f) W (f) X (f, t)
Here, D _N (f) indicates a diagonal matrix whose element is i row i column (1 <i). In the general diagonal matrix, all diagonal elements are 1, but D _N (f) used here is set to 0 in the 1st row and 1st column so as not to obtain a user voice. The remaining diagonal elements are all 1. Therefore, for example, in the case of 3 rows and 3 columns, the matrix is as shown in FIG. 12A, and in the case of 4 rows and 4 columns, the matrix is as shown in FIG.

直接音声・初期反射音声推定部２４は、逆行列演算部１３により求めた行列Ｈ（ｆ）の各要素について逆高速フーリエ変換処理を行い、時間領域に変換後の空間伝達特性の推定値を得る。そして、この推定値の中から、各マイクロホンへの直接音声・初期反射音声に相当する区間を窓関数により切り出し、切り出した区間についてフーリエ変換処理を行って周波数領域へと変換することで、各周波数での直接音声・初期反射音声推定フィルタ行列を作成する。ここで得たフーリエ変換処理後の行列をＨ_Ｅ（ｆ）とする。なお、切り出しの際に用いる窓の長さｈ_Ｅはパラメータとして与えられ、例えば、非特許文献１のＦｉｇ．３で求められたｈ_Ｅの値（７０［ｔａｐｓ］）を用いればよい。直接音声・初期反射音声推定部２４による切り出し処理は、例えば、空間伝達特性の推定値を時間領域に変換後の波形を図３の左図に示した場合に、これをｈ_Ｅの長さの窓により、図３の右図に示すように切り出すことである。 The direct speech / early reflection speech estimation unit 24 performs inverse fast Fourier transform processing on each element of the matrix H (f) obtained by the inverse matrix calculation unit 13 to obtain an estimated value of the spatial transfer characteristic after conversion into the time domain. . Then, from this estimated value, a section corresponding to the direct sound and initial reflected sound to each microphone is cut out by a window function, and the cut-out section is subjected to Fourier transform processing to be converted into the frequency domain, whereby each frequency is obtained. A direct speech / early reflection speech estimation filter matrix is created. The matrix after Fourier transform processing obtained here is denoted by H _E (f). Note that the window length h _E used in the cut-out is given as a parameter. For example, FIG. The value of _{h E} obtained in 3 (70 [taps]) may be used. Cutout process using sound and initial reflected sound estimation unit 24 directly, for example, in the case shown the waveform after converting the estimate of the spatial transfer characteristic in the time domain in the left diagram of FIG. 3, which in the h _E length of It is to cut out by the window as shown in the right figure of FIG.

さらに、直接音声・初期反射音声推定部２４は、以下の計算式を用いて、フーリエ変換後の直接音声・初期反射音声推定フィルタ行列Ｈ_Ｅ（ｆ）と、分離行列フィルタ行列Ｗ（ｆ）と、観測信号（入力信号ベクトルＸ（ｆ，ｔ））とから、観測信号に含まれるユーザ音声の直接音・初期反射音信号ベクトルＸ_Ｅ（ｆ，ｔ）を算出する。すなわち、直接音声・初期反射音声推定部２４は、観測信号に含まれるユーザ音声の直接音・初期反射音成分を推定する。
Ｘ_Ｅ（ｆ，ｔ）＝Ｈ_Ｅ（ｆ）Ｄ_Ｓ（ｆ）Ｗ（ｆ）Ｘ（ｆ，ｔ）
ただし、Ｄ_Ｓ（ｆ）はｉ行ｉ列（１＜ｉ）目の要素のみが１で、他の要素は全て０である行列を示す。ここで用いるＤ_Ｓ（ｆ）は、雑音を求めないようにするため、１行１列目の要素のみを１とし、残りの要素は全て０とする。このため、例えば３行３列の場合には図１３（ａ）に示す行列となり、４行４列の場合には、図１３（ｂ）に示す行列となる。 Further, the direct speech / initial reflection speech estimation unit 24 uses the following calculation formula to calculate the direct speech / initial reflection speech estimation filter matrix H _E (f) after the Fourier transform and the separation matrix filter matrix W (f): From the observation signal (input signal vector X (f, t)), the direct sound / initial reflected sound signal vector X _E (f, t) of the user voice included in the observation signal is calculated. That is, the direct sound / initial reflected sound estimation unit 24 estimates the direct sound / initial reflected sound component of the user sound included in the observation signal.
X _E (f, t) = H _E (f) D _S (f) W (f) X (f, t)
However, D _S (f) represents a matrix in which only the element in i row and i column (1 <i) is 1, and all other elements are 0. DS _S (f) used here is set so that only the element in the first row and first column is set to 1 and all the remaining elements are set to 0 in order not to obtain noise. Therefore, for example, in the case of 3 rows and 3 columns, the matrix is as shown in FIG. 13A, and in the case of 4 rows and 4 columns, the matrix is as shown in FIG. 13B.

後期残響生成部２５は、以下の計算式を用いて、ユーザ音声の直接音・初期反射音信号ベクトルＸ_Ｅ（ｆ，ｔ）と、後述するＨ_Ｅ（ｆ）とから、後期残響推定信号ベクトルＸ_Ｌ（ｆ，ｔ）を算出する。すなわち、後期残響生成部２５は、擬似的な後期残響成分を生成する。なお、Ｘ_Ｌ ^（ｉ）（ｆ，ｔ）は、ベクトルＸ_Ｌ（ｆ，ｔ）のｉ番目の要素を示す。また、各マイクロホンに対して同じＨ_Ｌ（ｆ）を用いる。
Ｘ_Ｌ ^（ｉ）（ｆ，ｔ）＝Ｘ_Ｅ ^（ｉ）（ｆ，ｔ）Ｈ_Ｌ（ｆ）
ここで、Ｈ_Ｌ（ｆ）は後期残響特性のフィルタ係数を示し、ユーザにより与えられる部屋の残響時間（Ｔ６０）と、直接音声・初期反射音声推定部２５でカットしたパワー量（すなわち、切出した区間のパワー量）と、から決定される。より具体的には、直接音声及び初期反射音声のパワー量と後期残響のパワー量との比は、直接音声・初期反射音声推定部２５において切出した区間のパワー量とその他の切出されなかった区間のパワー量との比に相当するため、この比率に基づいて残響時間から求める後期残響の振幅を補正することで、後期残響特性のフィルタ係数を算出することができる。なお、Ｈ_Ｌ（ｆ）の算出は、従来知られた公知の計算式を用いて行えばよいため、ここでは、その詳細な説明を省略する。 The late reverberation generation unit 25 uses the following calculation formula to calculate the late reverberation estimation signal vector from the direct sound / early reflected sound signal vector X _E (f, t) of the user voice and H _E (f) described later. X _L (f, t) is calculated. That is, the late reverberation generation unit 25 generates a pseudo late reverberation component. X _L ⁽ⁱ⁾ (f, t) indicates the i-th element of the vector X _L (f, t). The same H _L (f) is used for each microphone.
X _L ⁽ⁱ⁾ (f, t) = X _E ⁽ⁱ⁾ (f, t) H _L (f)
Here, H _L (f) represents the filter coefficient of the late reverberation characteristic, and the reverberation time (T60) of the room given by the user and the amount of power cut by the direct voice / early reflected voice estimation unit 25 (ie, cut out) And the power amount of the section). More specifically, the ratio between the power amount of the direct sound and the early reflection sound and the power amount of the late reverberation is not extracted from the power amount of the section cut out in the direct sound / early reflection sound estimation unit 25 and the others. Since this corresponds to the ratio to the power amount of the section, the filter coefficient of the late reverberation characteristic can be calculated by correcting the amplitude of the late reverberation obtained from the reverberation time based on this ratio. In addition, since calculation of _HL (f) should just be performed using the conventionally well-known calculation formula, the detailed description is abbreviate | omitted here.

雑音抑圧マスク生成部２２は、以下の計算式を用いて、入力信号ベクトルＸ（ｆ，ｔ）と雑音信号ベクトルＸ_Ｎ（ｆ，ｔ）とから、雑音抑圧マスクベクトルＭ_Ｎ（ｆ，ｔ）を算出する。なお、Ｍ_Ｎ ^（ｉ）（ｆ，ｔ）は、ベクトルＭ_Ｎ（ｆ，ｔ）のｉ番目の要素を示す。また、係数α_１は、雑音抑圧の程度を調整するパラメータであり、ユーザにより適切な値が与えられる。
Ｍ_Ｎ ^（ｉ）（ｆ，ｔ）＝ｓｑｒｔ（｜Ｘ^（ｉ）（ｆ，ｔ）｜^２／（｜Ｘ^（ｉ）（ｆ，ｔ）｜^２＋α_１｜Ｘ_Ｎ ^（ｉ）（ｆ，ｔ）｜^２）） Noise suppression mask generating unit 22 uses the following equation, the input signal vector X (f, t) from the noise signal vector _X N (f, t) and the noise suppression mask vector _M N (f, t) Is calculated. Note that M _N ⁽ⁱ⁾ (f, t) represents the i-th element of the vector M _N (f, t). The coefficient α ₁ is a parameter for adjusting the degree of noise suppression, and an appropriate value is given by the user.
M _N ⁽ⁱ⁾ (f, t) = sqrt (| X ⁽ⁱ⁾ (f, t) | ² / (| X ⁽ⁱ⁾ (f, t) | ² + α ₁ | X _N ⁽ⁱ⁾ (f, t) | ² ))

雑音抑圧処理部２３は、以下の計算式を用いて、観測信号（入力信号ベクトルＸ（ｆ，ｔ））と、雑音抑圧マスクベクトルＭ_Ｎ（ｆ，ｔ）と、から雑音抑圧後の中間出力信号ベクトルＶ（ｆ，ｔ）を算出する。すなわち、雑音抑圧処理部２３は、観測信号に含まれる環境の雑音を抑圧する。なお、Ｖ^（ｉ）（ｆ，ｔ）は、ベクトルＶ（ｆ，ｔ）のｉ番目の要素を示す。
Ｖ^（ｉ）（ｆ，ｔ）＝Ｍ_Ｎ ^（ｉ）（ｆ，ｔ）Ｘ^（ｉ）（ｆ，ｔ） The noise suppression processing unit 23 uses the following calculation formula to calculate an intermediate output after noise suppression from the observation signal (input signal vector X (f, t)) and the noise suppression mask vector M _N (f, t). A signal vector V (f, t) is calculated. That is, the noise suppression processing unit 23 suppresses environmental noise included in the observation signal. V ⁽ⁱ⁾ (f, t) indicates the i-th element of the vector V (f, t).
V ⁽ⁱ⁾ (f, t) = M _N ⁽ⁱ⁾ (f, t) X ⁽ⁱ⁾ (f, t)

ゲイン補正部２６は、以下の計算式を用いて、後期残響推定信号ベクトルＸ_Ｌ（ｆ，ｔ）と、雑音抑圧マスクベクトルＭ_Ｎ（ｆ，ｔ）とから、中間出力信号ベクトルＶ_Ｌ ^（ｉ）（ｆ，ｔ）を算出する。すなわち、ゲイン補正部２６は、雑音抑圧マスクを用いて、後期残響推定成分の振幅を補正する。雑音抑圧マスク生成部２２により求められた雑音抑圧マスクベクトルＭ_Ｎ（ｆ，ｔ）を、雑音抑圧処理部２３と、ゲイン補正部２６とで共通して用いることで、雑音抑圧処理部２３での抑圧に応じて、ゲイン補正部２６において後期残響推定成分の振幅を補正することができる。なお、Ｖ_Ｌ ^（ｉ）（ｆ，ｔ）は、ベクトルＶ_Ｌ（ｆ，ｔ）のｉ番目の要素を示す。
Ｖ_Ｌ ^（ｉ）（ｆ，ｔ）＝Ｍ_Ｎ ^（ｉ）（ｆ，ｔ）Ｘ_Ｌ ^（ｉ）（ｆ，ｔ） The gain correction unit 26 uses the following calculation formula to calculate the intermediate output signal vector V _L ⁽ⁱ ) from the late reverberation estimation signal vector X _L (f, t) and the noise suppression mask vector M _N (f, t). ⁾ (F, t) is calculated. That is, the gain correction unit 26 corrects the amplitude of the late reverberation estimation component using the noise suppression mask. The noise suppression mask vector M _N (f, t) obtained by the noise suppression mask generation unit 22 is used in common by the noise suppression processing unit 23 and the gain correction unit 26, so that the noise suppression processing unit 23 In accordance with the suppression, the gain correction unit 26 can correct the amplitude of the late reverberation estimation component. V _L ⁽ⁱ⁾ (f, t) represents the i-th element of the vector V _L (f, t).
V _L ⁽ⁱ⁾ (f, t) = M _N ⁽ⁱ⁾ (f, t) X _L ⁽ⁱ⁾ (f, t)

後期残響音抑圧マスク生成部２７は、以下の計算式を用いて、雑音抑圧後の中間出力信号ベクトルＶ（ｆ，ｔ）と、中間出力信号ベクトルＶ_Ｌ（ｆ，ｔ）とから、残響抑圧マスクベクトルＭ_Ｌ（ｆ，ｔ）を算出する。なお、Ｍ_Ｌ ^（ｉ）（ｆ，ｔ）は、ベクトルＭ_Ｌ（ｆ，ｔ）のｉ番目の要素を示す。また、係数α_２は、後期残響音抑圧の程度を調整するパラメータであり、ユーザにより適切な値が与えられる。
Ｍ_Ｌ ^（ｉ）（ｆ，ｔ）＝ｓｑｒｔ（｜Ｖ^（ｉ）（ｆ，ｔ）｜^２／（｜Ｖ^（ｉ）（ｆ，ｔ）｜^２＋α_２｜Ｖ_Ｌ ^（ｉ）（ｆ，ｔ）｜^２）） The late reverberation suppression mask generation unit 27 uses the following calculation formula to suppress dereverberation from the intermediate output signal vector V (f, t) after noise suppression and the intermediate output signal vector V _L (f, t). A mask vector M _L (f, t) is calculated. M _L ⁽ⁱ⁾ (f, t) represents the i-th element of the vector M _L (f, t). The coefficient α ₂ is a parameter that adjusts the degree of late reverberation suppression, and is given an appropriate value by the user.
M _L ⁽ⁱ⁾ (f, t) = sqrt (| V ⁽ⁱ⁾ (f, t) | ² / (| V ⁽ⁱ⁾ (f, t) | ² + α ₂ | V _L ⁽ⁱ⁾ (f, t) | ² ))

後期残響抑圧処理部２８は、以下の計算式を用いて、雑音抑圧後の中間出力信号ベクトルＶ（ｆ，ｔ）と、残響抑圧マスクベクトルＭ_Ｌ（ｆ，ｔ）と、から、後期残響抑圧後の出力信号ベクトルＹ（ｆ，ｔ）を算出する。すなわち、後期残響抑圧処理部２８は、残響抑圧マスクを用いて、雑音抑圧後の中間出力に含まれる後期残響を抑圧する。なお、Ｙ^（ｉ）（ｆ，ｔ）は、ベクトルＹ（ｆ，ｔ）のｉ番目の要素を示す。
Ｙ^（ｉ）（ｆ，ｔ）＝Ｍ_Ｌ ^（ｉ）（ｆ，ｔ）Ｖ^（ｉ）（ｆ，ｔ） Late dereverberation processing unit 28, using the following equation, the intermediate output signal vector V (f, t) after the noise suppression and, dereverberation mask vector M _{L (f,} t) and a, late reverberation suppression The subsequent output signal vector Y (f, t) is calculated. That is, the late reverberation suppression processing unit 28 uses the reverberation suppression mask to suppress the late reverberation included in the intermediate output after noise suppression. Y ⁽ⁱ⁾ (f, t) represents the i-th element of the vector Y (f, t).
^{_{^{Y (i) (f, t}}} ) = M L (i) (f, t) V (i) (f, t)

直接音声強調部２９は、以下の算出式を用いて、出力信号ベクトルＹ（ｆ，ｔ）から、直接音声強調後の出力信号ベクトルＯ（ｆ，ｔ）を算出する。すなわち、直接音声強調部２９は、ユーザ方位θにビームを向け、直接音声を強調する。なお、ユーザ方位θは、音声・雑音選択部１２で雑音成分を推定する際に得られる。
Ｏ（ｆ，ｔ）＝Σ_ｉＹ^（ｉ）（ｆ，ｔ）Ｈ_ＤＳ ^（ｉ）（ｆ，ｔ）
なお、Ｈ_ＤＳ ^（ｉ）（ｆ，ｔ）は、ＤｅｌａｙａｎｄＳｕｍのフィルタ係数であり、Σ_ｉは全てのチャネルｉ（全てのマイクロホン素子）についての平均化処理を行うことを示す。また、ＤｅｌａｙａｎｄＳｕｍは、出力信号ベクトルＹ（ｆ，ｔ）から推定されたユーザ方位θを用いてマイクロホン素子間の到来時間差を補正し、ユーザ方位にビームを形成する手法である。 The direct speech enhancement unit 29 calculates the output signal vector O (f, t) after direct speech enhancement from the output signal vector Y (f, t) using the following calculation formula. That is, the direct voice emphasis unit 29 directs the beam toward the user direction θ and directly enhances the voice. The user orientation θ is obtained when the speech / noise selector 12 estimates a noise component.
O (f, t) = Σ _i Y ⁽ⁱ⁾ (f, t) H _DS ⁽ⁱ⁾ (f, t)
H _DS ⁽ⁱ⁾ (f, t) is a delay and sum filter coefficient, and Σ _i indicates that the averaging process is performed for all channels i (all microphone elements). Delay and Sum is a method of correcting the arrival time difference between microphone elements using the user orientation θ estimated from the output signal vector Y (f, t) and forming a beam in the user orientation.

以上説明したように、本実施の形態にかかる雑音及び残響抑圧装置１００によれば、新たな空間の観測信号に基づいて推定直接音声及び初期反射音声を算出し、後期残響特性を空間の残響特性から自動的に作成して、算出した推定直接音声及び初期反射音声と、作成した後期残響特性と、から後期残響を算出することができるため、環境が変化した場合においても、環境の変化に適応して、雑音及び残響の両方を自動的に抑圧することができる。 As described above, according to the noise and dereverberation apparatus 100 according to the present embodiment, the estimated direct speech and the initial reflected speech are calculated based on the new spatial observation signal, and the late reverberation characteristics are converted into the spatial reverberation characteristics. It is possible to calculate late reverberation from the estimated direct sound and early reflection sound that are automatically created and calculated, and the late reverberation characteristics that are created, so even if the environment changes, it adapts to environmental changes Thus, both noise and reverberation can be automatically suppressed.

その他の実施の形態．
上述した実施の形態では、フィルタ作成部１０がブラインド音源分離（ＢＳＳ）を行う例を説明したが、本発明はこれに限定されない。例えば、図４に示すように、ブラインド音源分離（ＢＳＳ）に代えて、ブラインド信号抽出（ＢＳＥ）を適用するものとしてもよい。すなわち、図４に示すように、雑音及び残響抑圧装置２００は、マイクロホンアレイ２と、ブラインド信号抽出（ＢＳＥ）２０１と、射影ベクトル推定部２０２と、雑音推定部２０３と、雑音抑圧マスク生成部２０４と、雑音抑圧処理部２０５と、直接音声・初期反射音声推定部２０６と、後期残響生成部２０７と、ゲイン補正部２０８と、後期残響音抑圧マスク生成部２０９と、後期残響音抑圧処理部２１０と、直接音声強調部２１１と、を備える構成としてもよい。 Other embodiments.
In the above-described embodiment, the example in which the filter creation unit 10 performs blind sound source separation (BSS) has been described, but the present invention is not limited to this. For example, as shown in FIG. 4, blind signal extraction (BSE) may be applied instead of blind sound source separation (BSS). That is, as illustrated in FIG. 4, the noise and dereverberation apparatus 200 includes a microphone array 2, a blind signal extraction (BSE) 201, a projection vector estimation unit 202, a noise estimation unit 203, and a noise suppression mask generation unit 204. A noise suppression processing unit 205, a direct speech / early reflection speech estimation unit 206, a late reverberation generation unit 207, a gain correction unit 208, a late reverberation suppression mask generation unit 209, and a late reverberation suppression unit 210. And a direct voice emphasis unit 211.

図４に示したブラインド信号抽出（ＢＳＥ）２０１では、観測信号からユーザ音声を抽出して出力し、射影ベクトル推定部２０２では、これに基づいて、空間伝達特性行列の推定値を出力する。また、雑音推定部２０３で推定したユーザ方位θが、音声強調部２１１に出力される。なお、ブラインド信号抽出（ＢＳＥ）２０１や射影ベクトル推定部２０２で行う処理は公知であるため、ここではその詳細な説明を省略する。 In the blind signal extraction (BSE) 201 shown in FIG. 4, user speech is extracted from the observation signal and output, and the projection vector estimation unit 202 outputs an estimated value of the spatial transfer characteristic matrix based on this. In addition, the user orientation θ estimated by the noise estimation unit 203 is output to the speech enhancement unit 211. Since the processing performed by the blind signal extraction (BSE) 201 and the projection vector estimation unit 202 is known, detailed description thereof is omitted here.

また、上述した実施の形態では、雑音抑圧マスクと後期残響抑圧マスクとを別々のマスク生成部で生成する例を説明したが、本発明はこれに限定されない。例えば、図５に示すように、雑音抑圧マスクと後期残響抑圧マスクとを一つのマスク生成部で生成し、雑音抑圧処理と後期残響抑圧処理とを一つの抑圧処理部で行うものとしてもよい。すなわち、図５に示すように、雑音及び残響抑圧装置３００は、マイクロホンアレイ２と、ブラインド音源分離（ＢＳＳ）３０１と、音声・雑音選択部３０２と、逆行列演算部３０３と、雑音推定部３０４と、直接音声・初期反射音声推定部３０５と、後期残響生成部３０６と、雑音・後期残響音抑圧マスク生成部３０７と、雑音・後期残響音抑圧処理部３０８と、直接音声強調部３０９と、を備える構成としてもよい。 In the above-described embodiment, an example in which a noise suppression mask and a late dereverberation suppression mask are generated by separate mask generation units has been described, but the present invention is not limited to this. For example, as shown in FIG. 5, the noise suppression mask and the late dereverberation suppression mask may be generated by one mask generation unit, and the noise suppression process and the late dereverberation suppression process may be performed by one suppression processing unit. That is, as shown in FIG. 5, the noise and dereverberation apparatus 300 includes a microphone array 2, a blind sound source separation (BSS) 301, a speech / noise selection unit 302, an inverse matrix calculation unit 303, and a noise estimation unit 304. A direct speech / early reflection speech estimation unit 305, a late reverberation generation unit 306, a noise / late reverberation suppression mask generation unit 307, a noise / late reverberation suppression unit 308, a direct speech enhancement unit 309, It is good also as a structure provided with.

また、例えば、図６に示すように、ブラインド音源分離（ＢＳＳ）に代えて、ブラインド信号抽出（ＢＳＥ）を適用すると共に、雑音抑圧マスクと後期残響抑圧マスクとを一つのマスク生成部で生成し、雑音抑圧処理と後期残響抑圧処理とを一つの抑圧処理部で行うものとしてもよい。すなわち、図６に示すように、雑音及び残響抑圧装置４００は、マイクロホンアレイ２と、ブラインド信号抽出（ＢＳＥ）４０１と、射影ベクトル推定部４０２と、雑音推定部４０３と、直接音声・初期反射音声推定部４０４と、後期残響生成部４０５と、雑音・後期残響音抑圧マスク生成部４０６と、雑音・後期残響音抑圧処理部４０７と、直接音声強調部４０８と、を備える構成としてもよい。 Further, for example, as shown in FIG. 6, blind signal extraction (BSE) is applied instead of blind sound source separation (BSS), and a noise suppression mask and a late reverberation suppression mask are generated by one mask generation unit. The noise suppression process and the late dereverberation suppression process may be performed by one suppression processing unit. That is, as shown in FIG. 6, the noise and dereverberation apparatus 400 includes a microphone array 2, a blind signal extraction (BSE) 401, a projection vector estimation unit 402, a noise estimation unit 403, a direct voice and an early reflection voice. The estimation unit 404, the late reverberation generation unit 405, the noise / late reverberation suppression mask generation unit 406, the noise / late reverberation suppression processing unit 407, and the direct speech enhancement unit 408 may be provided.

ここで、図１０やなどに示した本発明に関連する技術と比較した場合に、本発明と相違する点及び有利な効果についてさらに説明する。
（１）後期残響特性について
図１０に示した技術では、空間の残響特性を予め与える必要がある。
これに対して本発明では、部屋などの残響時間から、自動的にその残響特性を作成することができる。
（２）残響音の推定方法について
図１０に示した技術では、ユーザ音声の直接音と、初期反射音と、後期残響音とが含まれた信号を対象として、残響音の推定を行っている。
これに対して本発明では、ユーザ音声の直接音と、初期反射音と、が含まれた信号を対象として残響音の推定を行っている。これは、直接音及び初期反射音の推定機能を更に備えたことで実現している。
（３）後期残響抑圧処理前のゲイン補正について
図１０に示した技術では、ゲイン補正係数は予め与えられ、補正係数を自動的に推定することができないため、予め部屋の特性を計測し、補正係数を求めておく必要がある。
これに対して本発明では、生成された後期残響は、自動的に補正される。本発明では、直接音声・初期反射音声を推定する際にカットされた伝達特性のパワー量（すなわち、切出されなかった区間のパワー量）と、作成する後期残響特性のパワー量とが、同量となるように補正する。
（４）直接音声強調処理による処理歪みの緩和について
図１０に示した技術及び本発明ともに、直接音声強調処理ではＤｅｌａｙａｎｄＳｕｍ（ＤＳ）処理を採用している。ＤＳ処理には平均化処理が含まれており、その処理により各チャネルで生じていた抑圧処理歪みが緩和されるという副作用がある。
図１０に示した技術では、直接音声強調処理後に残響抑圧処理を行うため、各チャネルの残響抑圧処理の歪みは緩和されない。
これに対して本発明では、雑音及び残響の両方を抑圧した後に直接音声強調処理を実施するため、各チャネルの残響除去処理の歪みを緩和することができる。 Here, when compared with the technique related to the present invention shown in FIG. 10 and the like, the points different from the present invention and advantageous effects will be further described.
(1) Late Reverberation Characteristics In the technique shown in FIG. 10, it is necessary to give the reverberation characteristics of the space in advance.
On the other hand, in the present invention, the reverberation characteristic can be automatically created from the reverberation time of a room or the like.
(2) Reverberation Sound Estimation Method In the technique shown in FIG. 10, reverberation sound is estimated for a signal including a direct user sound, an early reflection sound, and a late reverberation sound. .
On the other hand, in the present invention, the reverberation sound is estimated for a signal including the direct sound of the user voice and the initial reflected sound. This is achieved by further providing a direct sound and early reflection sound estimation function.
(3) Gain Correction Before Late Reverberation Suppression In the technique shown in FIG. 10, since the gain correction coefficient is given in advance and the correction coefficient cannot be estimated automatically, the characteristics of the room are measured and corrected in advance. It is necessary to find the coefficient.
On the other hand, in the present invention, the generated late reverberation is automatically corrected. In the present invention, the power amount of the transfer characteristic cut when estimating the direct sound and the early reflection sound (that is, the power amount of the section not cut out) and the power amount of the late reverberation characteristic to be created are the same. Correct to the amount.
(4) Reducing processing distortion by direct speech enhancement processing Both the technique shown in FIG. 10 and the present invention employ the Delay and Sum (DS) processing in the direct speech enhancement processing. The DS process includes an averaging process, which has a side effect that the suppression process distortion generated in each channel is reduced.
In the technique shown in FIG. 10, since the dereverberation process is performed after the direct speech enhancement process, the distortion of the dereverberation process of each channel is not alleviated.
On the other hand, in the present invention, since the speech enhancement process is directly performed after suppressing both noise and reverberation, distortion of the dereverberation process of each channel can be reduced.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention.

１００、２００、３００、４００雑音及び残響抑圧装置、
１マイクロホン、２マイクロホンアレイ、１０フィルタ作成部、
１１ブラインド音源分離（ＢＳＳ）、１２音声・雑音選択部、
１３逆行列演算部、２１雑音推定部、２２雑音抑圧マスク生成部、
２３雑音抑圧処理部、２４直接音声・初期反射音声推定部、
２５後期残響生成部、２６ゲイン補正部、２７後期残響音抑圧マスク生成部、
２８後期残響音抑圧処理部、２９直接音声強調部、

２０１ブラインド信号抽出（ＢＳＥ）、２０２射影ベクトル推定部、
２０３雑音推定部、２０４雑音抑圧マスク生成部、
２０５雑音抑圧処理部、２０６直接音声・初期反射音声推定部、
２０７後期残響生成部、２０８ゲイン補正部、
２０９後期残響音抑圧マスク生成部、２１０後期残響音抑圧処理部、
２１１直接音声強調部、

３０１ブラインド音源分離（ＢＳＳ）、３０２音声・雑音選択部、
３０３逆行列演算部、３０４雑音推定部、
３０５直接音声・初期反射音声推定部、３０６後期残響生成部、
３０７雑音・後期残響音抑圧マスク生成部、３０８雑音・後期残響音抑圧処理部、
３０９直接音声強調部、

４０１ブラインド信号抽出（ＢＳＥ）、４０２射影ベクトル推定部、
４０３雑音推定部、４０４直接音声・初期反射音声推定部、
４０５後期残響生成部、４０６雑音・後期残響音抑圧マスク生成部、
４０７雑音・後期残響音抑圧処理部、４０８直接音声強調部、

５０１後期残響推定部、５０２ゲイン補正部、５０３残響抑圧処理部、

６０１ブラインド音原分離（ＢＳＳ）、６０２音声・雑音選択部、
６０３多チャンネル雑音推定部、６０４マスク生成部、
６０５雑音抑圧処理部、６０６直接音声強調部、

７０１ブラインド音源分離（ＢＳＳ）、７０２音声・雑音選択部、
７０３多チャンネル雑音推定部、７０４マスク生成部、
７０５雑音抑圧処理部、７０６直接音声強調部、７０７後期残響推定部、
７０８ゲイン補正部、７０９残響抑圧処理部、

Ｐユーザ、Ｒロボット 100, 200, 300, 400 Noise and dereverberation device,
1 microphone, 2 microphone array, 10 filter creation unit,
11 Blind sound source separation (BSS), 12 Voice / noise selection unit,
13 inverse matrix calculation unit, 21 noise estimation unit, 22 noise suppression mask generation unit,
23 noise suppression processing unit, 24 direct speech / early reflection speech estimation unit,
25 late reverberation generation unit, 26 gain correction unit, 27 late reverberation suppression mask generation unit,
28 Late reverberation suppression processing section 29 Direct speech enhancement section

201 Blind signal extraction (BSE), 202 Projection vector estimation unit,
203 noise estimation unit, 204 noise suppression mask generation unit,
205 Noise suppression processing unit, 206 Direct speech / early reflection speech estimation unit,
207 late reverberation generation unit, 208 gain correction unit,
209 Late reverberation suppression mask generation unit, 210 Late reverberation suppression processing unit,
211 Direct speech enhancement unit,

301 blind sound source separation (BSS), 302 voice / noise selection unit,
303 Inverse matrix calculation unit, 304 Noise estimation unit,
305 direct speech / early reflection speech estimation unit, 306 late reverberation generation unit,
307 noise / late reverberation suppression mask generation unit, 308 noise / late reverberation suppression unit,
309 Direct speech enhancement unit,

401 Blind signal extraction (BSE), 402 Projection vector estimation unit,
403 noise estimation unit, 404 direct speech / early reflection speech estimation unit,
405 Late reverberation generator, 406 Noise / late reverberation suppression mask generator,
407 noise / late reverberation suppression processing unit, 408 direct speech enhancement unit,

501 late reverberation estimation unit, 502 gain correction unit, 503 reverberation suppression processing unit,

601 Blind sound source separation (BSS), 602 voice / noise selection unit,
603 multi-channel noise estimation unit, 604 mask generation unit,
605 noise suppression processing unit, 606 direct speech enhancement unit,

701 Blind sound source separation (BSS), 702 Voice / noise selection unit,
703 multi-channel noise estimation unit, 704 mask generation unit,
705 noise suppression processing unit, 706 direct speech enhancement unit, 707 late reverberation estimation unit,
708 gain correction unit, 709 dereverberation processing unit,

P user, R robot

Claims

A filter creation unit that creates a separation filter matrix at each frequency for separating the speech from the mixed observation signal, using an input signal in which the mixed observation signal including speech and noise is converted into a frequency domain;
A noise estimation unit that calculates an estimated noise using the input signal, the separation filter matrix, and an inverse matrix of the separation filter matrix;
Each element of the inverse matrix of the separation filter matrix is converted into the time domain and acquired as an estimated spatial transfer characteristic, and a section corresponding to the direct voice and initial reflection of the voice is extracted from the estimated spatial transfer characteristic, and the extracted section Is converted into the frequency domain to create a direct speech and initial reflection filter matrix at each frequency, and the estimated direct speech is estimated using the input signal, the separation filter matrix, and the direct speech and initial reflection filter matrix. And a direct speech / initial reflection speech estimator for calculating early reflection speech,
From the reverberation time of the given space and the power amount of the section extracted by the direct speech / initial reflected speech estimation unit, the filter coefficient of the late reverberation characteristics at each frequency is calculated, and the estimated direct speech and early reflected speech are calculated. A late reverberation generation unit that calculates pseudo late reverberation using the separation filter matrix and the filter coefficient of the late reverberation characteristic,
Noise and late reverberation in the mixed observation signal are suppressed using the estimated noise calculated by the noise estimation unit and the pseudo late reverberation calculated by the late reverberation generation unit. Reverberation suppressor.

A noise suppression mask generation unit that calculates a noise suppression mask using the input signal and the estimated noise calculated by the noise estimation unit;
Using the noise suppression mask, a noise suppression processing unit that suppresses noise in the mixed observation signal;
A gain correction unit that corrects the amplitude of the pseudo late reverberation calculated by the late reverberation generation unit using the noise suppression mask calculated by the noise suppression mask generation unit;
2. The noise and dereverberation apparatus according to claim 1, wherein late reverberation in the mixed observation signal is suppressed using pseudo late reverberation after amplitude is corrected by the gain correction unit.

A late reverberation suppression mask generation unit that calculates a late reverberation suppression mask using the input signal and the pseudo late reverberation after the amplitude is corrected by the gain correction unit;
The noise and dereverberation apparatus according to claim 2, further comprising: a later-stage reverberation suppression unit that suppresses reverberation in the mixed observation signal using the latter-stage reverberation suppression mask.

The filter creation unit
Blind sound source separation for performing adaptive learning processing using the input signal and creating a separation filter matrix at each frequency for separating the speech from the mixed observation signal;
A voice / noise selector that performs replacement so that the first element of the separation filter matrix created by the blind sound source separation is the voice;
The noise and reverberation according to any one of claims 1 to 3, further comprising: an inverse matrix calculation unit that calculates an inverse matrix of the separation filter matrix after the replacement by the voice noise / selection unit. Suppressor.

Using the estimated noise calculated by the noise estimation unit and the pseudo late reverberation calculated by the late reverberation generation unit, the noise in the mixed observation signal and the signal after the late reverberation are directly suppressed The noise and dereverberation apparatus according to any one of claims 1 to 4, further comprising a direct speech enhancement unit that enhances speech.

Converting a mixed observation signal including speech and noise into a frequency domain to be an input signal;
Using the input signal to create a separation filter matrix at each frequency that separates the speech from the mixed observation signal;
Calculating an inverse matrix of the created separation filter matrix;
Calculating estimated noise using the input signal, the created separation filter matrix, and an inverse matrix of the created separation filter matrix;
Transforming each element of the inverse matrix of the created separation filter matrix into the time domain to obtain an estimated spatial transfer characteristic;
Cutting out a section corresponding to the direct voice and initial reflection of the voice from the acquired estimated spatial transfer characteristic;
Transforming the extracted section into a frequency domain to create a direct speech and initial reflection filter matrix at each frequency; and
Calculating estimated direct speech and initial reflected speech using the input signal, the created separation filter matrix, and the created direct speech and initial reflection filter matrix;
From the reverberation time of the given space and the amount of power of the extracted section, calculating a filter coefficient of the late reverberation characteristics at each frequency;
Calculating the pseudo late reverberation using the calculated estimated direct sound and early reflected sound, the created separation filter matrix, and the calculated filter coefficient of the late reverberation characteristic;
Using the calculated estimated noise and the calculated pseudo late reverberation to suppress the noise and the late reverberation in the mixed observation signal, and a method for suppressing noise and reverberation.