JP6840302B2

JP6840302B2 - Information processing equipment, programs and information processing methods

Info

Publication number: JP6840302B2
Application number: JP2020557460A
Authority: JP
Inventors: 訓古田; 松岡　文啓; 文啓松岡
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2021-03-10
Anticipated expiration: 2038-11-28
Also published as: JPWO2020110228A1; WO2020110228A1

Description

本発明は、情報処理装置、プログラム及び情報処理方法に関する。 The present invention relates to an information processing device, a program, and an information processing method.

近年のデジタル信号処理技術の進展に伴い、自動車内又は家庭のリビングでの音声認識によるハンズフリー音声操作、又は、手ぶらで電話するためのハンズフリー通話が広く普及している。また、機械の発する異常音又は人の悲鳴等の音を捉えて検知する異常音監視システムも開発されてきている。 With the development of digital signal processing technology in recent years, hands-free voice operation by voice recognition in an automobile or in a living room at home, or hands-free calling for making an empty-handed call has become widespread. Further, an abnormal sound monitoring system has been developed that catches and detects an abnormal sound emitted by a machine or a sound such as a human scream.

これらハンズフリー音声操作システム、ハンズフリー通話システム又は異常音監視システムは、走行する自動車内、工場内、オフィス、又は、家庭のリビング等の様々な雑音環境下において、音声又は異常音等の目的音を収集するためにマイクロホンが設置される。しかしながら、そのようなマイクロホンは、目的音だけでなく、その目的音以外の周囲雑音及び他の音声（以下、妨害音と称する）を収集してしまう。 These hands-free voice operation system, hands-free call system or abnormal sound monitoring system are intended sounds such as voice or abnormal sound in various noise environments such as in a moving car, in a factory, in an office, or in a living room at home. A microphone is installed to collect. However, such a microphone collects not only the target sound but also ambient noise other than the target sound and other sounds (hereinafter referred to as disturbing sounds).

音声から個別に目的音を取り出す方法として、例えば、複数のマイクロホンを用いる場合、信号処理により目的音方向に指向性を向けたり、あるいは妨害音に死角を向けたりするようなビームフォーミングによる方法、又は、独立成分分析により混合行列を推定する方法等がある。但し、ビームフォーミングは、ノイズの抑圧には優れているが、音声の分離にはあまり有効でなく、独立成分分析は、残響又は騒音の影響で性能が低下する問題がある。更に、一般に実環境においては、妨害音の騒音源の数も１つとは限らず、マイクロホン数より多くの音源を分離するのに対応困難であるという制約がある。 As a method of extracting the target sound individually from the sound, for example, when using a plurality of microphones, a method by beamforming such that directivity is directed toward the target sound by signal processing or a blind spot is directed to the disturbing sound, or , There is a method of estimating the mixing matrix by independent component analysis. However, although beamforming is excellent in suppressing noise, it is not very effective in separating speech, and independent component analysis has a problem that its performance deteriorates due to the influence of reverberation or noise. Further, generally, in a real environment, the number of noise sources of disturbing sound is not limited to one, and there is a restriction that it is difficult to separate more sound sources than the number of microphones.

これらに対し、目的音信号と妨害音信号とが時間周波数領域上で互いに重ならないというスパース性の仮定の下で、目的音以外の周波数成分をマスクして音源信号を分離する、バイナリマスキングと呼ばれる方法が提案されている。バイナリマスキングは、実装が容易で方向性を有する妨害音を抑圧するのに有効な方法である。 On the other hand, under the assumption of sparseness that the target sound signal and the interfering sound signal do not overlap each other in the time frequency domain, the sound source signal is separated by masking the frequency components other than the target sound, which is called binary masking. A method has been proposed. Binary masking is an effective method for suppressing disturbing sounds that are easy to implement and have directionality.

このバイナリマスキングに基づく方法として、特許文献１に開示されている技術がある。特許文献１には、パワースペクトルの振幅差を意図的に生じさせることで、スパース性が保証されない混合音声に対するバイナリマスキングの精度を高める方法が開示されている。 As a method based on this binary masking, there is a technique disclosed in Patent Document 1. Patent Document 1 discloses a method of improving the accuracy of binary masking for a mixed voice whose sparsity is not guaranteed by intentionally causing an amplitude difference of a power spectrum.

特開２０１０−２３９４２４号公報JP-A-2010-239424

しかしながら、従来の方法では、主マイク入力信号と副マイク入力信号のパワースペクトル間に意図的にパワー差を生じさせるため、マスク係数に誤差が生ずる問題がある。 However, in the conventional method, since a power difference is intentionally generated between the power spectra of the main microphone input signal and the sub microphone input signal, there is a problem that an error occurs in the mask coefficient.

本発明の１又は複数の態様は、かかる問題を解決するためになされたもので、高品質な目的信号を容易に得ることができるようにすることを目的とする。 One or more aspects of the present invention have been made to solve such problems, and an object of the present invention is to make it possible to easily obtain a high-quality target signal.

本発明の第１の態様に係る情報処理装置は、第１の方向から到来する目的音と、前記第１の方向とは異なる第２の方向から到来する妨害音と、を含む観測音に基づいて第１のマイクロホンで生成された第１の観測アナログ信号、及び、前記観測音に基づいて第２のマイクロホンで生成された第２の観測アナログ信号の入力を受けて、第１の観測アナログ信号及び第２の観測アナログ信号の各々をデジタル信号に変換することで、第１の観測デジタル信号及び第２の観測デジタル信号を生成するアナログ／デジタル変換部と、前記第１の観測デジタル信号及び前記第２の観測デジタル信号の各々を、周波数領域の信号に変換することで、第１のスペクトル成分及び第２のスペクトル成分を生成する時間／周波数変換部と、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との時間差により、前記第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部と、前記第１のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部と、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間／周波数逆変換部と、を備え、前記マスク生成部は、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第１の時間差、及び、前記妨害音が前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第２の時間差から、前記観測音の内、前記第１の方向を含む第１の範囲から到来する音と、前記第２の方向を含み、前記第１の範囲とは重ならない第２の範囲から到来する音とを区別して、前記第１の範囲から到来する音のスペクトル成分を、前記第２の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出するマスク係数算出部と、前記第１のスペクトル成分の内、前記第１の範囲から到来する音のスペクトル成分の量の、前記第２の範囲から到来する音のスペクトル成分の量に対する比率を算出する発話量比算出部と、前記比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出するゲイン算出部と、前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出するマスク修正部と、を備えることを特徴とする。
本発明の第２の態様に係る情報処理装置は、第１の方向から到来する目的音と、前記第１の方向とは異なる第２の方向から到来する妨害音と、を含む観測音に基づいて第１のマイクロホンで生成された第１の観測アナログ信号、及び、前記観測音に基づいて第２のマイクロホンで生成された第２の観測アナログ信号の入力を受けて、第１の観測アナログ信号及び第２の観測アナログ信号の各々をデジタル信号に変換することで、第１の観測デジタル信号及び第２の観測デジタル信号を生成するアナログ／デジタル変換部と、前記第１の観測デジタル信号及び前記第２の観測デジタル信号の各々を、周波数領域の信号に変換することで、第１のスペクトル成分及び第２のスペクトル成分を生成する時間／周波数変換部と、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との時間差により、前記第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部と、前記第１のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部と、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間／周波数逆変換部と、を備え、前記マスク生成部は、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第１の時間差、及び、前記妨害音が前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第２の時間差から、前記観測音の内、前記第１の方向を含む第１の範囲から到来する音と、前記第２の方向を含み、前記第１の範囲とは重ならない第２の範囲から到来する音とを区別して、前記第１の範囲から到来する音のスペクトル成分を、前記第２の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出するマスク係数算出部と、前記第１のスペクトル成分の内、前記第１の範囲から到来している音のスペクトル成分の量の、前記第２の範囲から到来している音のスペクトル成分の量に対する比率を、時間の経過とともに順次算出し、過去に算出された前記比率を用いて最後に算出された前記比率を平滑化する発話量比算出部と、前記平滑化された比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出するゲイン算出部と、前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出するマスク修正部と、を備えることを特徴とする。 The information processing device according to the first aspect of the present invention is based on an observed sound including a target sound arriving from the first direction and a disturbing sound arriving from a second direction different from the first direction. Upon receiving the input of the first observed analog signal generated by the first microphone and the second observed analog signal generated by the second microphone based on the observed sound, the first observed analog signal An analog / digital conversion unit that generates a first observed digital signal and a second observed digital signal by converting each of the first observed analog signal and the second observed analog signal into a digital signal, the first observed digital signal, and the above. A time / frequency conversion unit that generates a first spectrum component and a second spectrum component by converting each of the second observation digital signals into a signal in the frequency region, the first spectrum component, and the first spectrum component. Using the intercorrelation function of the spectral components of 2, the observed sound differs from the first direction due to the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone. The spectrum component is separated by masking the first spectrum component with the mask generation unit for calculating the filtering coefficient for masking the spectrum component of the sound coming from the direction using the filtering coefficient. The mask generation unit includes a masking filter unit and a time / frequency inverse conversion unit that generates an output digital signal by converting the separated spectral components into signals in the time region. The mask generation unit is the first Using the mutual correlation function of the spectral component and the second spectral component, the first time difference between the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone, and From the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first range of the observed sound including the first direction. Distinguishing between the sound arriving from the sound and the sound arriving from the second range including the second direction and not overlapping the first range, the spectral component of the sound arriving from the first range is determined. A mask coefficient calculation unit that calculates a mask coefficient for separating from the spectrum component of the sound coming from the second range, and a spectrum component of the sound coming from the first range among the first spectrum components. Speech volume ratio calculation unit that calculates the ratio of the amount to the amount of the spectral component of the sound coming from the second range. The gain calculation unit that calculates the correction gain for correcting the mask coefficient and the mask coefficient are corrected by the correction gain so that the higher the ratio, the lower the strength at which the masking is performed. It is characterized by including a mask correction unit for calculating the filtering coefficient.
The information processing device according to the second aspect of the present invention is based on an observed sound including a target sound arriving from the first direction and a disturbing sound arriving from a second direction different from the first direction. Upon receiving the input of the first observed analog signal generated by the first microphone and the second observed analog signal generated by the second microphone based on the observed sound, the first observed analog signal An analog / digital conversion unit that generates a first observed digital signal and a second observed digital signal by converting each of the first observed analog signal and the second observed analog signal into a digital signal, the first observed digital signal, and the above. A time / frequency conversion unit that generates a first spectrum component and a second spectrum component by converting each of the second observation digital signals into a signal in the frequency region, the first spectrum component, and the first spectrum component. Using the intercorrelation function of the spectral components of 2, the observed sound differs from the first direction due to the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone. The spectrum component is separated by masking the first spectrum component with the mask generation unit for calculating the filtering coefficient for masking the spectrum component of the sound coming from the direction using the filtering coefficient. The mask generation unit includes a masking filter unit and a time / frequency inverse conversion unit that generates an output digital signal by converting the separated spectral components into signals in the time region. The mask generation unit is the first Using the mutual correlation function of the spectral component and the second spectral component, the first time difference between the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone, and From the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first range of the observed sound including the first direction. Distinguishing between the sound arriving from the sound and the sound arriving from the second range including the second direction and not overlapping the first range, the spectral component of the sound arriving from the first range is determined. A mask coefficient calculation unit that calculates a mask coefficient for separating from the spectrum component of the sound coming from the second range, and a spectrum of the sound coming from the first range of the first spectrum components. The ratio of the amount of component to the amount of spectral component of the sound coming from the second range is defined as the passage of time. Both are sequentially calculated, and the speech volume ratio calculation unit that smoothes the last calculated ratio using the previously calculated ratio, and the higher the smoothed ratio, the stronger the masking is performed. It is provided with a gain calculation unit for calculating a correction gain for correcting the mask coefficient so as to be low, and a mask correction unit for calculating the filtering coefficient by correcting the mask coefficient with the correction gain. It is characterized by.

本発明の第１の態様に係るプログラムは、コンピュータを、第１の方向から到来する目的音と、前記第１の方向とは異なる第２の方向から到来する妨害音と、を含む観測音に基づいて第１のマイクロホンで生成された第１の観測アナログ信号、及び、前記観測音に基づいて第２のマイクロホンで生成された第２の観測アナログ信号の入力を受けて、第１の観測アナログ信号及び第２の観測アナログ信号の各々をデジタル信号に変換することで、第１の観測デジタル信号及び第２の観測デジタル信号を生成するアナログ／デジタル変換部、前記第１の観測デジタル信号及び前記第２の観測デジタル信号の各々を、周波数領域の信号に変換することで、第１のスペクトル成分及び第２のスペクトル成分を生成する時間／周波数変換部、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との時間差により、前記第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部と、前記第１のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部、及び、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間／周波数逆変換部、として機能させ、前記マスク生成部は、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第１の時間差、及び、前記妨害音が前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第２の時間差から、前記観測音の内、前記第１の方向を含む第１の範囲から到来する音と、前記第２の方向を含み、前記第１の範囲とは重ならない第２の範囲から到来する音とを区別して、前記第１の範囲から到来する音のスペクトル成分を、前記第２の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出するマスク係数算出部と、前記第１のスペクトル成分の内、前記第１の範囲から到来する音のスペクトル成分の量の、前記第２の範囲から到来する音のスペクトル成分の量に対する比率を算出する発話量比算出部と、前記比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出するゲイン算出部と、前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出するマスク修正部と、を備えることを特徴とする。
本発明の第２の態様に係るプログラムは、コンピュータを、第１の方向から到来する目的音と、前記第１の方向とは異なる第２の方向から到来する妨害音と、を含む観測音に基づいて第１のマイクロホンで生成された第１の観測アナログ信号、及び、前記観測音に基づいて第２のマイクロホンで生成された第２の観測アナログ信号の入力を受けて、第１の観測アナログ信号及び第２の観測アナログ信号の各々をデジタル信号に変換することで、第１の観測デジタル信号及び第２の観測デジタル信号を生成するアナログ／デジタル変換部、前記第１の観測デジタル信号及び前記第２の観測デジタル信号の各々を、周波数領域の信号に変換することで、第１のスペクトル成分及び第２のスペクトル成分を生成する時間／周波数変換部、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との時間差により、前記第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出するマスク生成部、前記第１のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離するマスキングフィルタ部、及び、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する時間／周波数逆変換部、として機能させ、前記マスク生成部は、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第１の時間差、及び、前記妨害音が前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第２の時間差から、前記観測音の内、前記第１の方向を含む第１の範囲から到来する音と、前記第２の方向を含み、前記第１の範囲とは重ならない第２の範囲から到来する音とを区別して、前記第１の範囲から到来する音のスペクトル成分を、前記第２の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出するマスク係数算出部と、前記第１のスペクトル成分の内、前記第１の範囲から到来している音のスペクトル成分の量の、前記第２の範囲から到来している音のスペクトル成分の量に対する比率を、時間の経過とともに順次算出し、過去に算出された前記比率を用いて最後に算出された前記比率を平滑化する発話量比算出部と、前記平滑化された比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出するゲイン算出部と、前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出するマスク修正部と、を備えることを特徴とする。 The program according to the first aspect of the present invention makes a computer an observation sound including a target sound arriving from the first direction and a disturbing sound arriving from a second direction different from the first direction. Based on the input of the first observed analog signal generated by the first microphone and the second observed analog signal generated by the second microphone based on the observed sound, the first observed analog is received. An analog / digital conversion unit that generates a first observed digital signal and a second observed digital signal by converting each of the signal and the second observed analog signal into a digital signal, the first observed digital signal, and the above. A time / frequency converter that generates a first spectral component and a second spectral component by converting each of the second observed digital signals into a signal in the frequency region, the first spectral component, and the second. Using the mutual correlation function of the spectral components of, the direction different from the first direction due to the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone. A mask generation unit that calculates a filtering coefficient for masking the spectral component of the sound coming from the sound, and a masking that separates the spectral component by masking the first spectral component using the filtering coefficient. The filter unit and the separated spectral components are converted into signals in the time region to function as a time / frequency inverse conversion unit that generates an output digital signal, and the mask generation unit is the first mask generation unit. Using the mutual correlation function of the spectral component and the second spectral component, the first time difference between the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone, and From the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first range of the observed sound including the first direction. Distinguishing between the sound arriving from the sound and the sound arriving from the second range including the second direction and not overlapping the first range, the spectral component of the sound arriving from the first range is determined. A mask coefficient calculation unit that calculates a mask coefficient for separating from the spectrum component of the sound coming from the second range, and a spectrum component of the sound coming from the first range among the first spectrum components. Calculate the ratio of the quantity to the quantity of the spectral component of the sound coming from the second range. The utterance amount ratio calculation unit to be output, the gain calculation unit to calculate the correction gain for correcting the mask coefficient so that the higher the ratio, the lower the strength at which the masking is performed, and the mask coefficient are described. It is characterized by including a mask correction unit for calculating the filtering coefficient by correcting with a correction gain.
The program according to the second aspect of the present invention makes a computer an observation sound including a target sound arriving from the first direction and a disturbing sound arriving from a second direction different from the first direction. Based on the input of the first observed analog signal generated by the first microphone and the second observed analog signal generated by the second microphone based on the observed sound, the first observed analog is received. An analog / digital conversion unit that generates a first observed digital signal and a second observed digital signal by converting each of the signal and the second observed analog signal into a digital signal, the first observed digital signal, and the above. A time / frequency converter that generates a first spectral component and a second spectral component by converting each of the second observed digital signals into a signal in the frequency region, the first spectral component, and the second. Using the mutual correlation function of the spectral components of, the direction different from the first direction due to the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone. A mask generation unit that calculates a filtering coefficient for masking the spectral component of the sound coming from the sound, and a masking filter that separates the spectral component by masking the first spectral component using the filtering coefficient. The unit and the separated spectrum component are converted into a signal in the time region to function as a time / frequency inverse conversion unit that generates an output digital signal, and the mask generation unit is the first spectrum. Using the intercorrelation function of the components and the second spectral component, the first time difference between the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone, and From the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, from the first range including the first direction of the observed sound. Distinguishing between the incoming sound and the sound arriving from the second range including the second direction and not overlapping the first range, the spectral component of the sound arriving from the first range is defined as described above. A mask coefficient calculation unit that calculates a mask coefficient for separating from a sound spectrum component arriving from the second range, and a spectrum component of the sound arriving from the first range among the first spectrum components. The ratio of the amount of to the amount of the spectral component of the sound coming from the second range. The speech volume ratio calculation unit that calculates the rate sequentially with the passage of time and smoothes the last calculated ratio using the ratio calculated in the past, and the higher the smoothed ratio, the more the said. A gain calculation unit that calculates a correction gain for correcting the mask coefficient so that the intensity at which masking is performed becomes low, and a mask correction that calculates the filtering coefficient by correcting the mask coefficient with the correction gain. It is characterized by having a section and.

本発明の第１の態様に係る情報処理方法は、第１の方向から到来する目的音と、前記第１の方向とは異なる第２の方向から到来する妨害音と、を含む観測音に基づいて第１のマイクロホンで生成された第１の観測アナログ信号、及び、前記観測音に基づいて第２のマイクロホンで生成された第２の観測アナログ信号の入力を受けて、第１の観測アナログ信号及び第２の観測アナログ信号の各々をデジタル信号に変換することで、第１の観測デジタル信号及び第２の観測デジタル信号を生成し、前記第１の観測デジタル信号及び前記第２の観測デジタル信号の各々を、周波数領域の信号に変換することで、第１のスペクトル成分及び第２のスペクトル成分を生成し、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との時間差により、前記第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出し、前記第１のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離し、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する情報処理方法であって、前記フィルタリング係数を算出する際に、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第１の時間差、及び、前記妨害音が前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第２の時間差から、前記観測音の内、前記第１の方向を含む第１の範囲から到来する音と、前記第２の方向を含み、前記第１の範囲とは重ならない第２の範囲から到来する音とを区別して、前記第１の範囲から到来する音のスペクトル成分を、前記第２の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出し、前記第１のスペクトル成分の内、前記第１の範囲から到来する音のスペクトル成分の量の、前記第２の範囲から到来する音のスペクトル成分の量に対する比率を算出し、前記比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出し、前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出することを特徴とする。
本発明の第２の態様に係る情報処理方法は、第１の方向から到来する目的音と、前記第１の方向とは異なる第２の方向から到来する妨害音と、を含む観測音に基づいて第１のマイクロホンで生成された第１の観測アナログ信号、及び、前記観測音に基づいて第２のマイクロホンで生成された第２の観測アナログ信号の入力を受けて、第１の観測アナログ信号及び第２の観測アナログ信号の各々をデジタル信号に変換することで、第１の観測デジタル信号及び第２の観測デジタル信号を生成し、前記第１の観測デジタル信号及び前記第２の観測デジタル信号の各々を、周波数領域の信号に変換することで、第１のスペクトル成分及び第２のスペクトル成分を生成し、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記観測音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との時間差により、前記第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出し、前記第１のスペクトル成分に対して、前記フィルタリング係数を用いてマスキングを行うことで、スペクトル成分を分離し、前記分離されたスペクトル成分を、時間領域の信号に変換することで、出力デジタル信号を生成する情報処理方法であって、前記フィルタリング係数を算出する際に、前記第１のスペクトル成分及び前記第２のスペクトル成分の相互相関関数を用いて、前記目的音が、前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第１の時間差、及び、前記妨害音が前記第１のマイクロホンに到来する時間と、前記第２のマイクロホンに到来する時間との第２の時間差から、前記観測音の内、前記第１の方向を含む第１の範囲から到来する音と、前記第２の方向を含み、前記第１の範囲とは重ならない第２の範囲から到来する音とを区別して、前記第１の範囲から到来する音のスペクトル成分を、前記第２の範囲から到来する音のスペクトル成分から分離するためのマスク係数を算出し、前記第１のスペクトル成分の内、前記第１の範囲から到来している音のスペクトル成分の量の、前記第２の範囲から到来している音のスペクトル成分の量に対する比率を、時間の経過とともに順次算出し、過去に算出された前記比率を用いて最後に算出された前記比率を平滑化し、前記平滑化された比率が高いほど、前記マスキングが行われる強度が低くなるように、前記マスク係数を修正するための修正ゲインを算出し、前記マスク係数を前記修正ゲインで修正することで、前記フィルタリング係数を算出することを特徴とする。 The information processing method according to the first aspect of the present invention is based on an observed sound including a target sound arriving from the first direction and a disturbing sound arriving from a second direction different from the first direction. Upon receiving the input of the first observed analog signal generated by the first microphone and the second observed analog signal generated by the second microphone based on the observed sound, the first observed analog signal By converting each of the second observation analog signal into a digital signal, the first observation digital signal and the second observation digital signal are generated, and the first observation digital signal and the second observation digital signal are generated. By converting each of the above into a signal in the frequency region, a first spectrum component and a second spectrum component are generated, and the mutual correlation function of the first spectrum component and the second spectrum component is used. To mask the spectral component of the sound arriving from a direction different from the first direction due to the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone. By calculating the filtering coefficient of the above and masking the first spectral component with the filtering coefficient, the spectral component is separated and the separated spectral component is converted into a signal in the time region. Therefore, in the information processing method for generating the output digital signal, when the filtering coefficient is calculated, the target sound is generated by using the mutual correlation function of the first spectrum component and the second spectrum component. , The first time difference between the time when the first microphone arrives and the time when the second microphone arrives, the time when the disturbing sound arrives at the first microphone, and the second microphone. From the second time difference from the time when the sound arrives at, the sound coming from the first range including the first direction and the sound including the second direction among the observed sounds, the first range includes the second direction. Calculate the mask coefficient for separating the spectral component of the sound arriving from the first range from the spectral component of the sound arriving from the second range, distinguishing the sound arriving from the second range that does not overlap. Then, among the first spectral components, the ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range is calculated, and the ratio is The correction gain for correcting the mask coefficient is calculated so that the higher the value, the lower the strength at which the masking is performed. The filtering coefficient is calculated by modifying the mask coefficient with the correction gain .
The information processing method according to the second aspect of the present invention is based on an observed sound including a target sound arriving from the first direction and a disturbing sound arriving from a second direction different from the first direction. Upon receiving the input of the first observed analog signal generated by the first microphone and the second observed analog signal generated by the second microphone based on the observed sound, the first observed analog signal By converting each of the second observation analog signal into a digital signal, the first observation digital signal and the second observation digital signal are generated, and the first observation digital signal and the second observation digital signal are generated. By converting each of the above into a signal in the frequency region, a first spectrum component and a second spectrum component are generated, and the mutual correlation function of the first spectrum component and the second spectrum component is used. To mask the spectral component of the sound arriving from a direction different from the first direction due to the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone. By calculating the filtering coefficient of the above and masking the first spectral component with the filtering coefficient, the spectral component is separated and the separated spectral component is converted into a signal in the time region. Therefore, in the information processing method for generating the output digital signal, when the filtering coefficient is calculated, the target sound is generated by using the mutual correlation function of the first spectrum component and the second spectrum component. , The first time difference between the time when the first microphone arrives and the time when the second microphone arrives, the time when the disturbing sound arrives at the first microphone, and the second microphone. From the second time difference from the time when the sound arrives at, the sound coming from the first range including the first direction and the sound including the second direction among the observed sounds, the first range includes the second direction. Calculate the mask coefficient for separating the spectral component of the sound arriving from the first range from the spectral component of the sound arriving from the second range, distinguishing the sound arriving from the second range that does not overlap. Then, among the first spectral components, the ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range is set for time. The last calculated ratio was smoothed using the ratio calculated in the past, and the smoothed ratio was calculated sequentially with the progress of the above. The correction gain for correcting the mask coefficient is calculated so that the higher the ratio, the lower the intensity at which the masking is performed, and the filtering coefficient is calculated by correcting the mask coefficient with the correction gain. It is characterized by that.

本発明の１又は複数の態様によれば、高品質な目的信号を容易に得ることができる。 According to one or more aspects of the present invention, a high quality target signal can be easily obtained.

実施の形態１及び３に係る音源分離装置の構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the sound source separation apparatus which concerns on Embodiments 1 and 3. 実施の形態１〜３におけるマスク生成部の内部構成を概略的に示すブロック図である。It is a block diagram which shows schematic the internal structure of the mask generation part in Embodiments 1-3. 第１のマイクロホン及び第２のマイクロホンの配置と、目的音の到来方向を説明するための概略図である。It is the schematic for demonstrating the arrangement of the 1st microphone and the 2nd microphone, and the arrival direction of a target sound. （Ａ）〜（Ｃ）は、目的音話者と妨害音話者が発話した場合の発話量比を説明するためのグラフである。(A) to (C) are graphs for explaining the utterance amount ratio when the target sound speaker and the disturbing sound speaker speak. （Ａ）及び（Ｂ）は、実施の形態１における効果を説明するためのグラフである。(A) and (B) are graphs for explaining the effect in Embodiment 1. 音源分離装置の第１のハードウェア構成例を示すブロック図である。It is a block diagram which shows the 1st hardware configuration example of a sound source separation apparatus. 音源分離装置の第２のハードウェア構成例を示すブロック図である。It is a block diagram which shows the 2nd hardware configuration example of the sound source separation apparatus. 音源分離装置の動作を示すフローチャートである。It is a flowchart which shows the operation of a sound source separation apparatus. 実施の形態２に係る音源分離装置を含む情報処理システムの構成を概略的に示すブロック図である。It is a block diagram which shows schematic structure of the information processing system which includes the sound source separation apparatus which concerns on Embodiment 2. FIG. 目的音及び妨害音以外の雑音の影響を除外する方法の一例を示す模式図である。It is a schematic diagram which shows an example of the method of excluding the influence of noise other than a target sound and a disturbing sound.

実施の形態１．
図１は、実施の形態１に係る情報処理装置としての音源分離装置１００の構成を概略的に示すブロック図である。
音源分離装置１００は、アナログ／デジタル変換部（以下、Ａ／Ｄ変換部という）１０３と、時間／周波数変換部（以下、Ｔ／Ｆ変換部という）１０４と、マスク生成部１０５と、マスキングフィルタ部１１０と、時間／周波数逆変換部（以下、Ｔ／Ｆ逆変換部という）１１１と、デジタル／アナログ変換部（以下、Ｄ／Ａ変換部という）１１２とを備える。
音源分離装置１００は、第１のマイクロホン１０１及び第２のマイクロホン１０２に接続されている。Embodiment 1.
FIG. 1 is a block diagram schematically showing a configuration of a sound source separation device 100 as an information processing device according to the first embodiment.
The sound source separation device 100 includes an analog / digital conversion unit (hereinafter referred to as A / D conversion unit) 103, a time / frequency conversion unit (hereinafter referred to as T / F conversion unit) 104, a mask generation unit 105, and a masking filter. A unit 110, a time / frequency inverse conversion unit (hereinafter referred to as a T / F inverse conversion unit) 111, and a digital / analog conversion unit (hereinafter referred to as a D / A conversion unit) 112 are provided.
The sound source separation device 100 is connected to the first microphone 101 and the second microphone 102.

図２は、マスク生成部１０５の内部構成を概略的に示すブロック図である。
マスク生成部１０５は、マスク係数算出部１０６と、発話量比算出部１０７と、ゲイン算出部１０８と、マスク修正部１０９とを備える。
以下、図１及び図２に基づいて、実施の形態１の音源分離装置１００の構成及びその動作原理を説明する。音源分離装置１００は、第１のマイクロホン１０１及び第２のマイクロホン１０２で取得された時間領域の信号から生成された、周波数領域における信号に基づいて、マスキングフィルタを形成し、それを第１のマイクロホン１０１で取得された信号に対応する周波数領域の信号に掛けることで、妨害音が除去された目的音の出力信号を得る構成となっている。FIG. 2 is a block diagram schematically showing the internal configuration of the mask generation unit 105.
The mask generation unit 105 includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 107, a gain calculation unit 108, and a mask correction unit 109.
Hereinafter, the configuration of the sound source separation device 100 of the first embodiment and the operating principle thereof will be described with reference to FIGS. 1 and 2. The sound source separator 100 forms a masking filter based on the signal in the frequency domain generated from the signal in the time domain acquired by the first microphone 101 and the second microphone 102, and uses the first microphone as a masking filter. By multiplying the signal in the frequency domain corresponding to the signal acquired in 101, the output signal of the target sound from which the disturbing sound has been removed is obtained.

ここで、第１のマイクロホン１０１で取得された第１の観測アナログ信号を第１のチャンネルＣｈ１ともいい、第２のマイクロホン１０２で取得された第２の観測アナログ信号を第２のチャンネルＣｈ２ともいう。
また、以降の説明を簡単にするため、図３に示されているように、第１のマイクロホン１０１と、第２のマイクロホン１０２とは、同一水平面に位置し、かつ、それらの位置は既知であり、かつ、時間で変化しないものとする。さらに、目的音及び妨害音が到来し得る方向範囲についても時間で変化しないものとする。なお、目的音が到来する方向を第１の方向ともいい、妨害音が到来する方向を第２の方向ともいう。
ここでは、目的音及び妨害音は、それぞれ別の単一話者による音声であるものとして説明する。Here, the first observed analog signal acquired by the first microphone 101 is also referred to as a first channel Ch1, and the second observed analog signal acquired by the second microphone 102 is also referred to as a second channel Ch2. ..
Further, for simplification of the following description, as shown in FIG. 3, the first microphone 101 and the second microphone 102 are located on the same horizontal plane, and their positions are known. Yes, and shall not change over time. Furthermore, it is assumed that the directional range in which the target sound and the disturbing sound can reach does not change with time. The direction in which the target sound arrives is also referred to as the first direction, and the direction in which the disturbing sound arrives is also referred to as the second direction.
Here, the target sound and the disturbing sound will be described as being sounds by different single speakers.

第１のマイクロホン１０１は、観測音を電気信号に変換することで、第１の観測アナログ信号を生成する。第１の観測アナログ信号は、Ａ／Ｄ変換部１０３に与えられる。
第２のマイクロホン１０２は、観測音を電気信号に変換することで、第２の観測アナログ信号を生成する。第２の観測アナログ信号は、Ａ／Ｄ変換部１０３に与えられる。The first microphone 101 generates the first observation analog signal by converting the observation sound into an electric signal. The first observed analog signal is given to the A / D conversion unit 103.
The second microphone 102 generates a second observation analog signal by converting the observation sound into an electric signal. The second observed analog signal is given to the A / D conversion unit 103.

Ａ／Ｄ変換部１０３は、第１のマイクロホン１０１から与えられた第１の観測アナログ信号及び第２のマイクロホン１０２から与えられた第２の観測アナログ信号のそれぞれに対して、アナログ／デジタル変換（以下、Ａ／Ｄ変換という）を行うことで、それぞれをデジタル信号に変換し、第１の観測デジタル信号及び第２の観測デジタル信号を生成する。 The A / D conversion unit 103 performs analog / digital conversion (analog / digital conversion) for each of the first observation analog signal given from the first microphone 101 and the second observation analog signal given from the second microphone 102. Hereinafter, by performing A / D conversion), each is converted into a digital signal, and a first observation digital signal and a second observation digital signal are generated.

例えば、Ａ／Ｄ変換部１０３は、第１のマイクロホン１０１から与えられた第１の観測アナログ信号に対して、予め定められたサンプリング周波数でサンプリングして、フレーム単位で分割されたデジタル信号に変換することで、第１の観測デジタル信号を生成する。同様に、Ａ／Ｄ変換部１０３は、第２のマイクロホン１０２から与えられた第２の観測アナログ信号に対して、予め定められたサンプリング周波数でサンプリングして、フレーム単位で分割されたデジタル信号に変換することで、第２の観測デジタル信号を生成する。ここで、サンプリング周波数は、例えば、１６ｋＨｚであり、フレーム単位は、例えば、１６ｍｓである。 For example, the A / D conversion unit 103 samples the first observation analog signal given from the first microphone 101 at a predetermined sampling frequency and converts it into a digital signal divided in frame units. By doing so, the first observation digital signal is generated. Similarly, the A / D converter 103 samples the second observed analog signal given from the second microphone 102 at a predetermined sampling frequency into a digital signal divided in frame units. By converting, a second observation digital signal is generated. Here, the sampling frequency is, for example, 16 kHz, and the frame unit is, for example, 16 ms.

なお、サンプル番号ｔに対応するフレーム間隔における第１の観測アナログ信号から生成された第１の観測デジタル信号を、符号ｘ_１（ｔ）で表し、サンプル番号ｔに対応するフレーム間隔における第２の観測アナログ信号から生成された第２の観測デジタル信号を、符号ｘ_２（ｔ）で表す。
第１の観測デジタル信号ｘ_１（ｔ）及び第２の観測デジタル信号ｘ_２（ｔ）は、Ｔ／Ｆ変換部１０４に与えられる。The first observation digital signal generated from the first observation analog signal at the frame interval corresponding to the sample number t is represented _{by the code x 1 (t), and the second observation digital signal at the frame interval corresponding to the sample number t is represented by the reference numeral x 1 (t).} The second observed digital signal generated from the observed analog signal _{is represented by the code x 2} (t).
The first observed digital signal x ₁ (t) and the second observed digital signal x ₂ (t) are given to the T / F conversion unit 104.

Ｔ／Ｆ変換部１０４は、第１の観測デジタル信号ｘ_１（ｔ）及び第２の観測デジタル信号ｘ_２（ｔ）を受けて、時間領域の第１の観測デジタル信号ｘ_１（ｔ）及び第２の観測デジタル信号ｘ_２（ｔ）を、周波数領域の第１の短時間スペクトル成分Ｘ_１（ω，τ）及び第２の短時間スペクトル成分Ｘ_２（ω，τ）に変換する。但し、ωは、離散周波数であるスペクトル番号、τは、フレーム番号を表す。The T / F conversion unit 104 receives the first observation digital signal x ₁ (t) and the second observation digital signal x ₂ (t), and receives the first observation digital signal x ₁ (t) and the first observation digital signal x 1 (t) in the time domain. The second observed digital signal x ₂ (t) is converted into the first short-time spectral component X ₁ (ω, τ) and the second short-time spectral component X ₂ (ω, τ) in the frequency domain. However, ω represents a spectrum number which is a discrete frequency, and τ represents a frame number.

具体的には、Ｔ／Ｆ変換部１０４は、第１の観測デジタル信号ｘ_１（ｔ）に対して、例えば、５１２点の高速フーリエ変換を行うことで、第１の短時間スペクトル成分Ｘ_１（ω，τ）を生成する。同様に、Ｔ／Ｆ変換部１０４は、第２の観測デジタル信号ｘ_２（ｔ）から、第２の短時間スペクトル成分Ｘ_２（ω，τ）を生成する。
なお、以下では、特に断わりのない限り、現フレームの短時間スペクトル成分は、単にスペクトル成分としてその記載を省略する。Specifically, the T / F transforming unit 104 _{performs a fast Fourier transform of 512 points on the first} observed digital signal x 1 (t), so that the first short-time spectral component X ₁ Generate (ω, τ). Similarly, the T / F conversion unit 104 _{generates a second} short-time spectral component X 2 (ω, τ) _{from the second observed digital signal x 2 (t).}
In the following, unless otherwise specified, the short-time spectral component of the current frame is simply omitted from the description as a spectral component.

マスク生成部１０５は、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）を受けて、目的音を分離するためのマスキングを行うフィルタリング係数である時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を算出する。例えば、マスク生成部１０５は、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）の相互相関関数を用いて、観測音が、第１のマイクロホン１０１に到来する時間と、第２のマイクロホン１０２に到来する時間との時間差により、目的音が到来する第１の方向とは異なる方向から到来する音のスペクトル成分をマスキングするためのフィルタリング係数を算出する。The mask generation unit 105 receives the first spectral component X ₁ (ω, τ) and the second spectral component X ₂ (ω, τ), and is a filtering coefficient that performs masking for separating the target sound. Calculate the frequency filter coefficient b _mod (ω, τ). For example, the mask generation unit 105 uses _{the cross-correlation function of the first spectral component X 1} (ω, τ) and the second spectral component X ₂ (ω, τ) to make the observed sound the first microphone 101. The filtering coefficient for masking the spectral component of the sound arriving from a direction different from the first direction in which the target sound arrives is calculated from the time difference between the time arriving at the second microphone 102 and the time arriving at the second microphone 102. ..

時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を求めるにあたり、図３に示されているように、第１のマイクロホン１０１及び第２のマイクロホン１０２が設けられている水平面において、第１のマイクロホン１０１の垂直方向Ｖ_１及び第２のマイクロホン１０２の垂直方向Ｖ_２に対して、予め定められた角度θに含まれる方向から、目的音が到来するものとする。なお、妨害音は、第１のマイクロホン１０１の垂直方向Ｖ_１及び第２のマイクロホン１０２の垂直方向Ｖ_２に対して、目的音とは反対の側から到来するものとする。In determining the time-frequency filter coefficient b _mod (ω, τ), as shown in FIG. 3, in the horizontal plane provided with the first microphone 101 and the second microphone 102, the first microphone 101 with respect to the vertical direction V ₂ of the vertical V ₁ and second microphones 102, from a direction included in a predetermined angle theta, it is assumed that the target sound comes. Incidentally, interference sound is the vertical direction V ₂ of the vertical V ₁ and second microphones 102 of the first microphone 101, it is assumed that the target sound coming from the opposite side.

ここで、第１のマイクロホン１０１の垂直方向Ｖ_１及び第２のマイクロホン１０２の垂直方向Ｖ_２は、第１のマイクロホン１０１及び第２のマイクロホン１０２を結ぶ直線に対して、垂直になっているものとする。なお、第１のマイクロホン１０１の垂直方向Ｖ_１及び第２のマイクロホン１０２の垂直方向Ｖ_２は、予め定められている基準方向であって、必ずしも垂直方向である必要はない。
また、第１のマイクロホン１０１と第２のマイクロホン１０２との間隔は、間隔ｄとなっているものとする。Here, the vertical direction _{V 2} of the vertical _{V 1} and second microphones 102 of the first microphone 101, to the straight line connecting the first microphone 101 and second microphone 102, which are perpendicular And. The vertical direction V ₂ of the vertical V ₁ and second microphones 102 of the first microphone 101, a reference direction is predetermined, not necessarily vertical.
Further, it is assumed that the distance between the first microphone 101 and the second microphone 102 is the distance d.

第１のマイクロホン１０１及び第２のマイクロホン１０２で集音された音声が、目的音か妨害音かを判別するには、第１のマイクロホン１０１及び第２のマイクロホン１０２からの信号を用いて音声到来方向が所望の範囲であるかどうかを推定する必要がある。ここで、第１のマイクロホン１０１及び第２のマイクロホン１０２からの信号間に生じる時間差は、角度θによって決まるため、この時間差を利用することで到来方向の推定が可能となる。以下、図２及び図３を用いて説明する。 In order to determine whether the sound collected by the first microphone 101 and the second microphone 102 is a target sound or an interfering sound, the sound arrives using the signals from the first microphone 101 and the second microphone 102. It is necessary to estimate whether the direction is in the desired range. Here, since the time difference generated between the signals from the first microphone 101 and the second microphone 102 is determined by the angle θ, the arrival direction can be estimated by using this time difference. Hereinafter, a description will be given with reference to FIGS. 2 and 3.

マスク係数算出部１０６は、まず、下記の式（１）に示すように、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）の相互相関関数からクロススペクトルＤ（ω，τ）を算出する。そして、マスク係数算出部１０６は、算出されたクロススペクトルＤ（ω，τ）を、発話量比算出部１０７に与える。

The mask coefficient calculation unit 106 first starts with the cross-correlation function of _{the first spectral component X 1} (ω, τ) and the second spectral component X ₂ (ω, τ), as shown in the following equation (1). The cross spectrum D (ω, τ) is calculated. Then, the mask coefficient calculation unit 106 gives the calculated cross spectrum D (ω, τ) to the utterance amount ratio calculation unit 107.

次に、マスク係数算出部１０６は、クロススペクトルＤ（ω，τ）のフェイズΘ_Ｄ（ω，τ）を、下記の式（２）を用いて求める。

ここで、Ｑ（ω，τ）及びＫ（ω，τ）のそれぞれは、クロススペクトルＤ（ω，τ）の虚数部及び実数部のそれぞれを表す。Next, the mask coefficient calculation unit 106 obtains the phase Θ _D (ω, τ) of the cross spectrum D (ω, τ) using the following equation (2).

Here, each of Q (ω, τ) and K (ω, τ) represents the imaginary part and the real part of the cross spectrum D (ω, τ), respectively.

上記の式（２）で得られたフェイズΘ_Ｄ（ω，τ）は、第１のチャンネルＣｈ１及び第２のチャンネルＣｈ２のそれぞれのスペクトル成分毎の位相角を意味し、これを離散周波数ωで除算したものは、２つの信号間の時間遅れを表す。すなわち、第１のチャンネルＣｈ１及び第２のチャンネルＣｈ２の時間差δ（ω，τ）は、下記の式（３）のように表すことができる。

_{The phase Θ D} (ω, τ) obtained by the above equation (2) means the phase angle of each spectral component of the first channel Ch1 and the second channel Ch2, which is defined by the discrete frequency ω. The division represents the time lag between the two signals. That is, the time difference δ (ω, τ) between the first channel Ch1 and the second channel Ch2 can be expressed by the following equation (3).

次に、音声が角度θの方向から到来するときに観測される時間差の理論値δ_θは、間隔ｄを使って、下記の式（４）のように表すことができる。但し、ｃは音速である。

_{Next, the theoretical value δ θ} of the time difference observed when the voice arrives from the direction of the angle θ can be expressed by the following equation (4) using the interval d. However, c is the speed of sound.

ここで、θ＞θ_ｔｈを満たすθの集合を、所望の方向範囲とするならば、理論的な時間差δ_θ＿ｔｈと、観測アナログ信号の時間差δ（ω，τ）との大小を比較することで、音声が所望の方向範囲から到来しているかどうかを推定することができる。
そのため、目的音を分離するためのマスキングを行うマスク係数ｂ（ω，τ）は、下記の式（５）のように表すことができる。

Here, if _{the set of θ satisfying θ> θ th} is set as the desired direction range, the _{magnitude of} the theoretical time difference δ θ_th and the time difference δ (ω, τ) of the observed analog signal can be compared. , It is possible to estimate whether the voice is coming from a desired directional range.
Therefore, the mask coefficient b (ω, τ) for masking to separate the target sound can be expressed by the following equation (5).

言い換えると、マスク係数算出部１０６は、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）の相互相関関数を用いて、目的音が、第１のマイクロホン１０１に到来する時間と、第２のマイクロホン１０２に到来する時間との第１の時間差、及び、妨害音が第１のマイクロホン１０１に到来する時間と、第２のマイクロホン１０２に到来する時間との第２の時間差から、観測音の内、目的音が到来する第１の方向を含む第１の範囲から到来する音と、妨害音が到来する第２の方向を含む第２の範囲から到来する音とを区別して、第１の範囲に含まれる方向から到来する音のスペクトル成分を、第２の範囲に含まれる方向から到来する音のスペクトル成分から分離するためのマスク係数を算出する。In other words, the mask coefficient calculation unit 106 uses the _{intercorrelation function of the first spectral component X 1} (ω, τ) and the second spectral component X ₂ (ω, τ) to make the target sound first. The first time difference between the time when the microphone 101 arrives and the time when the second microphone 102 arrives, and the time when the disturbing sound arrives at the first microphone 101 and the time when the second microphone 102 arrives. From the second time difference of, among the observed sounds, the sound arriving from the first range including the first direction in which the target sound arrives and the sound arriving from the second range including the second direction in which the disturbing sound arrives. The mask coefficient for separating the spectral component of the sound arriving from the direction included in the first range from the spectral component of the sound arriving from the direction included in the second range is calculated to distinguish the sound from the sound.

式（５）で示されるマスク係数ｂ（ω，τ）は、目的音と推定される場合には１、妨害音と推定される場合にはＭとなる。ここで、Ｍ＝０とする場合には、１又は０の二値（バイナリ）とするマスク係数となるため、そのようなマスク係数を有するフィルタは、バイナリマスクと呼ばれる。なお、フィルタ係数として、二値以外の小数が用いられてもよく、この場合のフィルタは、ソフトマスクとも呼ばれる。但し、フィルタ係数は、目的音及び妨害音のいずれも１未満の値となる。本実施の形態では、例えば、Ｍ＝０．５を用いるものとする。
マスク係数算出部１０６は、マスク係数ｂ（ω，τ）を、マスク修正部１０９に与える。The mask coefficient b (ω, τ) represented by the equation (5) is 1 when it is presumed to be the target sound and M when it is presumed to be a disturbing sound. Here, when M = 0, the mask coefficient is a binary of 1 or 0, so a filter having such a mask coefficient is called a binary mask. A decimal number other than binary may be used as the filter coefficient, and the filter in this case is also called a soft mask. However, the filter coefficient is a value less than 1 for both the target sound and the disturbing sound. In this embodiment, for example, M = 0.5 is used.
The mask coefficient calculation unit 106 gives the mask coefficient b (ω, τ) to the mask correction unit 109.

発話量比算出部１０７は、第１のチャンネルＣｈ１の第１のスペクトル成分Ｘ_１（ω，τ）と、第２のチャンネルＣｈ２の第２のスペクトル成分Ｘ_２（ω，τ）と、クロススペクトルＤ（ω，τ）とを受け、目的音話者の発話量と妨害音話者の発話量との比率である発話量比を算出する。言い換えると、発話量比は、第１のスペクトル成分Ｘ_１（ω，τ）の内、目的音が到来する第１の方向を含む第１の範囲から到来する音のスペクトル成分の量の、妨害音が到来する第２の方向を含む第２の範囲から到来する音のスペクトル成分の量に対する比率である。The utterance ratio calculation unit 107 _{cross-spectrums the first spectral component X 1} (ω, τ) of the first channel Ch 1 and the second spectral component X ₂ (ω, τ) of the second channel Ch 2. In response to D (ω, τ), the utterance volume ratio, which is the ratio between the utterance volume of the target sound speaker and the speech volume of the disturbing sound speaker, is calculated. In other words, the speech volume ratio interferes with the amount of the spectral component of the sound coming from the first range of the first _{spectral component X 1 (ω, τ) including the first direction in which the target sound arrives.} It is a ratio to the amount of the spectral component of the sound coming from the second range including the second direction in which the sound comes.

まず、発話量比算出部１０７は、第１のチャンネルＣｈ１の第１のスペクトル成分Ｘ_１（ω，τ）から、第１のチャンネルＣｈ１の第１のパワースペクトルＰ_１（ω，τ）を、下記の式（６）から求める。

ただし、Ｘ_Ｒｅは、第１のスペクトル成分Ｘ_１（ω，τ）の実数部であり、Ｘ_Ｉｍは、第１のスペクトル成分Ｘ_１（ω，τ）の虚数部である。First, the utterance ratio calculation unit 107 _{converts the first power spectrum P 1} (ω, τ) _{of the first channel Ch 1 from the first spectrum component X 1} (ω, τ) of the first channel Ch 1. It is calculated from the following formula (6).

However, X _Re is the real part of the first spectral component X ₁ (ω, τ), and X _Im is the imaginary part of the first spectral component X ₁ (ω, τ).

続いて、発話量比算出部１０７は、上記の式（１）に示されているクロススペクトルＤ（ω，τ）の虚数部Ｑ（ω，τ）の符号により、対象となる音声の観測アナログ信号が、目的音側から到来しているのか、妨害音側から到来しているのかを判定する。そして、発話量比算出部１０７は、下記の式（７）に示されているように、符号の判定結果に従って第１のチャンネルＣｈ１の第１のパワースペクトルＰ１（ω，τ）を加算し、目的音話者の発話量ｓ_Ｔｇｔ（τ）、及び、妨害音話者の発話量ｓ_Ｉｎｔ（τ）をそれぞれ求める。

ここで、Ｎは、離散周波数スペクトルの総数であり、例えば、Ｎ＝２５６である。Subsequently, the utterance volume ratio calculation unit 107 uses the sign of the imaginary part Q (ω, τ) of the cross spectrum D (ω, τ) shown in the above equation (1) to observe the target voice. It is determined whether the signal is coming from the target sound side or the interfering sound side. Then, as shown in the following equation (7), the utterance amount ratio calculation unit 107 adds the first power spectrum P1 (ω, τ) of the first channel Ch1 according to the determination result of the sign, and adds the first power spectrum P1 (ω, τ) of the first channel Ch1. target sound speaker speech amount _{s Tgt} (tau), and obtains interference sound speaker speech amount _{s Int} a (tau), respectively.

Here, N is the total number of discrete frequency spectra, for example, N = 256.

そして、発話量比算出部１０７は、得られた２つの発話量ｓ_Ｔｇｔ（τ）及びｓ_Ｉｎｔ（τ）から、下記の式（８）により、発話量比ＳＲ（τ）を得る。

Then, the utterance amount ratio calculation unit 107 obtains _{the utterance amount ratio SR (τ) from the two obtained utterance amounts s Tgt} (τ) and s _Int (τ) by the following equation (8).

図４（Ａ）〜（Ｃ）は、目的音話者と妨害音話者が発話した場合の発話量比ＳＲ（τ）を説明するためのグラフである。
図４（Ａ）は、第１のマイクロホン１０１で取得された観測アナログ信号の時間波形の一例を示すグラフである。
図４（Ｂ）は、目的音話者と妨害音話者との発話量の時間変動の一例を示すグラフである。
図４（Ｃ）は、目的音話者の発話量と、妨害音話者の発話量とから得られた発話量比ＳＲ（τ）の時間変動の一例を示すグラフである。4 (A) to 4 (C) are graphs for explaining the utterance amount ratio SR (τ) when the target speaker and the disturbing speaker speak.
FIG. 4A is a graph showing an example of the time waveform of the observed analog signal acquired by the first microphone 101.
FIG. 4B is a graph showing an example of time variation in the amount of speech between the target sound speaker and the disturbing sound speaker.
FIG. 4C is a graph showing an example of time variation of the utterance amount ratio SR (τ) obtained from the utterance amount of the target sound speaker and the utterance amount of the disturbing sound speaker.

図４（Ｃ）に示されているように、ＳＲ（τ)＜０．３を満たすフレームの場合は、妨害音のみの可能性が高い一方、ＳＲ（τ）＞０．５を満たすフレームの場合は、目的音のみの可能性が高いことが分かる。
また、０．３≦ＳＲ（τ）≦０．５の場合は、目的音も妨害音も両方存在する場合とみなすことができる。As shown in FIG. 4C, in the case of a frame satisfying SR (τ) <0.3, there is a high possibility of only disturbing sound, while a frame satisfying SR (τ)> 0.5. In that case, it can be seen that there is a high possibility that only the target sound is used.
Further, when 0.3 ≦ SR (τ) ≦ 0.5, it can be considered that both the target sound and the disturbing sound are present.

よって、上記の式（８）で得られた発話量比ＳＲ（τ）を用い、観測アナログ信号の様態に応じたマスキングの強度の制御を行うことで、分離精度が高く歪みも少ない目的音の分離が可能である。より具体的には、例えば、発話量比ＳＲ（τ）が小さいフレームでは、マスキングのフィルタ係数の数値を大きくすることで強く妨害音を抑圧して分離性能を高め、発話量比ＳＲ（τ）が大きいフレームでは、マスキングのフィルタ係数の数値を小さくすることで目的音の歪みを小さくする制御が可能である。 Therefore, by using the utterance ratio SR (τ) obtained by the above equation (8) and controlling the masking intensity according to the mode of the observed analog signal, the target sound with high separation accuracy and little distortion can be obtained. Separation is possible. More specifically, for example, in a frame having a small utterance ratio SR (τ), increasing the value of the masking filter coefficient strongly suppresses disturbing sounds to improve the separation performance, and the utterance ratio SR (τ). In a frame with a large value, it is possible to control the distortion of the target sound by reducing the value of the masking filter coefficient.

図２に戻り、ゲイン算出部１０８は、上記の式（８）で得られた発話量比ＳＲ（τ）を用いて、上記の式（５）のマスク係数ｂ（ω，τ）中の定数Ｍを修正する修正ゲインｇ（ω，τ）を、下記の式（９）により計算する。

ここで、Ｇ_Ｔｇｔ、Ｇ_Ｉｎｔ及びＧ_ＤＴは、予め定められた修正ゲイン定数であり、Ｇ_Ｔｇｔは、観測アナログ信号が目的音だけの可能性が高い場合の定数、Ｇ_Ｉｎｔは、観測アナログ信号が妨害音だけの可能性が高い場合の定数、Ｇ_ＤＴは、観測アナログ信号に目的音及び妨害音の両者が存在する可能性が高い場合の定数である。本実施の形態においては、Ｇ_Ｔｇｔ＝１．５、Ｇ_ＤＴ＝０．９９、Ｇ_Ｉｎｔ＝０．０１を好適な一例とする。Returning to FIG. 2, the gain calculation unit 108 uses the utterance ratio SR (τ) obtained by the above equation (8) to use the constants in the mask coefficient b (ω, τ) of the above equation (5). The modified gain g (ω, τ) that modifies M is calculated by the following equation (9).

Here, G _Tgt , G _Int, and G _DT are predetermined correction gain constants, G _Tgt is a constant when the observed analog signal is likely to be only the target sound, and G _Int is the observed analog signal. there cases likely only interference sound constant, G _DT is a constant when it is probable that there are both the target sound and disturbance sound in the observed analog signal. In the present embodiment, G _Tgt = 1.5, G _DT = 0.99, and G _Int = 0.01 are suitable examples.

そして、目的音の可能性が高い場合は、上記の式（５）中のＭが大きくなるように、言い換えるならば、マスクの抑圧量が小さくなるように制御される。但し、修正後のＭは、１以下の値に制限される。
一方、妨害音の可能性が高い場合には、上述の式（５）中のＭが更に小さくなるように、言い換えると、妨害音の抑圧量が更に大きくなるように制御されることとなる。
即ち、ゲイン算出部１０８は、発話量比が高いほど、マスキングが行われる強度が低くなるように、マスク係数を修正するための修正ゲインを算出する。Then, when the possibility of the target sound is high, it is controlled so that M in the above equation (5) becomes large, in other words, the amount of suppression of the mask becomes small. However, the modified M is limited to a value of 1 or less.
On the other hand, when the possibility of the disturbing sound is high, it is controlled so that M in the above equation (5) becomes smaller, in other words, the suppression amount of the disturbing sound becomes larger.
That is, the gain calculation unit 108 calculates the correction gain for correcting the mask coefficient so that the higher the utterance amount ratio, the lower the intensity at which masking is performed.

この修正ゲインの算出にあたっては、単純な観測アナログ信号のパワー計算から求められる発話量比と、発話量比の比較による条件式のみで済むため計算コストが低くて済み、効率的にマスク係数を修正することが可能である。 In calculating this correction gain, the calculation cost is low because only the conditional expression obtained by comparing the utterance amount ratio and the utterance amount ratio obtained from the power calculation of the simple observation analog signal is required, and the mask coefficient is efficiently corrected. It is possible to do.

また、Ｋ（ω）は１以下の正の数で表現される周波数補正係数であり、下記の式（１０）で示されるように、周波数が高くなるに従って値が大きくなるように設定される。

Ｋ（ω）による周波数補正を行うことで、高周波数でのマスキングの強度が緩和されるので、マスキングによる目的音の歪みを抑制することができる。Further, K (ω) is a frequency correction coefficient represented by a positive number of 1 or less, and is set so that the value increases as the frequency increases, as shown by the following equation (10).

By performing frequency correction with K (ω), the intensity of masking at high frequencies is relaxed, so that distortion of the target sound due to masking can be suppressed.

なお、式（１０）の周波数補正係数は、周波数が高くなるに従って値が大きくなるように補正しているが、式（１０）の周波数補正係数は、このような例に限定されるものではなく、観測アナログ信号の特性に応じて適宜変更することが可能である。例えば、音源分離の対象とする音響信号が音声の場合、音声において重要な周波数帯域成分であるフォルマントの抑圧を弱くするように補正が行われるとともに、それ以外の帯域成分の抑圧を強くするように補正が行われてもよい。これにより、音声に対するマスク制御の精度が向上するので、目的音を効率良く分離することが可能となる。
また、音源分離の対象が機械の異常音であれば、その音響信号の周波数特性に応じて式（１０）の周波数補正係数を変更することで、異常音を効率良く分離することが可能となる。The frequency correction coefficient of the equation (10) is corrected so that the value increases as the frequency increases, but the frequency correction coefficient of the equation (10) is not limited to such an example. , It can be changed as appropriate according to the characteristics of the observed analog signal. For example, when the acoustic signal to be separated from the sound source is speech, correction is performed so as to weaken the suppression of formants, which are important frequency band components in speech, and to strengthen the suppression of other band components. Corrections may be made. As a result, the accuracy of mask control for voice is improved, so that the target sound can be efficiently separated.
Further, if the target of sound source separation is an abnormal sound of a machine, the abnormal sound can be efficiently separated by changing the frequency correction coefficient of the equation (10) according to the frequency characteristic of the acoustic signal. ..

このように周波数により補正することによる更なる効果としては、観測騒音に環境騒音が混入している場合では、目的とする音声又は異常音以外の音響信号（例えば、騒音又は音楽等）へのマスキングによる影響が少なくなるため、環境騒音に対する不必要なマスキングにより生じる不快な人工的雑音（ミュージカルトーン）が少なくなり、人工的雑音による音声認識装置又は異常音監視装置の誤動作が減少し、ハンズフリー通話時の不快な雑音が減少する副次的効果も奏する。 As a further effect of correcting by frequency in this way, when environmental noise is mixed in the observed noise, masking to acoustic signals (for example, noise or music) other than the target voice or abnormal sound is masked. Since the influence of noise is reduced, unpleasant artificial noise (musical tone) caused by unnecessary masking against environmental noise is reduced, malfunction of the voice recognition device or abnormal sound monitoring device due to artificial noise is reduced, and hands-free calling is performed. It also has the side effect of reducing the unpleasant noise of time.

なお、上記した修正ゲインの各定数値又は発話量比ＳＲ（τ）の定数閾値については、式（９）の場合に限定されることはなく、目的音又は妨害音の様態に合わせて適宜調整することができる。また、修正ゲインを決定する条件も式（９）のように３段階に限らず、更に多い段階で設定されてもよい。 The constant value of each of the above-mentioned correction gains or the constant threshold value of the utterance volume ratio SR (τ) is not limited to the case of the equation (9), and is appropriately adjusted according to the mode of the target sound or the disturbing sound. can do. Further, the condition for determining the correction gain is not limited to the three stages as in the equation (9), and may be set in more stages.

マスク修正部１０９は、下記の式（１１）に示すように、上記の式（５）で得られたマスク係数ｂ（ω，τ）に対して、式（９）で得られた修正ゲインｇ（ω，τ）を用いて修正し、時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を得る。

As shown in the following equation (11), the mask correction unit 109 has the correction gain g obtained by the equation (9) with respect to the mask coefficient b (ω, τ) obtained by the above equation (5). Correct using (ω, τ) to obtain the time-frequency filter coefficient b _{mod (ω, τ).}

図１に戻り、マスキングフィルタ部１１０は、下記の式（１２）で示されているように、第１のマイクロホン１０１側の第１のスペクトル成分Ｘ_１（ω，τ）に、上記の式（１１）で得られた時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を乗算し、スペクトル成分Ｙ（ω，τ）を算出する。そして、マスキングフィルタ部１１０は、算出されたスペクトル成分Ｙ（ω，τ）をＴ／Ｆ逆変換部１１１に送る。ここで分離されたスペクトル成分Ｙ（ω，τ）を目的スペクトル成分ともいう。目的スペクトル成分は、目的音を含むスペクトル成分である。

Returning to FIG. 1, as shown by the following equation (12), the masking filter unit 110 has the above equation (ω, τ) in _{the first spectral component X 1 (ω, τ) on the first microphone 101 side.} Multiply the time-frequency filter coefficient b _mod (ω, τ) obtained in 11) to calculate the spectral component Y (ω, τ). Then, the masking filter unit 110 sends the calculated spectral component Y (ω, τ) to the T / F inverse conversion unit 111. The spectral component Y (ω, τ) separated here is also referred to as a target spectral component. The target spectrum component is a spectrum component including the target sound.

Ｔ／Ｆ逆変換部１１１は、スペクトル成分Ｙ（ω，τ）に対し、例えば、逆高速フーリエ変換を行い、出力デジタル信号ｙ（ｔ）を算出する。Ｔ／Ｆ逆変換部１１１は、算出された出力デジタル信号ｙ（ｔ）をＤ／Ａ変換部１１２に与える。 The T / F inverse transform unit 111 performs, for example, an inverse fast Fourier transform on the spectrum component Y (ω, τ) to calculate the output digital signal y (t). The T / F inverse conversion unit 111 gives the calculated output digital signal y (t) to the D / A conversion unit 112.

Ｄ／Ａ変換部１１２は、出力デジタル信号ｙ（ｔ）をアナログ信号に変換することで、出力信号を生成する。生成された出力信号は、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置等の外部機器へ出力される。 The D / A conversion unit 112 generates an output signal by converting the output digital signal y (t) into an analog signal. The generated output signal is output to an external device such as a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device.

図５（Ａ）及び（Ｂ）は、実施の形態１における効果を説明するためのグラフである。
図５（Ａ）は、図４（Ａ）と同様に、第１のマイクロホン１０１で取得された観測アナログ信号の時間波形の一例を示すグラフである。
図５（Ｂ）は、Ｄ／Ａ変換部１１２から出力される出力信号の時間変動の一例を示すグラフである。
図５（Ａ）及び（Ｂ）から明らかなように、出力信号からは妨害音が殆ど除去されて目的音のみが分離されていることが分かる。5 (A) and 5 (B) are graphs for explaining the effect in the first embodiment.
FIG. 5A is a graph showing an example of the time waveform of the observation analog signal acquired by the first microphone 101, similarly to FIG. 4A.
FIG. 5B is a graph showing an example of time variation of the output signal output from the D / A conversion unit 112.
As is clear from FIGS. 5A and 5B, it can be seen that most of the disturbing sound is removed from the output signal and only the target sound is separated.

上記の音源分離装置１００のハードウェア構成は、タブレットタイプの可搬型コンピュータ、又は、カーナビゲーションシステム等の機器組み込み用途のマイクロコンピュータ等の、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）内蔵のコンピュータで実現可能である。あるいは、上記の音源分離装置１００のハードウェア構成は、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）、ＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）、又は、ＦＰＧＡ（Ｆｉｅｌｄ−ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等のＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ）により実現されてもよい。 The hardware configuration of the sound source separation device 100 can be realized by a computer having a built-in CPU (Central Processing Unit), such as a tablet-type portable computer or a microcomputer for embedded devices such as a car navigation system. Alternatively, the hardware configuration of the sound source separation device 100 may be an LSI (Integrate Circuit Integration) such as a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable Scale Integration). May be done.

図６は、ＤＳＰ、ＡＳＩＣ又はＦＰＧＡ等のＬＳＩを用いて構成される音源分離装置１００のハードウェア構成例を示すブロック図である。 FIG. 6 is a block diagram showing a hardware configuration example of the sound source separation device 100 configured by using an LSI such as a DSP, ASIC, or FPGA.

図６の例では、音源分離装置１００は、信号入出力部１３１、信号処理回路１３２、記録媒体１３３及びバス等の信号路１３４により構成されている。
信号入出力部１３１は、マイクロホン回路１４０及び外部装置１４１との接続機能を実現するインタフェース回路である。マイクロホン回路１４０は、第１のマイクロホン１０１及び第２のマイクロホン１０２に対応し、例えば、音響振動を捉えて電気信号へ変換する装置等を使用することができる。In the example of FIG. 6, the sound source separation device 100 is composed of a signal input / output unit 131, a signal processing circuit 132, a recording medium 133, and a signal path 134 such as a bus.
The signal input / output unit 131 is an interface circuit that realizes a connection function with the microphone circuit 140 and the external device 141. The microphone circuit 140 corresponds to the first microphone 101 and the second microphone 102, and for example, a device that captures acoustic vibration and converts it into an electric signal can be used.

図１に示されている、Ｔ／Ｆ変換部１０４、マスク生成部１０５、マスキングフィルタ部１１０及びＴ／Ｆ逆変換部１１１の各機能は、信号処理回路１３２及び記録媒体１３３で実現することができる。
また、図１のＡ／Ｄ変換部１０３及びＤ／Ａ変換部１１２は、信号入出力部１３１により実現することができる。Each function of the T / F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T / F inverse conversion unit 111 shown in FIG. 1 can be realized by the signal processing circuit 132 and the recording medium 133. it can.
Further, the A / D conversion unit 103 and the D / A conversion unit 112 of FIG. 1 can be realized by the signal input / output unit 131.

記録媒体１３３は、信号処理回路１３２の各種設定データ及び信号データ等の各種データを蓄積するために使用される。記録媒体１３３としては、例えば、ＳＤＲＡＭ（ＳｙｎｃｈｒｏｎｏｕｓＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の揮発性メモリ、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性メモリを使用することができる。記録媒体１３３には、音源分離処理の初期状態、各種設定データ、制御用の定数データ等を記憶しておくことができる。 The recording medium 133 is used to store various data such as various setting data and signal data of the signal processing circuit 132. As the recording medium 133, for example, a volatile memory such as SDRAM (Synchronous Dynamic Random Access Memory) or a non-volatile memory such as an HDD (Hard Disk Drive) or SSD (Solid State Drive) can be used. The recording medium 133 can store the initial state of the sound source separation process, various setting data, constant data for control, and the like.

信号処理回路１３２で音源分離処理が行われた出力デジタル信号は、信号入出力部１３１から外部装置１４１に送出されるが、この外部装置１４１としては、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置が相当する。 The output digital signal subjected to the sound source separation processing in the signal processing circuit 132 is sent from the signal input / output unit 131 to the external device 141, and the external device 141 is, for example, a voice recognition device, a hands-free communication device, or a hands-free communication device. Corresponds to an abnormal sound monitoring device.

図７は、コンピュータ等の演算装置を用いて構成される音源分離装置１００のハードウェア構成例を示すブロック図である。
図７の例では、音源分離装置１００は、信号入出力部１３１、ＣＰＵ１３５を内蔵するプロセッサ１３６、メモリ１３７、記録媒体１３８及びバス等の信号路１３４により構成されている。
信号入出力部１３１は、マイクロホン回路１４０及び外部装置１４１との接続機能を実現するインタフェース回路である。FIG. 7 is a block diagram showing a hardware configuration example of the sound source separation device 100 configured by using an arithmetic unit such as a computer.
In the example of FIG. 7, the sound source separation device 100 is composed of a signal input / output unit 131, a processor 136 incorporating a CPU 135, a memory 137, a recording medium 138, and a signal path 134 such as a bus.
The signal input / output unit 131 is an interface circuit that realizes a connection function with the microphone circuit 140 and the external device 141.

メモリ１３７は、音源分離処理を実現するための各種プログラムを記憶するプログラムメモリ、プロセッサ１３６がデータ処理を行う際に使用するワークメモリ、及び、信号データを展開するメモリ等として使用するＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）及びＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の記憶手段である。 The memory 137 is a ROM (Read Only) used as a program memory for storing various programs for realizing sound source separation processing, a work memory used when the processor 136 performs data processing, a memory for expanding signal data, and the like. It is a storage means such as Memory) and RAM (Random Access Memory).

Ｔ／Ｆ変換部１０４、マスク生成部１０５、マスキングフィルタ部１１０及びＴ／Ｆ逆変換部１１１の各機能は、プロセッサ１３６、メモリ１３７及び記録媒体１３８で実現することができる。
また、Ａ／Ｄ変換部１０３及びＤ／Ａ変換部１１２は、信号入出力部１３１で実現することができる。The functions of the T / F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T / F inverse conversion unit 111 can be realized by the processor 136, the memory 137, and the recording medium 138.
Further, the A / D conversion unit 103 and the D / A conversion unit 112 can be realized by the signal input / output unit 131.

記録媒体１３８は、プロセッサ１３６の各種設定データ及び信号データ等の各種データを蓄積するために使用される。記録媒体１３８としては、たとえば、ＳＤＲＡＭ等の揮発性メモリ、ＨＤＤ又はＳＳＤ等の不揮発性メモリを使用することが可能である。ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）を含むプログラム、各種設定データ、及び、音響信号データ等の各種データを蓄積することができる。なお、この記録媒体１３８に、メモリ１３７内のデータを蓄積しておくこともできる。 The recording medium 138 is used to store various data such as various setting data and signal data of the processor 136. As the recording medium 138, for example, a volatile memory such as SDRAM or a non-volatile memory such as an HDD or SSD can be used. It is possible to store various data such as a program including an OS (Operating System), various setting data, and acoustic signal data. The data in the memory 137 can also be stored in the recording medium 138.

プロセッサ１３６は、メモリ１３７を作業用メモリとして使用し、メモリ１３７から読み出されたコンピュータプログラムに従って動作することにより、Ｔ／Ｆ変換部１０４、マスク生成部１０５、マスキングフィルタ部１１０及びＴ／Ｆ逆変換部１１１として機能することができる。 The processor 136 uses the memory 137 as a working memory and operates according to the computer program read from the memory 137, thereby causing the T / F conversion unit 104, the mask generation unit 105, the masking filter unit 110, and the T / F reverse. It can function as a conversion unit 111.

プロセッサ１３６で音源分離処理が行われて生成された出力信号は、信号入出力部１３１から外部装置１４１に送出されるが、この外部装置１４１としては、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置が相当する。 The output signal generated by performing the sound source separation processing by the processor 136 is sent from the signal input / output unit 131 to the external device 141, and the external device 141 is, for example, a voice recognition device, a hands-free communication device, or a hands-free communication device. Corresponds to an abnormal sound monitoring device.

プロセッサ１３６が実行されるプログラムは、ソフトウェアプログラムを実行するコンピュータ内部の記憶装置に記憶していても良いし、ＣＤ−ＲＯＭ等の記憶媒体にて配布される形式でもよい。また、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）等の無線又は有線のネットワークを通じて、他のコンピュータからプログラムを取得することも可能である。このようなプログラムは、例えば、プログラムプロダクトとして提供されてもよい。 The program in which the processor 136 is executed may be stored in a storage device inside a computer that executes the software program, or may be distributed in a storage medium such as a CD-ROM. It is also possible to acquire a program from another computer through a wireless or wired network such as LAN (Local Area Network). Such a program may be provided, for example, as a program product.

さらに、マイクロホン回路１４０及び外部装置１４１に関しても、アナログ信号とデジタル信号との変換等を介せずに、無線又は有線ネットワークを通じて、各種データをデジタル信号のまま送受信しても構わない。 Further, with respect to the microphone circuit 140 and the external device 141, various data may be transmitted and received as digital signals as they are through a wireless or wired network without conversion between analog signals and digital signals.

また、プロセッサ１３６で実行されるプログラムは、外部装置１４１で実行されるプログラム、例えば、コンピュータを、音声認識装置、ハンズフリー通話装置又は異常音監視装置として機能させるために実行されるプログラムとソフトウェア上で結合され、同一のコンピュータで動作させることも可能であり、又は、複数のコンピュータ上で分散して動作させることも可能である。 The program executed by the processor 136 is a program executed by the external device 141, for example, a program and software executed to make the computer function as a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device. It can be combined with and operated on the same computer, or it can be distributed and operated on a plurality of computers.

なお、外部装置１４１が音源分離装置１００を含んでいてもよい。即ち、音源分離装置１００を含む形で、音声認識装置、ハンズフリー通話装置又は異常音監視装置が構成されてもよい。 The external device 141 may include the sound source separation device 100. That is, a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device may be configured to include the sound source separation device 100.

次に、実施の形態１に係る音源分離装置１００の動作について説明する。
図８は、音源分離装置１００の動作を示すフローチャートである。
まず、Ａ／Ｄ変換部１０３は、第１のマイクロホン１０１及び第２のマイクロホン１０２のそれぞれから入力された、第１の観測アナログ信号及び第２の観測アナログ信号のそれぞれを、予め定められたフレーム間隔で取り込み、それぞれをＡ／Ｄ変換することで、第１の観測デジタル信号ｘ_１（ｔ）及び第２の観測デジタル信号ｘ_２（ｔ）を生成して、それらをＴ／Ｆ変換部１０４に与える（Ｓ１０）。
そして、Ａ／Ｄ変換部１０３からの出力は、サンプル番号ｔが予め定められた値Ｔよりも小さい場合（Ｓ１１でＮｏ）には、繰り返し行われる。Next, the operation of the sound source separation device 100 according to the first embodiment will be described.
FIG. 8 is a flowchart showing the operation of the sound source separation device 100.
First, the A / D conversion unit 103 sets each of the first observation analog signal and the second observation analog signal input from each of the first microphone 101 and the second microphone 102 into a predetermined frame. By capturing at intervals and performing A / D conversion of each, a first observation digital signal x ₁ (t) and a second observation digital signal x ₂ (t) are generated, and they are converted into a T / F conversion unit 104. (S10).
Then, the output from the A / D conversion unit 103 is repeatedly performed when the sample number t is smaller than the predetermined value T (No in S11).

ステップＳ１２では、Ｔ／Ｆ変換部１０４は、第１の観測デジタル信号ｘ_１（ｔ）及び第２の観測デジタル信号ｘ_２（ｔ）のそれぞれに対して、例えば、５１２点の高速フーリエ変換を行い、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）を算出する。そして、Ｔ／Ｆ変換部１０４は、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）をマスク生成部１０５に与え、第１のスペクトル成分Ｘ_１（ω，τ）をマスキングフィルタ部１１０に与える。In step S12, the T / F transform unit 104 performs, for example, a Fast Fourier transform of 512 points for each of _{the first observed digital signal x 1} (t) and the second observed digital signal x _{2 (t).} The first spectral component X ₁ (ω, τ) and the second spectral component X ₂ (ω, τ) are calculated. Then, the T / F conversion unit 104 gives the first spectrum component X ₁ (ω, τ) and the second spectrum component X ₂ (ω, τ) to the mask generation unit 105, and the first spectrum component X ₁ (Ω, τ) is given to the masking filter unit 110.

マスク生成部１０５は、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）から、目的音を分離するためのマスキングを行う時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を算出する（Ｓ１３）。以下、ステップＳ１３Ａ〜Ｓ１３Ｄにより、ステップＳ１３での詳細な処理を説明する。Mask generator 105, the first spectral component _{X 1} (omega, tau) and a second spectral component _{X 2} (omega, tau) from the time-frequency filter coefficient _{b mod} performing masking for separating the target sound ( ω, τ) is calculated (S13). Hereinafter, detailed processing in step S13 will be described in accordance with steps S13A to S13D.

ステップＳ１３Ａでは、マスク係数算出部１０６は、第１のスペクトル成分Ｘ_１（ω，τ）及び第２のスペクトル成分Ｘ_２（ω，τ）の相互相関関数から、クロススペクトルＤ（ω，τ）を算出するとともに、得られたクロススペクトルＤ（ω，τ）に基づいて、マスク係数ｂ（ω，τ）を算出する。マスク係数算出部１０６は、クロススペクトルＤ（ω，τ）を発話量比算出部１０７に与え、マスク係数ｂ（ω，τ）をマスク修正部１０９に与える。そして、処理は、ステップＳ１３Ｂに進む。In step S13A, the mask coefficient calculation unit 106 cross-spectrum D (ω, τ) from the cross-correlation function of _{the first spectrum component X 1} (ω, τ) and the second spectrum component X _{2 (ω, τ).} Is calculated, and the mask coefficient b (ω, τ) is calculated based on the obtained cross spectrum D (ω, τ). The mask coefficient calculation unit 106 gives the cross spectrum D (ω, τ) to the utterance ratio calculation unit 107, and gives the mask coefficient b (ω, τ) to the mask correction unit 109. Then, the process proceeds to step S13B.

ステップＳ１３Ｂでは、発話量比算出部１０７は、第１のスペクトル成分Ｘ_１（ω，τ）、第２のスペクトル成分Ｘ_２（ω，τ）及びクロススペクトルＤ（ω，τ）から、目的音話者の発話量と、妨害音話者の発話量との間の比率である発話量比ＳＲ（τ）を算出する。発話量比算出部１０７は、発話量比ＳＲ（τ）をゲイン算出部１０８に与える。そして、処理はステップＳ１３Ｃに進む。In step S13B, the utterance ratio calculation unit 107 starts with the _{target sound from the first spectral component X 1} (ω, τ), the second spectral component X ₂ (ω, τ), and the cross spectrum D (ω, τ). The utterance volume ratio SR (τ), which is the ratio between the utterance volume of the speaker and the utterance volume of the disturbing sound speaker, is calculated. The utterance amount ratio calculation unit 107 gives the utterance amount ratio SR (τ) to the gain calculation unit 108. Then, the process proceeds to step S13C.

ステップＳ１３Ｃでは、ゲイン算出部１０８は、発話量比ＳＲ（τ）を用いて、マスク係数ｂ（ω，τ）を修正するための修正ゲインｇ（ω，τ）を計算する。ゲイン算出部１０８は、修正ゲインｇ（ω，τ）をマスク修正部１０９に与える。そして、処理はステップＳ１３Ｄに進む。 In step S13C, the gain calculation unit 108 calculates the correction gain g (ω, τ) for correcting the mask coefficient b (ω, τ) by using the utterance amount ratio SR (τ). The gain calculation unit 108 gives the correction gain g (ω, τ) to the mask correction unit 109. Then, the process proceeds to step S13D.

ステップＳ１３Ｄでは、マスク修正部１０９は、マスク係数ｂ（ω，τ）を、修正ゲインｇ（ω，τ）を用いて修正し、時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を得る。そして、マスク修正部１０９は、時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を、マスキングフィルタ部１１０に与える。In step S13D, the mask correction unit 109 corrects the mask coefficient b (ω, τ) using the correction gain g (ω, τ) to obtain the _{time-frequency filter coefficient b mod (ω, τ).} Then, the mask correction unit 109 gives the time-frequency filter coefficient b _mod (ω, τ) to the masking filter unit 110.

マスキングフィルタ部１１０は、第１のスペクトル成分Ｘ_１（ω，τ）に、時間周波数フィルタ係数ｂ_ｍｏｄ（ω，τ）を乗算し、出力デジタル信号ｙ（ｔ）のスペクトル成分Ｙ（ω，τ）を算出する（Ｓ１４）。そして、マスキングフィルタ部１１０は、スペクトル成分Ｙ（ω，τ）をＴ／Ｆ逆変換部１１１に与える。The masking filter unit 110 _{multiplies the first spectral component X 1} (ω, τ) by the time-frequency filter coefficient b _mod (ω, τ), and the spectral component Y (ω, τ) of the output digital signal y (t). ) Is calculated (S14). Then, the masking filter unit 110 gives the spectral component Y (ω, τ) to the T / F inverse conversion unit 111.

Ｔ／Ｆ逆変換部１１１は、スペクトル成分Ｙ（ω，τ）に対して逆高速フーリエ変換を行うことで、スペクトル成分Ｙ（ω，τ）を時間領域の出力デジタル信号ｙ（ｔ）に変換する（Ｓ１５）。 The T / F inverse transform unit 111 converts the spectral component Y (ω, τ) into the output digital signal y (t) in the time domain by performing an inverse fast Fourier transform on the spectral component Y (ω, τ). (S15).

Ｄ／Ａ変換部１１２は、出力デジタル信号ｙ（ｔ）を、Ｄ／Ａ変換することで、アナログ信号である出力信号に変換して、外部に出力する（Ｓ１６）。
そして、Ｄ／Ａ変換部１１２からの出力は、サンプル番号ｔが予め定められた値Ｔより小さい場合（Ｓ１７でＹｅｓ）には、繰り返し行われる。The D / A conversion unit 112 converts the output digital signal y (t) into an output signal which is an analog signal by D / A conversion, and outputs the output signal to the outside (S16).
Then, the output from the D / A conversion unit 112 is repeated when the sample number t is smaller than the predetermined value T (Yes in S17).

次に、音源分離処理が続行される場合（Ｓ１８でＹｅｓ）には、処理はステップＳ１０に戻る。一方、音源分離処理が続行されない場合（Ｓ１８でＮｏ）には、音源分離処理は終了する。 Next, when the sound source separation process is continued (Yes in S18), the process returns to step S10. On the other hand, if the sound source separation process is not continued (No in S18), the sound source separation process ends.

以上のように、実施の形態１の音源分離装置１００で、低い計算コストで分離性能の高いマスキングフィルタを作成することができる。このため、目的音を的確に取得することができ、高精度の音声認識装置、高品質なハンズフリー通話装置及び検出精度の高い異常音監視装置を提供することが可能となる。 As described above, the sound source separation device 100 of the first embodiment can produce a masking filter having high separation performance at a low calculation cost. Therefore, the target sound can be accurately acquired, and it is possible to provide a high-precision voice recognition device, a high-quality hands-free communication device, and an abnormal sound monitoring device with high detection accuracy.

実施の形態２．
実施の形態１では、音声による構成を例示したが、妨害音となる音声以外の雑音が存在する場合にも適用することができる実施の形態を、実施の形態２として説明する。Embodiment 2.
In the first embodiment, the configuration by voice is illustrated, but the embodiment that can be applied even when there is noise other than the voice that becomes the disturbing sound will be described as the second embodiment.

図９は、実施の形態２に係る音源分離装置２００を含む情報処理システム２５０の構成を概略的に示すブロック図である。ここで示す情報処理システム２５０は、カーナビゲーションシステムの一例であり、走行中の自動車内での運転席に着座する話者と、助手席に着座する話者とが発話する場合を示している。実施の形態２では、運転席に着座する話者を目的音話者とし、助手席に着座する話者を妨害音話者として、説明する。 FIG. 9 is a block diagram schematically showing the configuration of the information processing system 250 including the sound source separation device 200 according to the second embodiment. The information processing system 250 shown here is an example of a car navigation system, and shows a case where a speaker seated in the driver's seat and a speaker seated in the passenger seat speak in a moving vehicle. In the second embodiment, the speaker seated in the driver's seat is defined as the target sound speaker, and the speaker seated in the passenger seat is defined as the disturbing sound speaker.

図９に示されているように、情報処理システム２５０は、第１のマイクロホン１０１と、第２のマイクロホン１０２と、音源分離装置２００と、外部装置１４１とを備える。
実施の形態２における第１のマイクロホン１０１及び第２のマイクロホン１０２は、実施の形態１における第１のマイクロホン１０１及び第２のマイクロホン１０２と同様である。また、外部装置１４１は、図６又は図７を用いて説明した外部装置１４１と同様である。As shown in FIG. 9, the information processing system 250 includes a first microphone 101, a second microphone 102, a sound source separation device 200, and an external device 141.
The first microphone 101 and the second microphone 102 in the second embodiment are the same as the first microphone 101 and the second microphone 102 in the first embodiment. Further, the external device 141 is the same as the external device 141 described with reference to FIG. 6 or FIG.

実施の形態２における入力としては、第１のマイクロホン１０１及び第２のマイクロホン１０２を通じて取り込まれた目的音話者及び妨害音話者の音声の他、自動車走行騒音等の騒音、ハンズフリー通話時におけるスピーカより送出された遠端側話者の受話音声、カーナビゲーションが送出する案内音声、又は、カーオーディオの音楽等が回り込む音響エコー等である。目的音話者及び妨害音話者の音声以外の音声を雑音とする。また、雑音の信号を雑音信号とする。そして、実施の形態２では、目的音が到来する第１の方向を含む第１の範囲及び妨害音が到来する第２の方向を含む第２の範囲には含まれない方向から到来する音のスペクトル成分を除外して、発話量比を算出することで、雑音の影響を除外している。 The inputs in the second embodiment include the sounds of the target sound speaker and the disturbing sound speaker captured through the first microphone 101 and the second microphone 102, noise such as automobile running noise, and hands-free calling. It is a received sound of a far-end speaker transmitted from a speaker, a guidance sound transmitted by a car navigation system, an acoustic echo around which car audio music or the like wraps around. Sound other than the voice of the target sound speaker and the disturbing sound speaker is regarded as noise. Further, the noise signal is used as a noise signal. Then, in the second embodiment, the sound coming from a direction not included in the first range including the first direction in which the target sound arrives and the second range including the second direction in which the disturbing sound arrives. The influence of noise is excluded by calculating the utterance volume ratio by excluding the spectrum component.

外部装置１４１は、上述のように、例えば、音声認識装置、ハンズフリー通話装置又は異常音監視装置である。外部装置１４１では、例えば、音声認識処理、ハンズフリー通話処理又は異常音検出処理を行って、それぞれの処理に応じた出力結果を得る。 As described above, the external device 141 is, for example, a voice recognition device, a hands-free communication device, or an abnormal sound monitoring device. The external device 141 performs, for example, voice recognition processing, hands-free call processing, or abnormal sound detection processing, and obtains an output result corresponding to each processing.

音源分離装置２００は、Ａ／Ｄ変換部１０３と、Ｔ／Ｆ変換部１０４と、マスク生成部２０５と、マスキングフィルタ部１１０と、Ｔ／Ｆ逆変換部１１１とを備える。
実施の形態２に係る音源分離装置２００のＡ／Ｄ変換部１０３、Ｔ／Ｆ変換部１０４、マスキングフィルタ部１１０及びＴ／Ｆ逆変換部１１１は、実施の形態１の音源分離装置１００のＡ／Ｄ変換部１０３、Ｔ／Ｆ変換部１０４、マスキングフィルタ部１１０及びＴ／Ｆ逆変換部１１１と同様である。
但し、実施の形態２に係る音源分離装置２００では、Ｔ／Ｆ逆変換部１１１で生成された出力デジタル信号ｙ（ｔ）が外部装置１４１に与えられる。The sound source separation device 200 includes an A / D conversion unit 103, a T / F conversion unit 104, a mask generation unit 205, a masking filter unit 110, and a T / F inverse conversion unit 111.
The A / D conversion unit 103, the T / F conversion unit 104, the masking filter unit 110, and the T / F inverse conversion unit 111 of the sound source separation device 200 according to the second embodiment are the A of the sound source separation device 100 of the first embodiment. This is the same as the / D conversion unit 103, the T / F conversion unit 104, the masking filter unit 110, and the T / F inverse conversion unit 111.
However, in the sound source separation device 200 according to the second embodiment, the output digital signal y (t) generated by the T / F inverse conversion unit 111 is given to the external device 141.

図２に示されているように、マスク生成部２０５は、マスク係数算出部１０６と、発話量比算出部２０７と、ゲイン算出部１０８と、マスク修正部１０９とを備える。
実施の形態２におけるマスク生成部２０５のマスク係数算出部１０６、ゲイン算出部１０８及びマスク修正部１０９は、実施の形態１におけるマスク生成部１０５のマスク係数算出部１０６、ゲイン算出部１０８及びマスク修正部１０９と同様である。As shown in FIG. 2, the mask generation unit 205 includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 207, a gain calculation unit 108, and a mask correction unit 109.
The mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit 109 of the mask generation unit 205 in the second embodiment are the mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit of the mask generation unit 105 in the first embodiment. It is the same as the part 109.

発話量比算出部２０７は、実施の形態１で述べた式（７）を変形した式（１３）を用いることで、発話量比ＳＲ（τ）の計算から妨害音信号を除外する。
実施の形態１では、式（１）のクロススペクトルＤ（ω，τ）の虚数部Ｑ（ω，τ）の符号により、目的音の到来方向を判別していたが、式（１３）のように、条件式において、到来方向の角度を意味する、第１のチャンネルＣｈ１及び第２のチャンネルＣｈ２の時間差δ（ω，τ）を組み合わせることで、発話量の計算から目的音話者と妨害音話者以外の雑音の影響を除外することができる。

ここで、δ_θＤＴ及びδ_θＤＮは、それぞれ、発話量の計算から除外するための観測アナログ信号の時間差の閾値であり、到来方向角度を時間差に変換した予め定められた定数である。
δ_θＤＴは、観測アナログ信号の到来時間差が極めて小さく、到来方向が目的音方向なのか妨害音方向なのか判別が難しい場合、あるいは正面方向から騒音が到来している場合を想定し、それらの場合を発話量の計算から除外するための閾値である。
δ_θＤＮは、目的音及び妨害音の想定する到来方向から外れている可能性が高い場合、言い換えれば、観測アナログ信号が、例えば窓から混入する風きり音等の方向性雑音、又は、スピーカから放出される音楽等の可能性が高い場合において、そのような場合を発話量の計算から除外するための閾値である。The utterance ratio calculation unit 207 excludes the disturbing sound signal from the calculation of the utterance ratio SR (τ) by using the equation (13) which is a modification of the equation (7) described in the first embodiment.
In the first embodiment, the arrival direction of the target sound is determined by the sign of the imaginary part Q (ω, τ) of the cross spectrum D (ω, τ) of the equation (1). In addition, in the conditional expression, by combining the time difference δ (ω, τ) between the first channel Ch1 and the second channel Ch2, which means the angle in the arrival direction, the target sound speaker and the disturbing sound are calculated from the calculation of the utterance amount. The influence of noise other than the speaker can be excluded.

Here, δ _θDT and δ _θDN are threshold values of the time difference of the observed analog signal to be excluded from the calculation of the utterance amount, respectively, and are predetermined constants obtained by converting the arrival direction angle into the time difference.
δ _{θDT assumes the case where} the arrival time difference of the observed analog signal is extremely small and it is difficult to determine whether the arrival direction is the target sound direction or the disturbing sound direction, or the noise is coming from the front direction. Is a threshold value for excluding from the calculation of the amount of speech.
When _{there is a high possibility that δ θDN} deviates from the expected arrival direction of the target sound and the disturbing sound, in other words, the observed analog signal is directional noise such as wind noise mixed from the window, or from the speaker. This is a threshold value for excluding such cases from the calculation of the amount of speech when there is a high possibility that the music will be released.

図１０は、式（１３）における目的音及び妨害音以外の雑音の影響を除外する方法の一例を示す模式図である。
図１０の例は、第１のチャンネルＣｈ１を基準に除外範囲を記載している。
図１０のように、発話量の計算において除外範囲を設定することで、目的音及び妨害音以外の雑音の影響を除外することができるので、発話量比の計算精度が向上し、更に品質の高い音源分離装置を構成することが可能となる。FIG. 10 is a schematic diagram showing an example of a method for excluding the influence of noise other than the target sound and the disturbing sound in the equation (13).
In the example of FIG. 10, the exclusion range is described with reference to the first channel Ch1.
As shown in FIG. 10, by setting the exclusion range in the calculation of the utterance amount, the influence of noise other than the target sound and the disturbing sound can be excluded, so that the calculation accuracy of the utterance amount ratio is improved and the quality is further improved. It is possible to configure a high sound source separation device.

実施の形態２に係る音源分離装置２００は、以上のように構成されているため、様々な騒音条件であっても、低い計算コストで分離性能の高いマスキングフィルタを作成できる。このため、自動車内の騒音下でも目的音を的確に取得することができるので、高精度の音声認識装置、高品質なハンズフリー通話装置、又は、自動車内の異常音を検知する異常音監視装置を提供することが可能となる。 Since the sound source separation device 200 according to the second embodiment is configured as described above, it is possible to create a masking filter having high separation performance at a low calculation cost even under various noise conditions. Therefore, since the target sound can be accurately acquired even under the noise in the automobile, a high-precision voice recognition device, a high-quality hands-free calling device, or an abnormal sound monitoring device for detecting the abnormal sound in the automobile. Can be provided.

実施の形態３．
実施の形態１及び２では、発話量比の計算に現フレーム情報だけを使用しているが、実施の形態はこのような例に限定されるものではなく、過去のフレーム情報を用いて計算することも可能である。Embodiment 3.
In the first and second embodiments, only the current frame information is used for the calculation of the utterance volume ratio, but the embodiment is not limited to such an example, and the calculation is performed using the past frame information. It is also possible.

図１に示されているように、実施の形態３に係る音源分離装置３００は、Ａ／Ｄ変換部１０３と、Ｔ／Ｆ変換部１０４と、マスク生成部３０５と、マスキングフィルタ部１１０と、Ｔ／Ｆ逆変換部１１１と、Ｄ／Ａ変換部１１２とを備える。
実施の形態３に係る音源分離装置３００のＡ／Ｄ変換部１０３、Ｔ／Ｆ変換部１０４、マスキングフィルタ部１１０、Ｔ／Ｆ逆変換部１１１及びＤ／Ａ変換部１１２は、実施の形態１に係る音源分離装置１００のＡ／Ｄ変換部１０３、Ｔ／Ｆ変換部１０４、マスキングフィルタ部１１０、Ｔ／Ｆ逆変換部１１１及びＤ／Ａ変換部１１２と同様である。As shown in FIG. 1, the sound source separation device 300 according to the third embodiment includes an A / D conversion unit 103, a T / F conversion unit 104, a mask generation unit 305, a masking filter unit 110, and the like. It includes a T / F inverse conversion unit 111 and a D / A conversion unit 112.
The A / D conversion unit 103, the T / F conversion unit 104, the masking filter unit 110, the T / F inverse conversion unit 111, and the D / A conversion unit 112 of the sound source separation device 300 according to the third embodiment are the first embodiment. This is the same as the A / D conversion unit 103, the T / F conversion unit 104, the masking filter unit 110, the T / F inverse conversion unit 111, and the D / A conversion unit 112 of the sound source separation device 100.

図２に示されているように、実施の形態３におけるマスク生成部３０５は、マスク係数算出部１０６と、発話量比算出部３０７と、ゲイン算出部１０８と、マスク修正部１０９とを備える。
実施の形態３におけるマスク生成部３０５のマスク係数算出部１０６、ゲイン算出部１０８及びマスク修正部１０９は、実施の形態１におけるマスク生成部１０５のマスク係数算出部１０６、ゲイン算出部１０８及びマスク修正部１０９と同様である。As shown in FIG. 2, the mask generation unit 305 in the third embodiment includes a mask coefficient calculation unit 106, an utterance amount ratio calculation unit 307, a gain calculation unit 108, and a mask correction unit 109.
The mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit 109 of the mask generation unit 305 in the third embodiment are the mask coefficient calculation unit 106, the gain calculation unit 108, and the mask correction unit of the mask generation unit 105 in the first embodiment. It is the same as the part 109.

発話量比算出部３０７は、上記の式（８）を用いて発話量比ＳＲ（τ）を算出し、さらに、下記の式（１４）を用いて、算出されたＳＲ（τ）を、１フレーム前の発話量比ＳＲ（τ−１）で平滑化する。

ここで、αは、平滑化係数であり、実施の形態３においては、α＝０．９が好適な一例である。The utterance volume ratio calculation unit 307 calculates the utterance volume ratio SR (τ) using the above equation (8), and further, the calculated SR (τ) is calculated by using the following equation (14). Smooth with the utterance ratio SR (τ-1) before the frame.

Here, α is a smoothing coefficient, and in the third embodiment, α = 0.9 is a preferable example.

このように発話量比の計算において、過去に算出された発話量比を用いて、最後に算出された発話量比を平滑化することで、観測アナログ信号に騒音が混入した場合でも、安定して発話量比の計算を行うことが可能となり、更に精度の高い音源分離が可能となる。 In this way, in the calculation of the utterance volume ratio, by using the utterance volume ratio calculated in the past and smoothing the utterance volume ratio calculated last, it is stable even if noise is mixed in the observed analog signal. This makes it possible to calculate the utterance volume ratio, and even more accurate sound source separation becomes possible.

さらに、実施の形態２では、発話量比算出部２０７は、式（１３）を用いて、各信号の発話量を計算しているが、変形例として、発話量比算出部２０７は、この計算を所定のフレーム区間に拡張すること、言い換えると、予め定められたフレーム区間のパワースペクトルの積分値を計算することで、所定のフレーム区間での目的音と妨害音の占有率、具体的には、どちらが長く発話しているか、あるいは、どちらが大きな音量であるかを分析することが可能である。よって、目的音と妨害音とのダブルトーク時において、どちらの音声が支配的かを判定することが可能となり、より精度の高い音源分離が可能となる。 Further, in the second embodiment, the utterance amount ratio calculation unit 207 calculates the utterance amount of each signal by using the equation (13), but as a modification, the utterance amount ratio calculation unit 207 calculates this calculation. In other words, by calculating the integral value of the power spectrum of the predetermined frame section, the occupancy rate of the target sound and the disturbing sound in the predetermined frame section, specifically, , It is possible to analyze which is speaking longer or which is louder. Therefore, it is possible to determine which voice is dominant at the time of double talk between the target sound and the disturbing sound, and it is possible to separate the sound source with higher accuracy.

上述の実施の形態２において、情報処理システム２５０がカーナビゲーションシステムの一例である場合について説明したが、実施の形態２は、これに限定されるものではない。例えば、情報処理システム２５０は、一般家庭内又はオフィス内に設置されるスマートスピーカ又はテレビ等の遠隔音声認識システム、ＴＶ会議システムの拡声通話システム、ロボットの音声認識対話システム、又は、工場の異常音監視システム等にも適用可能である。このような場合にも、これらの音響的環境で生ずる雑音又は音響エコーについても、実施の形態２にて述べた効果を同様に奏する。 Although the case where the information processing system 250 is an example of the car navigation system has been described in the above-described second embodiment, the second embodiment is not limited to this. For example, the information processing system 250 is a remote voice recognition system such as a smart speaker or a television installed in a general home or office, a loudspeaker communication system of a TV conference system, a voice recognition dialogue system of a robot, or an abnormal sound of a factory. It can also be applied to monitoring systems and the like. Even in such a case, the effects described in the second embodiment can be similarly obtained with respect to the noise or acoustic echo generated in these acoustic environments.

また、以上に記載された実施の形態１〜３では、入力信号の周波数帯域幅を１６ｋＨｚとしているが、実施の形態１〜３は、このような例に限定されない。例えば、実施の形態１〜３は、２４ｋＨｚ等の更に広帯域の音響信号についても適用可能である。 Further, in the above-described first to third embodiments, the frequency bandwidth of the input signal is set to 16 kHz, but the first to third embodiments are not limited to such an example. For example, the first to third embodiments can be applied to a wider band acoustic signal such as 24 kHz.

上記以外にも、実施の形態１〜３は、任意の構成要素の変形、又は、任意の構成要素の省略が可能である。 In addition to the above, in the first to third embodiments, any component can be modified or any component can be omitted.

以上のように、実施の形態１〜３に係る音源分離装置１００〜３００は、低い計算コストで高品質な音源分離が可能なため、音声認識システム、音声通信システム又は異常音監視システムのいずれかに導入することができる。これにより、カーナビゲーション又はテレビ等の遠隔音声認識システムの認識率向上、携帯電話又はインターフォン等のハンズフリー通話システム、ＴＶ会議システム又は異常音監視システム等の品質改善に供することができる。 As described above, since the sound source separation devices 100 to 300 according to the first to third embodiments can separate high-quality sound sources at a low calculation cost, they are either a voice recognition system, a voice communication system, or an abnormal sound monitoring system. Can be introduced in. As a result, it is possible to improve the recognition rate of a remote voice recognition system such as a car navigation system or a television, and improve the quality of a hands-free calling system such as a mobile phone or an intercom, a TV conference system or an abnormal sound monitoring system.

１００，２００，３００音源分離装置、１０１第１のマイクロホン、１０２第２のマイクロホン、１０３Ａ／Ｄ変換部、１０４Ｔ／Ｆ変換部、１０５，２０５，３０５マスク生成部、１０６マスク係数算出部、１０７，２０７，３０７発話量比算出部、１０８ゲイン算出部、１０９マスク修正部、１１０マスキングフィルタ部、１１１Ｔ／Ｆ逆変換部、１１２Ｄ／Ａ変換部、２５０情報処理システム。 100, 200, 300 sound source separator, 101 first microphone, 102 second microphone, 103 A / D conversion unit, 104 T / F conversion unit, 105, 205, 305 mask generation unit, 106 mask coefficient calculation unit, 107, 207, 307 Speech ratio calculation unit, 108 gain calculation unit, 109 mask correction unit, 110 masking filter unit, 111 T / F inverse conversion unit, 112 D / A conversion unit, 250 information processing system.

Claims

The first observation analog generated by the first microphone based on the observation sound including the target sound arriving from the first direction and the disturbing sound arriving from the second direction different from the first direction. Upon receiving the signal and the input of the second observation analog signal generated by the second microphone based on the observation sound, each of the first observation analog signal and the second observation analog signal is converted into a digital signal. By doing so, the analog / digital conversion unit that generates the first observation digital signal and the second observation digital signal,
A time / frequency conversion unit that generates a first spectral component and a second spectral component by converting each of the first observed digital signal and the second observed digital signal into a signal in the frequency domain.
Using the cross-correlation function of the first spectral component and the second spectral component, the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone , A mask generation unit that calculates a filtering coefficient for masking the spectral components of sound coming from a direction different from the first direction.
A masking filter unit that separates spectral components by masking the first spectral component using the filtering coefficient.
A time / frequency inverse conversion unit that generates an output digital signal by converting the separated spectral components into a signal in the time domain is provided .
The mask generator
Using the intercorrelation function of the first spectral component and the second spectral component, the first of the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone. From the time difference between the above and the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first direction of the observed sound is included. A sound arriving from the first range is distinguished from a sound arriving from the second range including the second direction and not overlapping the first range. A mask coefficient calculation unit that calculates a mask coefficient for separating the spectrum component from the spectrum component of the sound coming from the second range, and a mask coefficient calculation unit.
With the utterance amount ratio calculation unit that calculates the ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range among the first spectral components. ,
A gain calculation unit that calculates a correction gain for correcting the mask coefficient so that the higher the ratio, the lower the strength at which the masking is performed.
An information processing apparatus including a mask correction unit that calculates the filtering coefficient by correcting the mask coefficient with the correction gain.

The first observation analog generated by the first microphone based on the observation sound including the target sound arriving from the first direction and the disturbing sound arriving from the second direction different from the first direction. Upon receiving the signal and the input of the second observation analog signal generated by the second microphone based on the observation sound, each of the first observation analog signal and the second observation analog signal is converted into a digital signal. By doing so, the analog / digital conversion unit that generates the first observation digital signal and the second observation digital signal,
A time / frequency conversion unit that generates a first spectral component and a second spectral component by converting each of the first observed digital signal and the second observed digital signal into a signal in the frequency domain.
Using the cross-correlation function of the first spectral component and the second spectral component, the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone , A mask generation unit that calculates a filtering coefficient for masking the spectral components of sound coming from a direction different from the first direction.
A masking filter unit that separates spectral components by masking the first spectral component using the filtering coefficient.
A time / frequency inverse conversion unit that generates an output digital signal by converting the separated spectral components into a signal in the time domain is provided.
The mask generator
Using the intercorrelation function of the first spectral component and the second spectral component, the first of the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone. From the time difference between the above and the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first direction of the observed sound is included. A sound arriving from the first range is distinguished from a sound arriving from the second range including the second direction and not overlapping the first range. A mask coefficient calculation unit that calculates a mask coefficient for separating the spectrum component from the spectrum component of the sound coming from the second range, and a mask coefficient calculation unit.
The ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range of the first spectral components over time. And a speech volume ratio calculation unit that smoothes the last calculated ratio using the ratio calculated in the past.
A gain calculation unit that calculates a correction gain for correcting the mask coefficient so that the higher the smoothed ratio is, the lower the strength at which the masking is performed.
An information processing apparatus including a mask correction unit that calculates the filtering coefficient by correcting the mask coefficient with the correction gain.

The utterance amount ratio calculating unit, by excluding the spectral components of the sound coming from the first range and the second direction which is not included in the scope, claim 1, characterized in that calculating the ratio Or the information processing apparatus according to 2.

Computer,
The first observation analog generated by the first microphone based on the observation sound including the target sound arriving from the first direction and the disturbing sound arriving from the second direction different from the first direction. Upon receiving the signal and the input of the second observation analog signal generated by the second microphone based on the observation sound, each of the first observation analog signal and the second observation analog signal is converted into a digital signal. By doing so, the analog / digital conversion unit that generates the first observation digital signal and the second observation digital signal,
A time / frequency conversion unit that generates a first spectral component and a second spectral component by converting each of the first observed digital signal and the second observed digital signal into a signal in the frequency domain.
Using the cross-correlation function of the first spectral component and the second spectral component, the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone , A mask generation unit that calculates a filtering coefficient for masking the spectral components of sound coming from a direction different from the first direction.
A masking filter unit that separates spectral components by masking the first spectral component using the filtering coefficient, and
By converting the separated spectral components into signals in the time domain, it functions as a time / frequency inverse conversion unit that generates an output digital signal .
The mask generator
Using the intercorrelation function of the first spectral component and the second spectral component, the first of the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone. From the time difference between the above and the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first direction of the observed sound is included. A sound arriving from the first range is distinguished from a sound arriving from the second range including the second direction and not overlapping the first range. A mask coefficient calculation unit that calculates a mask coefficient for separating the spectrum component from the spectrum component of the sound coming from the second range, and a mask coefficient calculation unit.
With the utterance amount ratio calculation unit that calculates the ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range among the first spectral components. ,
A gain calculation unit that calculates a correction gain for correcting the mask coefficient so that the higher the ratio, the lower the strength at which the masking is performed.
A program including a mask correction unit for calculating the filtering coefficient by correcting the mask coefficient with the correction gain.

Computer,
The first observation analog generated by the first microphone based on the observation sound including the target sound arriving from the first direction and the disturbing sound arriving from the second direction different from the first direction. Upon receiving the signal and the input of the second observation analog signal generated by the second microphone based on the observation sound, each of the first observation analog signal and the second observation analog signal is converted into a digital signal. By doing so, the analog / digital conversion unit that generates the first observation digital signal and the second observation digital signal,
A time / frequency conversion unit that generates a first spectral component and a second spectral component by converting each of the first observed digital signal and the second observed digital signal into a signal in the frequency domain.
Using the cross-correlation function of the first spectral component and the second spectral component, the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone , A mask generator that calculates a filtering coefficient for masking the spectral components of sound coming from a direction different from the first direction.
A masking filter unit that separates spectral components by masking the first spectral component using the filtering coefficient, and
By converting the separated spectral components into signals in the time domain, it functions as a time / frequency inverse conversion unit that generates an output digital signal .
The mask generator
Using the intercorrelation function of the first spectral component and the second spectral component, the first of the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone. From the time difference between the above and the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first direction of the observed sound is included. A sound arriving from the first range is distinguished from a sound arriving from the second range including the second direction and not overlapping the first range. A mask coefficient calculation unit that calculates a mask coefficient for separating the spectrum component from the spectrum component of the sound coming from the second range, and a mask coefficient calculation unit.
The ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range of the first spectral components over time. And a speech volume ratio calculation unit that smoothes the last calculated ratio using the ratio calculated in the past.
A gain calculation unit that calculates a correction gain for correcting the mask coefficient so that the higher the smoothed ratio is, the lower the strength at which the masking is performed.
A program including a mask correction unit for calculating the filtering coefficient by correcting the mask coefficient with the correction gain.

The first observation analog generated by the first microphone based on the observation sound including the target sound arriving from the first direction and the disturbing sound arriving from the second direction different from the first direction. Upon receiving the signal and the input of the second observation analog signal generated by the second microphone based on the observation sound, each of the first observation analog signal and the second observation analog signal is converted into a digital signal. By doing so, the first observation digital signal and the second observation digital signal are generated.
By converting each of the first observation digital signal and the second observation digital signal into a signal in the frequency domain, a first spectrum component and a second spectrum component are generated.
Using the cross-correlation function of the first spectral component and the second spectral component, the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone , Calculate the filtering coefficient for masking the spectral components of the sound coming from a direction different from the first direction.
By masking the first spectral component using the filtering coefficient, the spectral component is separated.
An information processing method that generates an output digital signal by converting the separated spectral components into signals in the time domain.
When calculating the filtering coefficient,
Using the intercorrelation function of the first spectral component and the second spectral component, the first of the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone. From the time difference between the above and the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first direction of the observed sound is included. A sound arriving from the first range is distinguished from a sound arriving from the second range including the second direction and not overlapping the first range. A mask coefficient for separating the spectrum component from the spectrum component of the sound arriving from the second range is calculated.
Among the first spectral components, the ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range was calculated.
The correction gain for correcting the mask coefficient is calculated so that the higher the ratio, the lower the strength at which the masking is performed.
An information processing method characterized in that the filtering coefficient is calculated by modifying the mask coefficient with the correction gain.

The first observation analog generated by the first microphone based on the observation sound including the target sound arriving from the first direction and the disturbing sound arriving from the second direction different from the first direction. Upon receiving the signal and the input of the second observation analog signal generated by the second microphone based on the observation sound, each of the first observation analog signal and the second observation analog signal is converted into a digital signal. By doing so, the first observation digital signal and the second observation digital signal are generated.
By converting each of the first observation digital signal and the second observation digital signal into a signal in the frequency domain, a first spectrum component and a second spectrum component are generated.
Using the cross-correlation function of the first spectral component and the second spectral component, the time difference between the time when the observed sound arrives at the first microphone and the time when the observed sound arrives at the second microphone , Calculate the filtering coefficient for masking the spectral components of the sound coming from a direction different from the first direction.
By masking the first spectral component using the filtering coefficient, the spectral component is separated.
An information processing method that generates an output digital signal by converting the separated spectral components into signals in the time domain.
When calculating the filtering coefficient,
Using the intercorrelation function of the first spectral component and the second spectral component, the first of the time when the target sound arrives at the first microphone and the time when the target sound arrives at the second microphone. From the time difference between the above and the second time difference between the time when the disturbing sound arrives at the first microphone and the time when the disturbing sound arrives at the second microphone, the first direction of the observed sound is included. A sound arriving from the first range is distinguished from a sound arriving from the second range including the second direction and not overlapping the first range. A mask coefficient for separating the spectrum component from the spectrum component of the sound arriving from the second range is calculated.
The ratio of the amount of the spectral component of the sound coming from the first range to the amount of the spectral component of the sound coming from the second range of the first spectral components over time. And smooth the last calculated ratio using the previously calculated ratio.
The correction gain for correcting the mask coefficient was calculated so that the higher the smoothed ratio, the lower the intensity at which the masking was performed.
An information processing method characterized in that the filtering coefficient is calculated by modifying the mask coefficient with the correction gain.