JP5294603B2

JP5294603B2 - Acoustic signal estimation device, acoustic signal synthesis device, acoustic signal estimation synthesis device, acoustic signal estimation method, acoustic signal synthesis method, acoustic signal estimation synthesis method, program using these methods, and recording medium

Info

Publication number: JP5294603B2
Application number: JP2007259797A
Authority: JP
Inventors: 健弘守谷; 登原田; 優鎌本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-10-03
Filing date: 2007-10-03
Publication date: 2013-09-18
Anticipated expiration: 2027-10-03
Also published as: JP2009089315A

Description

本発明は、複数チャネルの音響信号から音源の位置または方向と強度と位相を推定し、任意の位置の音響信号を合成する音響信号推定装置、音響信号合成装置、音響信号推定合成装置、音響信号推定方法、音響信号合成方法、音響信号推定合成方法、これらの方法を用いたプログラム、及び記録媒体に関する。 The present invention relates to an acoustic signal estimation device, an acoustic signal synthesis device, an acoustic signal estimation synthesis device, an acoustic signal that estimates the position or direction, intensity, and phase of a sound source from acoustic signals of a plurality of channels and synthesizes an acoustic signal at an arbitrary position. The present invention relates to an estimation method, an acoustic signal synthesis method, an acoustic signal estimation synthesis method, a program using these methods, and a recording medium.

立体的音響信号を複数のマイクで収音し、音源を分離したり、雑音を抑圧したりする手法は良く知られている。音源の位置はセンサーで収集できる。また、アレーマイクで個別の音に分離して収集することもできる。その手段として、ＳＡＦＩＡ法（非特許文献１）やＣＳＣＣ法（非特許文献２）が知られている。
青木真理子，山口義和，古家賢一，片岡章俊，“音源分離方式ＳＡＦＩＡを用いた高騒音下における近接音源の分離抽出”，電子情報通信学会誌Ａ，Vol.J88-A, No.4, pp.468-479, 2005．松本恭輔，小野順貴，嵯峨山茂樹，“位相差拘束付複素スペクトル円心（ＣＳＣＣ）法による雑音抑圧の検討”，日本音響学会講演論文集，3-1-11, pp.499-500, 2006． There are well-known techniques for collecting three-dimensional acoustic signals with a plurality of microphones, separating sound sources, and suppressing noise. The position of the sound source can be collected by the sensor. It can also be separated and collected into individual sounds with an array microphone. As such means, the SAFIA method (Non-patent document 1) and the CSCC method (Non-patent document 2) are known.
Mariko Aoki, Yoshikazu Yamaguchi, Kenichi Furuya, Akitoshi Kataoka, “Separation and Extraction of Proximity Sound Sources under High Noise Using Sound Source Separation Method SAFIA”, IEICE Journal A, Vol.J88-A, No.4, pp. 468-479, 2005. Shinsuke Matsumoto, Junki Ono, Shigeki Hatakeyama, “Examination of Noise Suppression by Complex Spectral Center with Constrained Phase Difference (CSCC) Method”, Proceedings of the Acoustical Society of Japan, 3-1-11, pp.499-500, 2006.

一般的に、複数のマイクは音源から離れた位置に設置され、常時、音を収音している。しかし、音源の位置や数は明確ではなく、時間とともに変動することも想定される。このような場合に、任意の位置で収音される音を求めるには、いくつかのパラメータを仮定して音源を分離する手法では対応できない。
本発明はこのような問題点を解決し、複数のマイクで収音された音から、音源を推定し、任意の位置での音を合成する方法を提供することにある。 In general, a plurality of microphones are installed at positions distant from a sound source and always collect sound. However, the position and number of sound sources are not clear and may vary with time. In such a case, it is not possible to obtain a sound picked up at an arbitrary position by a method of separating sound sources assuming some parameters.
The present invention solves such problems and provides a method for estimating a sound source from sounds collected by a plurality of microphones and synthesizing sound at an arbitrary position.

本発明の音響信号推定装置は、帯域分割部と音源推定部から構成される。帯域分割部は、複数のマイクで収音した複数チャネルの音響信号を、チャネルごとに所定の周波数帯域ごとに分割して帯域信号を生成する。音源推定部は、周波数帯域ごとに音源の位置または方向と強度と位相を推定する。そして、チャネルごとに音源からの信号を帯域信号から除いて残差帯域信号を求める。つまり、１以上の音源が推定できた周波数帯域は、チャネルごとに音源からの信号を帯域信号から除いて残差帯域信号を求め、音源が推定できなかった周波数帯域は、各チャネルの帯域信号を残差帯域信号とする。 The acoustic signal estimation apparatus according to the present invention includes a band dividing unit and a sound source estimating unit. The band dividing unit divides the sound signals of a plurality of channels collected by a plurality of microphones into predetermined frequency bands for each channel to generate a band signal. The sound source estimation unit estimates the position or direction, intensity, and phase of the sound source for each frequency band. Then, a residual band signal is obtained by removing the signal from the sound source from the band signal for each channel. That is, the frequency band in which one or more sound sources can be estimated is obtained by removing the signal from the sound source from the band signal for each channel to obtain a residual band signal, and the frequency band in which the sound source cannot be estimated is the band signal of each channel. Let it be a residual band signal.

本発明の音響信号合成装置は、帯域信号成分推定部と帯域信号成分加算部と帯域統合部から構成され、各音源の位置または方向と周波数帯域ごとの強度と位相、各チャネルの残差帯域信号、音を合成する位置を入力とする。帯域信号成分推定部は、各音源の位置または方向と周波数帯域ごとの強度と位相から、指定された位置での各音源からの帯域信号を推定する。帯域信号成分加算部は、推定された各音源からの帯域信号と各チャネルの残差帯域信号とを重み付き加算することで、指定された位置での帯域信号を求める。帯域統合部は、指定された位置での帯域信号を、時間領域の信号に変換する。 The acoustic signal synthesizer according to the present invention includes a band signal component estimation unit, a band signal component addition unit, and a band integration unit. The position or direction of each sound source, the intensity and phase of each frequency band, and the residual band signal of each channel The position where the sound is synthesized is input. The band signal component estimation unit estimates a band signal from each sound source at a designated position from the position or direction of each sound source and the intensity and phase for each frequency band. The band signal component adding unit obtains a band signal at a designated position by weighted addition of the estimated band signal from each sound source and the residual band signal of each channel. The band integration unit converts the band signal at the designated position into a signal in the time domain.

本発明の音響信号推定合成装置は、上述の音響信号推定装置と記録部と音響信号合成装置から構成される。記録部は、音響信号推定装置から出力される各音源の位置または方向と周波数帯域ごとの強度と位相、および各チャネルの残差帯域信号を記録する。音響信号合成装置は、記録部に記録された推定された各音源の位置または方向と周波数帯域ごとの強度と位相、各チャネルの残差帯域信号、収音される音を合成する位置を入力とする。
なお、音響信号推定装置や音響信号合成装置は、上述の記録部を内部に備えていてもよい。 The acoustic signal estimation and synthesis apparatus according to the present invention includes the acoustic signal estimation apparatus, the recording unit, and the acoustic signal synthesis apparatus described above. The recording unit records the position or direction of each sound source output from the acoustic signal estimation device, the intensity and phase of each frequency band, and the residual band signal of each channel. The acoustic signal synthesizer receives the estimated position or direction of each sound source recorded in the recording unit and the intensity and phase of each frequency band, the residual band signal of each channel, and the position where the collected sound is synthesized. To do.
Note that the acoustic signal estimation device and the acoustic signal synthesis device may include the recording unit described above.

本発明の音響信号推定装置によれば、複数のマイクで収音した複数チャネルの音響信号から、１以上の音源の位置または方向と周波数帯域ごとの強度と位相を推定し、各チャネルの残差帯域信号を求める。したがって、音源が推定できた音と雑音などの音源が推定できない音に分けることができる。
本発明の音響信号合成装置によれば、音源が推定できた音については、音源の位置または方向から指定された位置で収音される音を計算できる。また、音源が推定できない音については、各チャネルの残差帯域信号（帯域信号に含まれる音源が特定できない信号）から指定された位置で収音される音を計算できる。そして、これらを重み付け加算するので、指定された位置での音を合成できる。 According to the acoustic signal estimation device of the present invention, the position or direction of one or more sound sources and the intensity and phase of each frequency band are estimated from the acoustic signals of a plurality of channels collected by a plurality of microphones, and the residual of each channel is estimated. Obtain the band signal. Therefore, the sound can be divided into a sound that can be estimated and a sound that cannot be estimated such as noise.
According to the acoustic signal synthesizer of the present invention, the sound collected at a position designated from the position or direction of the sound source can be calculated for the sound whose sound source can be estimated. For sounds that cannot be estimated by the sound source, it is possible to calculate the sound that is collected at a specified position from the residual band signal of each channel (a signal that cannot identify the sound source included in the band signal). Since these are weighted and added, the sound at the designated position can be synthesized.

本発明の音響信号推定合成装置によれば、上述の音響信号推定装置と音響信号合成装置の効果を有するので、複数のマイクで収音した複数チャネルの音響信号から、指定された位置での合成できる。
このような効果があるので、例えば、複数の場所のカメラから任意の視点の画像・映像を合成する自由視点映像システムに対応した音響信号の合成も可能となる。 According to the acoustic signal estimating and synthesizing apparatus of the present invention, since it has the effects of the above-described acoustic signal estimating apparatus and the acoustic signal synthesizing apparatus, synthesis at a designated position from the acoustic signals of a plurality of channels picked up by a plurality of microphones. it can.
Because of such an effect, for example, it is possible to synthesize an audio signal corresponding to a free viewpoint video system that synthesizes images and videos of arbitrary viewpoints from cameras at a plurality of locations.

以下に、図を示しながら本発明の原理と実施形態を説明する。
原理
図１は、４つのマイクと伝播した音が平面波と近似できるほど遠方の音源からの音の様子を示す例である。一般的には、最も離れたマイク同士の間隔より、１０倍以上音源が離れた場合、平面波と近似できる。図１では、４つのマイク５０１〜５０４は直線状に配置されている。音源Ａからの音は、マイクの配置と垂直な方向から来るとする。この場合には、到達する音波の波面がそろうので、音源Ａからの各マイクへの入力信号は同一となる。音源Ｂからの音は、マイクの配置に対して垂直ではない方向から来るとする。この場合、各マイクへの音源Ｂからの音の到達時間が異なる。また、帯域ごとの帯域信号成分でみると位相が異なる。図２に、マイク５０１〜５０４が設置されている場所５０１〜５０４での、音源Ａから伝播された音のスペクトルの例を示す。図３に場所５０１〜５０４での音源Ｂから伝播された音のスペクトルの例を示す。図３（Ａ）は場所５０１での音源Ｂからの音のスペクトル、（Ｂ）は場所５０２での音源Ｂからの音のスペクトル、（Ｃ）は場所５０３での音源Ｂからの音のスペクトル、（Ｄ）は場所５０４での音源Ｂからの音のスペクトルである。また、図４から図６に場所５０１、５０２、５０３での、音源Ａと音源Ｂからの音のスペクトルを示す。図４が場所５０１での音源Ａと音源Ｂからの音のスペクトル、図５が場所５０２での音源Ａと音源Ｂからの音のスペクトル、図６が場所５０３での音源Ａと音源Ｂからの音のスペクトルである。 The principle and embodiments of the present invention will be described below with reference to the drawings.
Principle FIG. 1 is an example showing the state of sound from a sound source that is so far away that sound propagated through four microphones can be approximated to a plane wave. Generally, when the sound source is separated 10 times or more than the distance between the most distant microphones, it can be approximated as a plane wave. In FIG. 1, the four microphones 501 to 504 are arranged in a straight line. It is assumed that the sound from the sound source A comes from a direction perpendicular to the microphone arrangement. In this case, since the wavefronts of the arriving sound waves are aligned, the input signals from the sound source A to each microphone are the same. It is assumed that the sound from the sound source B comes from a direction that is not perpendicular to the microphone arrangement. In this case, the arrival time of the sound from the sound source B to each microphone is different. Further, the phase is different in terms of the band signal component for each band. FIG. 2 shows an example of the spectrum of the sound propagated from the sound source A at the locations 501 to 504 where the microphones 501 to 504 are installed. FIG. 3 shows an example of the spectrum of the sound propagated from the sound source B at the locations 501 to 504. 3A is a sound spectrum from the sound source B at the location 501, FIG. 3B is a sound spectrum from the sound source B at the location 502, and FIG. 3C is a sound spectrum from the sound source B at the location 503, (D) is the spectrum of the sound from the sound source B at the location 504. 4 to 6 show the spectrum of sound from the sound source A and the sound source B at the locations 501, 502, and 503. 4 is a spectrum of sound from sound source A and sound source B at location 501, FIG. 5 is a spectrum of sound from sound source A and sound source B at location 502, and FIG. 6 is a sound spectrum from sound source A and sound source B at location 503. It is the spectrum of sound.

本発明では、このように複数のマイクで収音した音のスペクトルから、音源の方向、音源のスペクトルを推定する。なお、図７のように球面波を前提とした場合には、音源の方向を推定するのではなく、音源の位置を推定することになる。そして、各マイクでの推定した音源からの音のスペクトルを計算し、残差として残る信号（残差信号）を音源が特定できない雑音として扱う。そして、推定された音源の位置とスペクトルから、音響波形が欲しい位置（指定された位置）での各音源からの音のスペクトルを求める。また、指定された位置の残差信号のスペクトルは、指定された位置の近くのマイクの残差信号を、指定された位置とマイクとの距離を考慮した重み付け加算して求める。これらを加算することで、指定された位置での音響波形を合成する。 In the present invention, the direction of the sound source and the spectrum of the sound source are estimated from the spectrum of the sound collected by the plurality of microphones. When spherical waves are assumed as shown in FIG. 7, the direction of the sound source is not estimated, but the position of the sound source is estimated. Then, the spectrum of sound from the sound source estimated by each microphone is calculated, and a signal remaining as a residual (residual signal) is handled as noise that cannot be specified by the sound source. Then, from the estimated position and spectrum of the sound source, the spectrum of the sound from each sound source at the position (designated position) where the acoustic waveform is desired is obtained. The spectrum of the residual signal at the designated position is obtained by weighting and adding the residual signal of the microphone near the designated position in consideration of the distance between the designated position and the microphone. By adding these, the acoustic waveform at the specified position is synthesized.

音源の方向やスペクトルを推定する方法は、従来から存在する方法を用いればよい。例えば、各マイクで収音した音の位相差を用いる方法などがある。音の位相差は、例えば、マイクが２つの場合には、相互相関関数の計算である時間差でピークがはっきりと出れば、１つの音源があると判断できる。また、マイクが２つ以上の場合、例えば、１つの音源を仮定して連立方程式を解くか、位相差を周波数領域で評価すると、得られた結果が１つの音源とみなせるか否か判断できる。つまり、一般的に、２つ以上のマイクがあれば、収音された音の個々の周波数帯域での位相の違いから、音源方向を推定できる。 As a method for estimating the direction and spectrum of the sound source, a conventional method may be used. For example, there is a method of using a phase difference of sound collected by each microphone. For example, when there are two microphones, the sound phase difference can be determined to be one sound source if a peak appears clearly due to the time difference calculated by the cross-correlation function. Further, when there are two or more microphones, for example, by solving simultaneous equations assuming one sound source or evaluating a phase difference in the frequency domain, it can be determined whether or not the obtained result can be regarded as one sound source. That is, generally, if there are two or more microphones, the direction of the sound source can be estimated from the difference in phase in each frequency band of the collected sound.

ＳＡＦＩＡ法では、個々の帯域では主要な音源の成分は１つであると仮定し、音源の位置と、その音源からの音を求める。音源のスペクトルには、強い部分と弱い部分があり、ある帯域に注目すると主要な成分が複数の音源から来ることは比較的少ない。例えば、図４から図６に示したように、音源Ａからの音のスペクトルと音源Ｂからの音のスペクトルでは、スペクトルが存在する周波数のほとんどが異なる（例えば、図４から図６の注目帯域ａと注目帯域ｃ）。したがって、帯域分割した場合、ある帯域では、音源Ａまたは音源Ｂの一方の音が主となり、他方はほとんどない。ＳＡＦＩＡ法は、このような特性を利用している。 In the SAFIA method, it is assumed that there is one main sound source component in each band, and the position of the sound source and the sound from the sound source are obtained. The spectrum of the sound source has a strong portion and a weak portion, and when attention is paid to a certain band, it is relatively rare that main components come from a plurality of sound sources. For example, as shown in FIGS. 4 to 6, the spectrum of the sound from the sound source A and the spectrum of the sound from the sound source B are almost different in frequency (for example, the attention band in FIGS. 4 to 6). a and band of interest c). Therefore, when the band is divided, one sound of the sound source A or the sound source B is mainly used in a certain band, and there is almost no other. The SAFIA method utilizes such characteristics.

ＣＳＣＣ法では、他の音源からの入力スペクトルが一定となる場合、あるいはそのように換算した場合、複数のマイクに対する単一音源からのスペクトルの複素平面上での配置から音源方向とその信号成分を分離して推定する。注目帯域ａの例の場合のように、音源Ａからの成分はほとんどない場合や、各信号に遅延を与えるなどして音源Ａからの信号成分がすべてのマイクに共通となるように換算できる場合には、場所５０１〜５０４の音源Ｂの成分から、音源Ｂの方向が精度よく推定できる。なお、音源の位置の推定精度は、他の音がどの程度あるかに依存する。注目帯域ｃの場合には、音源Ｂからの成分がほとんどないので、場所５０１〜５０４の音源Ａの成分から、音源Ａの方向が精度よく推定できる。この場合は、どの場所のスペクトルも同じなので、マイクの設置方向と垂直な方向に音源Ａが存在することが分かる。注目帯域ｂの場合には、音源Ａの成分も音源Ｂの成分も強いため、単純な分離は難しい。この場合、音源方向の推定の信頼度が高い帯域（例えば、注目帯域ａ、注目帯域ｃ）で推定した音源の位置を用いて、音源Ａからの成分と音源Ｂからの成分とを推定する。この例では、音源Ａからの成分は、マイクの場所によらないので、定数とみなすことができる。 In the CSCC method, when the input spectrum from another sound source is constant or converted as such, the sound source direction and its signal component are calculated from the arrangement of the spectrum from a single sound source for a plurality of microphones on the complex plane. Estimate separately. When there is almost no component from the sound source A as in the case of the band of interest a, or when the signal component from the sound source A can be converted to be common to all microphones by giving a delay to each signal. The direction of the sound source B can be accurately estimated from the components of the sound source B at the locations 501 to 504. Note that the accuracy of estimating the position of the sound source depends on how much other sound is present. In the case of the attention band c, since there is almost no component from the sound source B, the direction of the sound source A can be accurately estimated from the components of the sound source A at the locations 501 to 504. In this case, since the spectrum of every place is the same, it can be seen that the sound source A exists in a direction perpendicular to the microphone installation direction. In the case of the attention band b, since the components of the sound source A and the sound source B are strong, simple separation is difficult. In this case, the component from the sound source A and the component from the sound source B are estimated using the position of the sound source estimated in the band with high reliability of estimation of the sound source direction (for example, the attention band a and the attention band c). In this example, since the component from the sound source A does not depend on the location of the microphone, it can be regarded as a constant.

その他にも複数の音源を音源数以上の数のマイクの信号から分離する技術がある（特開２００６-２４３６６４号公報）。また、帯域を分割すれば、音源が発生する周波数成分が偏るので、マイクの数が少なくても分離可能となる（特開２００７−１９８９７７号公報）。 In addition, there is a technique for separating a plurality of sound sources from the number of microphone signals equal to or greater than the number of sound sources (Japanese Patent Laid-Open No. 2006-243664). Further, if the band is divided, the frequency components generated by the sound source are biased, so that separation is possible even with a small number of microphones (Japanese Patent Laid-Open No. 2007-198977).

本発明でも、複数の音源があることを前提に複数のマイクで収音した信号を、音源ごとに分離することで、音源の方向（または位置）、音源のスペクトルを推定する。したがって、上述の信号の分離方法や類似の方法を用いる点では共通するし、どの方法を用いるかは適宜選択すればよい。しかし、本発明の目的は、任意の位置での音を合成することであり、音源ごとに音を分離することではない。つまり、本発明では、音を正確に分離できることよりも、結果的に指定された位置での音のように合成できることが重要である。そこで、本発明では、上述のいずれかの方法で可能な範囲まで、音源の位置または方向と音源帯域信号（周波数帯域ごとの強度と位相）とを推定し、残る信号を音源の位置が特定できない残差信号として扱う。残差信号は、マイクごとに求められる。そして、音源ごとの方向と周波数帯域ごとの音源帯域信号（複素スペクトル）、マイク（チャネル）ごとの周波数帯域ごとの残差信号（残差帯域信号）が記録される。指定された位置での音の合成では、各音源の位置または方向と音源帯域信号（周波数帯域ごとの強度と位相）から、指定された位置での各音源からの帯域信号を推定する。そして、推定された各音源からの帯域信号と各チャネルの残差帯域信号とを重み付き加算することで、指定された位置での帯域信号を求める。最後に、指定された位置での帯域信号を、時間領域の信号に変換する。 In the present invention, the direction (or position) of the sound source and the spectrum of the sound source are estimated by separating the signals collected by the plurality of microphones for each sound source on the assumption that there are a plurality of sound sources. Therefore, it is common to use the above-described signal separation method and similar methods, and which method should be used may be appropriately selected. However, an object of the present invention is to synthesize sounds at arbitrary positions, not to separate sounds for each sound source. That is, in the present invention, it is more important to be able to synthesize like a sound at a designated position as a result rather than accurately separating the sounds. Therefore, in the present invention, the position or direction of the sound source and the sound source band signal (intensity and phase for each frequency band) are estimated to the extent possible by any of the above methods, and the position of the sound source cannot be specified for the remaining signals. Treat as residual signal. The residual signal is obtained for each microphone. Then, a sound source band signal (complex spectrum) for each direction and frequency band for each sound source, and a residual signal (residual band signal) for each frequency band for each microphone (channel) are recorded. In the synthesis of the sound at the designated position, the band signal from each sound source at the designated position is estimated from the position or direction of each sound source and the sound source band signal (intensity and phase for each frequency band). Then, the band signal at the designated position is obtained by weighted addition of the estimated band signal from each sound source and the residual band signal of each channel. Finally, the band signal at the designated position is converted into a time domain signal.

［第１実施形態］
図８に、本発明の音響信号推定合成装置の機能構成例を示す。また、図９に、音響信号推定合成装置の処理フローの例を示す。本発明の音響信号推定合成装置１００は、帯域分割部１１０、音源推定部１２０、記録部１３０、帯域信号成分推定部１４０、帯域信号成分加算部１５０、帯域統合部１６０から構成される。帯域分割部１１０は、Ｋ個（Ｋは２以上の整数）のマイクで収音したＫチャネルの音響信号ｘ_１（ｔ），ｘ_２（ｔ），…，ｘ_Ｋ（ｔ）を、チャネルごとに所定の周波数帯域ωごとに分割して帯域信号Ｘ_１（ω），Ｘ_２（ω），…，Ｘ_Ｋ（ω）を生成する（Ｓ１１０）。音響信号ｘ_１（ｔ）は、Ｔサンプルからなるフレーム中の１つのサンプル値（スカラー量）であり、ｔは０，…，Ｔ−１の値を取る。このような音響信号ｘ_１（ｔ）から、所定の周波数帯域ごとの帯域信号Ｘ_１（ω）を得る。帯域信号Ｘ_１（ω）は、例えば複素スペクトルである。なお、帯域信号Ｘ_１（ω）は帯域分割複素信号でもよいが、以下では複素スペクトルとして説明する。次式のように、時間領域のＴ点ごとのフレームを複素フーリエ変換し、Ｔ／２点の複素フーリエ係数を求めたものを帯域信号Ｘ_１（ω）とする。

ただし、ω＝０，…，Ｔ／２、ｊは虚数単位、πは円周率とする。
帯域信号Ｘ_１（ω）は１番目のマイク（第１のチャネル）の位置での信号の、周波数帯ωごとの振幅と位相を示している。サンプリング周波数をｆ〔Ｈｚ〕としたとき、ωｆ／Ｔ〔Ｈｚ〕を中心周波数とする帯域信号とみなせる。なお、帯域分割部１１０への入力を、アナログの音響信号とし、帯域分割部１１０内でサンプリングした値を音響信号ｘ_１（ｔ）としてもよい。どの場合も、出力は同じである。 [First Embodiment]
FIG. 8 shows a functional configuration example of the acoustic signal estimation / synthesis apparatus of the present invention. FIG. 9 shows an example of the processing flow of the acoustic signal estimation / synthesis apparatus. The acoustic signal estimation and synthesis apparatus 100 of the present invention includes a band dividing unit 110, a sound source estimating unit 120, a recording unit 130, a band signal component estimating unit 140, a band signal component adding unit 150, and a band integrating unit 160. The band dividing unit 110 collects K channel acoustic signals x ₁ (t), x ₂ (t),..., X _K (t) collected by K microphones (K is an integer of 2 or more) for each channel. Are divided into predetermined frequency bands ω to generate band signals X ₁ (ω), X ₂ (ω),..., X _K (ω) (S110). The acoustic signal x ₁ (t) is one sample value (scalar amount) in a frame composed of T samples, and t takes values of 0,..., T−1. A band signal X ₁ (ω) for each predetermined frequency band is obtained from the acoustic signal x ₁ (t). The band signal X ₁ (ω) is, for example, a complex spectrum. The band signal X ₁ (ω) may be a band division complex signal, but will be described as a complex spectrum below. A frame signal X ₁ (ω) is obtained by performing a complex Fourier transform on a frame for each T point in the time domain and obtaining a complex Fourier coefficient at a T / 2 point as in the following equation.

Here, ω = 0,..., T / 2, j is an imaginary unit, and π is a circumference ratio.
The band signal X ₁ (ω) indicates the amplitude and phase of each signal in the frequency band ω at the position of the first microphone (first channel). When the sampling frequency is f [Hz], it can be regarded as a band signal having a center frequency of ωf / T [Hz]. The input to the band dividing unit 110 may be an analog acoustic signal, and the value sampled in the band dividing unit 110 may be the acoustic signal x ₁ (t). In all cases, the output is the same.

音源推定部１２０は、従来から存在する方法で、周波数帯域ωごとに音源の位置または方向Ｄ_ω，１，Ｄ_ω，２，…，Ｄ_ω，Ｍωと音源帯域信号Ｓ_ω，１，Ｓ_ω，２，…，Ｓ_ω，Ｍωを推定する（Ｍωは周波数帯域ωでの音源の数であり、０以上の整数である）。音源帯域信号Ｓ_ω，１は、周波数帯域ωでの第１の音源から伝搬した音によって、マイク近傍で生じる信号を計算するための強度と位相の情報（例えば、複素スペクトル）である。例えば、Ｄ_ω，１が音源の位置を示しており、音を球面波とするのであれば、音源帯域信号Ｓ_ω，１は音源の位置での強度と位相を示す複素スペクトルとすればよい。また、Ｄ_ω，１が音源の方向を示しており、音を平面波に近似とするのであれば、音源帯域信号Ｓ_ω，１はある位置（音源の位置である必要はない）での強度と位相を示す複素スペクトルとすればよい。この推定の過程で、各マイクの位置での、それぞれの音源からの信号Ｕ_{ｋ，ω，ｍ}も求めておく（ｋはマイクの番号を示しており、１〜Ｋの整数である）。信号Ｕ_{ｋ，ω，ｍ}は、ｋ番目のマイクの位置での周波数帯ωのｍ番目の音源からの信号を示している（ｍは周波数帯ωごとに付された音源の番号であり、０〜Ｍωの整数である）。例えば、平面波で近似する場合であれば、音源帯域信号Ｓ_ω，１の位置とマイクｋの位置とを結ぶベクトルと音の伝搬方向の単位ベクトルとの内積（音の伝搬方向にどれだけ離れているかを示す値）から、音源帯域信号Ｓ_ω，１の位置とマイクｋの位置との位相差を求め、Ｓ_ω，１の位相をその位相差だけシフトした信号をマイクｋの位置での信号Ｕ_{ｋ，ω，１}とすればよい。 The sound source estimation unit 120 is a conventional method, and the position or direction D _{ω, 1} , D _{ω, 2} ,..., D _{ω, Mω} of the sound source and the sound source band signals S _{ω, 1} , S _ω for each frequency band _{ω. , 2} ,..., S _{ω, Mω} are estimated (Mω is the number of sound sources in the frequency band ω and is an integer of 0 or more). The sound source band signal S _{ω, 1} is intensity and phase information (for example, complex spectrum) for calculating a signal generated in the vicinity of the microphone by the sound propagated from the first sound source in the frequency band ω. For example, if D _{ω, 1} indicates the position of the sound source and the sound is a spherical wave, the sound source band signal S _{ω, 1} may be a complex spectrum indicating the intensity and phase at the position of the sound source. Further, if D _{ω, 1} indicates the direction of the sound source and the sound is approximated to a plane wave, the sound source band signal S _{ω, 1} has the intensity at a certain position (it does not have to be the position of the sound source). What is necessary is just to set it as the complex spectrum which shows a phase. In this estimation process, signals U _{k, ω, m} from the respective sound sources at the positions of the respective microphones are also obtained (k indicates the number of the microphone and is an integer from 1 to K). The signal U _{k, ω, m} indicates a signal from the mth sound source of the frequency band ω at the position of the kth microphone (m is the number of the sound source assigned to each frequency band ω, and 0 Is an integer of ~ Mω). For example, in the case of approximating with a plane wave, the inner product (how far away in the sound propagation direction) the vector connecting the position of the sound source band signal _{Sω, 1 and} the position of the microphone k and the unit vector of the sound propagation direction. A phase difference between the position of the sound source band signal S _{ω, 1 and} the position of the microphone k, and a signal obtained by shifting the phase of S _{ω, 1} by the phase difference is a signal at the position of the microphone k. U _{k, ω, 1} may be used.

１以上の音源が推定できた周波数帯域ωは、次式のようにチャネルごとに音源からの信号を帯域信号から除いて残差帯域信号Ｎ_１（ω），Ｎ_２（ω），…，Ｎ_Ｋ（ω）を求める。

また、音源が推定できなかった周波数帯域ωは、Ｎ_ｋ（ω）＝Ｘ_ｋ（ω）をすべてのｋ（マイク）とω（周波数帯）に対して計算することで、各チャネルの帯域信号Ｘ_ｋ（ω）を残差帯域信号Ｎ_１（ω），Ｎ_２（ω），…，Ｎ_Ｋ（ω）とする（Ｓ１２０）。つまり、チャネルごとに、マイクの位置での推定できた音源からの信号を帯域信号から引くことで、残差帯域信号Ｎ_１（ω），Ｎ_２（ω），…，Ｎ_Ｋ（ω）を求めている。このように、本発明では、音源の位置が推定できなかった信号を、残差帯域信号として扱うので、音源の位置が特定できなかった信号を無理やりいずれかの音源に割り振る必要がない。 The frequency band ω in which one or more sound sources can be estimated is obtained by removing the signal from the sound source from the band signal for each channel as shown in the following equation, and residual band signals N ₁ (ω), N ₂ (ω),. _{Find K} (ω).

Further, the frequency band ω for which the sound source could not be estimated is calculated by calculating N _k (ω) = X _k (ω) for all k (microphones) and ω (frequency band). Let X _k (ω) be the residual band signals N ₁ (ω), N ₂ (ω),..., N _K (ω) (S120). That is, for each channel, the residual band signals N ₁ (ω), N ₂ (ω),..., N _K (ω) are obtained by subtracting the signal from the sound source that can be estimated at the microphone position from the band signal. Looking for. In this way, in the present invention, since the signal for which the position of the sound source cannot be estimated is handled as a residual band signal, it is not necessary to forcibly allocate the signal for which the position of the sound source cannot be specified to any sound source.

音源の位置を推定するか方向を推定するかは、音を球面波と仮定するか平面波と仮定するかで決まる。この仮定は、あらかじめ定めておく。また、どのような方法で音源の位置または方向と強度と位相を推定するかは、上述の方法などから適宜選択しておけばよい。なお、上述したように、本発明では正確に音源の位置（または方向）やスペクトルを推定することよりも、最終的に合成された音が、指定された位置での音らしくなることが重要である。ステップＳ１２０で推定された各音源の位置または方向と周波数帯域ごとの強度と位相、および各チャネルの残差帯域信号は、記録部１３０に記録される。なお、記録される情報は、符号化された情報でもよい。 Whether the position of the sound source is estimated or the direction is estimated depends on whether the sound is assumed to be a spherical wave or a plane wave. This assumption is predetermined. The method for estimating the position or direction, intensity, and phase of the sound source may be appropriately selected from the above-described methods. As described above, in the present invention, it is more important that the finally synthesized sound looks like a sound at a designated position than accurately estimating the position (or direction) and spectrum of the sound source. is there. The position or direction of each sound source estimated in step S120, the intensity and phase of each frequency band, and the residual band signal of each channel are recorded in the recording unit 130. Note that the recorded information may be encoded information.

帯域信号成分推定部１４０は、位置Ｐが指定されると、周波数帯域ωごとの各音源の位置または方向Ｄ_ω，１，Ｄ_ω，２，…，Ｄ_ω，Ｍωと音源帯域信号Ｓ_ω，１，Ｓ_ω，２，…，Ｓ_ω，Ｍωから、指定された位置Ｐでのすべての音源からの音を合成した帯域信号Ｚ（ω）を推定する（Ｓ１４０）。例えば、周波数帯域ωごとに、位置Ｐでの各音源からの信号Ｕ_{Ｐ，ω，ｍ}を求める（ｍは周波数帯ωごとに付された音源の番号であり、０〜Ｍωの整数である）。信号Ｕ_{Ｐ，ω，ｍ}の求め方は、音源推定部１２０の各マイクの位置での音源からの信号Ｕ_{ｋ，ω，ｍ}の求め方と同じでよい。位置Ｐでの各音源からの信号Ｕ_{Ｐ，ω，ｍ}を、次のように周波数帯域ωごとに、加算すれば、帯域信号Ｚ（ω）を求めることができる。

帯域信号成分加算部１５０は、推定されたすべての音源からの音を合成した帯域信号Ｚ（ω）と各チャネルの残差帯域信号Ｎ_１（ω），Ｎ_２（ω），…，Ｎ_Ｋ（ω）とを重み付き加算することで、指定された位置Ｐでの帯域信号Ｙ（ω）を求める（Ｓ１５０）。例えば、次式のように、推定されたすべての音源からの音を合成した帯域信号Ｚ（ω）には重み１を乗算し、各チャネルの残差帯域信号には、すべてのチャネルへの重みの合計が１となるように、各チャネルのマイクと指定された位置Ｐとの距離に応じた（例えば、反比例した）重みを設定し、重みを乗算して加算すればよい。

ただし、ｄ_ｋは、ｋ番目のマイクと位置Ｐとの距離とする。 When the position P is designated, the band signal component estimation unit 140 specifies the position or direction D _{ω, 1} , D _{ω, 2} ,..., D _{ω, Mω} and the sound source band signal S _ω, for each frequency band _ω. A band signal Z (ω) obtained by synthesizing sounds from all the sound sources at the designated position P is _estimated from ₁ , S _{ω, 2} ,..., S _{ω, Mω} (S140). For example, for each frequency band ω, signals UP _{, ω, m} from each sound source at position P are obtained (m is a sound source number assigned to each frequency band ω, and is an integer from 0 to Mω). . Signals U _{P, omega,} Determination of _m, the signal U _k from the sound source at the position of the microphones in the sound source estimation unit _{120, omega,} or the same as the method for obtaining the _m. If the signals _{UP, ω, m} from each sound source at the position P are added for each frequency band ω as follows, the band signal Z (ω) can be obtained.

The band signal component adding unit 150 combines the band signals Z (ω) obtained by synthesizing sounds from all the estimated sound sources and the residual band signals N ₁ (ω), N ₂ (ω) _,. The band signal Y (ω) at the designated position P is obtained by weighted addition with (ω) (S150). For example, as shown in the following equation, the band signal Z (ω) obtained by synthesizing sounds from all estimated sound sources is multiplied by a weight 1, and the residual band signal of each channel is weighted to all channels. A weight corresponding to the distance between the microphone of each channel and the designated position P (for example, inversely proportional) may be set, and the weights may be multiplied and added so that the sum of the values becomes 1.

However, d _k is the distance between the k-th microphone and the position P.

帯域統合部１６０は、指定された位置Ｐでの帯域信号Ｙ（ω）を、時間領域の信号ｙ（ｔ）に変換する（Ｓ１６０）。例えば、信号ｙ（ｔ）は、Ｔサンプルからなるフレーム内の１つのサンプル値であり、ｔは０，…，Ｔ−１の値を取る。
本発明の音響信号推定合成装置１００はこのような構成なので、音源が推定できた音と雑音などの音源が推定できない音に分けることができる。そして、音源が推定できた音については、音源の位置または方向から指定された位置Ｐでの音を計算できる。また、音源が推定できない音については、各チャネルの残差帯域信号（帯域信号に含まれる音源が特定できない信号）から指定された位置Ｐでの音を計算できる。そして、これらを重み付け加算するので、指定された位置Ｐでの音を合成できる。このような効果があるので、例えば、複数の場所のカメラから任意の視点の画像・映像を合成する自由視点映像システムに対応した音響信号の合成も可能となる。 The band integration unit 160 converts the band signal Y (ω) at the designated position P into a time domain signal y (t) (S160). For example, the signal y (t) is one sample value in a frame composed of T samples, and t takes values of 0,.
Since the acoustic signal estimation / synthesis apparatus 100 according to the present invention has such a configuration, it can be divided into a sound whose sound source can be estimated and a sound such as noise that cannot be estimated. As for the sound whose sound source can be estimated, the sound at the position P designated from the position or direction of the sound source can be calculated. For sounds that cannot be estimated by a sound source, it is possible to calculate the sound at a specified position P from the residual band signal of each channel (a signal that cannot specify a sound source included in the band signal). Since these are weighted and added, the sound at the designated position P can be synthesized. Because of such an effect, for example, it is possible to synthesize an audio signal corresponding to a free viewpoint video system that synthesizes images and videos of arbitrary viewpoints from cameras at a plurality of locations.

［変形例］
第１実施形態では、音響信号推定合成装置１００を説明した。しかし、各音源の位置または方向と周波数帯域ごとの音源帯域信号、および各チャネルの残差帯域信号を推定するまでを１つの装置（音響信号推定装置）としても良い。また、各音源の位置または方向と周波数帯域ごとの音源帯域信号、および各チャネルの残差帯域信号から、指定された位置Ｐでの音を合成するまでを１つの装置（音響信号合成装置）としても良い。 [Modification]
In the first embodiment, the acoustic signal estimation / synthesis apparatus 100 has been described. However, one device (acoustic signal estimation device) may be used until the position or direction of each sound source and the sound source band signal for each frequency band and the residual band signal for each channel are estimated. Also, one device (acoustic signal synthesizer) is a process from the sound source band signal of each sound source and the sound source band signal for each frequency band and the residual band signal of each channel until the sound at the designated position P is synthesized. Also good.

音響信号推定装置２００は、例えば、帯域分割部１１０と音源推定部１２０から構成される。記録部１３０は、音響信号推定装置２００の内部に備えても良いし、外部でも良い。音響信号合成装置３００は、例えば、帯域信号成分推定部１４０、帯域信号成分加算部１５０、帯域統合部１６０から構成される。
このように、いくつかの装置に分割して全体で音響信号推定合成装置を形成しても、第１実施形態と同じ効果を得ることができる。 The acoustic signal estimation apparatus 200 includes, for example, a band division unit 110 and a sound source estimation unit 120. The recording unit 130 may be provided inside the acoustic signal estimation apparatus 200 or may be provided outside. The acoustic signal synthesis device 300 includes, for example, a band signal component estimation unit 140, a band signal component addition unit 150, and a band integration unit 160.
Thus, even if it divides | segments into several apparatuses and forms an acoustic signal estimation synthetic | combination apparatus as a whole, the same effect as 1st Embodiment can be acquired.

図１０に、コンピュータの機能構成例を示す。なお、本発明の音響信号推定合成方法、音響信号推定方法、音響信号合成方法は、コンピュータ２０００の記録部２０２０に、本発明の各構成部としてコンピュータ２０００を動作させるプログラムを読み込ませ、制御部２０１０、入力部２０３０、出力部２０４０などを動作させることで、コンピュータに実行させることができる。また、コンピュータに読み込ませる方法としては、プログラムをコンピュータ読み取り可能な記録媒体に記録しておき、記録媒体からコンピュータに読み込ませる方法、サーバ等に記録されたプログラムを、電気通信回線等を通じてコンピュータに読み込ませる方法などがある。 FIG. 10 shows a functional configuration example of a computer. The acoustic signal estimation and synthesis method, acoustic signal estimation method, and acoustic signal synthesis method of the present invention cause the recording unit 2020 of the computer 2000 to read a program that causes the computer 2000 to operate as each component of the present invention, and to control the controller 2010. The computer can be executed by operating the input unit 2030, the output unit 2040, and the like. In addition, as a method of causing the computer to read, the program is recorded on a computer-readable recording medium, and the program recorded on the server or the like is read into the computer through a telecommunication line or the like. There is a method to make it.

４つのマイクと遠くの音源から伝播した平面波の音の様子を示す図。The figure which shows the mode of the sound of the plane wave propagated from four microphones and a distant sound source. 場所５０１〜５０４での音源Ａから伝播された音のスペクトルの例を示す図。The figure which shows the example of the spectrum of the sound propagated from the sound source A in the places 501-504. 場所５０１〜５０４での音源Ｂから伝播された音のスペクトルの例を示す図。The figure which shows the example of the spectrum of the sound propagated from the sound source B in the places 501-504. 場所５０１での音源Ａと音源Ｂからの音のスペクトルを示す図。The figure which shows the spectrum of the sound from the sound source A and the sound source B in the place 501. FIG. 場所５０２での音源Ａと音源Ｂからの音のスペクトルを示す図。The figure which shows the spectrum of the sound from the sound source A and the sound source B in the place 502. FIG. 場所５０３での音源Ａと音源Ｂからの音のスペクトルを示す図。The figure which shows the spectrum of the sound from the sound source A and the sound source B in the place 503. FIG. ４つのマイクと音源から伝播した球面波の音の様子を示す図。The figure which shows the mode of the sound of the spherical wave propagated from four microphones and a sound source. 音響信号推定合成装置の機能構成例を示す図。The figure which shows the function structural example of an acoustic signal estimation synthetic | combination apparatus. 音響信号推定合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of an acoustic signal estimation synthetic | combination apparatus. コンピュータの機能構成例を示す図。The figure which shows the function structural example of a computer.

Claims

A K-channel acoustic signal picked up by K microphones (K is an integer equal to or greater than 2) is divided into predetermined frequency bands ω for each channel k (k = 1, 2,..., K) to obtain a band signal. A band dividing unit for generating X _k (ω);
For each frequency band ω, estimate the signal U _{k, ω, m} from each sound source m (m = 1, 2,..., Mω, Mω is the number of sound sources in the frequency band ω) at the position of each microphone k, From the band signal X _k (ω) and the signal U _{k, ω, m} ,

A sound source estimator for obtaining a residual band signal N _k (ω),
An acoustic signal estimation device comprising:

The frequency or the direction D _{ω, m} of each sound source m (m = 1, 2,..., Mω, Mω is the number of sound sources in the frequency band ω) for each frequency band ω and the frequency associated with the position or direction of the sound source. Each channel k (k = 1, 2,..., K, K is the number of channels) and a frequency band with a signal that is not associated with the intensity and phase S _{ω, m} for each band ω and the position or direction of the sound source. a residual band signal N _k (ω), which is a signal associated with ω, and an acoustic signal synthesizer that receives a position for synthesizing a sound;
For each frequency band ω, a band signal Z (ω that combines signals from the sound sources at specified positions from the position or direction D _{ω, m of} each sound source and the intensity and phase S _{ω, m of} each frequency band. ) To estimate the band signal component,
When the weight corresponding to the distance between each microphone k and the designated position is α _k , for each frequency band ω, from the band signal Z (ω) and the residual band signal N _k (ω),

A band signal component adder for obtaining a band signal Y (ω) at the designated position,
A sound signal synthesizer comprising: a band integration unit that converts the band signal Y (ω) at the designated position into a signal in the time domain.

A K-channel acoustic signal picked up by K microphones (K is an integer equal to or greater than 2) is divided into predetermined frequency bands ω for each channel k (k = 1, 2,..., K) to obtain a band signal. A band dividing unit for generating X _k (ω);
For each frequency band ω, the position or direction D _{ω, m of} each sound source m (m = 1, 2,..., Mω, Mω is the number of sound sources in the frequency band ω), intensity, phase S _{ω, m,} and each microphone k. The signal U _{k, ω,} m from each sound source m at the position is estimated, and from the band signal X _k (ω) and the signal U _{k, ω, m} ,

A sound source estimator for obtaining a residual band signal N _k (ω),
A recording unit for recording the position or direction D _{ω, m of} each sound source and the intensity and phase S _{ω, m} for each frequency band ω, and the residual band signal N _k (ω) of each channel;
For each frequency band ω, a band signal Z (ω) obtained by synthesizing a signal from each sound source at a designated position from the position or direction D _{ω, m of} each sound source and the intensity and phase S _{ω, m} for each frequency band. A band signal component estimator for estimating
When the weight corresponding to the distance between each microphone k and the designated position is α _k , for each frequency band ω, from the band signal Z (ω) and the residual band signal N _k (ω),

A band signal component adder for obtaining a band signal Y (ω) at the designated position,
An acoustic signal estimation and synthesis device comprising: a band integration unit that converts the band signal Y (ω) at the designated position into a signal in the time domain.

A K-channel acoustic signal picked up by K (K is an integer equal to or greater than 2) microphones in the band dividing unit is obtained for each predetermined frequency band ω for each channel k (k = 1, 2,..., K). A band dividing step of dividing to generate a band signal X _k (ω);
In the sound source estimation unit, for each frequency band ω, signals U _{k, ω,,} from each sound source m (m = 1, 2,..., Mω, Mω are the number of sound sources in the frequency band ω) at the position of each microphone k _{. m} is estimated, and from the band signal X _k (ω) and the signal U _{k, ω, m} ,

A sound source estimation step for obtaining a residual band signal N _k (ω) of
An acoustic signal estimation method comprising:

The frequency or the direction D _{ω, m} of each sound source m (m = 1, 2,..., Mω, Mω is the number of sound sources in the frequency band ω) for each frequency band ω and the frequency associated with the position or direction of the sound source. Each channel k (k = 1, 2,..., K, K is the number of channels) and a frequency band with a signal that is not associated with the intensity and phase S _{ω, m} for each band ω and the position or direction of the sound source. a residual band signal N _k (ω), which is a signal associated with ω, and an acoustic signal synthesis method in which a position for synthesizing a sound is input,
In the band signal component estimation unit, for each frequency band ω, a signal from each sound source at a specified position is obtained from the position or direction D _{ω, m of} each sound source and the intensity and phase S _{ω, m} for each frequency band. A band signal component estimation step for estimating the combined band signal Z (ω);
In the band signal component adding unit, when the weight corresponding to the distance between each microphone k and the designated position is α _k , the band signal Z (ω) and the residual band signal N for each frequency band ω. _{From k} (ω),

A band signal component adding step for obtaining a band signal Y (ω) at the designated position,
A band integrating step of converting a band signal Y (ω) at the designated position into a time domain signal in a band integrating unit.

A K-channel acoustic signal picked up by K (K is an integer equal to or greater than 2) microphones in the band dividing unit is obtained for each predetermined frequency band ω for each channel k (k = 1, 2,..., K). A band dividing step of dividing to generate a band signal X _k (ω);
In the sound source estimation unit, for each frequency band ω, the position or direction D _{ω, m of} each sound source m (m = 1, 2,..., Mω, where Mω is the number of sound sources in the frequency band ω), the intensity and phase S _{ω, m} and the signal U _{k, ω,} m from each sound source m at the position of each microphone k, and from the band signal X _k (ω) and the signal U _{k, ω, m} ,

A sound source estimation step for obtaining a residual band signal N _k (ω) of
The band signal component estimation unit synthesizes the signal from each sound source at the specified position from the position or direction D _{ω, m of} each sound source and the intensity and phase S _{ω, m of} each frequency band for each frequency band _ω. A band signal component estimation step for estimating the band signal Z (ω) performed;
In the band signal component adding unit, when the weight corresponding to the distance between each microphone k and the designated position is α _k , the band signal Z (ω) and the residual band signal N for each frequency band ω. _{From k} (ω),

A band signal component adding step for obtaining a band signal Y (ω) at the designated position,
A band integration step of converting a band signal Y (ω) at the designated position into a time domain signal in a band integration unit.

The program which makes a computer perform the method in any one of Claim 4 to 6.

A computer-readable recording medium on which the program according to claim 7 is recorded.