JP2009282536A

JP2009282536A - Method and device for removing known acoustic signal

Info

Publication number: JP2009282536A
Application number: JP2009170514A
Authority: JP
Inventors: Masataka Goto; 真孝後藤
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2003-05-30
Filing date: 2009-07-21
Publication date: 2009-12-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a known acoustic signal removal device which receives an acoustic signal as a mixture of a plurality of acoustic signals and can remove a known acoustic signal when the known acoustic signal similar to one of the acoustic signals is given. <P>SOLUTION: The known acoustic signal removal device converts an input mixed acoustic signal m(t) and a known acoustic signal b' (t) into an amplitude spectra M(ω, t) and B' (ω, t) in the respective time frequency areas and subtracts a component corresponding to the B' (ω, t) in the M(ω, t) to be removed, thereby obtaining an amplitude spectrum S(ω, t) after the removal. Here, the component corresponding to the B' (ω, t) in the M(ω, t) is deformed by various factors such as a positional shift by time, temporal change of frequency characteristics, and temporal change of the sound volume. Accordingly, they are corrected to obtain B(ω, t), which is subtracted. Lastly, by using the phase of the m(t) and S(ω, t), reverse conversion is performed in the temporal region to obtain an acoustic signal s(t) after a desired removal. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、複数の音響信号が混合された混合音響信号の中から、既知の音響信号の成分を除去する方法及び装置、該装置に用いるインタフェース並びにプログラムに関するものである。 The present invention relates to a method and apparatus for removing a component of a known acoustic signal from a mixed acoustic signal obtained by mixing a plurality of acoustic signals, an interface used for the apparatus, and a program.

従来よりスペクトルサブトラクション法（非特許文献１）と呼ばれる方法が知られている。従来のスペクトルサブトラクション法とは、定常雑音（スペクトルが時間的に変化せず、周波数特性や音量等がほぼ一定な雑音）と所望の音（ターゲット音）が混合された音響信号（混合音）から定常雑音を除去してターゲット音を得る方法である。この方法では、事前に定常なスペクトルの平均を求める等の簡易な方法で定常雑音のスペクトルを学習しておき、入力された混合音のスペクトルから定常雑音のスペクトルを引き去る処理を行う（つまり雑音の平均を引き去る処理を行う）。また、一般に音響信号除去に関しては、複数のマイクロホンからの入力を用いる方法も多数提案されている。またスペクトルサブトラクション法には様々な改良もなされている（特許文献）。 Conventionally, a method called a spectral subtraction method (Non-Patent Document 1) is known. The conventional spectral subtraction method is based on an acoustic signal (mixed sound) in which stationary noise (noise whose spectrum does not change with time and whose frequency characteristics and volume are almost constant) and a desired sound (target sound) are mixed. This is a method for obtaining a target sound by removing stationary noise. In this method, the stationary noise spectrum is learned in advance by a simple method such as calculating the average of the stationary spectrum in advance, and the process of subtracting the stationary noise spectrum from the input mixed sound spectrum is performed (that is, noise). Process to subtract the average). In general, a number of methods using input from a plurality of microphones have been proposed for acoustic signal removal. Various improvements have been made to the spectral subtraction method (Patent Literature).

ＳｔｅｖｅｎＢｏｌｌ，“ＳｕｐｐｒｅｓｓｉｏｎｏｆＡｃｏｕｓｔｉｃＮｏｉｓｅｉｎＳｐｅｅｃｈＵｓｉｎｇＳｐｅｃｔｒａｌＳｕｂｔｒａｃｔｉｏｎ”，ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．ＡＳＳＰ−２７，Ｎｏ．２，Ａｐｒｉｌ１９７９．Steven Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction”, IEEE Transactions on Acoustics, Speech, and SignalProceed. ASSP-27, no. 2, April 1979.

特開２００２−１７５０９９号「雑音抑制方法および雑音抑制装置」JP 2002-175099 “Noise Suppression Method and Noise Suppression Device” 特開２００２−０１４６９４号「音声認識装置」Japanese Patent Laid-Open No. 2002-014694 “Voice Recognition Device” 特開２００１−２２８８９２号「ノイズ除去装置およびノイズ除去方法並びに記録媒体」Japanese Patent Laid-Open No. 2001-228892 “Noise Removal Device, Noise Removal Method, and Recording Medium” 特開２００１−２１５９９２号「音声認識装置」Japanese Patent Laid-Open No. 2001-215992 “Voice Recognition Device” 特開平１１−００３０９４号「ノイズ除去装置」Japanese Patent Laid-Open No. 11-003094 “Noise Removal Device” 特開平１０−２４０２９４号「雑音軽減方法及び雑音軽減装置」Japanese Patent Application Laid-Open No. 10-240294 “Noise Reduction Method and Noise Reduction Device” 特開平０８−２２１０９２号「スペクトルサブトラクションを用いた雑音除去システム」Japanese Patent Application Laid-Open No. 08-2221092 “Noise Reduction System Using Spectral Subtraction”

しかし、従来のスペクトルサブトラクション法は、定常雑音を前提としており、非定常雑音（スペクトルが時間的に大きく変化し、周波数特性や音量等も変化する雑音）には適用できなかった。特に、音楽のような時間的に大きく変化する非定常雑音を除去することは不可能であった。これは非定常雑音のスペクトルの変化が大きすぎて学習ができないからである。また、仮に従来の方法で非定常雑音が事前に与えられた条件を扱おうとしても、非定常雑音の周波数特性、音量、振幅スペクトルの時間軸方向の伸縮及び周波数軸方向の伸縮等の変化の影響で、引き去る処理を適切に行うことはできなかった。また、複数のマイクロホンからの入力を用いる方法は、モノラル音響信号には適用することができなかった。また改良された従来のスペクトルサブトラクション法のいずれの方法も、主に音声認識の前処理を目的としている。そのため、非定常雑音が事前に与えられ、その非定常雑音を除去する用途には利用できなかった。 However, the conventional spectral subtraction method is based on stationary noise and cannot be applied to non-stationary noise (noise in which the spectrum changes greatly with time and the frequency characteristics, volume, etc. also change). In particular, it has been impossible to remove non-stationary noise such as music that varies greatly with time. This is because the spectrum of non-stationary noise is too large to learn. Furthermore, even if the conventional method is used to handle conditions in which non-stationary noise is given in advance, the frequency characteristics, volume, amplitude spectrum expansion and contraction in the time axis direction and expansion and contraction in the frequency axis direction of the non-stationary noise are changed. Due to the influence, the removal process could not be performed properly. Further, the method using inputs from a plurality of microphones cannot be applied to a monaural sound signal. Also, any of the improved conventional spectral subtraction methods is primarily intended for preprocessing of speech recognition. Therefore, non-stationary noise is given in advance and cannot be used for the purpose of removing the non-stationary noise.

本発明の目的は、複数の音響信号が混合された混合音響信号の中から、既知の音響信号（非定常でも定常でもよい）の成分を、それに対応する元音源の既知音響信号を用いて除去することができる既知音響信号除去方法及び装置並びに該装置に用いるプログラムを提供することにある。 An object of the present invention is to remove a component of a known acoustic signal (which may be non-stationary or stationary) from a mixed acoustic signal obtained by mixing a plurality of acoustic signals using the known acoustic signal of the corresponding original sound source. It is an object to provide a method and apparatus for removing a known acoustic signal that can be performed, and a program used for the apparatus.

本発明の他の目的は、既知の音響信号が音楽であり、その音楽音響信号が、人間の音声や物音に対するバックグラウンドミュージック（ＢＧＭ）として使用されている混合音から、既知の音響信号に対応する元音源の既知音響信号（ＣＤやレコード等から同一音楽の音響信号を別途入手したもの）を用いてＢＧＭを除去することができる既知音響信号除去方法及び装置並びに該装置に用いるプログラムを提供することにある。 Another object of the present invention is that the known acoustic signal is music, and the musical acoustic signal corresponds to a known acoustic signal from a mixed sound that is used as background music (BGM) for human speech or sound. Provided is a known acoustic signal removal method and apparatus capable of removing BGM using a known acoustic signal of an original sound source (a separately obtained acoustic signal of the same music from a CD or a record), and a program used for the apparatus There is.

本発明の他の目的は、複数の音響信号が混合された音響信号（混合音）の中から、既知の音響信号の成分を除去する際に、混合音中での既知の音響信号の正確な位置を自動推定し、その位置の既知の音響信号を除去することができる既知音響信号除去方法及び装置並びに該装置に用いるプログラムを提供することにある。 Another object of the present invention is to accurately detect a known acoustic signal in a mixed sound when removing a component of the known acoustic signal from an acoustic signal (mixed sound) in which a plurality of acoustic signals are mixed. An object of the present invention is to provide a known acoustic signal removal method and apparatus capable of automatically estimating a position and removing a known acoustic signal at the position, and a program used for the apparatus.

本発明の他の目的は、複数の音響信号が混合された音響信号（混合音）の中から、既知の音響信号の成分を除去する際に、混合音中での既知の音響信号の正確な位置を人間が指定できるインタフェースを備えた既知音響信号除去装置を提供することにある。 Another object of the present invention is to accurately detect a known acoustic signal in a mixed sound when removing a component of the known acoustic signal from an acoustic signal (mixed sound) in which a plurality of acoustic signals are mixed. An object of the present invention is to provide a known acoustic signal removing device having an interface that allows a human to designate a position.

本発明の他の目的は、複数の音響信号が混合された音響信号（混合音）の中から、既知の音響信号の成分を除去する際に、混合音中では既知の音響信号の周波数特性や音量が時間的に変化しているときに、それらの変化を自動推定して補正しながら除去することができる既知音響信号除去方法及び装置並びに該装置に用いるプログラムを提供することにある。 Another object of the present invention is to remove a known acoustic signal component from an acoustic signal (mixed sound) in which a plurality of acoustic signals are mixed. It is an object of the present invention to provide a known acoustic signal removal method and apparatus capable of automatically estimating and correcting the change when the volume changes with time, and a program used for the apparatus.

本発明の他の目的は、複数の音響信号が混合された音響信号（混合音）の中から、既知の音響信号の成分を除去する際に、混合音中では既知の音響信号の周波数特性や音量が時間的に変化しているときに、それらの変化を人間が指定できるインタフェースを備えた既知音響信号除去装置を提供することにある。 Another object of the present invention is to remove a known acoustic signal component from an acoustic signal (mixed sound) in which a plurality of acoustic signals are mixed. An object of the present invention is to provide a known acoustic signal removing device having an interface that allows a human to specify a change in volume when the volume changes with time.

本発明の他の目的は、複数の音響信号が混合された音響信号（混合音）の中から、既知の音響信号の成分を除去する際に、混合音中では既知の音響信号が時間軸あるいは周波数軸方向に伸縮しているときに、それらの伸縮を自動推定して補正しながら除去することができる既知音響信号除去方法及び装置並びに該装置に用いるプログラムを提供することにある。 Another object of the present invention is to remove a known acoustic signal component from an acoustic signal (mixed sound) in which a plurality of acoustic signals are mixed. An object of the present invention is to provide a known acoustic signal removal method and apparatus that can be removed while automatically estimating and correcting the expansion and contraction when expanding and contracting in the frequency axis direction, and a program used for the apparatus.

本発明の他の目的は、複数の音響信号が混合された音響信号（混合音）の中から、既知の音響信号の成分を除去する際に、混合音中では既知の音響信号が時間軸あるいは周波数軸方向に伸縮しているときに、それらの伸縮を人間が指定できるインタフェースを備えた既知音響信号除去装置を提供することにある。 Another object of the present invention is to remove a known acoustic signal component from an acoustic signal (mixed sound) in which a plurality of acoustic signals are mixed. It is an object of the present invention to provide a known acoustic signal removing device having an interface that allows a human to specify expansion and contraction when expanding and contracting in the frequency axis direction.

本発明の他の目的は、複数の音響信号が混合された音響信号の中から、複数の既知の音響信号の成分を除去する際に、既知の音響信号を一つずつ繰り返し除去できるようにした既知音響信号除去方法及び装置並びに該装置に用いるプログラムを提供することにある。 Another object of the present invention is to repeatedly remove known acoustic signals one by one when removing a plurality of known acoustic signal components from an acoustic signal in which a plurality of acoustic signals are mixed. It is an object to provide a method and apparatus for removing a known acoustic signal and a program used for the apparatus.

本発明は、複数の音響信号が混合された混合音響信号から、既知の音響信号（非定常でも定常でもよい）の成分を、それに対応する元音源の既知音響信号を用いて除去する既知音響信号除去方法を対象とする。本発明の方法では、まず混合音響信号を時間周波数表現に変換して混合音響信号の振幅スペクトルと混合音響信号の位相とを求める（混合音響信号変換ステップ）。音響信号を時間周波数表現に変換する方法としては、フーリエ変換やウェーブレット変換など公知の変換方法を用いることができる。次に、混合音響信号中に含まれている既知の音響信号に対応（類似）している既知音響信号（ＣＤやレコード等から同一音楽の音響信号を別途入手したもの）を時間周波数表現に変換して既知音響信号の振幅スペクトルを求める（既知音響信号変換ステップ）。そして混合音響信号の振幅スペクトルに基づいて、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した前記既知音響信号の補正振幅スペクトルを求める（補正ステップ）。次に、混合音響信号の振幅スペクトルから既知音響信号の補正振幅スペクトルを除去する（除去ステップ）。この除去ステップにより得た除去後振幅スペクトルと混合音響信号の位相とに基づいて時間表現に逆変換を行って単位波形を求める（逆変換ステップ）。最後に、単位波形をオーバーラップ・アド法等の各種の合成方法を用いて合成して既知の音響信号の成分を除去した音響信号を得る（合成ステップ）。 The present invention is a known acoustic signal for removing a component of a known acoustic signal (which may be non-stationary or stationary) from a mixed acoustic signal obtained by mixing a plurality of acoustic signals using a known acoustic signal of a corresponding original sound source. Target removal method. In the method of the present invention, first, the mixed sound signal is converted into a time-frequency representation to obtain the amplitude spectrum of the mixed sound signal and the phase of the mixed sound signal (mixed sound signal conversion step). As a method for converting an acoustic signal into a time-frequency representation, a known conversion method such as Fourier transform or wavelet transform can be used. Next, a known sound signal corresponding to (similar to) a known sound signal included in the mixed sound signal (separately obtained sound signal of the same music from a CD or a record) is converted into a time-frequency representation. Then, the amplitude spectrum of the known acoustic signal is obtained (known acoustic signal conversion step). Then, based on the amplitude spectrum of the mixed acoustic signal, the temporal position shift of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, the temporal change of the frequency characteristics, the temporal change of the volume, the expansion and contraction in the time axis direction, and the frequency A corrected amplitude spectrum of the known acoustic signal obtained by correcting at least one of axial expansion and contraction is obtained (correction step). Next, the corrected amplitude spectrum of the known acoustic signal is removed from the amplitude spectrum of the mixed acoustic signal (removal step). Based on the post-removal amplitude spectrum obtained in this removal step and the phase of the mixed acoustic signal, inverse transformation is performed on the time representation to obtain a unit waveform (inverse transformation step). Finally, a unit waveform is synthesized using various synthesis methods such as an overlap-add method to obtain an acoustic signal from which a known acoustic signal component is removed (synthesis step).

本発明では、補正ステップにより、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した既知音響信号の補正振幅スペクトルを求め、この補正振幅スペクトルを混合音響信号の振幅スペクトルから除去するため、混合音響信号中に非定常雑音として含まれている既知音響信号を高い精度で除去することができる。理想的には、既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の中で、実際に混合音響信号中でその現象または変化が起きていたものを全て補正するのが好ましい。しかしながら何も補正しない場合よりも、実際に混合音響信号中でその現象または変化が起きているものの１つでも補正すれば、既知音響信号の除去精度を高めることができるので、必要な補正のすべてを行わなくてもよい。もちろん必要な補正のすべてを行ってもよいのは当然である。 In the present invention, by the correction step, the temporal position shift of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, the temporal change of the frequency characteristic, the temporal change of the volume, the expansion / contraction in the time axis direction, and the frequency axis direction In order to obtain a corrected amplitude spectrum of a known acoustic signal in which at least one of the expansion and contraction has been corrected, and to remove the corrected amplitude spectrum from the amplitude spectrum of the mixed acoustic signal, the known acoustic signal included as non-stationary noise in the mixed acoustic signal Can be removed with high accuracy. Ideally, the mixed acoustic signal is actually mixed among the time position shift of the amplitude spectrum of the known acoustic signal, the time change of the frequency characteristic, the time change of the volume, the time axis direction expansion and contraction, and the frequency axis direction expansion and contraction. It is preferable to correct all the phenomena or changes that have occurred. However, it is possible to increase the accuracy of removing known acoustic signals by correcting even one of the phenomena or changes actually occurring in the mixed acoustic signal, compared with the case where nothing is corrected. It is not necessary to perform. Of course, all necessary corrections may be performed.

補正ステップでは、混合音響信号に含まれる既知の音響信号の時間的な位置を推定し、推定した時間的な位置に基づいて既知音響信号の振幅スペクトルの時間的な位置のずれを補正することができる。推定方法は、例えば、混合音響信号の振幅スペクトルの所定の区間と既知音響信号の振幅スペクトルの所定の区間の距離（類似度）を求め、距離が最も近い区間を混合音響信号に含まれる既知の音響信号の時間的な位置と推定することができる。なお推定手法は、任意である。 In the correction step, the temporal position of the known acoustic signal included in the mixed acoustic signal is estimated, and the temporal position shift of the amplitude spectrum of the known acoustic signal is corrected based on the estimated temporal position. it can. For example, the estimation method obtains a distance (similarity) between a predetermined section of the amplitude spectrum of the mixed acoustic signal and a predetermined section of the amplitude spectrum of the known acoustic signal, and a known section included in the mixed acoustic signal includes a section having the closest distance. It can be estimated as the temporal position of the acoustic signal. The estimation method is arbitrary.

また補正ステップでは、混合音響信号に含まれる既知の音響信号の周波数特性の変化を推定し、推定した周波数特性の時間変化に基づいて既知音響信号の振幅スペクトルの周波数特性の時間変化を補正することができる。この周波数特性の変化の推定は、例えば、混合音響信号中の既知の音響信号だけが含まれている区間を特定し、この区間の周波数特性とこの区間に対応する既知音響信号の周波数特性との対比から、混合音響信号に含まれる既知の音響信号の周波数特性の変化を推定することができる。なおこの推定手法は、任意である。 In the correction step, a change in the frequency characteristic of the known acoustic signal included in the mixed acoustic signal is estimated, and the temporal change in the frequency characteristic of the amplitude spectrum of the known acoustic signal is corrected based on the temporal change in the estimated frequency characteristic. Can do. The estimation of the change in the frequency characteristic is performed by, for example, identifying a section including only the known acoustic signal in the mixed acoustic signal, and determining the frequency characteristic of this section and the frequency characteristic of the known acoustic signal corresponding to this section. From the comparison, it is possible to estimate a change in frequency characteristics of a known acoustic signal included in the mixed acoustic signal. This estimation method is arbitrary.

また補正ステップでは、混合音響信号に含まれる既知の音響信号の音量の時間変化を推定し、推定した音量の時間変化に基づいて既知音響信号の振幅スペクトルの音量の時間変化を補正することができる。音量の時間変化の推定は、周波数特性の補正を行った後に、例えば、混合音響信号に含まれる既知音響信号に相当する振幅を持つ周波数帯域を各時刻において特定し、その周波数帯域における混合音響信号の振幅と既知音響信号の振幅との対比から推定することができる。なおこの推定手法は、任意である。 In the correction step, the temporal change in the volume of the known acoustic signal included in the mixed acoustic signal can be estimated, and the temporal change in the volume of the amplitude spectrum of the known acoustic signal can be corrected based on the estimated temporal change in the volume. . The estimation of the temporal change in volume is performed after the frequency characteristics are corrected, for example, by specifying a frequency band having an amplitude corresponding to the known acoustic signal included in the mixed acoustic signal at each time, and the mixed acoustic signal in that frequency band. It can be estimated from a comparison between the amplitude of the known sound signal and the amplitude of the known acoustic signal. This estimation method is arbitrary.

また補正ステップでは、混合音響信号に含まれる既知の音響信号の時間軸方向の伸縮を推定し、推定した時間軸方向の伸縮に基づいて既知音響信号の振幅スペクトルの時間軸方向の伸縮を補正することができる。時間軸方向の伸縮の推定には、例えば、混合音響信号中の既知の音響信号だけが含まれている区間を特定し、この区間に対応する既知音響信号の区間との時間軸の対比により、時間軸方向の伸縮を推定することができる。あるいは、時間軸を短い区間に分割した全区間の対比によって推定してもよい。なおこの推定手法は、任意である。 In the correction step, the expansion / contraction in the time axis direction of the known acoustic signal included in the mixed acoustic signal is estimated, and the expansion / contraction in the time axis direction of the amplitude spectrum of the known acoustic signal is corrected based on the estimated expansion / contraction in the time axis direction. be able to. In the estimation of the expansion and contraction in the time axis direction, for example, a section containing only known acoustic signals in the mixed acoustic signal is specified, and by comparing the time axis with the section of the known acoustic signal corresponding to this section, Expansion and contraction in the time axis direction can be estimated. Alternatively, the time axis may be estimated by comparing all the sections divided into short sections. This estimation method is arbitrary.

また補正ステップでは、混合音響信号に含まれる既知の音響信号の周波数軸方向の伸縮を推定し、推定した周波数軸方向の伸縮に基づいて既知音響信号の振幅スペクトルの周波数軸方向の伸縮を補正することができる。周波数軸方向の伸縮の推定には、例えば、混合音響信号中の既知の音響信号だけが含まれている区間を特定し、この区間に対応する既知音響信号の区間との周波数軸の対比により、周波数軸方向の伸縮を推定することができる。なおこの推定手法は、任意である。 Further, in the correction step, the expansion and contraction in the frequency axis direction of the known acoustic signal included in the mixed acoustic signal is estimated, and the expansion and contraction in the frequency axis direction of the amplitude spectrum of the known acoustic signal is corrected based on the estimated expansion and contraction in the frequency axis direction. be able to. In the estimation of the expansion and contraction in the frequency axis direction, for example, a section containing only known acoustic signals in the mixed acoustic signal is specified, and by comparing the frequency axis with the section of the known acoustic signal corresponding to this section, Expansion and contraction in the frequency axis direction can be estimated. This estimation method is arbitrary.

また本発明の方法では、混合音響信号の振幅スペクトルと既知音響信号の振幅スペクトルを視覚により認識できるように画像表示する画像表示ステップを更に備えることができる。この場合には、画像表示に基づいて人間が混合音響信号中における既知の音響信号が含まれている区間を定め、この区間について補正ステップ、除去ステップ、逆変換ステップまたは合成ステップを実行する。 The method of the present invention may further include an image display step of displaying an image so that the amplitude spectrum of the mixed acoustic signal and the amplitude spectrum of the known acoustic signal can be visually recognized. In this case, a section in which a known acoustic signal is included in the mixed acoustic signal is determined by a human based on the image display, and a correction step, a removal step, an inverse conversion step, or a synthesis step is executed for this section.

本発明の方法では、混合音響信号、既知音響信号及び合成ステップの出力信号を音響として再生する音響再生ステップを更に備えることができる。この場合には、音響再生ステップからの再生音に基づいて人間が混合音響信号中における既知の音響信号が含まれている区間を定め、この区間について補正ステップ、除去ステップ、逆変換ステップ及び合成ステップを実行する。 The method of the present invention may further include a sound reproduction step of reproducing the mixed sound signal, the known sound signal, and the output signal of the synthesis step as sound. In this case, based on the reproduced sound from the sound reproduction step, a human defines a section in which a known sound signal is included in the mixed sound signal, and a correction step, a removal step, an inverse conversion step, and a synthesis step are performed for this section. Execute.

また混合音響信号の振幅スペクトルに基づいて混合音響信号中における既知の音響信号が含まれている区間を自動推定し、この区間について補正ステップ、除去ステップ、逆変換ステップ及び合成ステップを実行することができる。混合音響信号中に比較的はっきりと既知の音響信号が含まれている場合（例えば、混合音響信号中で既知の音響信号が単独で鳴っている区間がある場合）には、自動推定により区間を特定することが可能である。自動推定を利用できれば、既知の音響信号の除去作業を速く実施できる。なお混合音響信号中に含まれる既知の音響信号の存在があまりはっきりとしていない場合には、人間が区間を指定するようにしてもよいのは勿論である。 In addition, a section including a known acoustic signal in the mixed acoustic signal is automatically estimated based on the amplitude spectrum of the mixed acoustic signal, and a correction step, a removal step, an inverse conversion step, and a synthesis step are executed for this section. it can. If the mixed sound signal contains a known sound signal relatively clearly (for example, if there is a section where a known sound signal is sounding alone in the mixed sound signal), the section is automatically estimated. It is possible to specify. If automatic estimation can be used, a known acoustic signal can be removed quickly. Of course, if the existence of a known acoustic signal included in the mixed acoustic signal is not so clear, a human may designate a section.

更に混合音響信号中に含まれている音響信号に相当する既知音響信号が複数種類存在する場合には、それら複数の既知音響信号のすべてに関して既知音響信号変換ステップ及び補正ステップを実行し、混合音響信号の振幅スペクトルから複数の既知音響信号の補正振幅スペクトルをすべて除去する除去ステップを実行して得た除去後振幅スペクトルを用いて、逆変換ステップ及び合成ステップを実行すればよい。このようにすれば混合音響信号中から複数種類のすべての既知音響信号を除去することができる。 Further, when there are a plurality of types of known acoustic signals corresponding to the acoustic signals included in the mixed acoustic signal, the known acoustic signal conversion step and the correction step are executed for all of the plurality of known acoustic signals, and the mixed acoustic signal is executed. The inverse conversion step and the synthesis step may be executed using the post-removal amplitude spectrum obtained by executing the removal step of removing all the corrected amplitude spectra of the plurality of known acoustic signals from the signal amplitude spectrum. In this way, it is possible to remove all of a plurality of types of known acoustic signals from the mixed acoustic signal.

また補正ステップを実行する際に、時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つの補正を人間が手作業で指定することを可能にするインタフェースを用いることができる。 In addition, when executing the correction step, humans manually specify at least one correction of time position shift, frequency characteristic time change, volume time change, time axis direction expansion and contraction, and frequency axis direction expansion and contraction. An interface can be used that allows

このインタフェースは、複数の音響信号が混合された混合音響信号の中から、既知の音響信号の成分を除去する際に、混合音響信号中での既知の音響信号の正確な位置を人間が指定できるように構成することができる。 This interface allows a human to specify the exact position of a known acoustic signal in a mixed acoustic signal when removing a component of the known acoustic signal from a mixed acoustic signal in which a plurality of acoustic signals are mixed. It can be constituted as follows.

またこのインタフェースは、混合音響信号中で既知の音響信号の周波数特性が時間的に変化しているときに、それらの変化を人間が指定できるように構成することができる。またこのインタフェースは、混合音響信号中で既知の音響信号の音量が時間的に変化しているときに、それらの変化を人間が指定できるように構成することができる。 In addition, this interface can be configured so that when the frequency characteristics of known acoustic signals in the mixed acoustic signal are temporally changing, those changes can be specified by a human. In addition, this interface can be configured such that when the volume of a known acoustic signal in the mixed acoustic signal changes with time, the change can be specified by a human.

更にこのインタフェースは、混合音響信号中で既知の音響信号が時間軸または周波数軸方向に伸縮しているときに、それらの伸縮を人間が指定できるように構成することができる。 Further, this interface can be configured such that when a known acoustic signal in the mixed acoustic signal is expanded or contracted in the time axis or frequency axis direction, the human can specify the expansion and contraction.

またこのインタフェースは、混合音響信号と既知音響信号の対応する区間を人間が指定できるように構成することができる。 This interface can also be configured so that a human can specify the corresponding sections of the mixed acoustic signal and the known acoustic signal.

本発明の既知音響信号除去装置は、混合音響信号を時間周波数表現に変換して混合音響信号の振幅スペクトルと混合音響信号の位相とを求める混合音響信号変換手段と、混合音響信号中に含まれている音響信号に相当する既知音響信号を時間周波数表現に変換して既知音響信号の振幅スペクトルを求める既知音響信号変換手段と、混合音響信号の振幅スペクトルに基づいて、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した既知音響信号の補正振幅スペクトルを求める補正手段と、混合音響信号の振幅スペクトルから既知音響信号の補正振幅スペクトルを除去する除去手段と、除去手段により得た除去後振幅スペクトルと混合音響信号の位相とに基づいて時間表現に逆変換を行って単位波形を求める逆変換手段と、単位波形を合成して既知の音響信号の成分を除去した音響信号を得る合成手段とから構成される。 The known acoustic signal removing device of the present invention includes a mixed acoustic signal converting means for converting a mixed acoustic signal into a time-frequency representation to obtain an amplitude spectrum of the mixed acoustic signal and a phase of the mixed acoustic signal, and is included in the mixed acoustic signal. A known acoustic signal conversion means for converting the known acoustic signal corresponding to the existing acoustic signal into a time-frequency representation to obtain the amplitude spectrum of the known acoustic signal, and the amplitude spectrum of the mixed acoustic signal based on the amplitude spectrum of the mixed acoustic signal. A corrected amplitude spectrum of a known acoustic signal in which at least one of a temporal position shift of the amplitude spectrum of the known acoustic signal, a temporal change in frequency characteristics, a temporal change in volume, expansion and contraction in the time axis direction, and expansion and contraction in the frequency axis direction is corrected. Correction means for obtaining the correction, removal means for removing the corrected amplitude spectrum of the known acoustic signal from the amplitude spectrum of the mixed acoustic signal, and removal Based on the post-removal amplitude spectrum obtained by the stage and the phase of the mixed acoustic signal, the inverse transform means for performing the inverse transform to the time expression to obtain the unit waveform, and synthesizing the unit waveform to remove the components of the known acoustic signal And synthesizing means for obtaining an acoustic signal.

補正手段には、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つの補正の指定を人間が手作業で行えることを可能にするインタフェースを設けることができる。このインタフェースは、混合音響信号の振幅スペクトルと既知音響信号の振幅スペクトルとを視覚により対比できるように画像表示する画像表示部と、混合音響信号、既知音響信号及び合成手段の出力信号を音響として再生する音響再生部とを備えているのが好ましい。このインタフェースを用いると、画像表示部に表示された混合音響信号の振幅スペクトル及び既知音響信号の振幅スペクトルの画像表示及び／または音響再生部からの再生音に基づいて、混合音響信号中に含まれている既知の音響信号の区間を人間が指定できるだけでなく、この区間について人間が手作業で既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つの補正を指定できる。その結果、混合音響信号中に含まれている既知の音響信号の態様が多少複雑であっても、高い除去精度で既知音響信号を除去することができる。 The correction means includes at least a shift in time position of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, a temporal change in frequency characteristics, a temporal change in volume, expansion and contraction in the time axis direction, and expansion and contraction in the frequency axis direction. An interface can be provided that allows one person to specify one correction manually. This interface reproduces the mixed sound signal, the known sound signal, and the output signal of the synthesizing means as sound. The image display unit displays an image so that the amplitude spectrum of the mixed sound signal and the amplitude spectrum of the known sound signal can be visually compared. It is preferable to include a sound reproducing unit. When this interface is used, the amplitude spectrum of the mixed acoustic signal displayed on the image display unit and the image display of the amplitude spectrum of the known acoustic signal and / or the reproduced sound from the acoustic reproduction unit are included in the mixed acoustic signal. In addition to being able to specify a known acoustic signal section by human beings, the human being can manually specify the time position shift of the amplitude spectrum of the known acoustic signal, the time variation of the frequency characteristics, the time variation of the volume, and the time. At least one correction of expansion and contraction in the axial direction and expansion and contraction in the frequency axis direction can be designated. As a result, even if the aspect of the known acoustic signal included in the mixed acoustic signal is somewhat complicated, the known acoustic signal can be removed with high removal accuracy.

なお画像表示部は、既知の音響信号が含まれている混合音響信号中の区間の振幅スペクトルと、既知音響信号の対応区間の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した補正振幅スペクトルとを時間軸上で位置を合わせて表示できるように構成されているのが好ましい。このようにすると補正振幅スペクトルの状態を視覚で確認できるので、補正スペクトルをどのようにすれば、除去精度を高めることができるのかを、画像を見ながら推測することができるので、除去作業が速くなる。 In addition, the image display unit is configured such that the amplitude spectrum of the section in the mixed acoustic signal including the known acoustic signal and the temporal position shift of the amplitude spectrum of the corresponding section of the known acoustic signal, the time change of the frequency characteristics, the volume It is preferable that the corrected amplitude spectrum obtained by correcting at least one of the time change, the expansion / contraction in the time axis direction, and the expansion / contraction in the frequency axis direction can be displayed in a position aligned on the time axis. In this way, since the state of the corrected amplitude spectrum can be visually confirmed, it can be estimated while viewing the image how the corrected spectrum can be used to improve the removal accuracy. Become.

また画像表示部は、前記混合音響信号の前記振幅スペクトルから前記補正振幅スペクトルを除去した音響信号の振幅スペクトルを画像表示できるように構成すのが好ましい。このようにすると、補正の効果を画像で確認できるので、カットアンドトライ方式で補正を行いながら、混合音響信号中から既知音響信号を最大限除去することができる。 The image display unit is preferably configured to display an image of the amplitude spectrum of the acoustic signal obtained by removing the corrected amplitude spectrum from the amplitude spectrum of the mixed acoustic signal. In this way, since the effect of the correction can be confirmed on the image, the known acoustic signal can be removed from the mixed acoustic signal to the maximum while performing the correction by the cut-and-try method.

また本発明のプログラムは、既知音響信号除去装置で用いるコンピュータに、混合音響信号を時間周波数表現に変換して混合音響信号の振幅スペクトルと混合音響信号の位相とを求める混合音響信号変換ステップと、混合音響信号中に含まれている音響信号に相当する既知音響信号を時間周波数表現に変換して既知音響信号の振幅スペクトルを求める既知音響信号変換ステップと、混合音響信号の振幅スペクトルに基づいて、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した前記既知音響信号の補正振幅スペクトルを求める補正ステップと、混合音響信号の振幅スペクトルから既知音響信号の補正振幅スペクトルを除去する除去ステップと、除去ステップにより得た除去後振幅スペクトルと混合音響信号の位相とに基づいて時間表現に逆変換を行って単位波形を求める逆変換ステップと、単位波形を合成して既知の音響信号の成分を除去した音響信号を得る合成ステップとを実行させるように構成されている。 Further, the program of the present invention is a computer for use in a known acoustic signal removal device, a mixed acoustic signal conversion step for converting a mixed acoustic signal into a time-frequency representation to obtain an amplitude spectrum of the mixed acoustic signal and a phase of the mixed acoustic signal; Based on the known acoustic signal conversion step for obtaining the amplitude spectrum of the known acoustic signal by converting the known acoustic signal corresponding to the acoustic signal included in the mixed acoustic signal into a time-frequency representation, and the amplitude spectrum of the mixed acoustic signal, Corrected at least one of the temporal position shift of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, the temporal change of the frequency characteristics, the temporal change of the volume, the expansion and contraction in the time axis direction, and the expansion and contraction in the frequency axis direction. A correction step for obtaining a corrected amplitude spectrum of the known acoustic signal and a known amplitude spectrum of the mixed acoustic signal A removing step for removing the corrected amplitude spectrum of the reverberant signal, an inverse converting step for obtaining a unit waveform by performing inverse transformation to the time expression based on the amplitude spectrum after removal obtained by the removing step and the phase of the mixed acoustic signal, and a unit And a synthesizing step for obtaining an acoustic signal obtained by synthesizing a waveform and removing a component of a known acoustic signal.

本発明の既知音響信号除去装置の実施の形態の一例の構成を示すブロック図である。It is a block diagram which shows the structure of an example of embodiment of the known acoustic signal removal apparatus of this invention. 本発明の既知音響信号除去方法を実施する場合のステップを示す図である。It is a figure which shows the step in the case of implementing the known acoustic signal removal method of this invention. 本発明の既知音響信号除去装置の主要部をコンピュータを用いて実現する場合に用いるプログラムのアルゴリズムの一例を示すフローチャートである。It is a flowchart which shows an example of the algorithm of the program used when implement | achieving the principal part of the known acoustic signal removal apparatus of this invention using a computer. ステップＳＴ１０３内の詳細なステップを示すフローチャートである。It is a flowchart which shows the detailed step in step ST103. 人間がかかわる推定と自動推定のいずれでも推定動作をする場合のステップの詳細を示すフローチャートである。It is a flowchart which shows the detail of the step in the case of performing estimation operation | movement by both the estimation which a human is concerned, and automatic estimation. エディタのインタフェースの画面構成を示す図である。It is a figure which shows the screen structure of the interface of an editor. 混合音響信号のパワーの時間変化を示す図である。It is a figure which shows the time change of the power of a mixed acoustic signal. 混合音響信号の振幅スペクトルの時間変化を示す図である。It is a figure which shows the time change of the amplitude spectrum of a mixed acoustic signal. ＢＧＭの元となる音源の既知音響信号のパワーの時間変化を示す図である。It is a figure which shows the time change of the power of the known acoustic signal of the sound source used as the origin of BGM. ＢＧＭの元となる音源の既知音響信号の振幅スペクトルの時間変化を示す図である。It is a figure which shows the time change of the amplitude spectrum of the known acoustic signal of the sound source used as the origin of BGM. 既知音響信号除去後の所望の音響信号のパワーの時間変化を示す図である。It is a figure which shows the time change of the power of the desired acoustic signal after a known acoustic signal removal. 既知音響信号除去後の所望の音響信号の振幅スペクトルの時間変化を示す図である。It is a figure which shows the time change of the amplitude spectrum of the desired acoustic signal after a known acoustic signal removal.

以下図面を参照して本発明の実施の形態の一例を詳細に説明する。図１は、本発明の既知音響信号除去方法を実施する本発明の既知音響信号除去装置の一実施の形態の構成を示すブロックである。この既知音響信号除去装置は、混合音響信号変換手段１と、既知音響信号変換手段２と、補正手段３と、インタフェース４と、除去手段５と、逆変換手段６と、合成手段７とから構成される。混合音響信号変換手段１は、所望の音声や物音等の音響信号ｓ（ｔ）（ｔは時間軸）に、ＢＧＭ等の音響信号ｂ（ｔ）が混合された混合音響信号ｍ（ｔ）を（この時点ではｓ（ｔ）とｂ（ｔ）は未知でありｍ（ｔ）のみが入力される）、時間周波数表現に変換して混合音響信号の振幅スペクトルＭ（ω，ｔ）と混合音響信号の位相θｍ（ω，ｔ）とを求める。また既知音響信号変換手段２は、除去すべき音響信号ｂ（ｔ）の元となる音源の既知音響信号ｂ’（ｔ）を時間周波数表現に変換して既知音響信号の振幅スペクトルＢ’（ω，ｔ）を求める。そして補正手段３は、混合音響信号の振幅スペクトルＭ（ω，ｔ）に基づいて、混合音響信号の振幅スペクトルＭ（ω，ｔ）に対する既知音響信号の振幅スペクトルＢ’（ω，ｔ）の時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮を補正した既知音響信号の補正振幅スペクトルＢ（ω，ｔ）を求める。自動化のためには、自動で位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮のすべてを自動で推定して補正するように補正手段３を構成することができる。しかしこの実施の形態では、補正手段３は、時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮のすべての補正を、インタフェース４を用いて人間が手作業で指定することができるように構成されている。このインタフェース４は、後に詳しく説明するように、混合音響信号の振幅スペクトルと既知音響信号の振幅スペクトルとを視覚により対比できるように画像表示をする画像表示部を備えている。そしてインタフェース４は、混合音響信号の振幅スペクトルと既知音響信号の振幅スペクトルとに基づいて混合音響信号中に含まれている既知の音響信号の区間を人間が指定でき且つ前述の補正を指定できるように構成されている。除去手段５は、混合音響信号の振幅スペクトルＭ（ω，ｔ）から既知音響信号の補正振幅スペクトルＢ（ω，ｔ）を除去する。そして逆変換手段６は、除去手段５により得た除去後振幅スペクトルＳ（ω，ｔ）と混合音響信号の位相θｍ（ω，ｔ）とに基づいて時間表現に逆変換を行って単位波形ｓ’（ｔ）を求める。最後に、合成手段７は、逆変換手段６から出力される単位波形ｓ’（ｔ）を合成して既知の音響信号の成分を除去した音響信号ｓ（ｔ）を得る。インタフェース４は、除去手段５から出力された除去後振幅スペクトルＳ（ω，ｔ）を画像表示部（図６参照）に表示する。またインタフェース４は音響再生部を内蔵しており、混合音響信号、既知音響信号及び合成手段７から出力された合成された音響信号を再生する。この構成によれば、補正の効果を画像表示部で視覚により確認し、また音響再生部で聴覚によっても確認できるので、カットアンドトライ方式で補正を行いながら、インタフェース４の表示を見ながら、人間が必要な補正を指定することにより、混合音響信号中から既知音響信号を最大限除去することができる。 Hereinafter, an example of an embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an embodiment of a known acoustic signal removal apparatus of the present invention that implements the known acoustic signal removal method of the present invention. This known acoustic signal removing device includes a mixed acoustic signal converting means 1, a known acoustic signal converting means 2, a correcting means 3, an interface 4, a removing means 5, an inverse converting means 6, and a synthesizing means 7. Is done. The mixed acoustic signal conversion means 1 outputs a mixed acoustic signal m (t) in which an acoustic signal b (t) such as BGM is mixed with an acoustic signal s (t) (t is a time axis) such as a desired voice or a physical sound. (At this time, s (t) and b (t) are unknown, and only m (t) is input), converted into a time-frequency representation and the amplitude spectrum M (ω, t) of the mixed acoustic signal and the mixed acoustic The signal phase θm (ω, t) is obtained. Further, the known acoustic signal conversion means 2 converts the known acoustic signal b ′ (t) of the sound source that is the source of the acoustic signal b (t) to be removed into a time-frequency representation to convert the amplitude spectrum B ′ (ω of the known acoustic signal. , T). Then, the correcting means 3 is based on the amplitude spectrum M (ω, t) of the mixed acoustic signal, and the time of the amplitude spectrum B ′ (ω, t) of the known acoustic signal with respect to the amplitude spectrum M (ω, t) of the mixed acoustic signal. A corrected amplitude spectrum B (ω, t) of a known acoustic signal in which a positional shift, a time change in frequency characteristics, a time change in volume, a time axis direction expansion and contraction, and a frequency axis direction expansion and contraction are corrected is obtained. For the automation, the correction means 3 is configured to automatically estimate and correct all of the position shift, the time change of the frequency characteristic, the time change of the volume, the expansion and contraction in the time axis direction, and the expansion and contraction in the frequency axis direction. Can be configured. However, in this embodiment, the correction means 3 performs all corrections for the positional displacement, the time change of the frequency characteristics, the time change of the volume, the expansion / contraction in the time axis direction, and the expansion / contraction in the frequency axis direction. It is configured so that humans can specify manually. As will be described in detail later, the interface 4 includes an image display unit that displays an image so that the amplitude spectrum of the mixed acoustic signal and the amplitude spectrum of the known acoustic signal can be visually compared. The interface 4 allows a human to specify a section of a known acoustic signal included in the mixed acoustic signal based on the amplitude spectrum of the mixed acoustic signal and the amplitude spectrum of the known acoustic signal, and can specify the above-described correction. It is configured. The removing unit 5 removes the corrected amplitude spectrum B (ω, t) of the known acoustic signal from the amplitude spectrum M (ω, t) of the mixed acoustic signal. Then, the inverse conversion means 6 performs an inverse conversion to the time expression based on the amplitude spectrum S (ω, t) after removal obtained by the removal means 5 and the phase θm (ω, t) of the mixed acoustic signal, and the unit waveform s. '(T) is obtained. Finally, the synthesizing unit 7 synthesizes the unit waveform s ′ (t) output from the inverse transform unit 6 to obtain an acoustic signal s (t) from which a known acoustic signal component is removed. The interface 4 displays the post-removal amplitude spectrum S (ω, t) output from the removing unit 5 on the image display unit (see FIG. 6). The interface 4 has a built-in sound reproduction unit, and reproduces the mixed sound signal, the known sound signal, and the synthesized sound signal output from the synthesizing unit 7. According to this configuration, the effect of the correction can be confirmed visually by the image display unit, and can also be confirmed by the auditory sense by the sound reproduction unit. Therefore, while performing correction by the cut-and-try method, By specifying the necessary correction, it is possible to remove as much known acoustic signals as possible from the mixed acoustic signals.

次に、図２及び図３を用いて、本発明のより詳細な実施の形態の一例を説明する。図２は、本発明の既知音響信号除去方法を実施する場合のステップを示しており、図３は本発明の既知音響信号除去装置の主要部をコンピュータを用いて実現する場合に用いるプログラムのアルゴリズムの一例を示すフローチャートである。 Next, an example of a more detailed embodiment of the present invention will be described with reference to FIGS. FIG. 2 shows steps when the known acoustic signal removal method of the present invention is implemented, and FIG. 3 shows an algorithm of a program used when the main part of the known acoustic signal removal apparatus of the present invention is realized using a computer. It is a flowchart which shows an example.

図４は、ステップＳＴ１０３内の詳細なステップを示すフローチャートである。また図５は、人間がかかわる推定と自動推定のいずれでも推定動作をする場合のステップの詳細を示すフローチャートである。以下これらの図１乃至図５を参照しながら、本発明の方法及び装置における信号除去動作を説明する。 FIG. 4 is a flowchart showing detailed steps in step ST103. FIG. 5 is a flowchart showing the details of the steps in the case where the estimation operation is performed for both estimation involving human beings and automatic estimation. Hereinafter, the signal removal operation in the method and apparatus of the present invention will be described with reference to FIGS.

まず以下の説明では、所望の音声や物音等の音響信号ｓ（ｔ）（ｔは時間軸）に、ＢＧＭ等の音響信号ｂ（ｔ）が混合された、混合音響信号ｍ（ｔ）が観測されるものとする。
First, in the following description, a mixed acoustic signal m (t) in which an acoustic signal b (t) such as BGM is mixed with an acoustic signal s (t) (t is a time axis) such as a desired voice or a physical sound is observed. Shall be.

ここでは、ｂ（ｔ）の元となる音源の音響信号ｂ’（ｔ）が既知という条件下で、ｍ（ｔ）が与えられたときに、未知のｓ（ｔ）を求める問題を解く。例えば、人間の声や物音と共にＢＧＭが鳴っているテレビ番組等の音響信号ｍ（ｔ）を入力とし、そのＢＧＭの楽曲が既知でその音響信号ｂ’（ｔ）が別途用意できるときに、そのＢＧＭの音楽音響信号を用いて番組中のＢＧＭを除去し、人間の声や物音だけの音響信号ｓ（ｔ）を得る処理を実現する。 Here, the problem of obtaining an unknown s (t) is solved when m (t) is given under the condition that the acoustic signal b '(t) of the sound source that is the source of b (t) is known. For example, when an acoustic signal m (t) of a TV program or the like in which a BGM is sounding together with a human voice or a sound is input, the music of the BGM is known and the acoustic signal b ′ (t) can be prepared separately. The BGM in the program is removed using the BGM music sound signal, and the process of obtaining the sound signal s (t) of only human voice or sound is realized.

ここで、ｂ（ｔ）とｂ’（ｔ）は完全には一致しないため、
Here, since b (t) and b ′ (t) do not completely match,

の減算に相当する処理では、ｂ’（ｔ）からｂ（ｔ）に相当する成分を推定して、ｓ（ｔ）を求める必要がある。具体的には、既知音響信号ｂ’（ｔ）は、混合音ｍ（ｔ）中では、以下のような変形を伴うことが多いため、補正することでｂ（ｔ）に相当する成分を推定する。補正の対象は、主として以下の時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸あるいは周波数軸方向の伸縮である。 In the process corresponding to the subtraction, it is necessary to estimate the component corresponding to b (t) from b '(t) and obtain s (t). Specifically, since the known acoustic signal b ′ (t) is often accompanied by the following deformation in the mixed sound m (t), a component corresponding to b (t) is estimated by correction. To do. The correction target is mainly the following temporal position shift, time change of frequency characteristics, time change of volume, and expansion or contraction in the time axis or frequency axis direction.

（時間的な位置のずれ）
混合音ｍ（ｔ）中で既知音響信号ｂ’（ｔ）が鳴っている位置は先頭からとは限らない。そこで、既知音響信号ｂ’（ｔ）を時間軸方向にずらし、両者の相対位置を合わせて、混合音から既知音響信号を減算する必要がある。 (Time shift)
The position where the known acoustic signal b ′ (t) is sounding in the mixed sound m (t) is not always from the beginning. Therefore, it is necessary to shift the known acoustic signal b ′ (t) in the time axis direction, match both relative positions, and subtract the known acoustic signal from the mixed sound.

（周波数特性の時間変化）
混合音ｍ（ｔ）中で既知音響信号ｂ’（ｔ）が鳴る際には、グラフィックイコライザ等の影響で周波数特性が変化することが多い。例えば、低域や高域が強調・減衰されることがある。そこで、ｂ’（ｔ）の周波数特性を同様に変化させて補正し、混合音から既知音響信号を減算する必要がある。 (Change in frequency characteristics over time)
When the known acoustic signal b ′ (t) is produced in the mixed sound m (t), the frequency characteristic often changes due to the influence of a graphic equalizer or the like. For example, the low range and high range may be emphasized and attenuated. Therefore, it is necessary to correct by changing the frequency characteristic of b ′ (t) in the same manner, and to subtract a known acoustic signal from the mixed sound.

（音量の時間変化）
混合音ｍ（ｔ）中で既知音響信号ｂ’（ｔ）が鳴る際には、混合音作成時のミキサーのフェーダー等の操作で混合比率が変更され、音量が時間変化することが多い。そこで、ｂ’（ｔ）の音量を同様に時間変化させて補正し、混合音から既知音響信号を減算する必要がある。 (Volume change over time)
When a known acoustic signal b ′ (t) is generated in the mixed sound m (t), the mixing ratio is changed by operation of a mixer fader or the like when the mixed sound is generated, and the volume often changes with time. Therefore, it is necessary to correct the volume of b ′ (t) by changing the time similarly, and to subtract a known acoustic signal from the mixed sound.

（時間軸あるいは周波数軸方向の伸縮）
混合音ｍ（ｔ）中で既知音響信号ｂ’（ｔ）が鳴る際には、レコード等の回転数の違いにより、時間軸あるいは周波数軸方向に伸縮されることがある。そこで、ｂ’（ｔ）を時間軸あるいは周波数軸方向に伸縮して補正し、混合音から既知音響信号を減算する必要がある。 (Expansion / contraction in time axis or frequency axis direction)
When the known acoustic signal b ′ (t) sounds in the mixed sound m (t), it may be expanded or contracted in the time axis or frequency axis direction due to the difference in the rotational speed of the record or the like. Therefore, it is necessary to correct b ′ (t) by expanding and contracting in the time axis or frequency axis direction, and subtracting a known acoustic signal from the mixed sound.

本発明の方法においては、図２に示すように、ステップＳＴ１で、まず混合音響信号をフーリエ変換して、混合音響信号の位相（ステップＳＴ２）と混合音響信号の振幅スペクトル（ステップＳＴ３）を求める（混合音響信号変換ステップ）。次に、ステップＳＴ４で混合音響信号中に含まれている音響信号に相当する既知音響信号をフーリエ変換して、既知音響信号の振幅スペクトル（ステップＳＴ５）を求める（既知音響信号変換ステップ）。そしてステップＳＴ６により、混合音響信号の振幅スペクトルに基づいて、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した既知音響信号の補正振幅スペクトル（ステップＳＴ７）を求める（補正ステップ）。次に、ステップＳＴ８で、混合音響信号の振幅スペクトルから既知音響信号の補正振幅スペクトルを除去して除去後振幅スペクトル（ステップＳＴ９）を求める（除去ステップ）。そしてステップＳＴ１０により、除去ステップにより得た除去後振幅スペクトルと混合音響信号の位相とに基づいて逆フーリエ変換を行って単位波形を求める（逆変換ステップ）。最後に、ステップＳＴ１１で、単位波形をオーバーラップ・アド法により合成して既知の音響信号の成分を除去した音響信号を得る（合成ステップ）。 In the method of the present invention, as shown in FIG. 2, in step ST1, first, the mixed acoustic signal is Fourier transformed to obtain the phase of the mixed acoustic signal (step ST2) and the amplitude spectrum of the mixed acoustic signal (step ST3). (Mixed acoustic signal conversion step). Next, in step ST4, the known acoustic signal corresponding to the acoustic signal included in the mixed acoustic signal is Fourier transformed to obtain the amplitude spectrum (step ST5) of the known acoustic signal (known acoustic signal converting step). Then, at step ST6, based on the amplitude spectrum of the mixed acoustic signal, the temporal position shift of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, the time change of the frequency characteristic, the time change of the volume, the time axis direction A corrected amplitude spectrum (step ST7) of a known acoustic signal obtained by correcting at least one of the expansion and contraction of frequency and the expansion and contraction in the frequency axis direction is obtained (correction step). Next, in step ST8, the corrected amplitude spectrum of the known acoustic signal is removed from the amplitude spectrum of the mixed acoustic signal to obtain a removed amplitude spectrum (step ST9) (removal step). In step ST10, a unit waveform is obtained by performing inverse Fourier transform based on the amplitude spectrum after removal obtained in the removal step and the phase of the mixed acoustic signal (inverse transform step). Finally, in step ST11, a unit waveform is synthesized by an overlap-add method to obtain an acoustic signal from which a known acoustic signal component is removed (synthesis step).

また図３のアルゴリズムでは、ステップＳＴ１０１で、混合音響信号をフーリエ変換して混合音響信号の振幅スペクトルと混合音響信号の位相とを求める。次にステップＳＴ１０２で、混合音響信号中に含まれている音響信号に相当する既知音響信号をフーリエ変換して既知音響信号の振幅スペクトルを求める。次にステップＳＴ１０３で、混合音響信号の振幅スペクトルに基づいて、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した既知音響信号の補正振幅スペクトルを求める。その後、ステップＳＴ１０４で、混合音響信号の振幅スペクトルから既知音響信号の補正振幅スペクトルを除去して除去後振幅スペクトルを求める。次にステップＳＴ１０５で、ステップＳＴ１０４で得た除去後振幅スペクトルと混合音響信号の位相とに基づいて逆フーリエ変換を行って単位波形を求め、ステップＳＴ１０６で単位波形をオーバーラップ・アド法により合成して既知の音響信号の成分を除去した音響信号を得る。その後ステップＳＴ１０７で、除去後の音響信号をユーザが満足したと評価したか否かの判定が加わり、判定結果が不満足であれば、ステップＳＴ１０３へと戻って補正がやり直される。ユーザが満足するまでは、ステップＳＴ１０３からステップＳＴ１０７が繰り返される。 In the algorithm of FIG. 3, in step ST101, the mixed acoustic signal is Fourier transformed to obtain the amplitude spectrum of the mixed acoustic signal and the phase of the mixed acoustic signal. Next, in step ST102, a known acoustic signal corresponding to the acoustic signal included in the mixed acoustic signal is Fourier transformed to obtain an amplitude spectrum of the known acoustic signal. Next, in step ST103, based on the amplitude spectrum of the mixed acoustic signal, the temporal position shift of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, the time change of the frequency characteristic, the time change of the volume, the time axis A corrected amplitude spectrum of a known acoustic signal in which at least one of direction expansion and contraction in the frequency axis direction is corrected is obtained. Thereafter, in step ST104, the corrected amplitude spectrum of the known acoustic signal is removed from the amplitude spectrum of the mixed acoustic signal to obtain the removed amplitude spectrum. Next, in step ST105, an inverse Fourier transform is performed based on the amplitude spectrum after removal obtained in step ST104 and the phase of the mixed acoustic signal to obtain a unit waveform, and in step ST106, the unit waveform is synthesized by the overlap-add method. Thus, an acoustic signal from which a known acoustic signal component is removed is obtained. Thereafter, in step ST107, a determination is made as to whether or not the user has evaluated the acoustic signal after removal. If the determination result is unsatisfactory, the process returns to step ST103 and correction is performed again. Step ST103 to step ST107 are repeated until the user is satisfied.

以下更に各ステップで実行される内容を詳細に説明する。本発明の実施の形態の方法では、時間領域で波形を減算処理をせずに、時間周波数領域での振幅スペクトル上で減算処理を行う。音響信号ｍ（ｔ），ｂ’（ｔ）に対する窓関数ｈ（ｔ）を用いた時刻ｔにおける短時間フーリエ変換（ＳＴＦＴ）Ｘｍ（ω，ｔ），Ｘｂ’（ω，ｔ）が
The contents executed at each step will be described in detail below. In the method according to the embodiment of the present invention, the subtraction process is performed on the amplitude spectrum in the time frequency domain without performing the subtraction process on the waveform in the time domain. Short-time Fourier transform (STFT) Xm (ω, t), Xb ′ (ω, t) at time t using the window function h (t) for the acoustic signals m (t) and b ′ (t)

で定義されるとき、それらの振幅スペクトルＭ（ω，ｔ），Ｂ’（ω，ｔ）は、
And their amplitude spectra M (ω, t), B ′ (ω, t) are

で求まる。 It is obtained by

現在の実装では、音響信号を標本化周波数４４．１ｋＨｚ、量子化ビット数１６ｂｉｔでＡ／Ｄ変換し、窓関数ｈ（ｔ）として窓幅８１９２点のハニング窓を用いた短時間フーリエ変換（ＳＴＦＴ）を、高速フーリエ変換（ＦＦＴ）によって計算する。その際、高速フーリエ変換（ＦＦＴ）のフレームを４４１点ずつシフトするため、フレームシフト時間（１フレームシフト）は１０ｍｓとなる。このフレームシフトを、処理の時間単位とする。 In the current implementation, the acoustic signal is A / D converted with a sampling frequency of 44.1 kHz and a quantization bit number of 16 bits, and a short-time Fourier transform (STFT) using a Hanning window with a window width of 8192 points as the window function h (t). ) By fast Fourier transform (FFT). At that time, since the frame of the fast Fourier transform (FFT) is shifted by 441 points, the frame shift time (1 frame shift) is 10 ms. This frame shift is used as a processing time unit.

既知音響信号除去後の所望の音響信号ｓ（ｔ）の振幅スペクトルＳ（ω，ｔ）は、振幅スペクトルＭ（ω，ｔ），Ｂ’（ω，ｔ）から以下の式によって求める。ここで、Ｂ（ω，ｔ）はＢ’（ω，ｔ）を補正した後の振幅スペクトルである。
The amplitude spectrum S (ω, t) of the desired acoustic signal s (t) after removal of the known acoustic signal is obtained from the amplitude spectra M (ω, t) and B ′ (ω, t) by the following formula. Here, B (ω, t) is an amplitude spectrum after correcting B ′ (ω, t).

上記の式における各種パラメータ関数ａ（ｔ），ｇ（ω，ｔ），ｐ（ω），ｑ（ｔ），ｒ（ｔ），ｃ（ω，ｔ）を順に説明する。 Various parameter functions a (t), g (ω, t), p (ω), q (t), r (t), c (ω, t) in the above formula will be described in order.

ａ（ｔ）は、混合音の振幅スペクトルから既知音響信号の振幅スペクトルに相当する成分を減算する分量を最終的に調整するための任意の形状の関数であり、通常、ａ（ｔ）≧１とする。これが大きいほど、減算量が大きくなる。 a (t) is a function of an arbitrary shape for finally adjusting the amount by which the component corresponding to the amplitude spectrum of the known acoustic signal is subtracted from the amplitude spectrum of the mixed sound. Usually, a (t) ≧ 1 And The larger the value, the larger the subtraction amount.

ｇ（ω，ｔ）は、周波数特性の時間変化と音量の時間変化を補正するための関数であり、
g (ω, t) is a function for correcting the time change of the frequency characteristic and the time change of the volume,

のように定義する。ここで、ｇω（ω，ｔ）は、周波数特性の時間変化を表し、周波数特性の変化がないときはｇω（ω，ｔ）＝１となる。一方、ｇｔ（ｔ）は、音量の時間変化を表し、音量の変化がないときは定数となる。Ｍ（ω，ｔ）とＢ’（ω，ｔ）との音量差は、基本的にｇｔ（ｔ）で補正される。ｇｒ（ｔ）は、主にｇ（ω，ｔ）の値を全体的に持ち上げるための関数で、補正時の微調整に使用される。使用しない場合には、ｇｒ（ｔ）＝０とする。 Define as follows. Here, gω (ω, t) represents a time change of the frequency characteristic, and gω (ω, t) = 1 when there is no change of the frequency characteristic. On the other hand, gt (t) represents a temporal change in volume, and is a constant when there is no change in volume. The volume difference between M (ω, t) and B ′ (ω, t) is basically corrected with gt (t). gr (t) is a function mainly for raising the value of g (ω, t) as a whole, and is used for fine adjustment during correction. If not used, gr (t) = 0.

ｐ（ω）は、周波数軸方向の伸縮を補正するための関数であり、振幅スペクトルＢ’（ω，ｔ）の周波数軸ωを変換することで、周波数軸方向の線形・非線型な伸縮を可能にする。なお、Ｂ’（ω，ｔ）は本来のωの定義域外では０をとり、離散化して実装する際には適宜補間することとする。 p (ω) is a function for correcting expansion / contraction in the frequency axis direction, and linear / nonlinear expansion / contraction in the frequency axis direction is obtained by converting the frequency axis ω of the amplitude spectrum B ′ (ω, t). enable. B ′ (ω, t) is 0 outside the original definition range of ω, and is interpolated as appropriate when discretized and implemented.

ｑ（ｔ）は、時間軸方向の伸縮を補正するための関数であり、振幅スペクトルＢ’（ω，ｔ）の時間軸ｔを変換することで、時間軸方向の線形・非線型な伸縮を可能にする。なお、Ｂ’（ω，ｔ）は本来のｔの定義域外では０をとり、離散化して実装する際には適宜補間することとする。 q (t) is a function for correcting expansion / contraction in the time axis direction, and linear / nonlinear expansion / contraction in the time axis direction is obtained by converting the time axis t of the amplitude spectrum B ′ (ω, t). enable. Note that B ′ (ω, t) takes 0 outside the original definition range of t, and is appropriately interpolated when discretized and mounted.

ｒ（ｔ）は、時間的な位置のずれを補正するための関数であり、通常は定数を設定することで、一定のずれ幅を補正する。ずれ幅が時間変化するときには、各時刻での幅を補正する関数を設定する。なお、Ｂ’（ω，ｔ）は本来のｔの定義域外では０をとり、離散化して実装する際には適宜補間することとする。ｑ（ｔ）とｒ（ｔ）を統合した一つの関数で表現することも可能だが、ここでは、ｑ（ｔ）は連続的な伸縮を表す目的で設定し、ｒ（ｔ）は不連続な位置のずれを表す目的で設定することとする。 r (t) is a function for correcting a positional shift in time, and usually a constant shift width is corrected by setting a constant. When the shift width changes with time, a function for correcting the width at each time is set. Note that B ′ (ω, t) takes 0 outside the original definition range of t, and is appropriately interpolated when discretized and mounted. Although it is possible to express q (t) and r (t) by a single function, q (t) is set for the purpose of continuous expansion and contraction, and r (t) is discontinuous. It is set for the purpose of representing the position shift.

ｃ（ω，ｔ）は、振幅スペクトルに対するイコライジング処理及びフェーダー操作処理のための任意の形状の関数である。ω方向の形状により、グラフィックイコライザのように、既知音響信号除去後の周波数特性を調整することができる。また、ｔ方向の形状により、ミキサーのボリュームフェーダー操作のように、既知音響信号除去後の音量変化を調整することができる。使用しない場合には、ｃ（ω，ｔ）＝１とする。 c (ω, t) is a function of an arbitrary shape for the equalizing process and the fader operation process for the amplitude spectrum. Depending on the shape in the ω direction, the frequency characteristic after removal of the known acoustic signal can be adjusted like a graphic equalizer. Further, the change in volume after removal of the known acoustic signal can be adjusted by the shape in the t direction, like the volume fader operation of the mixer. When not used, c (ω, t) = 1.

こうして求めた振幅スペクトルＳ（ω，ｔ）と、混合音ｍ（ｔ）の位相θｍ（ω，ｔ）を用いてＸｓ（ω，ｔ）を求め、それを逆フーリエ変換（ＩＦＦＴ）することで、単位波形ｓ’（ｔ）を得る。
Xs (ω, t) is obtained by using the amplitude spectrum S (ω, t) thus obtained and the phase θm (ω, t) of the mixed sound m (t), and is subjected to inverse Fourier transform (IFFT). A unit waveform s ′ (t) is obtained.

この単位波形ｓ’（ｔ）を、オーバーラップ・アド（ＯｖｅｒｌａｐＡｄｄ）法によって配置することにより、既知音響信号除去後の所望の音響信号ｓ（ｔ）を合成する。 The unit waveform s ′ (t) is arranged by an overlap add method, thereby synthesizing a desired acoustic signal s (t) after removal of the known acoustic signal.

以上では、混合音響信号ｍ（ｔ）の中に、既知音響信号ｂ’（ｔ）が一種類含まれている場合を説明したが、ｂ’１（ｔ），ｂ’２（ｔ），．．．，ｂ’Ｎ（ｔ）のように複数含まれている場合には、それらの振幅スペクトルＢ’１（ω，ｔ），Ｂ’２（ω，ｔ），．．．，Ｂ’Ｎ（ω，ｔ）からそれぞれに応じたパラメータ関数の設定で［数１２］によって求めたＢ１（ω，ｔ），Ｂ２（ω，ｔ），．．．，ＢＮ（ω，ｔ）を用いて、
In the above, the case where one kind of known acoustic signal b ′ (t) is included in the mixed acoustic signal m (t) has been described, but b′1 (t), b′2 (t),. . . , B′N (t), a plurality of amplitude spectra B′1 (ω, t), B′2 (ω, t),. . . , B′N (ω, t), B1 (ω, t), B2 (ω, t),. . . , BN (ω, t)

のようにＳ（ω，ｔ）を求める処理へ拡張できる。その際には、Ｂｎ（ω，ｔ）の各種パラメータ関数を順に設定するか、全体のバランスを取りながら、複数のＢｎ（ω，ｔ）の各種パラメータ関数を平行して設定する。 Thus, the process can be expanded to obtain S (ω, t). In that case, various parameter functions of Bn (ω, t) are set in order, or various parameter functions of a plurality of Bn (ω, t) are set in parallel while maintaining the overall balance.

また、以上では、モノラル信号を対象に説明したが、ステレオ信号は、左右を混合してモノラル信号に変換して適用してもよいし、ステレオ信号の左右の各信号に対して適用してもよい。また、ステレオ信号中の音源方向を利用して、適用してもよい。 In the above description, the monaural signal has been described. However, the stereo signal may be applied by mixing the left and right to convert to a monaural signal, or may be applied to the left and right signals of the stereo signal. Good. Moreover, you may apply using the direction of a sound source in a stereo signal.

上記各種パラメータ関数の設定について説明する。本発明の方法を適用する際に、［数１１］、［数１２］、［数１３］の各種パラメータ関数ａ（ｔ），ｇ（ω，ｔ）（ｇω（ω，ｔ），ｇｔ（ｔ），ｇｒ（ｔ）），ｐ（ω），ｑ（ｔ），ｒ（ｔ），ｃ（ω，ｔ）の形状は、自動推定してもよいし、人間が手作業で設定してもよい。あるいは、自動推定後に人間が修正してもよい。以下では、具体的な自動推定方法と、人間の手作業による修正を可能にする既知音響信号除去エディタ上のインタフェース４を用いる場合について説明する。 The setting of the various parameter functions will be described. When the method of the present invention is applied, various parameter functions a (t), g (ω, t) (gω (ω, t), gt (t) of [Equation 11], [Equation 12], and [Equation 13]. ), Gr (t)), p (ω), q (t), r (t), c (ω, t) may be automatically estimated or manually set by a human. Good. Alternatively, a human may correct after automatic estimation. In the following, a specific automatic estimation method and a case of using the interface 4 on the known acoustic signal removal editor that enables manual correction by humans will be described.

最初に、［数１１］、［数１２］、［数１３］の各種パラメータ関数ｇ（ω，ｔ）（ｇω（ω，ｔ），ｇｔ（ｔ）），ｐ（ω），ｑ（ｔ），ｒ（ｔ）の形状を推定する方法を図４を用いて以下に述べる。まずステップＳＴ２０１でＢＧＭ区間ψの集合Ψの指定・自動推定を行い、ステップＳＴ２０２でｐ（ω），ｑ（ｔ）の自動推定を行い、ステップＳＴ２０３でｇω（ω，ｔ），ｇｔ（ｔ），ｒ（ｔ）の自動推定を行う。そして推定結果のパラメータ関数が収束するまでこれらのステップが継続される（ステップＳＴ２０４）。ステップＳＴ２０５以降では、補正動作がインタフェース４を用いて実行される。 First, various parameter functions g (ω, t) (gω (ω, t), gt (t)), p (ω), q (t) of [Equation 11], [Equation 12], and [Equation 13]. , R (t) will be described below with reference to FIG. First, in step ST201, a set Ψ of BGM sections ψ is designated and automatically estimated, p (ω) and q (t) are automatically estimated in step ST202, and gω (ω, t) and gt (t) in step ST203. , R (t) is automatically estimated. These steps are continued until the parameter function of the estimation result converges (step ST204). After step ST205, the correction operation is performed using the interface 4.

ｇ（ω，ｔ）の推定では、まず、周波数特性の時間変化ｇω（ω，ｔ）を推定し、次に、音量の時間変化ｇｔ（ｔ）を推定する。ただし、ｇ（ω，ｔ）の推定に先立ち、ｐ（ω），ｑ（ｔ），ｒ（ｔ）は決定されている必要がある。ここでは便宜上、Ｂ’（ｐ（ω），ｑ（ｔ）＋ｒ（ｔ））をＢ’（ω，ｔ）と記述する。 In the estimation of g (ω, t), first, the time change gω (ω, t) of the frequency characteristic is estimated, and then the time change gt (t) of the volume is estimated. However, prior to estimating g (ω, t), p (ω), q (t), and r (t) need to be determined. Here, for convenience, B ′ (p (ω), q (t) + r (t)) is described as B ′ (ω, t).

周波数特性の時間変化ｇω（ω，ｔ）の推定では、原則として、人間の声や物音だけの音響信号ｓ（ｔ）がほとんど含まれていない区間（以下、ＢＧＭ区間と呼ぶ）を用いる。ＢＧＭ区間は、複数用いてもよい。ＢＧＭ区間では、混合音ｍ（ｔ）の振幅スペクトルＭ（ω，ｔ）は、既知音響信号ｂ’（ｔ）によるＢＧＭに相当する振幅スペクトルＢ’（ω，ｔ）に由来の成分がほとんどとなる。そこで、周波数特性が時間変化せずに定常、すなわち、ｇω（ω，ｔ）＝ｇ’ω（ω）と仮定できるときには、ｇ’ω（ω）を
In the estimation of the time variation gω (ω, t) of the frequency characteristic, as a rule, a section (hereinafter referred to as a BGM section) in which the acoustic signal s (t) of only human voice or sound is not included is used. A plurality of BGM sections may be used. In the BGM section, the amplitude spectrum M (ω, t) of the mixed sound m (t) is almost entirely derived from the amplitude spectrum B ′ (ω, t) corresponding to BGM by the known acoustic signal b ′ (t). Become. Therefore, when the frequency characteristic is steady without changing over time, that is, when gω (ω, t) = g′ω (ω) can be assumed, g′ω (ω) is

により推定する。ただし、ψは一つのＢＧＭ区間（時間軸上の領域）を表し、Ψは、ψの集合とする。一方、周波数特性が時間変化していくときには、ｇω（ω，ｔ）の時刻ｔに近いＢＧＭ区間ψから
Estimated by However, ψ represents one BGM section (region on the time axis), and ψ is a set of ψ. On the other hand, when the frequency characteristic changes with time, from the BGM section ψ near time t of gω (ω, t).

を求め、補間（内挿あるいは外挿）することによりｇω（ω，ｔ）を推定する（両側にＢＧＭ区間があるときには、両側から内挿する）。最後に、ｇω（ω，ｔ）を周波数軸方向に平滑化する。なお、平滑化幅は任意に設定でき、平滑化をしなくてもよい。 Gω (ω, t) is estimated by interpolation (interpolation or extrapolation) (when there are BGM sections on both sides, interpolation is performed from both sides). Finally, gω (ω, t) is smoothed in the frequency axis direction. Note that the smoothing width can be arbitrarily set, and smoothing may not be performed.

音量の時間変化ｇｔ（ｔ）の推定では、Ｍ（ω，ｔ）と、周波数特性補正後のｇω（ω，ｔ）Ｂ’（ω，ｔ）の各時刻における振幅を比較する。しかし、Ｍ（ω，ｔ）には、Ｂ’（ω，ｔ）に由来の成分以外に、ｓ（ｔ）に由来の成分も含まれる。そこで、周波数軸ωを複数の周波数帯域Φに分割し、各帯域φ（φ∈Φ）ごとに
In the estimation of the sound volume change gt (t), the amplitudes of M (ω, t) and gω (ω, t) B ′ (ω, t) after frequency characteristic correction at each time are compared. However, M (ω, t) includes components derived from s (t) in addition to components derived from B ′ (ω, t). Therefore, the frequency axis ω is divided into a plurality of frequency bands Φ, and for each band φ (φ∈Φ)

を求める（Φはφの集合を表す）。Φとして任意の分割が適用できるが、例えば、音楽で用いる平均律の１オクターブごとに分割（対数周波数軸上で等間隔に分割）するとよい。そして、ｇｔ（ｔ）は、ｍｉｎ（ｇ’ｔ（φ，ｔ））あるいは
(Φ represents a set of φ). Arbitrary division can be applied as Φ. For example, the division may be performed for every octave of equal temperament used in music (divided at equal intervals on the logarithmic frequency axis). And gt (t) is min (g't (φ, t)) or

により推定する。ｍｉｎ（ｇ’ｔ（φ，ｔ））の場合には、Ｍ（ω，ｔ）とｇω（ω，ｔ）Ｂ’（ω，ｔ）が一番が近い周波数帯域において振幅が比較されることになる。最後に、ｇｔ（ｔ）を時間軸方向に平滑化する。なお、平滑化幅は任意に設定でき、平滑化をしなくてもよい。 Estimated by In the case of min (g′t (φ, t)), the amplitude is compared in the frequency band where M (ω, t) and gω (ω, t) B ′ (ω, t) are closest. become. Finally, gt (t) is smoothed in the time axis direction. Note that the smoothing width can be arbitrarily set, and smoothing may not be performed.

ｐ（ω），ｑ（ｔ）の推定では、Ｍ（ω，ｔ）とＢ（ω，ｔ）との距離（例えば、対数スペクトル距離等）が最小となるように、ｐ（ω）とｑ（ｔ）を変更する。その際、Ｂ（ω，ｔ）＝ａ（ｔ）ｇ（ω，ｔ）Ｂ’（ｐ（ω），ｑ（ｔ）＋ｒ（ｔ））の右辺のうち、ａ（ｔ）＝１とし、
１．（推定途中の）ｐ（ω）とｑ（ｔ）を仮に固定した上で、ｇ（ω，ｔ）とｒ（ｔ）を推定
２．（推定途中の）ｇ（ω，ｔ）とｒ（ｔ）を仮に固定した上で、ｐ（ω）とｑ（ｔ）を推定
の二つを反復的に繰り返して、適切なｐ（ω），ｑ（ｔ）を推定する。これは、音響信号の全区間に対して一度に実行せず、時間軸を分割して、区分的におこなうとよい。初期値は前後の区間の連続性を考慮して定める。また、ＢＧＭ区間ψの集合Ψを用いて、それらの複数の区間におけるＭ（ω，ｔ）とＢ（ω，ｔ）との対応関係の時間軸を合わせるように、ｐ（ω），ｑ（ｔ）を推定するとよい。 In the estimation of p (ω) and q (t), p (ω) and q are set so that the distance (for example, logarithmic spectral distance) between M (ω, t) and B (ω, t) is minimized. Change (t). At that time, B (ω, t) = a (t) g (ω, t) B ′ (p (ω), q (t) + r (t)) out of the right side, a (t) = 1,
1. 1. Estimate g (ω, t) and r (t) after temporarily fixing p (ω) and q (t) (during estimation) After g (ω, t) and r (t) (during estimation) are temporarily fixed, p (ω) and q (t) are repeatedly iterated to obtain an appropriate p (ω) , Q (t). This may be performed in a segmented manner by dividing the time axis, rather than being performed at once for all sections of the acoustic signal. The initial value is determined in consideration of the continuity of the preceding and following sections. Further, using the set Ψ of BGM sections ψ, p (ω), q () so that the time axis of the correspondence relationship between M (ω, t) and B (ω, t) in the plurality of sections is matched. t) may be estimated.

ｒ（ｔ）の推定では、原則として、ＢＧＭ区間ψの集合Ψを用いて、それらの区間におけるＭ（ω，ｔ）とＢ（ω，ｔ）との対応関係の時間軸を合わせるように、ｒ（ｔ）を求める。ｒ（ｔ）は定数であることが多いが、既知音響信号ｂ’（ｔ）の一部区間が使われずに、飛び飛びで使用されながら混合されていたとき等には、その区間を飛ばすようにｒ（ｔ）が不連続関数となる。 In the estimation of r (t), in principle, using the set Ψ of BGM intervals ψ, the time axis of the correspondence relationship between M (ω, t) and B (ω, t) in those intervals is matched. Find r (t). In many cases, r (t) is a constant, but when a part of the known acoustic signal b ′ (t) is not used but is mixed while being used in a jump, the section is skipped. r (t) is a discontinuous function.

上記のｇ（ω，ｔ）やｒ（ｔ）等の推定では、ＢＧＭ区間ψの集合Ψを用いていた。これは、人間が手作業で指定してもよい。あるいは、手作業で指定したＢＧＭ区間の集合に自動推定で追加してもよい。図５は、人間が手作業で指定する場合と自動推定する場合のいずれでも対応するプログラムのソフトウエアのアルゴリズムを示すフローチャートである。自動推定する場合には、図５のステップＳＴ３０２〜ＳＴ３１３を実行する。Ψの自動推定では、基本的に、どこか一箇所のＢＧＭ区間ψ１を手掛かりとして、残りのＢＧＭ区間の集合を求める。まず、最初のψ１は、人間が手作業で指定するか、音響信号の時間軸を細かく分割して、それらの短い分割区間の対応関係を判定して求める。人間が手作業で指定しない場合、Ｂ（ω，ｔ）を仮に計算し（ステップＳＴ３０２）、Ｍ（ω，ｔ）とＢ（ω，ｔ）を細かく分割した時間窓の振幅スペクトル間の距離（類似度に相当）を計算する（ステップＳＴ３０３）。そして、その最小距離の時間窓の対応関係を調べ（ステップＳＴ３０４）、その結果を含む区間をψ１に設定して初期のΨとする（ステップＳＴ３０５）。次に、ψ１を含むΨに基づいて、Ｂ（ω，ｔ）の各種パラメータ関数を推定し（ステップＳＴ３０６乃至ステップＳＴ３０９）、Ｂ（ω，ｔ）を計算する（ステップＳＴ３１０）。各パラメータの推定値が収束しているかを調べ、収束していない場合には、Ψの全区間に対して、Ｍ（ω，ｔ）とＢ（ω，ｔ）との振幅スペクトル間の距離（類似度に相当）を求める。ここでその最大値（もしくは平均値）の定数倍をＢＧＭ区間判定用閾値とする（ステップＳＴ３１２）。そして、ＢＧＭ区間判定用閾値以下の距離を持つ区間を検出し、新たにΨに追加する（ステップＳＴ３１３）。ただし、追加には上限を設けることもできる。この推定と追加を繰り返すことで、Ψが更新され、各種パラメータ関数が適切に求まっていく。ここで、Ｍ（ω，ｔ）とＢ（ω，ｔ）との距離としては、例えば、二乗平均対数スペクトル距離
In the above estimation of g (ω, t), r (t), etc., a set Ψ of BGM sections ψ is used. This may be specified manually by a human. Or you may add to the set of BGM section designated manually by automatic estimation. FIG. 5 is a flowchart showing a software algorithm of a program corresponding to either a case where a human designates manually or an automatic estimation. In the case of automatic estimation, steps ST302 to ST313 in FIG. 5 are executed. In the automatic estimation of Ψ, basically, a set of remaining BGM sections is obtained using one BGM section ψ1 as a clue. First, the first ψ1 is determined manually by a human or by dividing the time axis of the acoustic signal finely and determining the correspondence between these short divided sections. When a human does not designate it manually, B (ω, t) is temporarily calculated (step ST302), and the distance between the amplitude spectra of time windows obtained by finely dividing M (ω, t) and B (ω, t) ( (Corresponding to similarity) is calculated (step ST303). Then, the correspondence relationship of the time window of the minimum distance is checked (step ST304), and the section including the result is set as ψ1 to be an initial ψ (step ST305). Next, various parameter functions of B (ω, t) are estimated based on ψ including ψ1 (step ST306 to step ST309), and B (ω, t) is calculated (step ST310). It is checked whether the estimated values of the parameters have converged. If the estimated values have not converged, the distance between the amplitude spectra of M (ω, t) and B (ω, t) (for all the sections of Ψ ( Equivalent). Here, a constant multiple of the maximum value (or average value) is set as a BGM section determination threshold (step ST312). Then, a section having a distance less than or equal to the BGM section determination threshold is detected and newly added to Ψ (step ST313). However, an upper limit can be set for addition. By repeating this estimation and addition, Ψ is updated and various parameter functions are obtained appropriately. Here, as the distance between M (ω, t) and B (ω, t), for example, the root mean square spectral distance

が有効である。 Is effective.

次に既知音響信号除去エディタ上のインタフェースによる各種パラメータ関数の調整について説明する。 Next, adjustment of various parameter functions by the interface on the known acoustic signal removal editor will be described.

［数１１］〜［数１３］のすべてのパラメータ関数ａ（ｔ），ｇ（ω，ｔ）（ｇω（ω，ｔ），ｇｔ（ｔ），ｇｒ（ｔ）），ｐ（ω），ｑ（ｔ），ｒ（ｔ），ｃ（ω，ｔ）の形状を、人間が手作業で設定できる既知音響信号除去エディタを以下に説明する。エディタのユーザは、最初から任意の関数形状を描いて指定してもよいし、最初はまず自動推定をして、その結果を修正してもよい。 All parameter functions a (t), g (ω, t) (gω (ω, t), gt (t), gr (t)), p (ω), q of [Equation 11] to [Equation 13] A known acoustic signal removal editor in which the shapes of (t), r (t), and c (ω, t) can be manually set by a human will be described below. The user of the editor may draw and specify an arbitrary function shape from the beginning, or may first perform automatic estimation and modify the result.

エディタのインタフェース４の画面構成を図６に示す。本エディタは、大別して、混合音響信号ｍ（ｔ）操作用のサブウィンドウＷ１、既知音響信号ｂ’（ｔ）操作用のサブウィンドウＷ２、既知音響信号除去後の所望の音響信号ｓ（ｔ）操作用のサブウィンドウＷ３の三つで構成されている。既知音響信号ｂ’（ｔ）が複数種類ある場合には、切り替えスイッチＷ２Ｓにより、サブウィンドウＷ２で操作する既知音響信号ｂ’（ｔ）を切り替えることができる。このインタフェースでは、図４に示したステップＳＴ２０５からステップＳＴ２１９が実行される。 The screen configuration of the editor interface 4 is shown in FIG. The editor is roughly divided into a sub-window W1 for operating the mixed sound signal m (t), a sub-window W2 for operating the known sound signal b ′ (t), and a desired sound signal s (t) operation after removing the known sound signal. The sub-window W3 is composed of three. When there are a plurality of types of known acoustic signals b ′ (t), the known acoustic signal b ′ (t) operated in the sub-window W2 can be switched by the changeover switch W2S. In this interface, steps ST205 to ST219 shown in FIG. 4 are executed.

まず、全サブウィンドウに共通の機能を述べる。操作範囲スライダーＰ１は、音響信号中のどこを現在表示しているかを表す。カーソルＰ２は、現在の操作対象の時間軸上の位置を表す。アイコン化（折り畳み）ボタンＰ３は、これを押すと一時的にそのボタンの属するサブウィンドウが折り畳まれ、小さくなる。現在操作対象以外の未使用のサブウィンドウを隠して、狭い画面を有効活用できる。フロート化（拡大）ボタンＰ４は、これを押すと一時的にそのボタンの属するサブウィンドウが、親ウィンドウから切り離され（フロート化）、さらに拡大されて操作・編集が容易になる。フロート化（拡大）ボタンＰ４しか描かれていない場合には、このボタンを押すと、それに関連づけられたサブウィンドウがフロート化されて新たに出現する。 First, the functions common to all subwindows are described. The operation range slider P1 indicates where in the acoustic signal is currently displayed. The cursor P2 represents the position on the time axis of the current operation target. When the iconized (folding) button P3 is pressed, the subwindow to which the button belongs is temporarily folded to become smaller. Narrow sub-screens can be effectively used by hiding unused sub-windows that are not currently operated. When the float (enlarge) button P4 is pressed, the subwindow to which the button belongs is temporarily separated from the parent window (float), and further enlarged to facilitate operation / editing. When only the float (enlarge) button P4 is drawn, when this button is pressed, the subwindow associated therewith is floated and newly appears.

サブウィンドウＷ１には、混合音響信号ｍ（ｔ）のパワーのグラフＥ１とその振幅スペクトルＭ（ω，ｔ）のグラフＥ２が表示されている。サブウィンドウＷ２には、既知音響信号ｂ’（ｔ）のパワーのグラフＥ３とその振幅スペクトルＢ’（ω，ｔ）のグラフＥ４が表示されている。サブウィンドウＷ３には、既知音響信号除去後の音響信号ｓ（ｔ）のパワーのグラフＥ５とその振幅スペクトルＳ（ω，ｔ）のグラフＥ６が表示されている。各振幅スペクトルでは、左側に濃淡で振幅が描かれ（横軸が時間軸、縦軸が周波数軸）、右側にカーソル位置での振幅が描かれている（横軸がパワー、縦軸が周波数軸）。 In the sub window W1, a graph E1 of the power of the mixed acoustic signal m (t) and a graph E2 of its amplitude spectrum M (ω, t) are displayed. In the sub-window W2, a graph E3 of the power of the known acoustic signal b ′ (t) and a graph E4 of its amplitude spectrum B ′ (ω, t) are displayed. In the sub window W3, a graph E5 of the power of the acoustic signal s (t) after removal of the known acoustic signal and a graph E6 of its amplitude spectrum S (ω, t) are displayed. In each amplitude spectrum, the amplitude is drawn with shading on the left side (the horizontal axis is the time axis, the vertical axis is the frequency axis), and the amplitude at the cursor position is drawn on the right side (the horizontal axis is power, the vertical axis is the frequency axis) ).

また再生制御操作パネルＰ５１には、人間が聞いて確認するために、混合音響信号の再生、停止、早送り、早戻しが可能なボタン群が並んでいる。再生制御操作パネルＰ５１の操作により、インタフェース４は、内蔵する音響再生部によって混合音響信号を再生する。 In addition, the reproduction control operation panel P51 has a group of buttons that can reproduce, stop, fast-forward, and fast-reverse the mixed sound signal so that a human can hear and confirm. By operating the reproduction control operation panel P51, the interface 4 reproduces the mixed sound signal by the built-in sound reproduction unit.

既知音響信号ｂ’（ｔ）操作用のサブウィンドウＷ２が操作の中心となるウィンドウであり、［数１２］、［数１３］のすべてのパラメータ関数ａ（ｔ），ｇ（ω，ｔ）（ｇω（ω，ｔ），ｇｔ（ｔ），ｇｒ（ｔ）），ｐ（ω），ｑ（ｔ），ｒ（ｔ）の形状を、自由に設定できる。以下、各操作パネルの説明を述べる。 The sub-window W2 for operation of the known acoustic signal b ′ (t) is the center of operation, and all parameter functions a (t), g (ω, t) (gω) of [Equation 12] and [Equation 13]. The shapes of (ω, t), gt (t), gr (t)), p (ω), q (t), r (t) can be freely set. Hereinafter, description of each operation panel will be described.

１．周波数特性の時間変化の補正用操作パネルＣ１（Ｅ７の右側）
ｇω（ω，ｔ）を表示・操作するためのパネルで、カーソル位置の時刻ｔでのｇω（ω，ｔ）が描かれている（横軸が大きさ、縦軸が周波数軸）。設定操作結果は、ｇ（ω，ｔ）の表示パネルＥ７に即座に反映される（ステップＳＴ２０５，ＳＴ２０６）。Ｅ７には、濃淡でｇ（ω，ｔ）の値の大きさが描かれている（横軸が時間軸、縦軸が周波数軸）。 1. Operation panel C1 for correcting the time variation of the frequency characteristics (right side of E7)
On the panel for displaying and operating gω (ω, t), gω (ω, t) at time t at the cursor position is drawn (the horizontal axis is the size and the vertical axis is the frequency axis). The setting operation result is immediately reflected on the display panel E7 of g (ω, t) (steps ST205 and ST206). In E7, the magnitude of the value of g (ω, t) is depicted in shading (the horizontal axis is the time axis, and the vertical axis is the frequency axis).

２．音量の時間変化の補正用操作パネルＣ２（Ｅ７の下側）
ｇｔ（ｔ）を表示・操作するためのパネルで、設定操作結果は、ｇ（ω，ｔ）の表示パネルＥ７に即座に反映される（ステップＳＴ２０７，ＳＴ２０８）。 2. Operation panel C2 for correcting changes in volume over time (lower side of E7)
The setting operation result is immediately reflected on the display panel E7 for g (ω, t) on the panel for displaying and operating gt (t) (steps ST207 and ST208).

３．ｇ（ω，ｔ）の値を全体的に持ち上げるための操作パネルＣ３（Ｅ７の下側）
ｇｒ（ｔ）を表示・操作するためのパネルで、設定操作結果は、ｇ（ω，ｔ）の表示パネルＥ７に即座に反映される（ステップＳＴ２０９，ＳＴ２１０）。 3. Operation panel C3 (lower side of E7) for raising the value of g (ω, t) as a whole
The setting operation result is immediately reflected on the display panel E7 for g (ω, t) on the panel for displaying and operating gr (t) (steps ST209 and ST210).

４．混合音の振幅スペクトルから既知音響信号の振幅スペクトルに相当する成分を減算する分量を最終的に調整するための操作パネルＣ４
ａ（ｔ）を表示・操作するためのパネルである。このパネルを操作するとａ（ｔ）の変更が即座に表示に反映する（ステップＳＴ２１１，ＳＴ２１２）。 4). Operation panel C4 for finally adjusting the amount of subtraction of the component corresponding to the amplitude spectrum of the known acoustic signal from the amplitude spectrum of the mixed sound
It is a panel for displaying and operating a (t). When this panel is operated, the change of a (t) is immediately reflected in the display (steps ST211 and ST212).

５．周波数軸方向の伸縮を補正するための操作パネルＣ５
ｐ（ω）を表示・操作するためのパネルである。このパネルを操作するとｐ（ｔ）の変更が即座に表示に反映する（ステップＳＴ２１３，ＳＴ２１４）。 5. Operation panel C5 for correcting expansion and contraction in the frequency axis direction
This is a panel for displaying and operating p (ω). When this panel is operated, the change of p (t) is immediately reflected in the display (steps ST213 and ST214).

６．時間軸方向の伸縮を補正するための操作パネルＣ６
ｑ（ｔ）を表示・操作するためのパネルである。このパネルを操作するとｑ（ｔ）の変更が即座に表示に反映する（ステップＳＴ２１５，ＳＴ２１６）。 6). Operation panel C6 for correcting expansion and contraction in the time axis direction
This is a panel for displaying and operating q (t). When this panel is operated, the change of q (t) is immediately reflected in the display (steps ST215 and ST216).

７．時間的な位置のずれを補正するための操作パネルＣ７
ｒ（ｔ）を表示・操作するためのパネルである。このパネルを操作するとｒ（ｔ）の変更が即座に表示に反映する（ステップＳＴ２１７，ＳＴ２１８）。 7). Operation panel C7 for correcting a positional shift in time
It is a panel for displaying and operating r (t). When this panel is operated, the change in r (t) is immediately reflected in the display (steps ST217 and ST218).

また再生制御操作パネルＰ５２には、人間が聞いて確認するために、既知音響信号の再生、停止、早送り、早戻しが可能なボタン群が並んでいる。再生制御操作パネルＰ５２の操作により、インタフェース４は、内蔵する音響再生部によって既知音響信号を再生する。 The reproduction control operation panel P52 includes a group of buttons that can reproduce, stop, fast-forward, and fast-reverse known acoustic signals for human confirmation by listening. By operating the reproduction control operation panel P52, the interface 4 reproduces a known acoustic signal by a built-in acoustic reproduction unit.

次に、既知音響信号除去後の音響信号ｓ（ｔ）操作用のサブウィンドウＷ３では、［数１１］のパラメータ関数ｃ（ω，ｔ）の形状を、自由に設定できる。以下、各操作パネルの説明を述べる。 Next, in the sub-window W3 for operating the acoustic signal s (t) after removing the known acoustic signal, the shape of the parameter function c (ω, t) of [Equation 11] can be freely set. Hereinafter, description of each operation panel will be described.

１．グラフィックイコライザ（ＧＥＱ）操作パネルＣ８（Ｅ８の右側）
ｃ（ω，ｔ）のω方向の形状を表示・操作するためのパネルで、カーソル位置の時刻ｔでのｃ（ω，ｔ）が描かれている（横軸が大きさ、縦軸が周波数軸）。設定操作結果は、ｃ（ω，ｔ）の表示パネルＥ８に即座に反映される。Ｅ８には、濃淡でｃ（ω，ｔ）の値の大きさが描かれている（横軸が時間軸、縦軸が周波数軸）。 1. Graphic equalizer (GEQ) operation panel C8 (right side of E8)
A panel for displaying / manipulating the shape of c (ω, t) in the ω direction, where c (ω, t) at the time t at the cursor position is drawn (the horizontal axis is the size and the vertical axis is the frequency) axis). The setting operation result is immediately reflected on the display panel E8 of c (ω, t). In E8, the magnitude of the value of c (ω, t) is depicted in shading (the horizontal axis is the time axis, and the vertical axis is the frequency axis).

２．ボリュームフェーダー操作パネルＣ９（Ｅ８の下側）
ｃ（ω，ｔ）のｔ方向の形状を表示・操作するためのパネルで、設定操作結果は、ｃ（ω，ｔ）の表示パネルＥ８に即座に反映される。 2. Volume fader operation panel C9 (below E8)
This is a panel for displaying / manipulating the shape of c (ω, t) in the t direction, and the setting operation result is immediately reflected on the display panel E8 of c (ω, t).

また再生制御操作パネルＰ５３には、人間が聞いて確認するために、合成した音響信号（合成手段７の出力）の再生、停止、早送り、早戻しが可能なボタン群が並んでいる。再生制御操作パネルＰ５３の操作により、インタフェース４は、内蔵する音響再生部によって合成した音響信号を再生する。 The reproduction control operation panel P53 has a group of buttons that can reproduce, stop, fast forward, and fast reverse the synthesized sound signal (output of the synthesizing means 7) for human confirmation by listening. By operating the reproduction control operation panel P53, the interface 4 reproduces the sound signal synthesized by the built-in sound reproduction unit.

次に本実施の形態の実装について説明する。まず音声や物音等の音響信号ｓ（ｔ）にＢＧＭ等の音響信号ｂ（ｔ）が加えられている混合音響信号ｍ（ｔ）が観測されたときに、ｂ（ｔ）の元となる音源の音響信号ｂ’（ｔ）が既知という条件下で、未知のｓ（ｔ）を求めることが可能なプログラムを、各種オペレーティングシステム（Ｌｉｎｕｘ２．４，ＳＧＩＩＲＩＸ６．５，ＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓＸＰ：登録商標）上に実装した。本プログラムに、ｍ（ｔ）とｂ’（ｔ）が収録されたオーディオファイルを与えると、ｓ（ｔ）のオーディオファイルを得ることができる。 Next, implementation of the present embodiment will be described. First, when a mixed acoustic signal m (t) in which an acoustic signal b (t) such as BGM is added to an acoustic signal s (t) such as a sound or a sound of sound is observed, a sound source that is the source of b (t) Programs that can determine the unknown s (t) under the condition that the acoustic signal b ′ (t) of the computer is known, various operating systems (Linux 2.4, SGI IRIX 6.5, Microsoft Windows XP: registered trademark) Implemented above. When an audio file containing m (t) and b '(t) is given to this program, an audio file of s (t) can be obtained.

人間の音声や物音にバックグラウンドミュージック（ＢＧＭ）が加えられた様々な混合音に対して実験した結果、そのＢＧＭの原曲の音響信号を用いて、混合音中のＢＧＭを除去し、人間の音声や物音が得られることを確認した。ドラムスの鳴っている曲や鳴っていない曲、ポピュラー音楽やクラシック音楽等の様々なジャンルの曲がＢＧＭとして含まれていても、除去が可能であった。 As a result of experiments on various mixed sounds in which background music (BGM) is added to human voice and sound, the BGM in the mixed sound is removed using the sound signal of the original music of the BGM. It was confirmed that voice and sound were obtained. Even if songs of various genres such as songs with and without drums, popular music and classical music are included as BGM, they can be removed.

実験結果の例として、二人の男女の対話のＢＧＭにクラシック音楽が鳴っている混合音を実際に処理した結果を図７〜図１２に示す。図７、図８に示す混合音響信号ｍ（ｔ）を入力として、図９、図１０に示す元音源の既知音響信号ｂ’（ｔ）を用いてＢＧＭ成分を除去した結果が、図１１、図１２に示す既知音響信号除去後の音響信号ｓ（ｔ）となる。この処理結果の例の混合音は、「ＲＷＣＰ音声対話データベース」から抜粋した二人の男女の対話の音響信号に、「ＲＷＣ研究用音楽データベース」から抜粋したクラシック音楽の音響信号が加えられたものである。 As an example of the experimental result, FIGS. 7 to 12 show results of actually processing a mixed sound in which classical music is played in the BGM of a dialogue between two men and women. The mixed acoustic signal m (t) shown in FIG. 7 and FIG. 8 is used as an input, and the result of removing the BGM component using the known acoustic signal b ′ (t) of the original sound source shown in FIG. 9 and FIG. The acoustic signal s (t) after removal of the known acoustic signal shown in FIG. The mixed sound of the example of this processing result is obtained by adding the acoustic signal of classical music extracted from the “music database for RWC research” to the acoustic signal of the conversation between two men and women extracted from the “RWCP audio dialogue database” It is.

本発明によれば、補正ステップにより、混合音響信号の振幅スペクトルに対する既知音響信号の振幅スペクトルの時間的な位置のずれ、周波数特性の時間変化、音量の時間変化、時間軸方向の伸縮及び周波数軸方向の伸縮の少なくとも１つを補正した既知音響信号の補正振幅スペクトルを求め、この補正振幅スペクトルを混合音響信号の振幅スペクトルから除去するため、混合音響信号中に非定常な雑音として含まれている既知音響信号を高い精度で除去することができる利点が得られる。 According to the present invention, the correction step shifts the time position of the amplitude spectrum of the known acoustic signal relative to the amplitude spectrum of the mixed acoustic signal, the time change of the frequency characteristic, the time change of the volume, the expansion and contraction in the time axis direction, and the frequency axis In order to obtain a corrected amplitude spectrum of a known acoustic signal corrected for at least one of the direction expansion and contraction, and to remove the corrected amplitude spectrum from the amplitude spectrum of the mixed acoustic signal, it is included in the mixed acoustic signal as non-stationary noise. There is an advantage that the known acoustic signal can be removed with high accuracy.

また本発明によれば、人間の声や物音の背景にＢＧＭが鳴っているテレビ番組や映画等の音響信号を入力とすると、別途用意したＢＧＭの音楽音響信号を用いて番組中のＢＧＭを除去し、人間の声や物音だけの音響信号を得ることが可能となる。 Further, according to the present invention, when an audio signal such as a TV program or a movie with BGM sounding in the background of a human voice or sound is input, BGM in the program is removed using a BGM music audio signal prepared separately. In addition, it is possible to obtain an acoustic signal only of a human voice or sound.

更に、ＢＧＭ除去後の音響信号に、別の音楽をＢＧＭとして付与することで、テレビ番組や映画等の音楽を差し換えた再利用が可能となる。 Furthermore, by assigning another music as BGM to the acoustic signal after BGM removal, it becomes possible to reuse music such as TV programs and movies.

既知音響信号は、任意の音響信号でよいため、音楽のジャンルを問わず、ボーカルの有無を問わず、伴奏の有無を問わずに適用できる。また、音楽に限らず、定常雑音及び非定常雑音を含めた、任意の既知の雑音に適用できる。 Since the known acoustic signal may be an arbitrary acoustic signal, it can be applied regardless of the genre of music, with or without vocals, and with or without accompaniment. Further, the present invention is not limited to music, and can be applied to any known noise including stationary noise and non-stationary noise.

また、既知音響信号除去エディタ上のインタフェースを使用して、人間が手作業で修正することで、実務の現場でより高品質な除去作業が実現できる。 Further, by using the interface on the known acoustic signal removal editor and manually correcting by a human, a higher quality removal operation can be realized at a practical site.

Ｗ１，Ｗ２，Ｗ３サブウィンドウ
Ｐ１操作範囲スライダー
Ｐ２カーソル
Ｐ３，Ｐ４ボタン
Ｐ５１〜Ｐ５３再生制御操作パネル
Ｅ１〜Ｅ６グラフ
Ｅ７，Ｅ８表示パネル
Ｃ１〜Ｃ９操作パネル W1, W2, W3 Subwindow P1 Operation range slider P2 Cursor P3, P4 Button P51-P53 Playback control operation panel E1-E6 Graph E7, E8 Display panel C1-C9 Operation panel

Claims

A known acoustic signal removal method for removing a component of a known acoustic signal from a mixed acoustic signal obtained by mixing a plurality of acoustic signals,
A mixed acoustic signal converting step of converting the mixed acoustic signal into a time-frequency representation to obtain an amplitude spectrum of the mixed acoustic signal and a phase of the mixed acoustic signal;
A known acoustic signal converting step of converting a known acoustic signal corresponding to a known acoustic signal included in the mixed acoustic signal into a time-frequency representation to obtain an amplitude spectrum of the known acoustic signal;
Based on the amplitude spectrum of the mixed acoustic signal, the temporal position shift of the amplitude spectrum of the known acoustic signal with respect to the amplitude spectrum of the mixed acoustic signal, the time change of the frequency characteristics, the time change of the volume, and the expansion and contraction in the time axis direction And a correction step for obtaining a corrected amplitude spectrum of the known acoustic signal in which at least one of expansion and contraction in the frequency axis direction is corrected,
A removing step of removing the corrected amplitude spectrum of the known acoustic signal from the amplitude spectrum of the mixed acoustic signal;
An inverse transformation step for obtaining a unit waveform by performing an inverse transformation on the time representation based on the amplitude spectrum after removal obtained by the removal step and the phase of the mixed acoustic signal;
A known acoustic signal removal method comprising: a synthesis step of obtaining an acoustic signal obtained by synthesizing the unit waveforms to remove the components of the known acoustic signal.