JP2005173055A

JP2005173055A - Device, method and program for removing acoustic signal

Info

Publication number: JP2005173055A
Application number: JP2003410959A
Authority: JP
Inventors: Masataka Goto; 真孝後藤; Yasumasa Nakada; 安優中田; Tomoyuki Okamura; 智之岡村; Hironobu Takahashi; 裕信高橋
Original assignee: National Institute of Advanced Industrial Science and Technology AIST; Fuji Television Network Inc
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Fuji Television Network Inc
Priority date: 2003-12-09
Filing date: 2003-12-09
Publication date: 2005-06-30
Anticipated expiration: 2023-12-09
Also published as: JP4274419B2; WO2005057551A1

Abstract

<P>PROBLEM TO BE SOLVED: To exclude a stationary sound to be a cause of erroneous estimation, to predict change of the sound which is going to be removed automatically and with high precision and to appropriately remove it in the case of removal processing in a mixed sound. <P>SOLUTION: This device for removing an acoustic signal is provided with a stationary sound detection part 18 which detects the stationary sound in a mixed acoustic signal constituted by mixing a known acoustic signal which is going to be removed with other acoustic signals, an acoustic amplitude extraction part 200 which extracts acoustic amplitude spectrums from the acoustic signal and a removal processing part 104 which selects a mixed acoustic amplitude spectrum with frequency which does not coincide with frequency of the stationary sound among the extracted mixed acoustic amplitude spectrums and removes the known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum with the selected frequency. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、既に放送された番組を再放送するなど、コンテンツを再利用する際に、コンテンツ中に混合されたＢＧＭや音声等の音響を除去する音響信号除去装置、音響信号除去方法及び音響信号除去プログラムに関する。 The present invention relates to an acoustic signal removing device, an acoustic signal removing method, and an acoustic signal for removing sound such as BGM and voice mixed in content when reusing the content such as rebroadcasting a program that has already been broadcast. Relates to the removal program.

近年の放送業界では、既に放送された番組を再放送するなど、コンテンツを再利用する場合がある。このコンテンツの再利用に際しては、既に放送された音声や音楽が混合している映像素材から、使用されている音楽だけを消去することで、新たな素材として活用することがある。このように、既存の混合音響から特定の音声を減算する技術としては、例えば特許文献１に開示された技術がある。
特開２０００−３１２３９５号公報 In the broadcasting industry in recent years, content may be reused, for example, by rebroadcasting a program that has already been broadcast. When the content is reused, it may be used as a new material by deleting only the music being used from the video material in which the already broadcast audio and music are mixed. As described above, as a technique for subtracting specific sound from existing mixed sound, for example, there is a technique disclosed in Patent Document 1.
JP 2000-31395A

しかしながら、放送番組の音声作成時には、製作意図にあわせて周波数特性や音量を調整する場合が多い。このため、音の位相が予測不能な変化をしているために、単に電子的な減算処理を行ったのみでは、適切に消去することはできない。 However, when creating audio for a broadcast program, frequency characteristics and volume are often adjusted according to the production intention. For this reason, since the phase of the sound changes in an unpredictable manner, it cannot be appropriately erased simply by performing an electronic subtraction process.

詳述すると、除去しようとする音響が、例えば番組のＢＧＭのような既知の音楽であっても、番組作成時に、番組効果のためにBGMの低音や高音の強調や減衰を行ったり、録音・再生を繰り返したりする過程で、BGMの周波数特性が変化している場合もあり、単純に減算処理をすることはできない。 In detail, even if the sound to be removed is known music such as BGM of a program, for example, BGM bass and treble are emphasized or attenuated for program effects when recording, In the process of repeating playback, the frequency characteristics of BGM may change, and simple subtraction cannot be performed.

また、アナログテープレコーダーによるヒスや各種のビート音といった特定周波数の定常的な雑音（定常音）が映像データに含まれている場合があり、この定常音の周辺の周波数チャンネルで大きく推定を誤ってしまうという問題があった。すなわち、定常音は、既知音響側にはまったく含まれない音であることから、周波数特性の推定を誤る原因となる。 In addition, steady noise (steady sound) of a specific frequency such as hiss and various beat sounds from analog tape recorders may be included in the video data. There was a problem that. In other words, the stationary sound is a sound that is not included at all on the known acoustic side, and thus causes an erroneous estimation of the frequency characteristics.

さらに、既知音響の除去処理は、既知音響が混合されている箇所のみ行うが、音声ファイル中の一部のみを消去した場合には、消去した部分とそれ以外の部分との間で音量の変化が生じるという問題がある。 Furthermore, the process of removing the known sound is performed only at the places where the known sound is mixed. However, if only a part of the audio file is deleted, the volume changes between the deleted part and the other part. There is a problem that occurs.

本発明は、上記問題を解決すべくなされたものであり、混合音響中の除去処理に際して誤推定の原因となる定常音を排除して、除去しようとする音響の変化を自動的かつ高精度に予測し、適正に除去することのできる音響信号除去装置、音響信号除去方法及び音響信号除去プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and eliminates stationary sound that causes erroneous estimation during removal processing in mixed sound, and automatically and accurately changes the sound to be removed. An object of the present invention is to provide an acoustic signal removal device, an acoustic signal removal method, and an acoustic signal removal program that can be predicted and removed appropriately.

上記課題を解決するために、本発明は、除去しようとする既知音響信号と他の音響信号とが混合してなる混合音響信号中の、定常音を検出し、既知音響信号から既知音響振幅スペクトルを抽出するとともに、混合音響信号から周波数毎の混合音響振幅スペクトルを抽出し、抽出された混合音響振幅スペクトルのうち、定常音の周波数と一致しない周波数の混合音響振幅スペクトルを選択し、選択された周波数の混合音響振幅スペクトルから既知音響振幅スペクトルを除去する。 In order to solve the above-described problems, the present invention detects a stationary sound in a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal, and detects a known acoustic amplitude spectrum from the known acoustic signal. And a mixed acoustic amplitude spectrum for each frequency is extracted from the mixed acoustic signal, and a mixed acoustic amplitude spectrum having a frequency that does not match the frequency of the stationary sound is selected from the extracted mixed acoustic amplitude spectra. Remove the known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum of the frequency.

上記発明において、混合音響信号中における、最小の振幅となる周波数、或いは値の順にソートしたｎ番目の値となる周波数、ｎ番目までの標準偏差から一定値を超えないｎの最大値となる周波数を定常音として検出することが好ましい。 In the above invention, in the mixed acoustic signal, the frequency that becomes the minimum amplitude, the frequency that becomes the nth value sorted in order of values, or the frequency that becomes the maximum value of n that does not exceed a certain value from the standard deviation up to the nth. Is preferably detected as a stationary sound.

また、他の発明は、除去しようとする既知音響信号と他の音響信号とが混合してなる混合音響信号から既知音響信号を除去する際に、混合音響としての一定振幅の音響信号に対し、既知音響信号を０として疑似減算処理を行い、一定振幅の音響信号と前記疑似減算処理後の音響信号との音量の差を計測し、前記計測結果に基づいて、既知音響信号の時間毎の信号強度を設定し、既知音響信号から既知音響振幅スペクトルを抽出するとともに、前記混合音響信号から混合音響振幅スペクトルを抽出し、前記設定に基づいて、既知音響振幅スペクトルを変換し、混合音響振幅スペクトルから既知音響振幅スペクトルを除去する。なお、この疑似減算処理は、最終的に行われる除去処理と同様に、振幅スペクトルの減算とすることができる。 In another invention, when the known acoustic signal is removed from the mixed acoustic signal obtained by mixing the known acoustic signal to be removed and the other acoustic signal, the acoustic signal having a constant amplitude as the mixed acoustic Perform a pseudo subtraction process with the known acoustic signal set to 0, measure the difference in volume between the acoustic signal with a constant amplitude and the acoustic signal after the pseudo subtraction process, and based on the measurement result, signal for each time of the known acoustic signal The intensity is set, the known acoustic amplitude spectrum is extracted from the known acoustic signal, the mixed acoustic amplitude spectrum is extracted from the mixed acoustic signal, the known acoustic amplitude spectrum is converted based on the setting, and the mixed acoustic amplitude spectrum is extracted. Remove the known acoustic amplitude spectrum. This pseudo subtraction process can be a subtraction of an amplitude spectrum, similarly to the removal process that is finally performed.

このような本発明によれば、音響信号において振幅スペクトルは、位相に依存しないため、位相が変化してもその影響を受けることなく、混合音響信号中の既知音響信号の周波数特性や音量変化を、適切に推定することができる。この結果、本発明によれば、例えば、音声と音楽が混じった番組の音声信号から、番組作成時に使用した音楽CD等の音のデータを使って、音楽だけを的確に消去することができる。なお、本発明は、音楽に限らず、番組作成時に混入した背景雑音等も、雑音のみが同時に録音された音のデータを使って消去することができる。 According to the present invention as described above, the amplitude spectrum of the acoustic signal does not depend on the phase, so that even if the phase changes, the frequency characteristics and volume changes of the known acoustic signal in the mixed acoustic signal are not affected by the change. Can be estimated appropriately. As a result, according to the present invention, for example, only the music can be erased accurately from the audio signal of the program in which audio and music are mixed, using the sound data such as a music CD used at the time of creating the program. In the present invention, not only music but also background noise mixed at the time of creating a program can be deleted using sound data in which only noise is recorded at the same time.

また、本発明によれば、アナログテープレコーダーによるヒスや各種のビート音といった特定周波数の定常的な雑音を検出し、この雑音が含まれる周波数チャンネルを無視して、除去処理を行うため、既知音響の周波数特性の推定をより適正に行うことができる。 In addition, according to the present invention, stationary noise of a specific frequency such as hiss and various beat sounds by an analog tape recorder is detected, and the removal process is performed by ignoring the frequency channel including the noise. The frequency characteristics can be estimated more appropriately.

さらに、他の発明によれば、一定振幅の単音（例えば、480Hz）を混合音として入力し、既知音響の引く量をゼロとして疑似減算処理を行い、出力される音量の違いを計測し、その値が一致するように設定するため、特定の音声ファイルの一部のみを消去した場合であっても、それぞれの音量を同一に維持することができる。 Furthermore, according to another invention, a single sound having a constant amplitude (for example, 480 Hz) is input as a mixed sound, a pseudo-subtraction process is performed with the amount of known sound drawn as zero, and a difference in output volume is measured. Since the values are set to match, even when only a part of a specific audio file is deleted, the respective volumes can be kept the same.

［既知音響除去システムの構成］
本発明の実施形態について図面を参照しながら説明する。図１は、本実施形態に係る既知音響除去システムの全体構成を示すブロック図である。 [Configuration of known sound removal system]
Embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the overall configuration of a known sound removal system according to this embodiment.

図１に示すように、本実施形態に係る既知音響除去システムは、混合音響や既知音響を入力する入力I/F１及びＤＶキャプチャー２を備えている。これら入力I/F１及びＤＶキャプチャー２から入力されたファイル（例えば、ＡＶＩファイルやＷＡＶファイル）は、記憶装置５に蓄積される。入力I/F１は、ＣＤプレーヤーやＭＤプレーヤー等の再生装置から音響信号を取り込むインターフェースである。また、ＤＶキャプチャー２は、映像や音声が混合した混合音響信号である消去対象用のＭＩＸ音声を取り出すインターフェースである。 As shown in FIG. 1, the known sound removal system according to the present embodiment includes an input I / F 1 and a DV capture 2 for inputting mixed sound and known sound. Files (for example, AVI files and WAV files) input from these input I / F 1 and DV capture 2 are stored in the storage device 5. The input I / F 1 is an interface that takes in an acoustic signal from a playback device such as a CD player or an MD player. The DV capture 2 is an interface for extracting MIX audio for erasure, which is a mixed acoustic signal in which video and audio are mixed.

また、既知音響除去システムは、記憶装置５に蓄積された各種データに対して音声データ抽出処理や音声変換処理を行う音声変換部（PreWav/PostWav）３及び音声データ抽出部（DVReMix）４とを備えている。これら音声変換部（PreWav/PostWav）３及び音声データ抽出部（DVReMix）４は、指定されたファイル（ＡＶＩファイルやＷＡＶファイル）を記憶装置５から読み出し、所定の処理を施した後、処理後のファイル（ＷＡＶファイル）を記憶装置５に蓄積する。 Further, the known sound removal system includes a voice conversion unit (PreWav / PostWav) 3 and a voice data extraction unit (DVReMix) 4 that perform voice data extraction processing and voice conversion processing on various data stored in the storage device 5. I have. The audio conversion unit (PreWav / PostWav) 3 and the audio data extraction unit (DVReMix) 4 read the designated file (AVI file or WAV file) from the storage device 5, perform a predetermined process, The file (WAV file) is stored in the storage device 5.

音声変換部（PreWav）３は、周波数変換、及びステレオからモノラルの分離を行う。すなわち、音響除去エンジンプログラム１００のフォーマットに合うように、ＷＡＶファイルを左右２チャンネルに分離するとともに、サンプリングレートを48kHzに変換し、２つのＷＡＶファイル（出力ファイル名：左チャンネルがＭＩＸ-L.ＷＡＶ、右チャンネルがＭＩＸ-R.ＷＡＶ）として生成し、記憶装置５に蓄積する。 The audio conversion unit (PreWav) 3 performs frequency conversion and separation of monaural from stereo. That is, the WAV file is separated into two left and right channels so as to conform to the format of the sound removal engine program 100, and the sampling rate is converted to 48 kHz, and two WAV files (output file name: left channel is MIX-L.WAV). , The right channel is generated as MIX-R.WAV) and stored in the storage device 5.

音声データ抽出部（DVReMix）３は、映像データ及び音声データから構成されるコンテンツから音声データのみを抽出するモジュールであり、本実施形態では、ＡＶＩファイルから音声データをＷＡＶ形式で抽出する。ここでのＷＡＶファイルは、ステレオ形式であり、そのサンプリングレートはＤＶの音声と同じ32kHz又は48kHzである。この抽出されたＷＡＶファイルは、記憶装置５に蓄積される。 The audio data extraction unit (DVReMix) 3 is a module that extracts only audio data from content composed of video data and audio data. In this embodiment, audio data is extracted from an AVI file in the WAV format. The WAV file here is in a stereo format, and its sampling rate is 32 kHz or 48 kHz, which is the same as DV audio. The extracted WAV file is stored in the storage device 5.

そして、既知音響除去システムは、混合音響信号から既知音響信号を除去する音響除去エンジンプログラム１００を備えている。この音響除去エンジンプログラム１００は、記憶装置５に蓄積された各音声ファイル（ＷＡＶファイル）を読み出し、除去したデータや除去処理に係る各種データを、一時メモリ７を介して、記憶装置５に蓄積したり、出力I/F８を通じて、モニタ１０やスピーカー１１から出力する。モニタ１０は、ユーザーインターフェース６による操作や処理結果を表示するＧＵＩであるエディタ４００を表示し、スピーカー１１は、ユーザーインターフェース６によるユーザー操作に基づいて、混合音響や既知音響、除去後音響を出力する。 The known sound removal system includes a sound removal engine program 100 that removes the known sound signal from the mixed sound signal. The sound removal engine program 100 reads each audio file (WAV file) stored in the storage device 5 and stores the removed data and various data related to the removal process in the storage device 5 via the temporary memory 7. Or output from the monitor 10 or the speaker 11 through the output I / F 8. The monitor 10 displays an editor 400 that is a GUI for displaying the operation and processing results by the user interface 6, and the speaker 11 outputs mixed sound, known sound, and post-removal sound based on the user operation by the user interface 6. .

また、音響除去エンジンプログラム１００は、キーボード６ａやマウス６ｂ等の入力デバイスによるユーザー操作に基づく操作信号を、ユーザーインターフェース６を通じて取得し、この操作信号に基づいて、各種処理を行う。この音響除去エンジンプログラム１００による既知音響除去処理については、後述する。 The sound removal engine program 100 acquires an operation signal based on a user operation by an input device such as a keyboard 6a or a mouse 6b through the user interface 6, and performs various processes based on the operation signal. The known sound removal processing by the sound removal engine program 100 will be described later.

また、既知音響除去システムは、同期制御部９を備えており、これにより記憶装置５からのデータの読み出し、音響除去エンジンプログラム１００による除去処理、メモリ７や出力I/F８によるデータの入出力を同期させる。これにより、エディタ４００に表示される映像と、スピーカー１１から出力される音声を、音響除去エンジンプログラム１００による処理やユーザーインターフェース６におけるユーザー操作に同期させることができる。 Further, the known sound removal system includes a synchronization control unit 9, thereby reading out data from the storage device 5, removal processing by the sound removal engine program 100, and data input / output by the memory 7 and the output I / F 8. Synchronize. Thereby, the video displayed on the editor 400 and the sound output from the speaker 11 can be synchronized with the processing by the sound removal engine program 100 and the user operation on the user interface 6.

さらに、音響除去エンジンプログラム１００は、パラメータの設定に際し、そのデフォルト値をシミュレーションにより設定し、ユーザーの作業を支援するシミュレーション部１４を備えている。 Furthermore, the sound removal engine program 100 includes a simulation unit 14 that sets the default value by simulation when setting parameters, and supports the user's work.

具体的に、シミュレーション部１４は、一定振幅の単音（480Hz）を混合音として入力し、既知音響をゼロとして除去処理を行い、出力される音量と、処理前の混合音とを比較して差分量を計測し、その差分量が０となるように、ユーザーインターフェース６における除去強度のデフォルト値を設定する。 Specifically, the simulation unit 14 inputs a single tone (480 Hz) having a constant amplitude as a mixed sound, performs a removal process with a known sound as zero, compares the output volume with the mixed sound before the processing, and performs a difference. The amount is measured, and a default value of the removal intensity in the user interface 6 is set so that the difference amount becomes zero.

また、シミュレーション部１４は、後述する「位相非依存減算関数」により、混合音響と既知音響の位相差が、0度から360度まで一様に確率分布しているとして、シミュレーションを行う。具体的にシミュレーション部１４では、他の音響信号の振幅を所定値とし、これに対する既知音響信号の振幅の位相差を０〜３６０度の範囲で変化させて合成して得られた混合音響信号の振幅の平均値を算出し、この平均値に対する既知音響信号の振幅の割合から、平均値に対する他の音響信号の振幅の割合の近似値を算出し、この近似値に平均値を積算することによって得られた他の音響信号の振幅に基づいて、既知信号の除去強度を設定する。以下に、位相差非依存減算関数の導出について説明する。 Further, the simulation unit 14 performs the simulation by assuming that the phase difference between the mixed sound and the known sound is uniformly distributed from 0 degrees to 360 degrees by a “phase-independent subtraction function” described later. Specifically, the simulation unit 14 sets the amplitude of the other acoustic signal to a predetermined value, and changes the phase difference of the amplitude of the known acoustic signal with respect to the amplitude in a range of 0 to 360 degrees. By calculating the average value of the amplitude, calculating the approximate value of the ratio of the amplitude of the other acoustic signal to the average value from the ratio of the amplitude of the known acoustic signal to the average value, and adding the average value to this approximate value Based on the amplitude of the obtained other acoustic signal, the removal strength of the known signal is set. Hereinafter, the derivation of the phase difference independent subtraction function will be described.

先ず、本実施形態では、周波数チャンネル毎に演算を行い、その周波数をｆ（Hz）とすると、混合音響、既知音響、消去後の音声出力（他の音響）の信号の関係は、
「混合音響」＝「消去後の音声出力」+「既知音響」
と表される。そして、上記の各音響は、それぞれ

First, in the present embodiment, when calculation is performed for each frequency channel and the frequency is f (Hz), the relationship between the mixed sound, the known sound, and the signal of the sound output after erasure (other sound) is
"Mixed sound" = "Sound output after erasure" + "Known sound"
It is expressed. And each of the above sounds

と表される。ここで、混合音響と、既知音響及び消去後の音声出力の関係から、

It is expressed. Here, from the relationship between mixed sound, known sound and sound output after erasure,

となる。如何なる時刻ｔにおいても、上式（数２）は成立するため、Sin(ωt)とcos(ωt)の係数はそれぞれ等しくなる。したがって、

It becomes. Since the above equation (Equation 2) holds at any time t, the coefficients of Sin (ωt) and cos (ωt) are equal. Therefore,

が成立する。この関係から、以下のように、Ｍは、ＡとＢと、既知音響と音声出力の位相差（θａ・θｂ）で表すことができる。

Is established. From this relationship, M can be represented by A and B, and the phase difference (θa · θb) between known sound and audio output as follows.

ここで、既知音響と音声出力の位相差（θａ・θｂ）を、位相差θδで置き換えると、

Here, when the phase difference (θa · θb) between the known sound and the audio output is replaced with the phase difference θδ,

となる。 It becomes.

本実施形態では、この位相差θδが一様な確率で発生すると考え、実際に測定されるＭの値の平均値を、次式のように、θδを０ラジアンから２πラジアンまで積分して算出する。

In the present embodiment, it is assumed that this phase difference θδ occurs with a uniform probability, and the average value of actually measured M values is calculated by integrating θδ from 0 radians to 2π radians as shown in the following equation. To do.

例えば、Ａ＝１．０として、ｂが様々な値を取った場合をシミュレーションすることにより、以下のような表を得る。

For example, assuming that A = 1.0, the following table is obtained by simulating the case where b takes various values.

ここで、

を正規化し、

here,

Normalize

とする。上記数６からも明らかなように、Ａ及びＢに定数をかけると、Ｎも定数倍になるため、表１の値を全て

で割っても（

の逆数倍しても）、下表のように、上記関係は維持される。

And As is clear from the above equation 6, if A and B are multiplied by a constant, N also becomes a constant multiple.

Divide by (

The above relationship is maintained as shown in the table below.

本実施形態の除去処理では、混合音響の振幅（Ｍ）と既知音響の振幅（Ｂ）が取得されるため、ＢをＭで除した値であるＲｂを求めることができる。そこで、上記表２に相当するテーブルデータ、若しくはその近似式によって、Ｒａを求めることができる。 In the removal process of the present embodiment, the amplitude (M) of the mixed sound and the amplitude (B) of the known sound are acquired, so that Rb, which is a value obtained by dividing B by M, can be obtained. Therefore, Ra can be obtained from the table data corresponding to Table 2 or an approximate expression thereof.

次に、

next,

により、他の音響の振幅Ａを求めることができる。 Thus, the amplitude A of the other sound can be obtained.

なお、実際の適用としては、上記Ｂを０から１００まで０．０１刻みで変化させ、表の値に対して以下のような近似式を生成し、Ｍの平均値

からＡを算出する。

As an actual application, B is changed from 0 to 100 in increments of 0.01, and the following approximate expression is generated for the values in the table, and the average value of M

A is calculated from

なお、理論上は、混合音の振幅を既知音響の振幅を上回ることはないが、実際には、推定ミスや消去強度を１より大きくする等により発生するため、このような場合（Ｒｂが１より大きい場合）には、Ｒａを０とする。 Theoretically, the amplitude of the mixed sound does not exceed the amplitude of the known sound. However, in actuality, it occurs due to an estimation error or an erasure intensity greater than 1, and in such a case (Rb is 1). Ra is set to 0 in the case of larger).

また、音響除去エンジンプログラム１００は、定常音検出部１８を備えている。このキャ定常音検出部１８は、図６（ｂ）に示すように、除去エンジンプログラム１００における除去処理に係る混合音響信号中から定常音を検出し、この検出された定常音を除去エンジンプログラム１００に通知するモジュールである。 The sound removal engine program 100 also includes a stationary sound detection unit 18. As shown in FIG. 6B, the stationary sound detection unit 18 detects a stationary sound from the mixed acoustic signal related to the removal processing in the removal engine program 100, and removes the detected steady sound from the removal engine program 100. It is a module to notify to.

そして、除去エンジンプログラム１００では、この通知に基づいて、混合音響振幅スペクトルのうち、定常音の周波数と一致しない周波数の混合音響振幅スペクトルを選択選択された周波数の混合音響振幅スペクトルから既知音響振幅スペクトルを除去する。 Based on this notification, the removal engine program 100 selects a mixed acoustic amplitude spectrum having a frequency that does not match the frequency of the stationary sound from the mixed acoustic amplitude spectrum, and selects a known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum of the selected frequency. Remove.

かかる定常音の検出は、混合音響信号中における、周波数チャンネル毎の振幅をその時間と対応付けて検出し、これをソートして、その最小値や、代表値を算出する。この定常音の算出には、以下の３つの方法が挙げられる。 In detecting the stationary sound, the amplitude for each frequency channel in the mixed sound signal is detected in association with the time, and the amplitude is sorted and the minimum value or the representative value is calculated. There are the following three methods for calculating the stationary sound.

（１）最小値法
混合音に対してFFT処理により各周波数チャンネルごとの振幅データを時刻に対応付けて取り出す。ここでは、例えば１５．７KHｚのチャンネルで、
時刻 0.00秒 0.01秒 0.02秒……1456.345秒
振幅 1.5 2.5 1.5 4.5
の値が検出されたものとする。 (1) Minimum value method The amplitude data for each frequency channel is extracted from the mixed sound in association with the time by FFT processing. Here, for example, a channel of 15.7 kHz
Time 0.00 seconds 0.01 seconds 0.02 seconds …… 1456.345 seconds Amplitude 1.5 2.5 1.5 4.5
It is assumed that the value of is detected.

この最小値法では、以上のようなデータから、全時刻の値について最小となる値を定常音とする。 In this minimum value method, the minimum value for all time values is determined as a stationary sound from the above data.

（２）順位付け法
なお、上述した最小値法では、ノイズが「定常」ではない場合や、他の音声や合成された音楽の位相の関係で、振幅が小さくなることがあり、このために最小値が、本来の定常値より低くなる惧れがある。この順位付け法では、最小値ではなく次のような値を用いる。 (2) Ranking method In the above-described minimum value method, the amplitude may be small when the noise is not “steady” or depending on the phase of other speech or synthesized music. The minimum value may be lower than the original steady value. In this ranking method, the following values are used instead of the minimum values.

上述した振幅値を、小さなものからソートする。 The above amplitude values are sorted from the smallest.

順位 1 2 3 4 5 … 345 346
振幅 0.34 0.035 0.36 0.40 0.40 …18.9 19.5
例えば３秒間なら10ミリ秒ごとなので300サンプルになります。これに、1番から300番までの数字を付与する。そして、ノイズ以外がない区間が少なくても10%あるとの予測に基づき、300の10%の値、すなわち30番目の値を定常音として採用する。 Rank 1 2 3 4 5… 345 346
Amplitude 0.34 0.035 0.36 0.40 0.40… 18.9 19.5
For example, if it is 3 seconds, it will be 300 samples because it is every 10 milliseconds. A number from 1 to 300 is given to this. Then, based on the prediction that there is at least 10% of sections without noise, the 10% value of 300, that is, the 30th value is adopted as the stationary sound.

（３）順位付け改良法
順位付け法では、ノイズだけの区間が多すぎたり少なすぎたりすることや、１０％という閾値を定める必要があるため、閾値の設定により算出結果が影響を受ける可能性がある。そこで、順位付け法の改良法として、上記の振幅値のうち小さい方からｎ番目までのノイズについて、統計をとって標準偏差 σ(n)を求め、このσ(n)が一定値を超えないｎの最大値を求め、このｎの時の振幅値を定常音とする。この一定値は、経験的に定めることが好ましい。 (3) Ranking improvement method In the ranking method, there are too many or too few sections of noise alone, and it is necessary to set a threshold of 10%, so the calculation result may be affected by the threshold setting. There is. Therefore, as an improvement of the ranking method, the standard deviation σ (n) is obtained by taking statistics for the noises from the smallest to the nth of the above amplitude values, and this σ (n) does not exceed a certain value. The maximum value of n is obtained, and the amplitude value at the time of n is set as a stationary sound. This constant value is preferably determined empirically.

［既知音響除去システムの動作］
以上説明した構成を有する既知音響除去システムは、以下のように動作する。図２は、既知音響除去システムの動作を示すフロー図である。なお、本実施形態では、映像と音声がステレオで記録された映像ファイル（ＤＶ）を混合音響（ＭＩＸ音声）とし、オリジナル曲が記録された音声ファイルを既知音響とし、映像ファイル中にＢＧＭとして包含されている当該オリジナル曲を除去する場合を例に説明する。本実施形態おける処理は、（１）前処理、（２）音楽消去処理、（３）後処理に大別される。以下、各処理について詳述する。 [Operation of known sound removal system]
The known sound removal system having the above-described configuration operates as follows. FIG. 2 is a flowchart showing the operation of the known sound removal system. In the present embodiment, a video file (DV) in which video and audio are recorded in stereo is referred to as mixed sound (MIX audio), an audio file in which the original music is recorded is referred to as known sound, and is included as BGM in the video file. An example of removing the original music that has been performed will be described. The processing in this embodiment is roughly divided into (1) preprocessing, (2) music erasure processing, and (3) postprocessing. Hereinafter, each process is explained in full detail.

（１）前処理
前処理では、ＤＶから消去対象用のＭＩＸ音声を取り出すとともに、ＢＧＭ用音声（オリジナル曲）を用意する。具体的には、ＤＶキャプチャー２からＤＶ用のビデオ編集ソフトを使用して動画をキャプチャーし（Ｓ１０１）、このキャプチャーしたファイルを、タイプ１のＡＶＩファイル（出力ファイル名：ＭＩＸ.ＡＶＩ）として、記憶装置５に蓄積する。 (1) Pre-processing In the pre-processing, MIX sound for erasure is extracted from DV and BGM sound (original music) is prepared. Specifically, a video is captured from the DV capture 2 using DV video editing software (S101), and the captured file is stored as a type 1 AVI file (output file name: MIX.AVI). Accumulate in device 5.

次いで、音声データ抽出部（DVReMix）３により、ＡＶＩファイルから音声データをＷＡＶ形式（出力ファイル名：ＭＩＸ.ＷＡＶ）で抽出する（Ｓ１０２）。ここでのＷＡＶファイルは、ステレオ形式であり、そのサンプリングレートはＤＶの音声と同じ32kHz又は48kHzである。この抽出されたＷＡＶファイルは、記憶装置５に蓄積される。 Next, the audio data extraction unit (DVReMix) 3 extracts audio data from the AVI file in the WAV format (output file name: MIX.WAV) (S102). The WAV file here is in a stereo format, and its sampling rate is 32 kHz or 48 kHz, which is the same as DV audio. The extracted WAV file is stored in the storage device 5.

その後、音声変換部（PreWav）４により、周波数変換及び、ステレオからモノラルの分離を行う（Ｓ１０３）。すなわち、音響除去エンジンプログラム１００のフォーマットに合うように、ＷＡＶファイルを左右２チャンネルに分離するとともに、サンプリングレートを48kHzに変換し、２つのＷＡＶファイル（出力ファイル名：左チャンネルがＭＩＸ-L.ＷＡＶ、右チャンネルがＭＩＸ-R.ＷＡＶ）として生成し、記憶装置５に蓄積する。また、このステップＳ１０３では、音声変換と同時にビデオの開始時間のオフセットを、設定ファイル（ファイル名：ＭＩＸ.time）に出力し、記憶装置５に蓄積する。 Thereafter, the audio conversion unit (PreWav) 4 performs frequency conversion and separation of monaural from stereo (S103). That is, the WAV file is separated into two left and right channels so as to conform to the format of the sound removal engine program 100, and the sampling rate is converted to 48 kHz, and two WAV files (output file name: left channel is MIX-L.WAV). , The right channel is generated as MIX-R.WAV) and stored in the storage device 5. In step S103, the offset of the start time of the video is output to the setting file (file name: MIX.time) simultaneously with the audio conversion, and is stored in the storage device 5.

これらステップＳ１０１〜Ｓ１０３と並行して、オリジナル曲の取り込みを行う（Ｓ１０４）。具体的には、オリジナル曲（ＢＧＭ曲）をＣＤなどから取り込み、44.1kHzステレオのＷＡＶファイル（出力ファイル名：ＢＧＭ.ＷＡＶ）として、記憶装置５に蓄積する。次いで、音声変換部（PreWav）３により、周波数変換、ステレオからモノラルの分離を行う（Ｓ１０５）。すなわち、音響除去エンジンプログラム１００のフォーマットに合うように、ステップＳ１０４で取り込んだＷＡＶファイルを、左右２チャンネルに分離し、サンプリングレートを48kHzに変換し２つのＷＡＶファイル（出力ファイル名：左チャンネルがＢＧＭ-L.ＷＡＶ、右チャンネルがＢＧＭ-R.ＷＡＶ）として、記憶装置５に蓄積する。 In parallel with these steps S101 to S103, the original music is taken in (S104). Specifically, the original music (BGM music) is taken from a CD or the like and stored in the storage device 5 as a 44.1 kHz stereo WAV file (output file name: BGM.WAV). Next, the audio conversion unit (PreWav) 3 performs frequency conversion and separation of stereo from monaural (S105). That is, to match the format of the sound removal engine program 100, the WAV file captured in step S104 is separated into two left and right channels, the sampling rate is converted to 48 kHz, and two WAV files (output file name: left channel is BGM) -L.WAV, the right channel is BGM-R.WAV) and stored in the storage device 5.

（２）音楽消去処理
音楽消去処理では、除去エンジンプログラム（GEQ）１００により、ＭＩＸ音声からＢＧＭ音声を消去する（Ｓ１０６）。この消去後出力される音声ファイルは、左右チャンネルともにモノラル48kHzのＷＡＶファイル（出力ファイル名：左チャンネルがERASE-L.ＷＡＶ、右チャンネルがERASE-R.ＷＡＶ）として、メモリ７又は記憶装置５に蓄積される。 (2) Music Erasing Process In the music erasing process, the removal engine program (GEQ) 100 erases the BGM sound from the MIX sound (S106). The audio file output after erasure is a mono 48 kHz WAV file (output file name: ERASE-L.WAV for the left channel and ERASE-R.WAV for the right channel) for both the left and right channels in the memory 7 or the storage device 5. Accumulated.

（３）後処理
後処理では、除去エンジンプログラムで消去した音声を、ＤＶ用の音声に変換し、ＤＶ（ＡＶＩファイル）に復元する。先ず、音声変換部（PostWav）３により、周波数変換、及びモノラルからステレオへの変換を行う（Ｓ１０７)。すなわち、音響除去エンジンプログラム１００から出力された左右２チャンネルのＷＡＶファイルをステレオに合成し、必要ならば元のＤＶの音声と同じサンプリングレートに変換し、ＷＡＶファイル（ファイル名：ERASE.ＷＡＶ）として、記憶装置５に蓄積する。次いで、音声データ抽出部（DVReMix）４において、キャプチャーしたＡＶＩファイル（ＢＧＭ.ＡＶＩ）の音声を、消去後の音声(ERASE.ＷＡＶ)と入れ替え、除去後音響ファイル（ファイル名：ERASE.ＡＶＩ）として、記憶装置５に蓄積する。 (3) Post-processing In the post-processing, the voice erased by the removal engine program is converted into a voice for DV and restored to DV (AVI file). First, the audio conversion unit (PostWav) 3 performs frequency conversion and monaural to stereo conversion (S107). That is, the left and right two-channel WAV file output from the sound removal engine program 100 is synthesized into stereo, and if necessary, converted to the same sampling rate as the original DV sound, as a WAV file (file name: ERASE.WAV). And stored in the storage device 5. Next, in the audio data extraction unit (DVReMix) 4, the audio of the captured AVI file (BGM.AVI) is replaced with the audio after deletion (ERASE.WAV), and as an after-removal audio file (file name: ERASE.AVI) And stored in the storage device 5.

なお、例えば、１回目の作業で、ステレオ放送の左音声（Ｌ）について作業した場合などには、この左音声に対する各種パラメータ設定データを記憶しておき、これを次回の右音声に対する除去処理の際に呼び出し、ユーザーインターフェース６におけるデフォルト値として設定する。 For example, when working on the left audio (L) of stereo broadcasting in the first operation, various parameter setting data for the left audio is stored, and this is used for the removal process for the next right audio. And set as a default value in the user interface 6.

［音響除去処理の理論］
次いで、上述した音響除去エンジンプログラム１００について詳述する。先ず、この音響除去エンジンプログラム１００による音響除去処理の理論について説明する。 [Theory of sound removal processing]
Next, the sound removal engine program 100 described above will be described in detail. First, the theory of sound removal processing by the sound removal engine program 100 will be described.

（基本概念）
所望の音声や物音等の音響信号ｓ（ｔ）(ｔは時間軸)に、ＢＧＭ等の非定常音響信号ｂ（ｔ）が混合された、混合音響信号ｍ（ｔ）が観測されるものとする。

ここでは、ｂ（ｔ）の元となる音源の音響信号ｂ’（ｔ）が既知という条件下で、ｍ（ｔ）が与えられたときに、未知のｓ（ｔ）を求める。例えば、人間の声や物音と共にＢＧＭが鳴っているテレビ番組等の音響信号ｍ（ｔ）を入力とし、そのＢＧＭの楽曲が既知でその音響信号ｂ’（ｔ）が別途用意できるときに、そのＢＧＭの音楽音響信号を用いて番組中のＢＧＭを除去し、人間の声や物音だけの音響信号ｓ（ｔ）を得る処理を実現する。 (Basic concept)
A mixed acoustic signal m (t) in which a non-stationary acoustic signal b (t) such as BGM is mixed with an acoustic signal s (t) (t is a time axis) such as a desired voice or a physical sound is observed. To do.

Here, when m (t) is given under the condition that the acoustic signal b ′ (t) of the sound source that is the source of b (t) is known, the unknown s (t) is obtained. For example, when an acoustic signal m (t) of a TV program or the like in which a BGM is sounding together with a human voice or a sound is input, the music of the BGM is known and the acoustic signal b ′ (t) can be prepared separately. The BGM in the program is removed using the BGM music sound signal, and the process of obtaining the sound signal s (t) of only human voice or sound is realized.

ここで、ｂ（ｔ）とｂ’（ｔ）は完全には一致しないため、

Here, since b (t) and b ′ (t) do not completely match,

の減算に相当する処理では、ｂ’（ｔ）からｂ（ｔ）に相当する成分を推定して、ｓ（ｔ）を求める必要がある。具体的には、既知の音響信号ｂ’（ｔ）は、混合音ｍ（ｔ）中では、以下のような変形に対する補正を行うことでｂ（ｔ）に相当する成分を推定する。
In the process corresponding to the subtraction, it is necessary to estimate the component corresponding to b (t) from b ′ (t) and obtain s (t). Specifically, the component corresponding to b (t) of the known acoustic signal b ′ (t) is estimated by correcting the following deformation in the mixed sound m (t).

・時間的な位置のずれ
混合音ｍ（ｔ）中で既知の音響信号ｂ’（ｔ）が鳴っている位置は先頭からとは限らない。そこで、既知の音響信号ｂ’（ｔ）を時間軸方向にずらし、両者の相対位置を合わせて、混合音から既知音響信号を減算する。 -Temporal position shift The position where the known acoustic signal b '(t) is sounding in the mixed sound m (t) is not necessarily from the beginning. Therefore, the known acoustic signal b ′ (t) is shifted in the time axis direction, the relative positions of the two are matched, and the known acoustic signal is subtracted from the mixed sound.

・周波数特性の時間変化
混合音ｍ（ｔ）中で既知の音響信号ｂ’（ｔ）が鳴る際には、グラフィックイコライザ等の影響で周波数特性が変化することが多い。例えば、低域や高域が強調・減衰されることがある。そこで、ｂ’（ｔ）の周波数特性を同様に変化させて補正し、混合音から既知音響信号を減算する。 -Time change of frequency characteristic When a known acoustic signal b '(t) is sounded in the mixed sound m (t), the frequency characteristic often changes due to the influence of a graphic equalizer or the like. For example, the low range and high range may be emphasized and attenuated. Therefore, the frequency characteristic of b ′ (t) is similarly changed and corrected, and the known acoustic signal is subtracted from the mixed sound.

・音量の時間変化
混合音ｍ（ｔ）中で既知の音響信号ｂ’（ｔ）が鳴る際には、混合音作成時におけるミキサーのフェーダー等の操作で混合比率が変更され、音量が時間変化することが多い。そこで、ｂ’（ｔ）の音量を同様に時間変化させて補正し、混合音から既知音響信号を減算する。・ Change in volume over time When a known acoustic signal b '(t) is produced in the mixed sound m (t), the mixing ratio is changed by operating the fader of the mixer at the time of creating the mixed sound, and the volume changes over time. Often done. Therefore, the volume of b ′ (t) is similarly corrected by changing the time, and the known acoustic signal is subtracted from the mixed sound.

本プログラムの処理の流れを図３に示す。本プログラムでは、時間領域での波形を減算処理をせずに、時間周波数領域での振幅スペクトル上で減算処理を行う。音響信号ｍ（ｔ），ｂ’（ｔ）に対する窓関数ｈ（ｔ）を用いた時刻ｔにおける短時間フーリエ変換（ＳＴＦＴ）Ｘ_ｍ(ω，ｔ)、Ｘ_ｂ(ω，ｔ)が、

で定義されるとき、それらの振幅スペクトルＭ(ω，ｔ)，Ｂ’(ω，ｔ)は、

で求まる。 The processing flow of this program is shown in FIG. In this program, the subtraction process is performed on the amplitude spectrum in the time frequency domain without subtracting the waveform in the time domain. Short-time Fourier transform (STFT) X _m (ω, t), X _b (ω, t) at time t using the window function h (t) for the acoustic signals m (t), b ′ (t)

And their amplitude spectra M (ω, t), B ′ (ω, t) are

It is obtained by.

本実施形態では、音響信号を標本化周波数48kHz、量子化ビット数16bitでA/D変換し、窓関数ｈ（ｔ）として窓幅8192点のハニング窓を用いたＳＴＦＴを、高速フーリエ変換(FFT)によって計算する。その際、FFTのフレームを480点ずつシフトするため、フレームシフト時間(1フレームシフト)は10msとなる。このフレームシフトを、処理の時間単位とする。なお、本プログラムは、他の標本化周波数(16kHz、44kHz等)や窓幅、フレームシフトにも容易に対応できるようになっている。 In this embodiment, an acoustic signal is subjected to A / D conversion at a sampling frequency of 48 kHz and a quantization bit number of 16 bits, and an STFT using a Hanning window having a window width of 8192 as a window function h (t) is converted into a fast Fourier transform (FFT). ) To calculate. At that time, since the FFT frame is shifted by 480 points, the frame shift time (one frame shift) is 10 ms. This frame shift is used as a processing time unit. This program can easily handle other sampling frequencies (16kHz, 44kHz, etc.), window widths, and frame shifts.

既知音響信号除去後の所望の音響信号ｓ（ｔ）の振幅スペクトルＳ(ω，ｔ)は、振幅スペクトルＭ(ω，ｔ)，Ｂ’(ω，ｔ)から以下のように求める。

上記の式における各種パラメータ関数ａ(ｔ)，ｇ(ω，ｔ)，ｒ(ｔ) ，ｃ(ω，ｔ)を順に説明する。 The amplitude spectrum S (ω, t) of the desired acoustic signal s (t) after removal of the known acoustic signal is obtained from the amplitude spectra M (ω, t) and B ′ (ω, t) as follows.

Various parameter functions a (t), g (ω, t), r (t), c (ω, t) in the above formula will be described in order.

・ａ(ｔ)は混合音の振幅スペクトルから既知音響信号の振幅スペクトルに相当する成分を減算する分量を最終的に調整するための任意の形状の関数であり、通常、ａ(ｔ)≧1とする。これが大きいほど、減算量が大きくなる。 A (t) is a function of an arbitrary shape for finally adjusting the amount by which the component corresponding to the amplitude spectrum of the known acoustic signal is subtracted from the amplitude spectrum of the mixed sound, and usually a (t) ≧ 1 And The larger the value, the larger the subtraction amount.

・ｇ(ω，ｔ)は、周波数特性の時間変化と音量の時間変化を補正するための関数であり、

により求める。ここで、ｇω(ω，ｔ)は、周波数特性の時間変化を表し、周波数特性の変化がないときはｇω(ω，ｔ)＝１となる。一方、Ｍ(ω，ｔ)，Ｂ’(ω，ｔ)は、音量の時間変化を表し、音量の変化がないときは定数となる。Ｍ(ω，ｔ)とＢ’(ω，ｔ)との音量差は、基本的にｇｔ(ｔ)で補正される。ｇｒ(ｔ)は、主にｇ(ω，ｔ)の値を全体的に持ち上げるための関数で、補正時の微調整に使用される。使用しない場合には、ｇｒ(ｔ)＝０とする。 G (ω, t) is a function for correcting the time change of the frequency characteristic and the time change of the volume,

Ask for. Here, gω (ω, t) represents a time change of the frequency characteristic, and gω (ω, t) = 1 when there is no change of the frequency characteristic. On the other hand, M (ω, t), B ′ (ω, t) represents a change in volume over time, and is a constant when there is no change in volume. The volume difference between M (ω, t) and B ′ (ω, t) is basically corrected by gt (t). gr (t) is a function mainly for raising the value of g (ω, t) as a whole, and is used for fine adjustment during correction. If not used, gr (t) = 0.

・ｒ(ｔ)は、時間的な位置のずれを補正するための関数であり、通常は定数を設定することで、一定のずれ幅を補正する(本プログラムでは、定数にのみ対応している)。・ R (t) is a function for correcting the positional deviation in time. Normally, a constant is set to correct a certain deviation (in this program, only the constant is supported). ).

・ｃ(ω，ｔ)は、振幅スペクトルに対するイコライジング処理及びフェーダー操作処理のための任意の形状の関数である。ω方向の形状により、グラフィックイコライザのように、既知音響信号除去後の周波数特性を調整することができる。また、ｔ方向の形状により、ミキサーのボリュームフェーダー操作のように、既知音響信号除去後の音量変化を調整することができる。使用しない場合には、ｃ(ω，ｔ)＝１とする。なお、本プログラムでは、ｃ(ω，ｔ)＝１にのみ対応しているが、ｃ(ω，ｔ)の処理を追加してもよい。 C (ω, t) is a function of an arbitrary shape for equalizing processing and fader operation processing on the amplitude spectrum. Depending on the shape in the ω direction, the frequency characteristic after removal of the known acoustic signal can be adjusted like a graphic equalizer. Further, the change in volume after removal of the known acoustic signal can be adjusted by the shape in the t direction, like the volume fader operation of the mixer. When not used, c (ω, t) = 1. In this program, only c (ω, t) = 1 is supported, but a process of c (ω, t) may be added.

こうして求めた振幅スペクトルＳ(ω，ｔ)と、混合音ｍ(ｔ)の位相θｍ(ω，ｔ) を用いてＸｓ(ω，ｔ)を求め、それを逆フーリエ変換(IFFT)することで、単位波形を得る。

この単位波形を、Overlap Add法によって配置することにより、既知音響信号除去後の所望の音響信号ｓ（ｔ）を合成する。 By using the amplitude spectrum S (ω, t) thus obtained and the phase θm (ω, t) of the mixed sound m (t), Xs (ω, t) is obtained and subjected to inverse Fourier transform (IFFT). Get the unit waveform.

By arranging this unit waveform by the overlap addition method, a desired acoustic signal s (t) after the known acoustic signal is removed is synthesized.

（各種パラメータ関数の設定）
上記の処理を実行する際に、上記数５及び数６の各種パラメータ関数ａ(ｔ)，ｇ(ω，ｔ)（ｇω(ω，ｔ)，ｇｔ(ｔ)，ｇｒ(ｔ)），ｒ(ｔ) ，ｃ(ω，ｔ)の形状は、ユーザーが手作業で設定してもよいし、可能なものは自動推定してもよい。あるいは、自動推定後にユーザーが修正してもよい。本プログラムでは式(11)、式(12)、式(13)の各種パラメータ関数ｇ(ω，ｔ)（ｇω(ω，ｔ)，ｇｔ(ｔ)），ｒ(ｔ)の一部の形状の自動推定にのみ対応しているため、実装されている具体的な自動推定方法を以下に述べる。 (Setting of various parameter functions)
When executing the above processing, the various parameter functions a (t), g (ω, t) (gω (ω, t), gt (t), gr (t)), r The shapes of (t) and c (ω, t) may be set manually by the user, or possible ones may be automatically estimated. Alternatively, the user may correct after automatic estimation. In this program, some shapes of various parameter functions g (ω, t) (gω (ω, t), gt (t)), r (t) of Equation (11), Equation (12), and Equation (13) Since only automatic estimation is supported, the specific automatic estimation method implemented is described below.

・ｇ(ω，ｔ)の推定では、先ず、周波数特性の時間変化ｇω(ω，ｔ)を推定し、次に、音量の時間変化ｇｔ(ｔ)を推定する。ただし、ｇ(ω，ｔ)の推定に先立ち、ｒ(ｔ)は決定されている必要がある。ここでは便宜上、Ｂ’(ω，ｔ＋ｒ(ｔ))をＢ’(ω，ｔ)と記述する。 In the estimation of g (ω, t), first, the time change gω (ω, t) of the frequency characteristic is estimated, and then the time change gt (t) of the volume is estimated. However, r (t) needs to be determined prior to estimating g (ω, t). Here, for convenience, B ′ (ω, t + r (t)) is described as B ′ (ω, t).

周波数特性の時間変化ｇω(ω，ｔ)の推定では、原則として、人間の声や物音だけの音響信号ｓ(ｔ)がほとんど含まれていない区間(以下、ＢＧＭ区間と呼ぶ)を用いる。ＢＧＭ区間は、複数用いてもよい。ＢＧＭ区間では、混合音ｍ(ｔ)の振幅スペクトルＭ(ω，ｔ)は、既知の音響信号ｂ’(ｔ)によるＢＧＭに相当する振幅スペクトルＢ’(ω，ｔ)に由来の成分がほとんどとなる。そこで、周波数特性が時間変化せずに定常、すなわち、ｇω(ω，ｔ)＝ｇ’ω(ω)と仮定できるときには、ｇ’ω(ω)を

により推定する。ただし、ψは一つのＢＧＭ区間(時問軸上の領域)を表し、Ψは、ψの集合とする。一方、周波数特性が時間変化していくときには、ｇω(ω，ｔ)の時刻ｔに近いＢＧＭ区間ψから

を求め、補間(内挿あるいは外挿)することによりｇω(ω，ｔ)を推定する。最後に、ｇω(ω，ｔ)を周波数軸方向に平滑化する。なお、平滑化幅は任意に設定でき、平滑化をしなくてもよい。 In the estimation of the time variation gω (ω, t) of the frequency characteristic, as a rule, a section (hereinafter referred to as a BGM section) in which the acoustic signal s (t) of only a human voice or a sound is hardly included is used. A plurality of BGM sections may be used. In the BGM section, the amplitude spectrum M (ω, t) of the mixed sound m (t) is almost entirely derived from the amplitude spectrum B ′ (ω, t) corresponding to BGM by the known acoustic signal b ′ (t). It becomes. Therefore, when the frequency characteristic is steady without changing with time, that is, when gω (ω, t) = g′ω (ω) can be assumed, g′ω (ω) is

Estimated by Here, ψ represents one BGM section (region on the time axis), and ψ is a set of ψ. On the other hand, when the frequency characteristic changes with time, from the BGM section ψ near the time t of gω (ω, t).

Gω (ω, t) is estimated by interpolation (interpolation or extrapolation). Finally, gω (ω, t) is smoothed in the frequency axis direction. Note that the smoothing width can be arbitrarily set, and smoothing may not be performed.

音量の時間変化ｇｔ(ｔ)の推定では、Ｍ(ω，ｔ)と、周波数特性補正後のｇω(ω，ｔ) Ｂ’(ω，ｔ)の各時刻における振幅を比較する。しかし、Ｍ(ω，ｔ)には、Ｂ’(ω，ｔ)に由来の成分以外に、ｓ（ｔ）に由来の成分も含まれる。そこで、周波数軸ωを複数の周波数帯域Φに分割し、各帯域φ(φ∈Φ)毎に

を求める(Φはφの集合を表す)。Φとして任意の分割が適用できるが、例えば、音楽で用いる平均律の１オクターブ毎に分割(対数周波数軸上で等間隔に分割)するとよい。そして、ｇｔ(ｔ)は、ｍｉｎ（ｇ’ｔ(φ，ｔ)）あるいは、数１１により推定する。

In the estimation of the volume change gt (t) over time, the amplitudes of M (ω, t) and gω (ω, t) B ′ (ω, t) after frequency characteristic correction are compared at each time. However, M (ω, t) includes components derived from s (t) in addition to components derived from B ′ (ω, t). Therefore, the frequency axis ω is divided into a plurality of frequency bands Φ, and for each band φ (φ∈Φ)

(Φ represents a set of φ). Arbitrary division can be applied as Φ, but, for example, it is preferable to divide every one octave of equal temperament used in music (equal intervals on the logarithmic frequency axis). Then, gt (t) is estimated by min (g′t (φ, t)) or Equation 11.

最後に、ｇｔ(ｔ)を時間軸方向に平滑化する。なお、平滑化幅は任意に設定でき、平滑化をしなくてもよい。

Finally, gt (t) is smoothed in the time axis direction. Note that the smoothing width can be arbitrarily set, and smoothing may not be performed.

・ｒ(ｔ)の推定では、原則として、ＢＧＭ区間ψの集合Ψを用いて、それらの区間におけるＭ(ω，ｔ)とＢ(ω，ｔ)との対応関係の時間軸を合わせるように、ｒ(ｔ)を求める。本プログラムでは、ｒ(ｔ)の推定のためのＢＧＭ区間の集合Ψの要素は一つに限定され、ｒ(ｔ)は定数のみに対応している。 In the estimation of r (t), in principle, a set Ψ of BGM intervals ψ is used so that the time axis of the correspondence relationship between M (ω, t) and B (ω, t) in those intervals is matched. R (t). In this program, the element of the set Ψ of BGM intervals for estimating r (t) is limited to one, and r (t) corresponds to only a constant.

上記のｇ(ω，ｔ)やｒ(ｔ)等の推定では、ＢＧＭ区間ψの集合Ψを用いていた。これは、ユーザーが手作業で指定してもよいし、以下のように自動推定してもよい。Ψの自動推定では、基本的に、どこか一箇所のＢＧＭ区間ψ1を手掛かりとして、残りのＢＧＭ区間の集合を求める。先ず、ψ1に基づいて、Ｂ(ω，ｔ)の各種パラメータ関数を推定して仮に決定し、そのときのψ1の区間内におけるＭ(ω，ｔ)とＢ(ω，ｔ)との振幅スペクトル間の距離を求め、その最大値(本プログラムでは最大値を用いたが平均値を用いる実装も可能である)の定数倍をＢＧＭ区間判定用閾値とする。そして、全区間に対して、Ｍ(ω，ｔ)とＢ(ω，ｔ)との振幅スペクトル間の距離を求め、ＢＧＭ区間判定用閾値以下の区間を検出し、新たにΨに加える。本プログラムではこの推定は繰り返さないが、この推定を繰り返してΨを求めてもよい。ここで、Ｍ(ω，ｔ)とＢ(ω，ｔ)との距離としては、例えば、二乗平均対数スペクトル距離

が有効である。なお、最初のψ₁は、ユーザーが手作業で指定する。 In the above estimation of g (ω, t), r (t), etc., a set Ψ of BGM sections ψ is used. This may be designated manually by the user, or may be automatically estimated as follows. In the automatic estimation of Ψ, basically, a set of remaining BGM sections is obtained using one BGM section ψ1 as a clue. First, various parameter functions of B (ω, t) are estimated and temporarily determined based on ψ1, and the amplitude spectrum of M (ω, t) and B (ω, t) in the interval of ψ1 at that time. The distance between them is obtained, and a constant multiple of the maximum value (the maximum value is used in this program, but an implementation using an average value is also possible) is set as a BGM section determination threshold. Then, the distance between the amplitude spectra of M (ω, t) and B (ω, t) is obtained for all the sections, a section below the BGM section determination threshold is detected, and newly added to Ψ. Although this estimation is not repeated in this program, Ψ may be obtained by repeating this estimation. Here, as the distance between M (ω, t) and B (ω, t), for example, the root mean square logarithmic spectral distance

Is effective. The first ψ ₁ is manually specified by the user.

（既知音響の伸縮推定）
本実施形態では、上記数５及び６の枠組みを拡張することにより、以下の変形に対応することができる。 (Estimated expansion and contraction of known sound)
In the present embodiment, the following modifications can be accommodated by extending the frameworks of the above formulas 5 and 6.

時間軸あるいは周波数軸方向の伸縮混合音ｍ(ｔ)中で既知の音響信号ｂ’(ｔ)が鳴る際には、レコード等の回転数の違いにより、時間軸あるいは周波数軸方向に伸縮されることがある。そこで、ｂ’(ｔ)を時間軸あるいは周波数軸方向に伸縮して補正し、混合音から既知音響信号を減算する必要がある。 When a known acoustic signal b ′ (t) is sounded in the expansion / contraction sound m (t) in the time axis or frequency axis direction, the sound signal is expanded or contracted in the time axis or frequency axis direction due to the difference in the rotational speed of the record or the like. Sometimes. Therefore, it is necessary to correct b ′ (t) by expanding and contracting in the time axis or frequency axis direction, and subtracting a known acoustic signal from the mixed sound.

これに対応するには、数５中の第２式を以下のように定義する。

上記の式におけるパラメータ関数ｐ（ω）、ｑ（ｔ) を説明する。 In order to cope with this, the second equation in Equation 5 is defined as follows.

The parameter functions p (ω) and q (t) in the above equation will be described.

・ｐ（ω）は、周波数軸方向の伸縮を補正するための関数であり、振幅スペクトルＢ’(ω，ｔ)の周波数軸ωを変換することで、周波数軸方向の線形・非線型な伸縮を可能にする。なお、Ｂ’(ω，ｔ)は本来のωの定義域外では０をとり、離散化して実装する際には適宜補間することとする。 P (ω) is a function for correcting expansion / contraction in the frequency axis direction, and linear / non-linear expansion / contraction in the frequency axis direction by converting the frequency axis ω of the amplitude spectrum B ′ (ω, t). Enable. B ′ (ω, t) is 0 outside the original definition range of ω, and is interpolated as appropriate when discretized and implemented.

・ｑ（ｔ) は、時間軸方向の伸縮を補正するための関数であり、振幅スペクトルＢ’(ω，ｔ)の時間軸ｔを変換することで、時間軸方向の線形・非線型な伸縮を可能にする。なお、Ｂ’(ω，ｔ)は本来のｔの定義域外では０をとり、離散化して実装する際には適宜補間することとする。ｑ（ｔ)とｒ（ｔ)を統合した一つの関数で表現することも可能だが、ここでは、ｑ（ｔ)は連続的な伸縮を表す目的で設定し、ｒ（ｔ)は不連続な位置のずれを表す目的で設定することとする。 Q (t) is a function for correcting expansion / contraction in the time axis direction, and linear / nonlinear expansion / contraction in the time axis direction by converting the time axis t of the amplitude spectrum B ′ (ω, t). Enable. Note that B ′ (ω, t) takes 0 outside the original definition range of t, and is interpolated as appropriate when discretized and implemented. Although it is possible to express q (t) and r (t) by a single function, q (t) is set for the purpose of continuous expansion and contraction, and r (t) is discontinuous. It is set for the purpose of representing the position shift.

・上記の数１３で新たに導入されたパラメータ関数ｐ（ω）、ｑ（ｔ)の形状を自動推定する方法を以下に述べる。 A method for automatically estimating the shapes of the parameter functions p (ω) and q (t) newly introduced in the above equation 13 will be described below.

・ｐ（ω）、ｑ（ｔ)の推定では、Ｍ(ω，ｔ)とＢ(ω，ｔ)との距離(例えば、対数スペクトル距離等)が最小となるように、ｐ（ω）とｑ（ｔ)を変更する。その際、Ｂ(ω，ｔ)＝ａ(ｔ)ｇ(ω，ｔ) Ｂ’（ｐ（ω），ｑ（ｔ)＋ｒ(ｔ)）の右辺のうち、ａ(ｔ)＝１とし、
１．(推定途中の) ｐ（ω）とｑ（ｔ)を仮に固定した上で、ｇ(ω，ｔ)とｒ(ｔ)を推定
２．(推定途中の) ｇ(ω，ｔ)とｒ(ｔ)を仮に固定した上で、ｐ（ω）とｑ（ｔ)を推定
の二つを反復的に繰り返して、適切なｐ（ω），ｑ（ｔ)を推定する。これは、音響信号の全区間に対して一度に実行せず、時間軸を分割して、区分的に行うとよい。初期値は前後の区間の連続性を考慮して定める。また、ＢＧＭ区間ψの集合Ψを用いて、それらの複数の区間におけるＭ(ω，ｔ)とＢ(ω，ｔ)との対応関係の時間軸を合わせるように、ｐ（ω），ｑ（ｔ)を推定するとよい。なお、既知の音響信号ｂ’ （ｔ)の一部区間が使われずに、飛び飛びで混ざっていたとき等には、その区間を飛ばすようにｒ(ｔ)が不連続関数となるようにする。 In the estimation of p (ω) and q (t), p (ω) is set so that the distance between M (ω, t) and B (ω, t) (for example, logarithmic spectral distance, etc.) is minimized. Change q (t). At that time, B (ω, t) = a (t) g (ω, t) B ′ (p (ω), q (t) + r (t)) of the right side, a (t) = 1,
1. 1. Estimate g (ω, t) and r (t) after temporarily fixing p (ω) and q (t) (during estimation) (Estimated) g (ω, t) and r (t) are temporarily fixed, and p (ω) and q (t) are estimated repeatedly. , Q (t). This may be performed in a segmented manner by dividing the time axis without performing it for all sections of the acoustic signal at once. The initial value is determined in consideration of the continuity of the preceding and following sections. Further, by using the set Ψ of BGM sections ψ, p (ω), q () so that the time axes of the correspondence relationship between M (ω, t) and B (ω, t) in the plurality of sections are matched. t) may be estimated. When a part of the known acoustic signal b ′ (t) is not used and is mixed by skipping, r (t) is made to be a discontinuous function so as to skip that part.

ここで、ＢＧＭ区間ψの集合Ψを自動推定する際に、本プログラムでは最初のψ₁は、ユーザーが手作業で指定する必要があるが、音響信号の時間軸を細かく分割して、それらの短い分割区間同士の対応関係を調査して求める方法もある。 Here, when automatically estimating the set Ψ of BGM sections ψ, the user needs to manually specify the first ψ ₁ , but the time axis of the acoustic signal is divided finely, There is also a method for investigating and finding the correspondence between short divided sections.

（複数の既知音響信号への対応等）
本プログラムでは、混合音響信号ｍ（ｔ)の中に、既知音響信号ｂ’ （ｔ)が一種類含まれている場合に対応している。ｂ’_１（ｔ)，ｂ’_２（ｔ)…ｂ’_Ｎ（ｔ)のように複数含まれている場合には、それらの振幅スペクトルＢ’_１(ω，ｔ)，Ｂ’_２(ω，ｔ)…Ｂ’_Ｎ (ω，ｔ) から、上記数５第２式で求めたＢ_１(ω，ｔ)，Ｂ_２(ω，ｔ)…Ｂ_Ｎ (ω，ｔ)を用いて、

のようにＳ(ω，ｔ)を求める処理へ拡張できる。その際には、Ｂｎ (ω，ｔ)の各種パラメータ関数を順に設定するか、全体のバランスを取りながら、複数のＢｎ (ω，ｔ)の各種パラメータ関数を平行して設定する。 (Corresponding to multiple known acoustic signals, etc.)
This program corresponds to the case where one type of known acoustic signal b ′ (t) is included in the mixed acoustic signal m (t). b ' ₁ (t), b' ₂ (t)..., b ′ _N (t), when there are a plurality of amplitude spectra B ′ ₁ (ω, t), B ′ ₂ (ω , t) ... from _{B 'N (ω, t)} , B was determined by the above expression 5 second equation _{_{1 (ω, t), B}} 2 (ω, t) ... B N (ω, t) with,

Thus, the process can be expanded to obtain S (ω, t). At that time, various parameter functions of Bn (ω, t) are set in order, or various parameter functions of a plurality of Bn (ω, t) are set in parallel while maintaining the overall balance.

また、本プログラムはモノラル信号を対象にしているが、ステレオ信号は、左右を混合してモノラル信号に変換して適用してもよいし、ステレオ信号の左右の各信号に対して適用してもよい。また、ステレオ信号中の音源方向を利用して適用するように拡張してもよい。 Although this program is intended for monaural signals, a stereo signal may be applied by mixing the left and right to convert to a monaural signal, or applied to the left and right signals of the stereo signal. Good. Moreover, you may extend so that it may apply using the sound source direction in a stereo signal.

（実験結果）
このような本実施形態に係る音響除去エンジンプログラムの実験結果を以下に示す。ここでは、音声や物音等の音響信号ｓ(ｔ)にＢＧＭ等の音響信号ｂ(ｔ)が加えられている混合において、音響信号ｍ(ｔ)が観測されたときに、ｂ(ｔ)の元となる音源の音響信号ｂ’(ｔ)が既知という条件下で、未知のｓ(ｔ)を求めた。ｍ(ｔ)とｂ’(ｔ)が収録されたオーディオファイルを与えると、ｓ(ｔ)のオーディオファイルを得ることができる。 (Experimental result)
The experimental results of the sound removal engine program according to this embodiment are shown below. Here, when an acoustic signal m (t) is observed in a mixture in which an acoustic signal b (t) such as BGM is added to an acoustic signal s (t) such as a voice or a sound, a b (t) Under the condition that the acoustic signal b ′ (t) of the original sound source is known, the unknown s (t) was obtained. If an audio file in which m (t) and b ′ (t) are recorded is given, an audio file of s (t) can be obtained.

人間の音声にバックグラウンドミュージック(ＢＧＭ)が加えられた混合音に対して実験した結果、そのＢＧＭの原曲の音響信号を用いて、混合音中のＢＧＭを除去し、人間の音声や物音が得られることを確認した。 As a result of experiments on mixed sound in which background music (BGM) is added to human voice, the BGM in the mixed sound is removed using the sound signal of the original music of the BGM, and human voice and sound are It was confirmed that it was obtained.

実験結果の例として、二人の男女の対話のＢＧＭにクラシック音楽が鳴っている混合音を実際に処理した結果を図４（ａ）〜（ｆ）に示す。図４（ａ）及び（ｂ）に示す混合音ｍ（ｔ）を入力として、図４（ｃ）及び（ｄ）に示す元音源の既知音響信号ｂ’（ｔ）を用いてＢＧＭ成分を除去した結果、図４（ｅ）及び（ｆ）に示す既知音響信号除去後の音響信号ｓ(ｔ)が得られた。 As an example of the experimental results, FIGS. 4A to 4F show results of actually processing a mixed sound in which classical music is played in the BGM of dialogue between two men and women. Using the mixed sound m (t) shown in FIGS. 4 (a) and 4 (b) as an input, the BGM component is removed using the known acoustic signal b ′ (t) of the original sound source shown in FIGS. 4 (c) and 4 (d). As a result, the acoustic signal s (t) after the known acoustic signal removal shown in FIGS. 4E and 4F was obtained.

以上から、人間の声や物音の背景にＢＧＭが鳴っているテレビ番組や映画等の音響信号を入力とすると、別途用意したＢＧＭの音楽音響信号を用いて番組中のＢＧＭを除去し、人間の声や物音だけの音響信号を得ることができる。なお、ＢＧＭ除去後の音響信号に対して、別の音楽をＢＧＭとして付与するようにしてもよい。 From the above, when an audio signal such as a TV program or a movie with BGM sounding in the background of a human voice or sound is input, BGM in the program is removed using a BGM music audio signal prepared separately, It is possible to obtain an audio signal only of voice and sound. In addition, you may make it provide another music as BGM with respect to the acoustic signal after BGM removal.

［音響除去エンジンの構成］
以上説明した理論に基づく音響除去エンジンプログラム１００の構成について説明する。図５は、音響除去エンジンプログラム１００の機能を示すブロック図である。 [Configuration of sound removal engine]
The configuration of the sound removal engine program 100 based on the theory described above will be described. FIG. 5 is a block diagram illustrating functions of the sound removal engine program 100.

図５に示すように、音響除去エンジンプログラム１００は、信号入力手段として、混合音響信号が入力される混合音響入力部１０１と、除去しようとする既知音響信号を入力する既知音響信号入力部１０２とを有し、除去処理が施された音響信号の出力手段として除去後音響信号出力部１０７を有している。 As shown in FIG. 5, the sound removal engine program 100 includes, as signal input means, a mixed sound input unit 101 to which a mixed sound signal is input, and a known sound signal input unit 102 to which a known sound signal to be removed is input. And a post-removal acoustic signal output unit 107 as an output means of the acoustic signal subjected to the removal process.

また、音響除去エンジンプログラム１００は、入力された音響信号から振幅スペクトルを抽出する振幅スペクトル抽出部２００を備えている。具体的に、この振幅スペクトル抽出部２００は、データ分割部２０１と、窓関数処理部２０２と、フーリエ変換部２０３とを備えている。 The sound removal engine program 100 also includes an amplitude spectrum extraction unit 200 that extracts an amplitude spectrum from the input sound signal. Specifically, the amplitude spectrum extraction unit 200 includes a data division unit 201, a window function processing unit 202, and a Fourier transform unit 203.

データ分割部２０１は、混合音響信号を、特定の長さ（窓サイズ）の区間に分割する。一般の音声認識等では、一区間の長さを２0ミリ秒程度とするが、音声に比べて、音楽では同じ音が長く継続することから、本実施形態では、これよりも10倍程度長い、２のべき乗である8192サンプル（8192÷48,000=0.170約170ミリ秒）としている。 The data dividing unit 201 divides the mixed acoustic signal into sections having a specific length (window size). In general speech recognition and the like, the length of one section is about 20 milliseconds, but the same sound continues longer in music than in speech, so in this embodiment, about 10 times longer than this. It is 8192 samples (8192 ÷ 48,000 = 0.170 approximately 170 milliseconds), which is a power of 2.

窓関数処理部２０２は、データ分割部２０１により分割された窓サイズ区間（170m秒）の音声信号データに対し、ハニング関数を掛けて、データの最初と最後の部分において、なだらかにゼロに収束させる信号波形に変換する。 The window function processing unit 202 multiplies the audio signal data in the window size section (170 msec) divided by the data dividing unit 201 by a Hanning function so as to smoothly converge to zero in the first and last portions of the data. Convert to signal waveform.

フーリエ変換部２０３は、混合音響信号及び既知音響信号それぞれのデータをフーリエ変換して、周波数チャンネル毎の位相及び振幅スペクトルを分離して出力する。なお、振幅スペクトルのみからなるデータは「時間周波数データ」として出力される。 The Fourier transform unit 203 performs Fourier transform on the data of each of the mixed acoustic signal and the known acoustic signal, and separates and outputs the phase and amplitude spectrum for each frequency channel. Data consisting only of the amplitude spectrum is output as “time frequency data”.

詳述すると、このフーリエ変換部２０３は、ハニング関数処理された音声データに対して、高速フーリエ変換（FFT）を行う。なお、入力される音声データは実数のみで、虚数部が含まれ、このFFTでは入出力を複素数で計算することから、2回の窓の変換を入力データの実部と虚部にそれぞれ行い、高速フーリエ変換し、変換後に共役関係を使って分離して2倍の速度向上を実現している。なお、本システムではインテル社のPentium４（登録商標）プロセッサ等で利用できるSSE2命令を使用し、処理の高速化を図っている。 More specifically, the Fourier transform unit 203 performs fast Fourier transform (FFT) on the audio data that has been subjected to the Hanning function process. Note that the input speech data is only real numbers and includes an imaginary part.In this FFT, the input and output are calculated as complex numbers, so two window transformations are performed on the real part and imaginary part of the input data, respectively. Fast Fourier transform is performed, and after conversion, separation is performed using a conjugate relationship, realizing a double speed improvement. This system uses SSE2 instructions that can be used with Intel's Pentium 4 (registered trademark) processor to speed up the processing.

そして、振幅スペクトル抽出部２００では、フーリエ変換する区間を480サンプル（480÷48,000=0.01：10ミリ秒）単位で移動させ、窓関数処理部２０２及びフーリエ変換部２０３によるハニング窓関数の掛け算とフーリエ変換の処理を繰り返す。このようにして10ミリセカンド毎に得られたデータから、周波数チャンネル毎に音声信号の「振幅のみ」を表すデータである「時間周波数データ」を取得する。このようにして得られた周波数チャンネルは、0Hz,5.86Hz,11,72Hz,17.57Hz….23,994.14Hzというように、０Hz（直流）から約5.86Hz毎に約24kHzまでの4096チャンネルとなる。 Then, the amplitude spectrum extraction unit 200 moves the section to be Fourier-transformed in units of 480 samples (480 ÷ 48,000 = 0.01: 10 milliseconds), the multiplication of the Hanning window function by the window function processing unit 202 and the Fourier transform unit 203, and the Fourier Repeat the conversion process. From the data obtained every 10 milliseconds in this way, “time frequency data” that is data representing “amplitude only” of the audio signal is obtained for each frequency channel. The frequency channels thus obtained are 4096 channels from 0 Hz (direct current) to about 24 kHz every about 5.86 Hz, such as 0 Hz, 5.86 Hz, 11, 72 Hz, 17.57 Hz... 23,994.14 Hz.

なお、振幅スペクトル抽出部２００は、入力された信号が混合音響信号であるときには、混合音響信号から混合音響振幅スペクトルを抽出する混合音響振幅抽出部として機能し、入力された信号が除去しようとする既知音響信号であるときには、この除去し音響振幅スペクトルを抽出する既知音響振幅抽出部として機能する。 When the input signal is a mixed sound signal, the amplitude spectrum extraction unit 200 functions as a mixed sound amplitude extraction unit that extracts a mixed sound amplitude spectrum from the mixed sound signal, and the input signal tries to be removed. When it is a known acoustic signal, it functions as a known acoustic amplitude extracting section that removes this and extracts the acoustic amplitude spectrum.

また、音響除去エンジンプログラム１００は、振幅スペクトル抽出部２００から抽出された既知音響の振幅スペクトルに基づいて、混合音響中の既知音響の変化を自動推定したり、自動推定結果をユーザー操作により修正するパラメータ推定部３００を備えている。 Further, the sound removal engine program 100 automatically estimates the change of the known sound in the mixed sound based on the amplitude spectrum of the known sound extracted from the amplitude spectrum extraction unit 200, and corrects the automatic estimation result by a user operation. A parameter estimation unit 300 is provided.

このパラメータ推定部３００は、振幅スペクトル抽出部２００から抽出された「時間周波数データ」の周波数チャンネル毎のデータに基づいて、上述した数５及び数６のすべてのパラメータ関数ａ(ｔ)，ｇ(ω，ｔ)（ｇω(ω，ｔ)，ｇｔ(ｔ)，ｇｒ(ｔ)），ｐ(ω) ，ｑ(ω) ，ｒ(ｔ) ，ｃ(ω，ｔ)の形状を自動で推定したり、若しくはユーザーの操作で設定したりする。なお、ユーザーは、最初から任意の関数形状を描いて指定してもよいし、最初に先ず自動推定をして、その結果を修正してもよい。 The parameter estimation unit 300, based on the data for each frequency channel of the “time frequency data” extracted from the amplitude spectrum extraction unit 200, all the parameter functions a (t) and g (5) described above. ω, t) (gω (ω, t), gt (t), gr (t)), p (ω), q (ω), r (t), c (ω, t) are automatically estimated. Or set by user operation. The user may draw and specify an arbitrary function shape from the beginning, or may first perform automatic estimation and correct the result.

パラメータ推定部３００は、混合音響及び既知音響のキャリブレーションを行うキャリブレーション部３０４を備えている。このキャリブレーション部３０４は、人間の声や物音だけの音響信号ｓ(ｔ)がほとんど含まれていない区間(ＢＧＭ区間)を用いて、上記数５及び数６の各種パラメータ関数ｇ(ω，ｔ)，（ｇω(ω，ｔ)，ｇｔ(ｔ)），ｒ(ｔ)の一部の形状の自動推定する。 The parameter estimation unit 300 includes a calibration unit 304 that performs calibration of mixed sound and known sound. The calibration unit 304 uses the section (BGM section) in which the acoustic signal s (t) of only human voice or sound is not included (BGM section), and the various parameter functions g (ω, t ), (Gω (ω, t), gt (t)) and r (t) are partially estimated automatically.

具体的には、ユーザーの手動操作により、混合音の中で、既知音響のみが出力されている1秒から数秒の区間を選択するとともに、既知音響に対してもほぼ同じ部分を選択する。そして、キャリブレーション部３０４は、この選択された区間内で周波数チャンネル毎の音量を合計して、混合音と除去音から得られる値を比較する。この区間では混合音には既知音響のみが含まれていることから、その比が周波数特性となる。 Specifically, by a user's manual operation, a section from 1 second to several seconds in which only known sound is output is selected from the mixed sound, and substantially the same part is selected for known sound. And the calibration part 304 totals the volume for every frequency channel in this selected area, and compares the value obtained from a mixed sound and a removal sound. In this section, since the mixed sound contains only known sound, the ratio becomes the frequency characteristic.

さらに、パラメータ推定部３００は、周波数特性変化補正部３０１と、音量変化補正部３０２と、時間位置補正部３０３とを備え、これらにより(1)混合音響と既知音響の時間的な位置ずれ、(2)既知音響の周波数特性、及び(3)既知音響の音量の時間変化を推定する。 Further, the parameter estimation unit 300 includes a frequency characteristic change correction unit 301, a sound volume change correction unit 302, and a time position correction unit 303, whereby (1) a temporal positional shift between mixed sound and known sound, ( 2) Estimate the frequency characteristics of the known sound, and (3) change over time of the volume of the known sound.

周波数特性変化補正部３０１は、周波数分布の推定を行うモジュールであり、この周波数分布の推定に際し、振幅スペクトルに対するイコライジング処理及びフェーダー操作処理のための任意の形状の関数であるｃ(ω，ｔ)について、ω方向の形状を変化させることにより、グラフィックイコライザのように、既知音響信号除去後の周波数特性を調整する。 The frequency characteristic change correction unit 301 is a module that estimates a frequency distribution. When the frequency distribution is estimated, c (ω, t), which is a function of an arbitrary shape for equalizing processing and fader operation processing on the amplitude spectrum. For, the frequency characteristic after removal of the known acoustic signal is adjusted like a graphic equalizer by changing the shape in the ω direction.

また、周波数特性変化補正部３０１は、音声チャンネルのうちＢＧＭの音量が小さい部分ではノイズ等により得られる値が不安定となるため、周波数特性の平滑化を行う。 Further, the frequency characteristic change correction unit 301 smoothes the frequency characteristic because a value obtained by noise or the like becomes unstable in a portion where the volume of the BGM is low in the audio channel.

詳述すると、推定したい各周波数チャンネルに既知音響の音がなければ推定が不可能なことから、周波数分布の推定にあたっては既知音響に低音から高音まで含まれている、いわゆる「リッチ」な部分を用いることが望ましい。 In detail, since it is impossible to estimate if there is no sound of a known sound in each frequency channel to be estimated, when estimating the frequency distribution, the so-called “rich” part that is included in the known sound from low to high is included. It is desirable to use it.

しかしながら、4096チャンネルといった非常に細分化されている場合には、すべてのチャンネルに音が含まれていることはむしろ不可能に近いと考えられ、また、既知音響の音が小さく、対して混合音側にノイズがあった場合は、割り算の結果として非常識な推定値となることがある。 However, in the case of very subdivided channels such as 4096 channels, it is considered almost impossible to include sound in all channels, and the sound of known sounds is small, whereas the mixed sound If there is noise on the side, the result of division may be an insane estimate.

これに対して、本実施形態では、周波数分布の平滑化を行う。この平滑化は、前後のチャンネルの平均値をとってなめらかにすることによって実現される。本実施形態では、このチャンネル数がエディタ４００のスライダ「SmoothingFreq.Weight」で、この値を大きくするほどなめらかにすることができる。 On the other hand, in this embodiment, the frequency distribution is smoothed. This smoothing is realized by taking the average value of the previous and subsequent channels and smoothing them. In this embodiment, the number of channels can be made smoother as this value is increased by the slider “SmoothingFreq.Weight” of the editor 400.

また、本実施形態では、別の平滑化の機構（Blur_freq_mode）が、エディタ４００のボタン「BAFAM」により実装されている。音がない周波数チャンネルでは推定ができず、隣接する周波数チャンネルに比べて極端に落ち込んでいる場合があるため、隣接した周波数チャンネルと同じ値に持ち上げることにより、こうした予測不能の場合を回避することができる。 In this embodiment, another smoothing mechanism (Blur_freq_mode) is implemented by the button “BAFAM” of the editor 400. The frequency channel without sound cannot be estimated, and may be extremely depressed compared to the adjacent frequency channel. By raising the value to the same value as the adjacent frequency channel, it is possible to avoid such an unpredictable case. it can.

さらに、本実施形態では、全くゼロとなる周波数チャンネルが、推定処理に影響を及ぼすのを避けるために、エディタ４００のスライダ「ShiftFreqWeight」で実装されている。 Furthermore, in this embodiment, the frequency channel that is completely zero is implemented by the slider “ShiftFreqWeight” of the editor 400 in order to avoid affecting the estimation process.

また、本実施形態に係る周波数特性変化補正部３０１は、アナログテープレコーダーによるヒスや各種のビート音といった特定周波数の定常的な雑音を無視して、処理を行う定常音処理機能を備えている。映像データに含まれる水平周波数(15.75kHz)などの定常音が混合音に含まれていると、この周辺の周波数チャンネルで大きく推定を誤ってしまうという問題があった。すなわち、定常音は、既知音響側には全く含まれない音であることから、周波数特性の推定を誤る原因となる。 Further, the frequency characteristic change correction unit 301 according to the present embodiment has a steady sound processing function for performing processing while ignoring stationary noise of a specific frequency such as hiss and various beat sounds by an analog tape recorder. When the mixed sound contains a stationary sound such as a horizontal frequency (15.75 kHz) included in the video data, there is a problem that the estimation is largely wrong in the surrounding frequency channels. That is, the stationary sound is a sound that is not included at all on the known acoustic side, and thus causes a frequency characteristic estimation error.

音量変化補正部３０２は、音量の時間変化の推定と平滑化とを行う。この音量の時間変化の推定に際しては、ｃ(ω，ｔ)のｔ方向の形状を補正することにより、ミキサーのボリュームフェーダー操作のように、既知音響信号除去後の音量変化を調整することができる。 The sound volume change correction unit 302 performs estimation and smoothing of the time change of the sound volume. When estimating the temporal change in volume, the volume change after removal of the known acoustic signal can be adjusted by correcting the shape of c (ω, t) in the t direction, as in the volume fader operation of the mixer. .

詳述すると、時間方向の音量推定において、時刻によって混合音側があらゆる周波数領域を含むような場合、実際より既知音響が大きいと推定をする傾向がある。この場合に単純に引いてしまうと、本来消すべきではない音まで引いてしまい、聴感上は「痩せた」音になる惧れがある。 More specifically, in the time direction sound volume estimation, when the mixed sound side includes all frequency regions depending on the time, there is a tendency to estimate that the known sound is larger than the actual sound. In this case, if it is simply drawn, it will draw even a sound that should not be erased, and there is a risk that the sound will be “skinned”.

本実施形態において音量変化補正部３０２は、混合音全時間域にわたって、既知音響の音量の時間変化を検出する。混合音には、既知音響以外に音声等の音が含まれているため、混合音と周波数特性で補正した既知音響の周波数チャンネルを１オクターブ毎（周波数で2倍毎）にまとめて合計する。同じ時刻毎に比較し、既知音響に対して混合音の大きさの比率が一番小さなものを選択する。これにより、オクターブ毎に比べた場合、どれか一つの区間では既知音響のみになっている可能性を反映させることができる。これをその時刻での既知音響と混合音の音量比とする。 In the present embodiment, the sound volume change correction unit 302 detects the time change of the sound volume of the known sound over the entire mixed sound time range. Since the mixed sound includes sounds such as voice in addition to the known sound, the frequency channels of the known sound corrected by the mixed sound and the frequency characteristics are summed up for every octave (every twice the frequency). Compared at the same time, the one with the smallest ratio of the mixed sound to the known sound is selected. Thereby, when compared with every octave, it is possible to reflect a possibility that only one known sound is present in any one section. This is the volume ratio between the known sound and the mixed sound at that time.

なお、本実施形態では、グラフ表示により、ユーザーが、音量が明らかに大きくなることを識別し、手作業で補正して対処する。なお、この推定においては、ロバスト統計的手法など自動的に判断する方法を採用してもよい。 In the present embodiment, the user recognizes that the sound volume is clearly increased from the graph display, and manually corrects the problem. In this estimation, an automatic determination method such as a robust statistical method may be employed.

また、音量変化補正部３０２は、時間変化の推定でも平滑化を行い、時間前後の既知音響の音量の平均値をとってなめらかにする。この値がエディタ４００中のスライダ「SmoothingTimeWeight」として実装されているで、この値を大きくすることによりなめらかにすることができる。 Further, the volume change correction unit 302 performs smoothing even when estimating the time change, and smoothes the average value of the volume of the known sound before and after the time. This value is implemented as a slider “SmoothingTimeWeight” in the editor 400, and can be smoothed by increasing this value.

また、本実施形態では、別の平滑化の機構（Blur_time_mode）が、「BATAM」ボタンで実装している。音がない時間に推定ができず、隣接する時間に比べて極端に落ち込んでいる場合に、隣接した時刻と同じ値に持ち上げることで、こうした予測不能の場合を回避する。 In the present embodiment, another smoothing mechanism (Blur_time_mode) is implemented by a “BATAM” button. When it is impossible to estimate the time when there is no sound and the time is extremely low compared to the adjacent time, the unpredictable case is avoided by raising the value to the same value as the adjacent time.

なお、既知音響の音がない時刻には、音量をゼロと推定してしまうことから、これを防ぐためにエディタ４００中のスライダ「ShiftTimeWeight」を実装している。 Note that the slider “ShiftTimeWeight” in the editor 400 is implemented to prevent this because the sound volume is estimated to be zero at a time when there is no known sound.

さらに、混合音と消去後の音の音量の調整について、特定の音声ファイルの一部のみを消去した場合には、それぞれの音量を同一に維持する必要がある。このため、本実施形態では、一定振幅の単音（480Hz）を合成して混合音として入力し、既知音響の引く量をゼロとして除去処理を行い、出力される音量の違いを計測し、その値が一致するように設定する。 Furthermore, regarding the adjustment of the volume of the mixed sound and the sound after erasure, when only a part of a specific audio file is erased, it is necessary to maintain the same volume. For this reason, in the present embodiment, a single tone (480 Hz) with a constant amplitude is synthesized and input as a mixed sound, a removal process is performed with the amount of known sound drawn as zero, a difference in output volume is measured, and the value is calculated. Set to match.

時間位置補正部３０３は、混合音響の開始点と、既知音響の開始点の時間的な位置ずれを推定するものであり、上述した関数ｒ(ｔ)の定数を設定することで、一定のずれ幅を補正する。詳述すると、指定した混合音と既知音響が時間的に位置ずれしている場合に、既知音響を混合音に対して、１ミリ秒毎に前後に最大１００ミリ秒ずらして、上記各補正部３０１及び３０２による処理を繰り返す。指定区間における混合音響と既知音響との音量の差を求めて、この差が最小となる時刻に最も一致していると判断し、この混合音と既知音響の時間的な位置ずれとする。 The time position correcting unit 303 estimates a time positional shift between the start point of the mixed sound and the start point of the known sound. By setting the constant of the function r (t) described above, a constant shift is performed. Correct the width. More specifically, when the specified mixed sound and the known sound are displaced in time, the respective correctors are shifted by up to 100 milliseconds before and after the known sound with respect to the mixed sound. The processing by 301 and 302 is repeated. The difference in volume between the mixed sound and the known sound in the designated section is obtained, and it is determined that the difference is the best at the time when the difference is the minimum, and the time difference between the mixed sound and the known sound is determined.

なお、本システムでは、上記手順を実行して自動推定することも、あらかじめ決めた時間のずれを、ユーザーが指定して、自動推定しないことを選択することができる。また、ユーザーの耳での混合音と既知音響とを左右別々のスピーカーから出力し、これらの音声を比較しながら聴き、ユーザーの聴覚により位置合わせをするようにしてもよい。 In this system, it is possible to perform automatic estimation by executing the above-described procedure or to specify that a predetermined time lag is specified by the user and not to perform automatic estimation. Alternatively, the mixed sound and the known sound at the user's ear may be output from separate left and right speakers, listened while comparing these sounds, and may be aligned by the user's hearing.

さらに、音響除去エンジンプログラム１００は、振幅スペクトル抽出部２００によって抽出された混合音響振幅スペクトルから既知音響振幅スペクトルを除去する除去処理部１０４と、逆フーリエ変換して、除去後の音響を復元する逆フーリエ変換部１０５及び配置処理部１０６を備えている。 Further, the sound removal engine program 100 includes a removal processing unit 104 that removes a known sound amplitude spectrum from the mixed sound amplitude spectrum extracted by the amplitude spectrum extraction unit 200, and an inverse Fourier transform to restore the sound after removal. A Fourier transform unit 105 and an arrangement processing unit 106 are provided.

除去処理部１０４は、既知音響をパラメータ推定部３００で生成した推定データに応じて変換し、この変換した信号を混合音響の「時間周波数データ」から、消去する。なお、この消去にあたって本実施形態では、シミュレーション部１４による「位相非依存減算アルゴリズム」も実装している。すなわち、本実施形態では、上述した「位相非依存減算関数」によって、0度から360度まで位相が一様に確率分布しているとして、シミュレーションを行い、このシミュレーション結果に一致させるように消去強度を自動設定する。 The removal processing unit 104 converts the known sound according to the estimation data generated by the parameter estimation unit 300, and deletes the converted signal from the “time frequency data” of the mixed sound. In this embodiment, a “phase-independent subtraction algorithm” by the simulation unit 14 is also implemented in this embodiment. In other words, in the present embodiment, the above-described “phase-independent subtraction function” is used to perform a simulation on the assumption that the phase is uniformly distributed from 0 degrees to 360 degrees, and the erasing intensity is matched with this simulation result. Is set automatically.

前記逆フーリエ変換部１０５は、差し引き計算によって得られた「時間周波数データ」と、混合音響信号中の位相データとから、既知音響を消した音声のみのデータを逆フーリエ変換により復元する。具体的に、逆フーリエ変換部１０５では、音響除去エンジンプログラム１００で求めた振幅スペクトルＳ(ω，ｔ)と、混合音ｍ(ｔ)の位相θｍ(ω，ｔ) を用いてＸｓ(ω，ｔ)を求め、それを逆フーリエ変換(IFFT)することで、単位波形を得る。 The inverse Fourier transform unit 105 restores only the sound-only data from which the known sound is eliminated from the “time-frequency data” obtained by the subtraction calculation and the phase data in the mixed sound signal by the inverse Fourier transform. Specifically, the inverse Fourier transform unit 105 uses the amplitude spectrum S (ω, t) obtained by the sound removal engine program 100 and the phase θm (ω, t) of the mixed sound m (t) to use Xs (ω, t). By obtaining t) and performing inverse Fourier transform (IFFT) on it, a unit waveform is obtained.

なお、ここでは、各時刻での差し引き後の周波数チャンネルデータを逆フーリエ変換するが、このときの各チャンネルの音声の位相は、消去する前の既知音響又は混合音響の位相と同じ値を取るようにする。この操作によって、消去前の音声の位相が維持でき、また区間毎に「ブチブチ」というノイズが発生するのを防ぐことができる。なお、IFFTの高速化については時間周波数データを作成する場合と同じ手法を用いている。 Here, the frequency channel data after the subtraction at each time is subjected to inverse Fourier transform, but the phase of the sound of each channel at this time is assumed to have the same value as the phase of the known sound or the mixed sound before being erased. To. By this operation, the phase of the sound before erasure can be maintained, and it is possible to prevent the occurrence of noise “buchibuchi” for each section. Note that the same technique used to create time-frequency data is used for speeding up IFFT.

配置処理部１０６は、ハニング窓の幅である170ミリセカンドの幅をもつ各時刻の音声について、同じ幅の窓の出力をOverlapAdd法で重ね合わせて、最終的に音楽が消去された音声を復元する。 The placement processing unit 106 uses the OverlapAdd method to superimpose the audio of each time having a Hanning window width of 170 milliseconds, and finally restores the audio from which the music has been erased. To do.

［音響除去エンジンの動作］
上述した構成を有する音響除去エンジンプログラム１００は、以下のように動作する。図６（ａ）は、かかる動作を示すフロー図である。 [Operation of the sound removal engine]
The sound removal engine program 100 having the above-described configuration operates as follows. FIG. 6A is a flowchart showing this operation.

同図に示すように、既知音響（オリジナル曲）と混合音響のデータが入力されると（ステップＳ３０１）、先ず、データ分割部２０１により、混合音響信号を、特定の長さ（窓サイズ）の区間に分割する。ここでは、２のべき乗である8192サンプル（8192÷48,000=0.170約170ミリ秒）とする。 As shown in the figure, when known sound (original music) and mixed sound data are input (step S301), first, the data dividing unit 201 converts the mixed sound signal into a specific length (window size). Divide into sections. Here, 8192 samples (8192 ÷ 48,000 = 0.170 approximately 170 milliseconds), which is a power of 2.

次いで、ステップＳ３０２により、時間周波数データの取得を行う。具体的には、窓関数処理部２０２により、窓サイズ区間（170m秒）の音声信号データに対し、ハニング関数を掛け（Ｓ３０２ａ）、フーリエ変換部２０３により、高速フーリエ変換（FFT）を行う（Ｓ３０２ｂ）。そして、フーリエ変換する区間を480サンプル（480÷48,000=0.01：10ミリ秒）単位で移動し（Ｓ３０２ｃ）、これらのステップＳ３０２ａ〜ｃをループ処理により繰り返す。 Next, in step S302, time-frequency data is acquired. Specifically, the window function processing unit 202 multiplies the audio signal data in the window size section (170 msec) by a Hanning function (S302a), and the Fourier transform unit 203 performs fast Fourier transform (FFT) (S302b). ). Then, the section for Fourier transform is moved in units of 480 samples (480 ÷ 48,000 = 0.01: 10 milliseconds) (S302c), and these steps S302a to c are repeated by loop processing.

このようにして10ミリセカンド毎に得られた「時間周波数データ」に対して各種パラメータの推定を行う。具体的には、パラメータ推定部３００において、ユーザー操作に基づいて、混合音の中で、音楽（ＢＧＭ）のみが鳴っている１秒から数秒の区間（上述した「ＢＧＭ区間」）を選択してキャリブレーション（Ｓ３０４）を行う。オリジナル曲に対してもほぼ同じ部分を選択する。 Various parameters are estimated for the “time frequency data” obtained every 10 milliseconds in this way. Specifically, the parameter estimation unit 300 selects, based on a user operation, a section from 1 second to several seconds (the above-described “BGM section”) in which only music (BGM) is sounded in the mixed sound. Calibration (S304) is performed. Select almost the same part for the original song.

次に、この区間内で周波数チャンネル毎に合計して、混合音と除去音（オリジナル曲）から得られる値を比較し、その比から周波数特性を取得する（Ｓ３０５）とともに、平滑化を行う（Ｓ３０６）。 Next, the frequency channels are summed for each frequency channel in this section, the values obtained from the mixed sound and the removed sound (original music) are compared, and the frequency characteristic is acquired from the ratio (S305), and smoothing is performed ( S306).

その後、混合音全時間域にわたって、ＢＧＭの音量の時間変化を検出し、周波数毎の混合音に対するＢＧＭの音量比を取得し（Ｓ３０７）、これに基づいて、混合音中のＢＧＭと、オリジナル曲が時間的に位置ずれしているかを判断する（Ｓ３０９）。 After that, the time change of the volume of the BGM is detected over the entire time range of the mixed sound, and the volume ratio of the BGM to the mixed sound for each frequency is acquired (S307). Based on this, the BGM in the mixed sound and the original music are obtained. It is determined whether or not the position is shifted in time (S309).

上記ステップＳ０１で指定した混合音中のＢＧＭとオリジナル曲が時間的に位置ずれしている場合は、オリジナル曲を混合音に対して、１ミリ秒毎に前後に最大１００ミリ秒ずらして（Ｓ３１０）、ステップＳ３０５〜ステップＳ３０８までの処理を繰り返す。この区間での音量の差を求めて、この差が最小となる時刻に最も一致していると考えられるので、それが混合音中のＢＧＭとオリジナル曲の時間的な位置ずれとなる。 If the BGM in the mixed sound specified in step S01 and the original music are shifted in time, the original music is shifted up and down by 100 milliseconds every 1 millisecond with respect to the mixed sound (S310). ), The processing from step S305 to step S308 is repeated. A difference in volume in this section is obtained, and it is considered that the difference coincides with the time when the difference becomes the smallest, so this is a time positional shift between the BGM in the mixed sound and the original music.

位置ずれが補正された後、除去処理部１０４では、混合音響から推定した既知音響の差し引き演算を行い（Ｓ３１１）、フーリエ変換部２０３において、各時刻での差し引き後の周波数チャンネルデータを逆フーリエ変換する（Ｓ３１２）。このときの各チャンネルの音声の位相は、消去する前の混合音響中のＢＧＭの位相と同じ値を取るようにする。そして、逆フーリエ変換された各窓サイズのデータを、配置処理部において、OverlapAdd法により重ね合わせて（Ｓ３１３）、最終的に音楽が消去された音声を復元し、データの出力を行う（Ｓ３１４）。 After the positional deviation is corrected, the removal processing unit 104 performs a subtraction operation of the known sound estimated from the mixed sound (S311), and the Fourier transform unit 203 performs inverse Fourier transform on the frequency channel data after the subtraction at each time. (S312). At this time, the sound phase of each channel takes the same value as the phase of BGM in the mixed sound before erasure. Then, the data of each window size subjected to the inverse Fourier transform is overlapped by the OverlapAdd method in the placement processing unit (S313), and finally the sound from which the music is erased is restored and the data is output (S314). .

［エディタ］
（エディタの構成）
次いで、既知音響除去システムのＧＵＩであるエディタ４００について説明する。図７は、エディタ４００の画面構成を示す説明図である。 [editor]
(Editor configuration)
Next, the editor 400 that is a GUI of the known sound removal system will be described. FIG. 7 is an explanatory diagram showing the screen configuration of the editor 400.

同図に示すように、エディタ４００は、ＭＩＸファイルの波形表示をするウインドウ４０１、ＢＧＭファイルの波形表示をするウインドウ４０２と、ＭＩＸファイルのスペクトル表示をするウインドウ４０３と、ＢＧＭファイルのスペクトル表示をするウインドウ４０４と、消去後音声と周波数-時間特性のスペクトル表示をするウインドウ４０５と、消去強度等を表示・操作するウインドウ４０６とを備えている。これらの各ウインドウは、除去処理の対象となる混合音響信号の範囲を設定する範囲設定部として、機能する。 As shown in the drawing, the editor 400 displays a window 401 for displaying the waveform of the MIX file, a window 402 for displaying the waveform of the BGM file, a window 403 for displaying the spectrum of the MIX file, and a spectrum of the BGM file. A window 404, a window 405 for displaying a spectrum of voice after erasure and frequency-time characteristics, and a window 406 for displaying / manipulating erasure intensity and the like are provided. Each of these windows functions as a range setting unit that sets the range of the mixed acoustic signal to be subjected to the removal process.

ウインドウ４０１及びウインドウ４０２では、ミックスファイル及びＢＧＭファイルの波形が表示され、マウスの右ボタンを押しながら上下に移動することにより、拡大・縮小（下・上）でき、右ボタンを押しながら左右に移動することにより左右にスクロールさせることができる。さらに、このウインドウ上で、左ボタンを押しながら左右に移動することにより区間を選択することができ、選択された区間は色が変わり、選択時間の最初と最後の時間が表示される。なお、これらの操作は、キーボードの方向キーによっても可能となっている。 In the window 401 and the window 402, the waveform of the mix file and the BGM file is displayed, and can be enlarged / reduced (down / up) by moving the mouse up / down while holding down the right button of the mouse. By doing so, it is possible to scroll left and right. Furthermore, on this window, a section can be selected by moving left and right while pressing the left button. The selected section changes color, and the first and last times of the selection time are displayed. These operations can also be performed by using the direction keys on the keyboard.

また、右端の操作ボタン４０１ａ又は４０２ａを操作することにより、時間軸方向のスクロール、選択区間の再生・再生停止、選択区間のスペクトル表示を実行することができる。 Further, by operating the operation button 401a or 402a at the right end, scrolling in the time axis direction, reproduction / reproduction stop of the selected section, and spectrum display of the selected section can be executed.

ウインドウ４０３及び４０４では、上記ウインドウ４０１又は４０２で選択された区間における、ＭＩＸファイルやＢＧＭファイルのスペクトル表示がなされ、上記ウインドウ４０１と４０２とほぼ同様の操作を行うことができる。なお、ここでのスペクトル表示において、スペクトルは、その強さを１６段階の色で表示され、横軸は時間、縦軸は周波数を示す。 In the windows 403 and 404, the spectrum display of the MIX file or the BGM file in the section selected in the window 401 or 402 is performed, and almost the same operation as the windows 401 and 402 can be performed. In addition, in the spectrum display here, the spectrum is displayed in 16 levels of intensity, the horizontal axis indicates time, and the vertical axis indicates frequency.

特に、ウインドウ４０３では、キャリブレーション処理における周波数分布の推定の際、このウインドウ上で、ＢＧＭだけが聞こえる区間を選択し、「ＳＲ」ボタンを押すことにより、選択された区間の開始位置及び終了位置が、テキストボックス部４０７ｂのCalibrationStartTime(ＢＧＭのみ区間スタート位置)、CalibrationEndTime(ＢＧＭのみ区間終了位置)の欄に入力される。なお、このテキストボックス部４０７ｂへの入力としては、数値の直接入力がある。 In particular, in the window 403, when estimating the frequency distribution in the calibration process, a section in which only BGM can be heard is selected on this window, and the “SR” button is pressed to start and end positions of the selected section. Are input in the fields of CalibrationStartTime (BGM only section start position) and CalibrationEndTime (BGM only section end position) in the text box portion 407b. The input to the text box portion 407b includes direct numerical input.

ウインドウ４０５は、音響信号の時間毎の周波数分布（周波数特性や振幅）を線又は図形で表示するとともに、周波数の信号強度（音量）を、線又は図形を段階的に色分けすることによりサーモグラフ形式で表示する表示部であり、ラジオボタン部４０７ｅをチェックすることによって、図１０に示すような、消去後のスペクトル表示と周波数-時間特性のスペクトル表示とを切り替える。図１０（ａ）は、消去後のスペクトル表示であり、同図（ｂ）は、時間-周波数特性を正規化したスペクトル表示である。このウインドウ４０５においても、上述したウインドウ４０１〜４０４と同様のマウスやキーボード操作を行うことができる。 The window 405 displays the frequency distribution (frequency characteristics and amplitude) of the acoustic signal for each time as a line or figure, and the signal strength (volume) of the frequency is thermographed by color-coding the line or figure stepwise. When the radio button 407e is checked, the spectrum display after erasure and the spectrum display of the frequency-time characteristic are switched as shown in FIG. FIG. 10A is a spectrum display after erasure, and FIG. 10B is a spectrum display in which time-frequency characteristics are normalized. In this window 405, the same mouse and keyboard operations as those in the windows 401 to 404 described above can be performed.

ウインドウ４０６では、ラジオボタン部４０７ｄをチェックすることによって、図９に示すような、消去強度曲線（同図（ａ））、周波数特性曲線（同図（ｂ））、時間特性曲線（同図（ｃ））及び時間−周波数特性（同図（ｄ））を切り替えて表示する。このウインドウ４０６は、曲線を変形させることにより、対応する時間において除去される既知音響振幅スペクトルの信号強度などを設定する設定部として機能し、左ボタンを押しながら右方向に移動することで、曲線を変形することができ、関数曲線を細かく且つ任意に調整することができる。なお、マウスを左方向に移動しても変更できないようになっている。 In the window 406, by checking the radio button part 407d, the erasing intensity curve (FIG. 9A), the frequency characteristic curve (FIG. 9B), the time characteristic curve (FIG. c)) and time-frequency characteristics ((d) in the figure) are switched and displayed. This window 406 functions as a setting unit for setting the signal intensity of a known acoustic amplitude spectrum to be removed at a corresponding time by deforming the curve, and moves to the right while pressing the left button. The function curve can be finely and arbitrarily adjusted. Note that even if the mouse is moved to the left, it cannot be changed.

なお、これらの各ウインドウには、時間軸としてタイムコードが記述されており、このタイムコードでは、混合音響信号から検出された所定周波数（ここでは、カラーバーの１kHz）の音響信号終了時刻から１５秒後を０となるように自動調整されている。 In each of these windows, a time code is described as a time axis. In this time code, 15 times from the acoustic signal end time of a predetermined frequency (here, 1 kHz of the color bar) detected from the mixed acoustic signal. It is automatically adjusted to zero after 2 seconds.

また、エディタ４００は、その下部に、ファイルの表示をしたり、各種パラメータを設定したりする左右２つの操作パネル４０７及び４０８を備えている。 In addition, the editor 400 includes two left and right operation panels 407 and 408 for displaying a file and setting various parameters.

左側の操作パネル４０７は、ＭＩＸファイルを表示するウインドウ４０７ａと、パラメータの設定をするテキストボックス部４０７ｂと、チェックボックス部４０７ｃと、ラジオボタン部４０７ｄ，４０７ｅと、消去実行用のボタン４０７ｄとが設けられている。一方、右側の操作パネル４０８は、ＢＧＭファイルを表示するウインドウ４０８ａと、パラメータの設定をするスライダ部４０８ｂとを備えている。 The left operation panel 407 is provided with a window 407a for displaying a MIX file, a text box section 407b for setting parameters, a check box section 407c, radio button sections 407d and 407e, and a delete execution button 407d. It has been. On the other hand, the right operation panel 408 includes a window 408a for displaying a BGM file and a slider unit 408b for setting parameters.

ウインドウ４０７ａ及び４０８ａでは、ＭＩＸファイルやＢＧＭファイルをここにドラッグすることによってウインドウ４０１や４０２に、ＭＩＸファイル又はＢＧＭファイルの波形が表示され、ドラッグされたファイルが動画ファイルであればこのウインドウ内に映像が表示される。これらのウインドウ４０７ａ及び４０８ａの下方には、スライドバーが配置され、これをスライドさせることによってＭＩＸファイルの再生開始位置を変えることができる。本実施形態において波形表示される区間は再生開始位置から5分に設定されている。さらにこのスライドバーの下方には、再生ボタン、再生一時停止ボタン、再生停止ボタン、音量調整バー、再生開始位置を数値入力するテキストボックスが配置されている。 In windows 407a and 408a, by dragging a MIX file or BGM file here, the waveform of the MIX file or BGM file is displayed in windows 401 and 402. If the dragged file is a video file, the video is displayed in this window. Is displayed. Below these windows 407a and 408a, a slide bar is arranged, and by sliding it, the playback start position of the MIX file can be changed. In this embodiment, the waveform display section is set to 5 minutes from the playback start position. Further, below the slide bar, a play button, a play pause button, a play stop button, a volume adjustment bar, and a text box for numerical input of the play start position are arranged.

テキストボックス部４０７ｂは、図８（ａ）に示すように、消去するＢＧＭの消去強度の入力欄である「EraseRatio」と、周波数分布の推定のためにＭＩＸファイルのＢＧＭのみ区間の開始位置及び終了位置を入力する欄である「Calibration Start Time」、「Calibration Start Time」と、ＭＩＸファイルとＢＧＭファイルの選択区間の開始位置のずれを入力する欄である「Offset Between Target and ＢＧＭ」とを備えている。「EraseRatio」では、基本の値は１であり、０．１〜５までを設定することができる。また、「Offset Between Target and ＢＧＭ」では、−１を設定すると、自動位置合わせを計算するモードとなる。 As shown in FIG. 8A, the text box portion 407b includes “EraseRatio” which is an input field for the erasing intensity of the BGM to be erased, and the start position and the end of the BGM only section of the MIX file for estimating the frequency distribution. “Calibration Start Time” and “Calibration Start Time” that are fields for inputting the position, and “Offset Between Target and BGM” that is a field for inputting the deviation of the start position of the selected section of the MIX file and the BGM file. Yes. In “EraseRatio”, the basic value is 1, and can be set to 0.1 to 5. In “Offset Between Target and BGM”, when −1 is set, a mode for calculating automatic alignment is set.

チェックボックス部４０７ｃでは、図８（ｂ）に示すように、周波数分布の推定精度を向上させるために、ユーザーが指定した区間に加えて、他のＢＧＭだけの部分を探して自動的に探し出してデータに加える再推定モード「PEM」と、推定を誤って小さな値となった周波数チャンネルについて、隣接するチャンネルの値から推定して補正する周波数ぼかしモード「BAFAM」と、推定を誤って小さな値となった時刻について、前後の時刻の値から推定して補正する時間軸ぼかしモード「BATAM」とを選択できるようになっている。 In the check box unit 407c, as shown in FIG. 8B, in order to improve the estimation accuracy of the frequency distribution, in addition to the section specified by the user, other BGM only parts are searched for automatically. Re-estimation mode “PEM” to be added to the data, frequency blurring mode “BAFAM” that corrects by estimating from the value of the adjacent channel for the frequency channel whose estimation is erroneously small value, and the estimation is erroneously small value The time axis blurring mode “BATAM” that corrects the estimated time from the previous and next time values can be selected.

ラジオボタン部４０７ｄでは、図８（ｃ）に示すように、択一的にチェックを入れることにより、図９（ａ）〜（ｄ）に示すような、ウインドウ４０６におけるEraseRatio(強度曲線)、FreqWeight(周波数特性曲線)、TimeWeight(時間特性曲線)、BGMWeight(時間-周波数特性曲線)の表示を切り替えることができる。なお、このラジオボタン部４０７ｄにおいて、「ｅ」をチェックすると強度曲線が、「ｆ」をチェックすると周波数特性曲線が、「ｔ」をチェックすると時間特性曲線が、「ＢＧＭ」をチェックすると時間-周波数特性曲線が、ウインドウ４０６に表示され、これらの各表示上において、パラメータの再設定することができる。 In the radio button unit 407d, as shown in FIG. 8 (c), by selectively checking, the EraseRatio (intensity curve), FreqWeight in the window 406 as shown in FIGS. 9 (a) to 9 (d). (Frequency characteristic curve), TimeWeight (time characteristic curve), BGMWeight (time-frequency characteristic curve) display can be switched. In this radio button section 407d, when “e” is checked, the intensity curve is checked, when “f” is checked, the frequency characteristic curve is checked, when “t” is checked, the time characteristic curve is checked, and when “BGM” is checked, the time-frequency is checked. A characteristic curve is displayed in window 406, and parameters can be reset on each of these displays.

ラジオボタン部４０７ｅでは、図８（ｄ）に示すように、択一的にチェックを入れることにより、図１０（ａ）及び（ｂ）に示すような、ウインドウ４０５における、消去後のスペクトルと時間-周波数特性のスペクトルとの表示を切り替えることができる。すなわち、このラジオボタン部４０７ｅにおいて、「Res」をチェックすると消去後のデータのスペクトルが、「ＢＧＭ」をチェックすると時間-周波数特性のスペクトルが、ウインドウ４０５に表示される。 In the radio button unit 407e, as shown in FIG. 8 (d), the spectrum and time after erasure in the window 405 as shown in FIGS. 10 (a) and 10 (b) can be selectively checked. -The display with the spectrum of the frequency characteristic can be switched. That is, in this radio button section 407e, if “Res” is checked, the spectrum of the data after deletion is displayed in the window 405, and if “BGM” is checked, the spectrum of the time-frequency characteristic is displayed.

スライダ部４０８ｂでは、バーを左右にスライドさせることによって、「Shift Freq Weight」で周波数特性曲線のバイアス量を、「Shift Time Weight」で時間特性曲線のバイアス量を、「Shift Global Weight」で時間-周波数特性のバイアス量を、「Smoothing Time Weight」で周波数方向のスムージング幅を変更することができる。なお、本実施形態では、これらのでパラメータを変更するとTimeWeightの設定がリセットされる。 In the slider unit 408b, by sliding the bar left and right, the shift amount of the frequency characteristic curve is set by “Shift Freq Weight”, the bias amount of the time characteristic curve is set by “Shift Time Weight”, and the time − The smoothing width in the frequency direction can be changed with “Smoothing Time Weight” as the bias amount of the frequency characteristic. In the present embodiment, when these parameters are changed, the setting of TimeWeight is reset.

ボタン部４０７ｆでは、「DEL_Music」ボタンをクリックすることにより、音楽消去処理を実行し、「RDEL_Music」ボタンをクリックすることにより、音楽の消去後、再調整された特性での音楽消去処理処理を実行する。 In the button section 407f, music erasure processing is executed by clicking the “DEL_Music” button, and music erasure processing processing with readjusted characteristics is executed after erasing music by clicking the “RDEL_Music” button. To do.

（エディタの変更例）
上述したＧＵＩであるエディタ４００は、以下のような形態とすることができる。図１１は、エディタ４００の変更例を示す構成図である。 (Editor change example)
The editor 400 which is the GUI described above can be configured as follows. FIG. 11 is a configuration diagram illustrating a modification example of the editor 400.

この変更例に係るエディタも、上述したエディタ４００と同様に、音響除去エンジンプログラム１００の一部機能を、ユーザーの操作によって呼び出して、上記数５及び数６のすべてのパラメータ関数ａ（ｔ），ｇ（ω，ｔ）（ｇω（ω，ｔ），ｇｔ（ｔ），ｇｒ（ｔ）），ｐ（ω），ｑ（ω），ｒ（ｔ），ｃ（ω，ｔ）の形状を、ユーザーが手作業で設定することができる。エディタのユーザーは、最初から任意の関数形状を描いて指定してもよいし、最初は先ず自動推定をして、その結果を修正してもよい。 Similarly to the editor 400 described above, the editor according to this modified example also calls a part of the function of the sound removal engine program 100 by the user's operation, so that all the parameter functions a (t), g (ω, t) (gω (ω, t), gt (t), gr (t)), p (ω), q (ω), r (t), c (ω, t), User can set manually. The user of the editor may draw and specify an arbitrary function shape from the beginning, or may first perform automatic estimation and modify the result.

本エディタは、大別して、混合音響信号ｍ（ｔ）操作用のサブウインドウＷ１、既知音響信号ｂ’（ｔ）操作用のサブウインドウＷ２、既知音響信号除去後の所望の音響信号ｓ（ｔ）操作用のサブウインドウＷ３の三つで構成されている。既知音響信号ｂ’（ｔ）が複数ある場合には、切り替えスイッチＷ２Ｓにより、サブウインドウＷ２で操作する既知音響信号ｂ’（ｔ）を切り替えることができる。 The editor is roughly divided into a sub-window W1 for operating the mixed acoustic signal m (t), a sub-window W2 for operating the known acoustic signal b ′ (t), and a desired acoustic signal s (t) after the known acoustic signal is removed. It consists of three sub-windows W3 for operation. When there are a plurality of known acoustic signals b '(t), the known acoustic signal b' (t) operated in the sub window W2 can be switched by the changeover switch W2S.

先ず、全サブウインドウに共通の機能を述べる。操作範囲スライダＰ１は、音響信号中のどこを現在表示しているかを表す。カーソルＰ２は、現在の操作対象の時間軸上の位置を表すアイコン化（折り畳み）ボタンＰ３は、これを押すと一時的にそのボタンの属するサブウインドウが折り畳まれ、小さくなる現在操作対象以外の未使用のサブウインドウを隠して、狭い画面を有効活用できる。フロート化（拡大）ボタンＰ４は、これを押すと一時的にそのボタンの属するサブウインドウが、親ウインドウから切り離され（フロート化）、さらに拡大されて操作・編集が容易になる、フロート化（拡大）ボタンＰ４しか描かれていない場合には、このボタンを押すと、それに関連づけられたサブウインドウがフロート化されて出現する。再生制御パネルＰ５には、人間が聞いて確認するために、音響信号の再生、停止、早送り、早戻しが可能なボタン群が並んでいる。 First, functions common to all sub-windows will be described. The operation range slider P1 indicates where in the acoustic signal is currently displayed. The cursor P2 is an iconized (folding) button P3 indicating the position of the current operation target on the time axis. When the button P3 is pressed, the subwindow to which the button belongs is temporarily folded, and the remaining non-current operation target is reduced. The sub-window of use can be hidden to make effective use of a narrow screen. When the float (enlarge) button P4 is pressed, the sub-window to which the button belongs is temporarily separated from the parent window (float), and further expanded to facilitate operation / editing. ) When only the button P4 is drawn, when this button is pressed, the sub-window associated therewith is floated and appears. On the reproduction control panel P5, there are arranged a group of buttons that can reproduce, stop, fast-forward, and fast-reverse acoustic signals for humans to hear and confirm.

サブウインドウＷ１、Ｗ２、Ｗ３には、混合音響信号ｍ（ｔ）のパワーのグラフＥ１とその振幅スペクトルＭ（ω，ｔ）のグラフＥ２、既知音響信号ｂ’（ｔ）のパワーのグラフＥ３とその振幅スペクトルＢ’（ω，ｔ）のグラフＥ４、既知音響信号除去後の音響信号ｓ（ｔ）のパワーのグラフＥ５とその振幅スペクトルＳ（ω，ｔ）のグラフＥ６が表示されている。各振幅スペクトルでは、左側に濃淡で振幅が描かれ（横軸が時間軸、縦軸が周波数軸）、右側にカーソル位置での振幅が描かれている（横軸がパワー、縦軸が周波数軸）。 In the sub-windows W1, W2, and W3, a graph E1 of the power of the mixed acoustic signal m (t), a graph E2 of the amplitude spectrum M (ω, t), a graph E3 of the power of the known acoustic signal b ′ (t) A graph E4 of the amplitude spectrum B ′ (ω, t), a graph E5 of the power of the acoustic signal s (t) after removal of the known acoustic signal, and a graph E6 of the amplitude spectrum S (ω, t) are displayed. In each amplitude spectrum, the amplitude is drawn with shading on the left side (the horizontal axis is the time axis, the vertical axis is the frequency axis), and the amplitude at the cursor position is drawn on the right side (the horizontal axis is power, the vertical axis is the frequency axis) ).

既知音響信号ｂ’（ｔ）操作用のサブウインドウＷ２が操作の中心となるウインドウであり、数５及び数６のすべてのパラメータ関数ａ（ｔ），ｇ（ω，ｔ）（ｇω（ω，ｔ），ｇｔ（ｔ），ｇｒ（ｔ）），ｐ（ω），ｑ（ｔ），ｒ（ｔ）の形状を、自由に設定できる。以下、各操作パネルの説明を述べる。 The sub-window W2 for operating the known acoustic signal b ′ (t) is a window that is the center of the operation, and all the parameter functions a (t), g (ω, t) (gω (ω, t), gt (t), gr (t)), p (ω), q (t), and r (t) can be freely set. Hereinafter, description of each operation panel will be described.

１．周波数特性の時間変化の補正用操作パネルＣ１（Ｅ７の右側）
ｇω（ω，ｔ）を表示・操作するためのパネルで、カーソル位置の時刻ｔでのｇω（ω，ｔ）が描かれている（横軸が大きさ、縦軸が周波数軸）。設定操作結果は、ｇ（ω，ｔ）の表示パネルＥ７に即座に反映される。Ｅ７には、濃淡でｇ（ω，ｔ）の値の大きさが描かれている（横軸が時間軸、縦軸が周波数軸）。 1. Operation panel C1 for correcting the time variation of the frequency characteristics (right side of E7)
On the panel for displaying and operating gω (ω, t), gω (ω, t) at time t at the cursor position is drawn (the horizontal axis is the size and the vertical axis is the frequency axis). The setting operation result is immediately reflected on the display panel E7 of g (ω, t). In E7, the magnitude of the value of g (ω, t) is depicted in shading (the horizontal axis is the time axis, and the vertical axis is the frequency axis).

２．音量の時間変化の補正用操作パネルＣ２（Ｅ７の下側）
ｇｔ（ｔ）を表示・操作するためのパネルで、設定操作結果は、ｇ（ω，ｔ）の表示パネルＥ７に即座に反映される。 2. Operation panel C2 for correcting changes in volume over time (lower side of E7)
A panel for displaying and operating gt (t), and the setting operation result is immediately reflected on the display panel E7 for g (ω, t).

３．ｇ（ω，ｔ）の値を全体的に持ち上げるための操作パネルＣ３（Ｅ７の下側）
ｇｒ（ｔ）を表示・操作するためのパネルで、設定操作結果は、ｇ（ω，ｔ）の表示パネルＥ７に即座に反映される。 3. Operation panel C3 (lower side of E7) for raising the value of g (ω, t) as a whole
The setting operation result is immediately reflected on the display panel E7 of g (ω, t) on the panel for displaying and operating gr (t).

４．混合音の振幅スペクトルから既知音響信号の振幅スペクトルに相当する成分を減算する分量を最終的に調整するための操作パネルＣ４
ａ（ｔ）を表示・操作するためのパネルである。 4). Operation panel C4 for finally adjusting the amount of subtraction of the component corresponding to the amplitude spectrum of the known acoustic signal from the amplitude spectrum of the mixed sound
It is a panel for displaying and operating a (t).

５．周波数軸方向の伸縮を補正するための操作パネルＣ５
ｐ（ω）を表示・操作するためのパネルである。 5). Operation panel C5 for correcting expansion and contraction in the frequency axis direction
This is a panel for displaying and operating p (ω).

６．時間軸方向の伸縮を補正するための操作パネルＣ６
ｑ（ｔ）を表示・操作するためのパネルである。 6). Operation panel C6 for correcting expansion and contraction in the time axis direction
This is a panel for displaying and operating q (t).

７．時間的な位置のずれを補正するための操作パネルＣ７
ｒ（ｔ）を表示・操作するためのパネルである。 7). Operation panel C7 for correcting a positional shift in time
It is a panel for displaying and operating r (t).

既知音響信号除去後の音響信号ｓ（ｔ）操作用のサブウインドウＷ３では、数５中のパラメータ関数ｃ（ω，ｔ）の形状を、自由に設定できる。以下、各操作パネルの説明を述べる。 In the sub-window W3 for operating the acoustic signal s (t) after removing the known acoustic signal, the shape of the parameter function c (ω, t) in Equation 5 can be freely set. Hereinafter, description of each operation panel will be described.

１．グラフィックイコライザ（ＧＥＱ）操作パネルＣ８（Ｅ８の右側）
ｃ（ω，ｔ）のω方向の形状を表示操作するためのパネルで、カーソル位置の時刻ｔでのｃ（ω，ｔ）が描かれている（横軸が大きさ、縦軸が周波数軸）。設定操作結果は、ｃ（ω，ｔ）の表示パネルＥ８に即座に反映される。Ｅ８には、濃淡でｃ（ω，ｔ）の値の大きさが描かれている（横軸が時間軸、縦軸が周波数軸）。 1. Graphic equalizer (GEQ) operation panel C8 (right side of E8)
This is a panel for displaying and operating the shape of c (ω, t) in the ω direction, and c (ω, t) at time t at the cursor position is drawn (the horizontal axis is the size, the vertical axis is the frequency axis) ). The setting operation result is immediately reflected on the display panel E8 of c (ω, t). In E8, the magnitude of the value of c (ω, t) is depicted in shading (the horizontal axis is the time axis, and the vertical axis is the frequency axis).

２．ボリュームフェーダー操作パネルＣ９（Ｅ８の下側）
ｃ（ω，ｔ）のｔ方向の形状を表示・操作するためのパネルで、設定操作結果は、ｃ（ω，ｔ）の表示パネルＥ８に即座に反映される。 2. Volume fader operation panel C9 (below E8)
This is a panel for displaying / manipulating the shape of c (ω, t) in the t direction, and the setting operation result is immediately reflected on the display panel E8 of c (ω, t).

本エディタでは、オーディオファイルを入出力するだけでなく、各種パラメータ関数の形状もファイルに入出力できるようにする。これにより、除去作業の中断・再開を可能にする。また、ユーザーに対して高速なレスポンスを達成するために、マルチスレッドに基づいてＧＵＩと信号処理、音響再生を異なるスレッドとして実装し、時間のかかる信号処理等の待ち時間を隠蔽するとよい。 In this editor, not only audio files can be input and output, but also the shape of various parameter functions can be input and output to the file. As a result, the removal operation can be interrupted / resumed. In order to achieve a high-speed response to the user, GUI, signal processing, and sound reproduction are implemented as different threads based on multi-threads, and waiting time such as time-consuming signal processing may be hidden.

［プログラムを記録したコンピュータ読み取り可能な記録媒体］
なお、上述した実施形態及びその変更例に係る既知音響除去プログラムでは、ユーザー端末やWebサーバ等のコンピュータやＩＣチップにインストールすることにより、上述した各機能を有する装置やシステムを容易に構築することができる。このプログラムは、例えば、通信回線を通じて配布することが可能であり、またスタンドアローンの計算機上で動作するパッケージアプリケーションとして譲渡することができる。 [Computer-readable recording medium recording the program]
In addition, in the known sound removal program according to the above-described embodiment and its modification example, it is possible to easily construct a device or system having each of the above-described functions by being installed in a computer such as a user terminal or a Web server or an IC chip. Can do. This program can be distributed through a communication line, for example, and can be transferred as a package application that operates on a stand-alone computer.

そして、このようなプログラムは、図１２に示すような、汎用コンピュータ１２０で読み取り可能な記録媒体１１６〜１１９に記録することができる。具体的には、同図に示すような、フレキシブルディスク１１６やカセットテープ１１９等の磁気記録媒体、若しくはＣＤ−ＲＯＭやＤＶＤ−ＲＯＭ１１７等の光ディスクの他、ＲＡＭカード１１８など、種々の記録媒体に記録することができる。 Such a program can be recorded on recording media 116 to 119 readable by the general-purpose computer 120 as shown in FIG. Specifically, as shown in the figure, recording is performed on various recording media such as a RAM card 118 in addition to a magnetic recording medium such as a flexible disk 116 and a cassette tape 119, or an optical disk such as a CD-ROM and a DVD-ROM 117. can do.

そして、このプログラムを記録したコンピュータ読み取り可能な記録媒体によれば、汎用のコンピュータや専用コンピュータを用いて、上述したコンテンツ表示システムや方法を実施することが可能となるとともに、プログラムの保存、運搬及びインストールを容易に行うことができる。 According to the computer-readable recording medium in which the program is recorded, the above-described content display system and method can be implemented using a general-purpose computer or a dedicated computer, and the program can be stored, transported, and Easy installation.

［実施形態における作用・効果］
以上説明した本実施形態によれば、振幅データは位相が変化してもかわらないので、位相に依存しない処理が可能となる。従って、例えば、音声と音楽が混じった番組の音声信号から、番組作成時に使用した音楽ＣＤ等の音のデータを使って、音楽だけを消去することができる。 [Operations and effects in the embodiment]
According to the present embodiment described above, the amplitude data does not change even if the phase changes, so that processing independent of the phase is possible. Therefore, for example, only music can be erased from the audio signal of a program in which audio and music are mixed, using sound data such as a music CD used at the time of program creation.

特に、番組音声作成時において、製作意図にあわせて周波数特性や音量が調整され、音の位相が予測不能な変化をしている場合であっても、混合音響中における音楽の変化を自動的且つ高精度に予測し、位相に依存せず消去することができる。 Especially when creating program audio, even if the frequency characteristics and volume are adjusted according to the production intention, and the phase of the sound changes unpredictably, the change of music in the mixed sound is automatically and Predict with high accuracy and erase without depending on the phase.

本実施形態では、除去処理に係る音響中の範囲設定に際し、同期制御部９により、映像と音声とを同期させてモニタ１０及びスピーカー１１から出力するため、音楽消去前、音楽消去後のそれぞれの音声を映像と照らし合わせて視覚的に確認しながらの操作が可能となり、作業効率を向上させることができる。 In the present embodiment, when setting the range in the sound related to the removal process, the synchronization control unit 9 synchronizes the video and the audio and outputs them from the monitor 10 and the speaker 11. The operation can be performed while visually confirming the sound against the image, and the work efficiency can be improved.

本実施形態では、時間変化グラフを表示し、その上をユーザーがマウスでドローイングする直感的な操作で修正できるため、番組の各場面や再利用方法を考慮するなどユーザーの意図に基づいて、音楽消去の効果を調節することができる。 In this embodiment, since the time change graph is displayed and can be corrected by an intuitive operation that the user draws with a mouse, music can be changed based on the user's intention such as considering each scene of the program and the reuse method. The effect of erasing can be adjusted.

［変更例］
なお、上述した実施形態に係るシステムには、例えば、以下のような機能を追加することができる。 [Example of change]
Note that, for example, the following functions can be added to the system according to the above-described embodiment.

（変更例１）
民放の標準フォーマットでは、音声開始位置が決められているとともに、番組内での時刻と使用曲を記述したキューシート等の情報に基づいて、作業を進める必要があり、この情報とリンクするための機能が必要である。 (Modification 1)
In the standard format of commercial broadcasting, the audio start position is determined, and it is necessary to proceed based on information such as cue sheets that describe the time and songs used in the program. A function is necessary.

そこで、本変更例では、図１３（ａ）に示すように、上記システムに自動時刻調整機能１２を設ける。この自動時刻調整機能１２は、図１３（ｂ）に示すように、ＴＶのオンエア用テープからの処理を行う場合に、エディタ４００等の各ウインドウにより、除去処理の対象となる混合音響信号の範囲を設定する際、範囲の設定に係る混合音響データの所定周波数（ここでは、カラーバーの１kHz）を検出部により検出し、この所定周波数の音響信号終了時刻から１５秒後を、タイマーで計測し、設定部により、タイムコード上の０とするモジュールである。これにより、Qシートに記載された消去対象の音楽の時刻を入力するだけで、対象部分の映像と音声を取り出すことができる。 Therefore, in this modified example, as shown in FIG. 13A, an automatic time adjustment function 12 is provided in the system. As shown in FIG. 13B, the automatic time adjustment function 12 uses a window of the editor 400 or the like to perform a range of the mixed sound signal to be removed by the window of the editor 400 or the like when performing processing from the on-air tape of the TV. , The detection unit detects a predetermined frequency (in this case, 1 kHz of the color bar) of the mixed acoustic data related to the setting of the range, and measures with a timer 15 seconds after the end of the acoustic signal at the predetermined frequency. This is a module for setting the time code to 0 by the setting unit. Thus, the video and audio of the target portion can be extracted simply by inputting the time of the music to be erased described on the Q sheet.

このような、自動時刻調整機能１２によれば、TVのオンエア用テープからの処理等においては、民放の標準フォーマットのカラーバー信号からタイムコードを調整することが可能となり、Qシートに記載された消去対象の音楽の時刻を入力するだけで、対象部分の映像と音声を取り出すことができる。 According to such an automatic time adjustment function 12, it is possible to adjust the time code from the color bar signal of the standard format for commercial use in the processing from the on-air tape of the TV, which is described in the Q sheet. Just by inputting the time of the music to be erased, the video and audio of the target part can be extracted.

（変更例２）
上述した実施形態でも説明したように、放送番組では、左右の音声からなるステレオ方式のものがある。ところが、音声の除去処理は、モノラル的に左右それぞれに対して行わなければならないことから、ステレオ放送については、多数のパラメータの設定を要する除去処理の作業が２倍となるという問題がある。 (Modification 2)
As described in the above-described embodiments, some broadcast programs have a stereo system including left and right sounds. However, since the sound removal process must be performed for each of the left and right in monaural, there is a problem that the work of the removal process that requires setting of many parameters is doubled for stereo broadcasting.

そこで、本変更例では、図１４（ａ）に示すように、上記システムに、パラメータ記憶部１３を設ける。このパラメータ記憶部１３は、図１４（ｂ）に示すように、例えば、１回目の作業で、ステレオ放送の左音声（Ｌ）について作業した場合などには、この左音声に対する各種パラメータ設定データを記憶する。すなわち、音声データ抽出部３で分離抽出された左音声について、ユーザーインターフェース６により、各種パラメータを設定した場合、この左音声に対する設定データをパラメータ記憶部１３に記憶しておき、これを次回の右音声に対する除去処理の際に呼び出し、ユーザーインターフェース６におけるデフォルト値として設定する。 Therefore, in this modified example, as shown in FIG. 14A, a parameter storage unit 13 is provided in the system. As shown in FIG. 14B, the parameter storage unit 13 stores various parameter setting data for the left audio when, for example, the left audio (L) of stereo broadcasting is operated in the first operation. Remember. That is, when various parameters are set by the user interface 6 for the left audio separated and extracted by the audio data extraction unit 3, the setting data for the left audio is stored in the parameter storage unit 13, and this is stored in the next right Called during the voice removal process and set as a default value in the user interface 6.

このパラメータ記憶部１３によれば、左右の音声データを含むコンテンツデータに対する処理に際しては、左右一方の音声に対する設定を記憶保持しておき、他方の音声に対する作業を行うときに、記憶保持しておいた設定を利用することができるため、作業効率の向上を図ることができる。 According to the parameter storage unit 13, when processing content data including left and right audio data, the settings for one of the left and right audios are stored and stored when the work for the other audio is performed. Therefore, work efficiency can be improved.

（変更例３）
既知の音源と、混合音内におけるＢＧＭの時間的な位置ずれがあり、混合音とBGMの時刻を正確にあわせることが困難である。 (Modification 3)
There is a time shift of the BGM in the mixed sound with the known sound source, and it is difficult to accurately match the time of the mixed sound and the BGM.

そこで、本変更例では、図１５に示すように、上記システムに、範囲設定部１５を設ける。この範囲設定部１５は、混合音響中におけるＢＧＭと、除去しようとする既知音響の、時間的位置のずれを設定するに際し、段階的に範囲を絞り込む処理を行う。この範囲設定部１５は、ユーザーインターフェース６と連動しており、時間的位置ずれを自動推定する場合に動作する。これにより、範囲設定に際し、初め広めの範囲での位置合わせを行い、順次範囲をクローズアップしていくことにより、微調整が可能となる。 Therefore, in this modified example, as shown in FIG. 15, a range setting unit 15 is provided in the system. The range setting unit 15 performs a process of narrowing down the range step by step when setting a time position shift between the BGM in the mixed sound and the known sound to be removed. The range setting unit 15 is linked to the user interface 6 and operates when the temporal positional deviation is automatically estimated. As a result, when setting the range, first, the position adjustment is performed in a wider range, and the range is sequentially closed up, so that fine adjustment is possible.

また、この範囲設定部１５は、メロディーを検出機能を備えており、これにより、周波数の分布を推定することができる。すなわち、この範囲設定部１５は、既知音響信号が音楽にかかる信号である場合に、音楽に含まれる旋律に応じた周波数分布を、混合音響信号から検出し、検出された信号開始位置に基づいて、相対時間位置を設定する。このような範囲設定部１５によれば、除去する既知音響が音楽等であるときには、そのメロディーを検出することにより、相対位置決めをより容易なものとすることができる。 In addition, the range setting unit 15 has a function for detecting a melody, whereby the frequency distribution can be estimated. That is, the range setting unit 15 detects the frequency distribution according to the melody included in the music when the known acoustic signal is a signal related to music, and based on the detected signal start position. Set the relative time position. According to such a range setting unit 15, when the known sound to be removed is music or the like, relative positioning can be made easier by detecting the melody.

さらに、本変更例では、出力I/F８にヘッドホン１６を接続する。このヘッドホン１６は、左のスピーカー１６ａから既知音響であるＢＧＭ音源を出力し、左のスピーカー１６ｂから混合音響であるＭＩＸ音源を出力する。これにより、ユーザーの耳で、これらの音声を比較しながら聴き、ユーザーの聴覚により位置あわせをする。 Further, in the present modification example, the headphones 16 are connected to the output I / F 8. The headphone 16 outputs a BGM sound source that is a known sound from the left speaker 16a, and outputs a MIX sound source that is a mixed sound from the left speaker 16b. As a result, the user's ear listens while comparing these sounds and aligns with the user's hearing.

このようなヘッドホン１６によれば、混合音響と既知音響を、例えばヘッドホンの左右のスピーカーから出力することによって、ユーザーが聴覚を用いて時間的なずれを判断することができ、作業の効率を向上させることができる。 According to such a headphone 16, the mixed sound and the known sound are output from the left and right speakers of the headphone, for example, so that the user can determine the time lag using the auditory sense and improve the work efficiency. Can be made.

（変更例４）
既知音響の消去においては、混合音響中における既知音響の特性を検出するキャリブレーションを行う必要があるが、混合音響中における既知音響は、番組作成時において強調されたり減衰されたりしており、刻々とその特性が変化している。そのため、前記キャリブレーションの際に、一箇所のみのサンプルを取得しても、適正に除去処理を行うことができないおそれがある。また、既知音響が音楽などである場合には、レコード等の回転数の違いにより、部分的に、時間軸あるいは周波数軸方向に伸縮されることがあり、この場合にも、一箇所のみのキャリブレーションでは、適正なサンプルが取得できない。 (Modification 4)
In order to erase the known sound, it is necessary to perform calibration to detect the characteristics of the known sound in the mixed sound. However, the known sound in the mixed sound is emphasized or attenuated at the time of program creation, And its characteristics have changed. Therefore, even if a sample at only one place is acquired during the calibration, there is a possibility that the removal process cannot be performed properly. In addition, when the known sound is music or the like, it may be partially expanded or contracted in the time axis or frequency axis direction due to the difference in the rotation speed of the record or the like. Can't get the right sample.

そこで、本変更例では、図１６（ａ）に示すように、上記システムにキャリブレーション設定部１７を設ける。このキャリブレーション設定部１７は、図１６（ｂ）に示すように、上述した除去エンジンプログラム１００におけるキャリブレーション用のサンプルを、ユーザーが任意に選択した複数の箇所から取得するためのモジュールである。このキャリブレーション設定部１７により取得された複数のサンプルは、上述した既知音響の推定にも用いられ、このサンプルを用いて既知音響を時間軸あるいは周波数軸方向に伸縮して補正し、混合音から既知音響信号を減算する。 Therefore, in this modified example, as shown in FIG. 16A, a calibration setting unit 17 is provided in the system. As shown in FIG. 16B, the calibration setting unit 17 is a module for acquiring samples for calibration in the above-described removal engine program 100 from a plurality of locations arbitrarily selected by the user. The plurality of samples acquired by the calibration setting unit 17 is also used for the above-described estimation of the known sound. Using this sample, the known sound is expanded and contracted in the time axis or frequency axis direction to correct the mixed sound. Subtract known acoustic signal.

このようなキャリブレーション設定部１７によれば、混合音響中の複数箇所からキャリブレーション用のサンプルを取得するため、取得した複数の箇所から得られるデータの中から、その平均、標準偏差、代表値等を算出することによりより精度の高い除去処理を行うことができる。 According to such a calibration setting unit 17, in order to obtain calibration samples from a plurality of locations in the mixed sound, the average, standard deviation, and representative value are obtained from the data obtained from the obtained plurality of locations. Etc. can be performed with higher accuracy.

（変更例５）
放送番組の音声作成時には、製作意図にあわせて周波数特性や音量を調整する場合が多い。このため、音の位相が予測不能な変化を生じ、単に電子的な減算処理を行ったのみでは、適切に消去することはできない。 (Modification 5)
When creating audio for a broadcast program, frequency characteristics and volume are often adjusted according to the production intention. For this reason, an unpredictable change in the phase of the sound occurs, and it cannot be erased appropriately by simply performing electronic subtraction.

そこで、本変更例では、図１７（ａ）に示すように、入力I/F１に集音マイク１９が接続可能となっており、また、ＤＶキャプチャー２には、ビデオカメラ２０が接続可能となっている。集音マイク１９は、指向性を有し、除去しようとする既知音響信号を取得する第１の集音装置であり、ビデオカメラ２０は、混合音響信号を取得する第２の集音装置である。集音マイク１９は、ビデオカメラ２０とは切離して携帯できるようになっており、ビデオカメラ２０の撮影中に、消去したいノイズの音源付近に設定し、そのノイズのみを録音する。 Therefore, in this modified example, as shown in FIG. 17A, the sound collection microphone 19 can be connected to the input I / F 1, and the video camera 20 can be connected to the DV capture 2. ing. The sound collection microphone 19 has directivity and is a first sound collection device that acquires a known sound signal to be removed, and the video camera 20 is a second sound collection device that acquires a mixed sound signal. . The sound collecting microphone 19 can be carried away from the video camera 20, and is set near the sound source of the noise to be erased while the video camera 20 is shooting, and only the noise is recorded.

なお、上述した集音マイク１９とビデオカメラ２０は、同図（ｂ）の２１に示すように、一体構造としてもよい。この場合には、混合音響信号を取得する第２の集音装置２１ａを一体型マイク２１の先端に配置し、既知音響信号を取得する第１の集音装置２１ｂを、一体型マイク２１の後方に配置する。集音装置２１ａと、２１ｂは離隔して配置されており、集音装置２１ａは、通常撮影用として装置前方に指向性を有し、集音装置２１ｂは、前方以外の指向性を有するように構成する。 Note that the sound collecting microphone 19 and the video camera 20 described above may have an integral structure as indicated by 21 in FIG. In this case, the second sound collecting device 21a that acquires the mixed sound signal is disposed at the tip of the integrated microphone 21, and the first sound collecting device 21b that acquires the known sound signal is located behind the integrated microphone 21. To place. The sound collecting devices 21a and 21b are spaced apart from each other, the sound collecting device 21a has directivity in front of the device for normal photographing, and the sound collecting device 21b has directivity other than the front. Constitute.

このような本変更例によれば、例えば、集音装置２１ａで録音した音声と雑音が混じった音声信号から、集音装置２１ｂで録音した雑音等の音のデータを使って、雑音だけを的確に消去することができる。 According to this modified example, for example, only noise is accurately detected using sound data such as noise recorded by the sound collector 21b from a sound signal mixed with noise recorded by the sound collector 21a. Can be erased.

実施形態に係る既知音響除去システムの構成を示すブロック図である。It is a block diagram which shows the structure of the known sound removal system which concerns on embodiment. 実施形態に係る既知音響除去システムの動作を示すフロー図である。It is a flowchart which shows operation | movement of the known sound removal system which concerns on embodiment. 実施形態に係る既知音響除去方法の基本理論を示すフロー図である。It is a flowchart which shows the basic theory of the known sound removal method which concerns on embodiment. 実施形態に係る既知音響除去方法による効果を示す説明図である。It is explanatory drawing which shows the effect by the known sound removal method which concerns on embodiment. 実施形態に係る音響除去エンジンプログラムの機能ブロック図である。It is a functional block diagram of the sound removal engine program which concerns on embodiment. 実施形態に係る音響除去エンジンプログラムの動作を示すフロー図である。It is a flowchart which shows operation | movement of the sound removal engine program which concerns on embodiment. 実施形態に係るエディタの構成を示す説明図である。It is explanatory drawing which shows the structure of the editor which concerns on embodiment. 実施形態に係るエディタの操作パネルを示す説明図である。It is explanatory drawing which shows the operation panel of the editor which concerns on embodiment. 実施形態に係るエディタのウインドウを示す説明図である。It is explanatory drawing which shows the window of the editor which concerns on embodiment. 実施形態に係るエディタのウインドウを示す説明図である。It is explanatory drawing which shows the window of the editor which concerns on embodiment. 変更例に係るエディタの構成を示す説明図である。It is explanatory drawing which shows the structure of the editor which concerns on the example of a change. 実施形態に係るプログラムを記録したコンピュータ読み取り可能な記録媒体を示す斜視図である。It is a perspective view which shows the computer-readable recording medium which recorded the program which concerns on embodiment. 変更例１に係る既知音響除去システムの構成及び動作を示す説明図である。It is explanatory drawing which shows a structure and operation | movement of the known sound removal system which concerns on the example 1 of a change. 変更例２に係る既知音響除去システムの構成及び動作を示す説明図である。It is explanatory drawing which shows a structure and operation | movement of the known sound removal system which concerns on the example 2 of a change. 変更例３に係る既知音響除去システムの構成及び動作を示す説明図である。It is explanatory drawing which shows a structure and operation | movement of the known sound removal system which concerns on the example 3 of a change. 変更例４に係る既知音響除去システムの構成及び動作を示す説明図である。It is explanatory drawing which shows the structure and operation | movement of the known sound removal system which concerns on the example 4 of a change. 変更例５に係る既知音響除去システムの構成及び動作を示す説明図である。It is explanatory drawing which shows a structure and operation | movement of the known sound removal system which concerns on the example 5 of a change.

Explanation of symbols

１…入力I/F、２…ＤＶキャプチャー、３…音声データ抽出部、４…音声変換部、５…記憶装置、６…ユーザーインターフェース、６ａ…キーボード、６ｂ…マウス、７…メモリ、８…出力I/F、９…同期制御部、１０…モニタ、１１…スピーカー、１２…時刻調整部、１３…パラメータ記憶部、１４…シミュレーション部、１５…範囲設定部、１６…ヘッドホン、１７…キャリブレーション設定部、１８…定常音検出部、１９…集音マイク、２０…ビデオカメラ、２１…一体型マイク、１００…音響除去エンジンプログラム、１０１…混合音響入力部、１０２…既知音響信号入力部、１０３…制御部、１０４…除去処理部、１０５…逆フーリエ変換部、１０６…配置処理部、１０７…除去後音響信号出力部、１１６〜１１９…記録媒体、１２０…汎用コンピュータ、２００…振幅スペクトル抽出部、２０１…データ分割部、２０２…窓関数処理部、２０３…フーリエ変換部、３００…パラメータ推定部、３０１…周波数特性変化補正部、３０２…音量変化補正部、３０３…時間位置補正部、３０４…キャリブレーション部、４００…エディタ DESCRIPTION OF SYMBOLS 1 ... Input I / F, 2 ... DV capture, 3 ... Audio | voice data extraction part, 4 ... Audio | voice conversion part, 5 ... Storage device, 6 ... User interface, 6a ... Keyboard, 6b ... Mouse, 7 ... Memory, 8 ... Output I / F, 9 ... Synchronization control unit, 10 ... Monitor, 11 ... Speaker, 12 ... Time adjustment unit, 13 ... Parameter storage unit, 14 ... Simulation unit, 15 ... Range setting unit, 16 ... Headphone, 17 ... Calibration setting , 18 ... stationary sound detection unit, 19 ... sound collecting microphone, 20 ... video camera, 21 ... integrated microphone, 100 ... sound removal engine program, 101 ... mixed sound input unit, 102 ... known sound signal input unit, 103 ... Control unit 104 ... removal processing unit 105 ... inverse Fourier transform unit 106 ... placement processing unit 107 ... acoustic signal output unit after removal 116-119 ... recording medium 120 General-purpose computer, 200 ... amplitude spectrum extraction unit, 201 ... data division unit, 202 ... window function processing unit, 203 ... Fourier transform unit, 300 ... parameter estimation unit, 301 ... frequency characteristic change correction unit, 302 ... volume change correction unit, 303: time position correction unit, 304: calibration unit, 400: editor

Claims

A stationary sound detection unit for detecting stationary sound in a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal;
A known acoustic amplitude extraction unit for extracting a known acoustic amplitude spectrum from the known acoustic signal;
A mixed acoustic amplitude extraction unit that extracts a mixed acoustic amplitude spectrum for each frequency from the mixed acoustic signal;
Of the extracted mixed acoustic amplitude spectrum, a frequency selection unit that selects a mixed acoustic amplitude spectrum having a frequency that does not match the frequency of the stationary sound;
A removal processing unit for removing the known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum of the selected frequency;
An acoustic signal removing device comprising:

The acoustic signal removing apparatus according to claim 1, wherein the stationary sound detection unit detects a frequency having a minimum amplitude in the mixed acoustic signal as a stationary sound.

The said stationary sound detection part acquires the amplitude for every time in the said mixed acoustic signal, sorts in order of the value, and detects the frequency used as the nth value as a stationary sound. The acoustic signal removal apparatus as described.

The stationary sound detection unit acquires amplitudes for each time in the mixed acoustic signal, sorts the amplitudes in order of the values, obtains a maximum value of n not exceeding a certain value from the nth standard deviation, and determines the nth The acoustic signal removing device according to claim 1, wherein a frequency having a value of is detected as a stationary sound.

An acoustic signal removing device that removes a known acoustic signal from a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal,
A simulation for performing a pseudo subtraction process on the sound signal having a constant amplitude as the mixed sound with the known sound signal set to 0, and measuring a difference in volume between the sound signal having the constant amplitude and the sound signal after the pseudo subtraction process. And
Based on the measurement result by the simulation unit, a removal intensity setting unit that sets a signal intensity for each time of the known acoustic signal,
A known acoustic amplitude extraction unit for extracting a known acoustic amplitude spectrum from the known acoustic signal;
A mixed acoustic amplitude extraction unit that extracts a mixed acoustic amplitude spectrum from the mixed acoustic signal;
Based on the setting by the removal intensity setting unit, the removal processing unit that converts the known acoustic amplitude spectrum and removes the known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum;
An acoustic signal removing device comprising:

Detecting a stationary sound in a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal;
Extracting a known acoustic amplitude spectrum from the known acoustic signal and extracting a mixed acoustic amplitude spectrum for each frequency from the mixed acoustic signal;
Selecting a mixed acoustic amplitude spectrum having a frequency that does not match the frequency of the stationary sound from the extracted mixed acoustic amplitude spectrum;
Removing the known acoustic amplitude spectrum from a mixed acoustic amplitude spectrum at a selected frequency;
An acoustic signal removing method comprising:

The acoustic signal removal method according to claim 6, wherein a frequency having a minimum amplitude in the mixed acoustic signal is detected as a stationary sound.

The acoustic signal removal method according to claim 6, wherein amplitudes for each time in the mixed acoustic signal are acquired, sorted in the order of the values, and the frequency that becomes the n-th value is detected as a stationary sound.

In the mixed acoustic signal, the amplitude for each time is acquired, sorted in the order of the values, the maximum value of n not exceeding a certain value is obtained from the standard deviation up to the nth, and the frequency that becomes the nth value is steady. The acoustic signal removing method according to claim 6, wherein the acoustic signal is detected as sound.

An acoustic signal removal method for removing a known acoustic signal from a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal,
Performing a pseudo-subtraction process on the acoustic signal having a constant amplitude as the mixed sound with the known acoustic signal set to 0, and measuring a difference in volume between the acoustic signal having the constant amplitude and the acoustic signal after the pseudo-subtraction process When,
Based on the measurement result, setting a signal intensity for each time of the known acoustic signal;
Extracting a known acoustic amplitude spectrum from the known acoustic signal and extracting a mixed acoustic amplitude spectrum from the mixed acoustic signal;
Converting the known acoustic amplitude spectrum based on the setting and removing the known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum;
An acoustic signal removing method comprising:

On the computer,
Detecting a stationary sound in a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal;
Extracting a known acoustic amplitude spectrum from the known acoustic signal and extracting a mixed acoustic amplitude spectrum for each frequency from the mixed acoustic signal;
Selecting a mixed acoustic amplitude spectrum having a frequency that does not match the frequency of the stationary sound from the extracted mixed acoustic amplitude spectrum;
Removing the known acoustic amplitude spectrum from a mixed acoustic amplitude spectrum at a selected frequency;
An acoustic signal removal program characterized by causing the processing to be executed.

The acoustic signal removal program according to claim 11, wherein a frequency having a minimum amplitude in the mixed acoustic signal is detected as a stationary sound.

The acoustic signal removal program according to claim 11, wherein amplitudes for each time in the mixed acoustic signal are acquired, sorted in order of values, and a frequency that becomes an nth value is detected as a stationary sound.

In the mixed acoustic signal, the amplitude for each time is acquired, sorted in the order of the values, the maximum value of n not exceeding a certain value is obtained from the standard deviation up to the nth, and the frequency that becomes the nth value is steady. The sound signal removal program according to claim 11, wherein the sound signal removal program is detected as sound.

An acoustic signal removal program for removing a known acoustic signal from a mixed acoustic signal obtained by mixing a known acoustic signal to be removed and another acoustic signal,
On the computer,
Performing a pseudo-subtraction process on the acoustic signal of constant amplitude as the mixed sound with the known acoustic signal set to 0, and measuring a difference in volume between the acoustic signal of constant amplitude and the acoustic signal after the pseudo-subtraction process When,
Based on the measurement result, setting a signal intensity for each time of the known acoustic signal;
Extracting a known acoustic amplitude spectrum from the known acoustic signal and extracting a mixed acoustic amplitude spectrum from the mixed acoustic signal;
Converting the known acoustic amplitude spectrum based on the setting and removing the known acoustic amplitude spectrum from the mixed acoustic amplitude spectrum;
An acoustic signal removal program characterized by causing a process comprising: