JP5971646B2

JP5971646B2 - Multi-channel signal processing apparatus, method, and program

Info

Publication number: JP5971646B2
Application number: JP2012070301A
Authority: JP
Inventors: 造田邉; 利博古川; 隆廣名取
Original assignee: Tokyo University of Science
Current assignee: Tokyo University of Science
Priority date: 2012-03-26
Filing date: 2012-03-26
Publication date: 2016-08-17
Anticipated expiration: 2032-03-26
Also published as: JP2013201722A

Description

本発明は、多チャネル信号処理装置、方法、及びプログラムに係り、特に、多チャネル信号に含まれる特定の信号を抽出または抑圧する多チャネル信号処理装置、方法、及びプログラムに関する。 The present invention relates to a multi-channel signal processing apparatus, method, and program, and more particularly, to a multi-channel signal processing apparatus, method, and program for extracting or suppressing a specific signal included in a multi-channel signal.

従来、ステレオ信号をチャネル毎に複数の周波数帯域に分割し、周波数帯域毎のチャネル間の類似度を計算し、類似度から中央付近に定位する音源信号を抑圧、もしくは強調するための減衰係数を計算し、その減衰係数を各周波数帯域信号に乗算し、チャネル毎の各周波数帯域信号を再合成して出力するステレオ音響信号処理装置が提案されている（例えば、特許文献１参照）。 Conventionally, a stereo signal is divided into a plurality of frequency bands for each channel, a similarity between channels for each frequency band is calculated, and an attenuation coefficient for suppressing or enhancing a sound source signal localized near the center from the similarity is calculated. There has been proposed a stereo sound signal processing apparatus that calculates, multiplies each frequency band signal by the attenuation coefficient, re-synthesizes each frequency band signal for each channel, and outputs the result (for example, see Patent Document 1).

特許文献１に記載のステレオ音響信号処理装置は、ステレオ信号入力部に入力される音響信号が、強調、もしくは抑圧したい目的音源信号が中央付近に定位するように収音されているステレオ信号である場合に有効である。詳細には、ステレオ信号入力部に入力されたステレオ信号（左チャネルの信号ｓＬ、右チャネルの信号ｓＲ）の各々を帯域分割数Ｎの周波数領域の信号（ｆＬ(ｋ)及びｆＲ(ｋ)、ｋ＝０，・・・，Ｎ−１）に変換し、同じ周波数帯域毎にｆＬ(ｋ)とｆＲ(ｋ)との類似度ａ(ｋ)を計算する。周波数帯域毎に計算された類似度ａ(ｋ)に基づき周波数帯域毎に減衰係数ｇ(ｋ)を算出し、同一周波数帯域において、左右チャネル間で同一の減衰係数ｇ(ｋ)が各周波数帯域信号ｆＬ(ｋ)に乗算し再合成することで、チャネル間の類似度の大きな成分だけの成分集合ｓＬ'、ｓＲ'が出力され、その結果、中央付近に定位する音源信号だけが残る。 The stereo acoustic signal processing device described in Patent Document 1 is a stereo signal in which an acoustic signal input to a stereo signal input unit is collected so that a target sound source signal to be emphasized or suppressed is localized near the center. It is effective in the case. Specifically, each of the stereo signals (left channel signal sL and right channel signal sR) input to the stereo signal input unit is converted into frequency domain signals (fL (k) and fR (k), N). k = 0,..., N−1), and the similarity a (k) between fL (k) and fR (k) is calculated for each same frequency band. Based on the similarity a (k) calculated for each frequency band, an attenuation coefficient g (k) is calculated for each frequency band. In the same frequency band, the same attenuation coefficient g (k) between the left and right channels is calculated for each frequency band. By multiplying and re-synthesizing the signal fL (k), component sets sL ′ and sR ′ of only components having a high degree of similarity between channels are output, and as a result, only a sound source signal localized near the center remains.

このように、特許文献１に記載のステレオ音響信号処理装置では、全ての帯域に対して処理を行って、目的音源信号が中央付近に定位する音源信号を得ている。 As described above, the stereo acoustic signal processing device described in Patent Document 1 performs processing on all bands to obtain a sound source signal in which the target sound source signal is localized near the center.

また、２チャネルの入力音響信号各々のスペクトルデータを生成し、そのスペクトルデータにおける特定の音響信号（ボーカル信号の音声）に対応する設定周波数帯域に属する複数の周波数ビン各々のデータが、２チャネル相互間で所定の近似条件を満たす場合に、その周波数ビンのデータのパワーを縮減補正し、補正後のスペクトルデータに基づく時間領域の補正後音響信号と、２チャネル各々における他チャネルに対する差信号とを合成することによりステレオ音響信号を構成する２チャネルの出力音響信号を生成する音響信号処理装置が提案されている（例えば、特許文献２参照）。 Further, spectrum data of each of the input sound signals of two channels is generated, and data of each of a plurality of frequency bins belonging to a set frequency band corresponding to a specific sound signal (voice of vocal signal) in the spectrum data is When the predetermined approximate condition is satisfied, the power of the data of the frequency bin is reduced and corrected, and the corrected acoustic signal in the time domain based on the corrected spectrum data and the difference signal with respect to other channels in each of the two channels are obtained. There has been proposed an acoustic signal processing apparatus that generates a two-channel output acoustic signal that constitutes a stereo acoustic signal by combining (for example, see Patent Document 2).

特許文献２に記載の音響信号処理装置では、Ｌ及びＲの２チャンネル各々について、他方のチャンネルに対する入力音響信号の差分を計算した結果である差信号（ΔＸＬ(t)＝ＸＬ(t)−ＸＲ(t)とΔＸＲ(t)＝ＸＲ(t)−ＸＬ(t)）を生成する。そして、Ｌ及びＲの２チャンネル各々について、時間領域の補正後音響信号ＸＬ'(t)及びＸＲ'(t)と、差信号ΔＸＬ(t)及びΔＸＲ(t)とを、例えば重み付け加算により合成することにより、ステレオ音響信号を構成する２チャンネルの出力音響信号ＹＬ(t)、ＹＲ(t)を生成する。 In the acoustic signal processing device described in Patent Document 2, a difference signal (ΔXL (t) = XL (t) −XR), which is the result of calculating the difference between the input acoustic signals with respect to the other channel, for each of the two channels L and R. (t) and ΔXR (t) = XR (t) −XL (t)) are generated. Then, for each of the L and R channels, the corrected acoustic signals XL ′ (t) and XR ′ (t) in the time domain and the difference signals ΔXL (t) and ΔXR (t) are synthesized by weighted addition, for example. As a result, two-channel output acoustic signals YL (t) and YR (t) that constitute the stereo acoustic signal are generated.

特開２００２−７８１００号公報JP 2002-78100 A 特開２００８−７２６００号公報JP 2008-72600 A

しかしながら、特許文献１に記載の技術では、特定の周波数成分を抑圧することにより、周波数スペクトルが孤立する箇所が発生し、時間領域の信号に変換した際にトーン性のミュージカルノイズとして聞こえてしまう、という問題がある。 However, in the technique described in Patent Literature 1, by suppressing a specific frequency component, a portion where the frequency spectrum is isolated occurs, and when it is converted into a time domain signal, it is heard as tone-like musical noise. There is a problem.

また、特許文献２に記載の技術では、差信号を合成して失われた周波数帯域の信号を補完することで、ミュージカルノイズの発生を防止している。特許文献１に記載の技術に比べ演算量が軽減されているものの、左右の信号に同一の差信号を合成して補正するため、生成される音響信号のステレオ感が減少し、音源信号の臨場感が損なわれる。また、抽出する信号がボーカル信号のような中央付近に定位する音源信号の場合には、その信号がモノラル信号となるため、その信号を補正するための差信号を生成することができない。 Further, in the technique described in Patent Document 2, the generation of musical noise is prevented by synthesizing the difference signals and complementing the lost frequency band signals. Although the amount of calculation is reduced as compared with the technique described in Patent Document 1, since the same difference signal is synthesized with the left and right signals and corrected, the stereo effect of the generated acoustic signal is reduced, and the sound source signal is more realistic. The feeling is impaired. Further, when the extracted signal is a sound source signal localized near the center, such as a vocal signal, the signal is a monaural signal, and thus a difference signal for correcting the signal cannot be generated.

このように、特許文献１及び２に記載の技術では、ステレオ感が無くなり、再現性が悪くなる、という問題がある。 As described above, the techniques described in Patent Documents 1 and 2 have a problem that the feeling of stereo disappears and the reproducibility deteriorates.

本発明は、上記問題点に鑑みてなされたものであり、ステレオ及び２チャネルを含む多チャネルの入力信号に含まれる特定の信号を抽出または抑圧する場合において、ステレオ感を損なわず再現性の良い多チャネル信号を出力することができる多チャネル信号処理装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and in the case of extracting or suppressing a specific signal included in a multi-channel input signal including stereo and two channels, the reproducibility is good without impairing the sense of stereo. An object of the present invention is to provide a multichannel signal processing apparatus, method, and program capable of outputting a multichannel signal.

上記目的を達成するために、第１の発明に係る多チャネル信号処理装置は、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換し、各スペクトル信号の比に基づいて前記第１信号のスペクトル信号と推定される推定第１スペクトル信号を抽出し、周波数領域の信号である前記推定第１スペクトル信号を時間領域の信号に変換して前記第１信号と推定される時間領域の推定第１信号を抽出する抽出手段と、前記抽出手段で抽出された前記時間領域の推定第１信号の分散値、前記観測信号の分散値から前記推定第１信号の分散値を差し引いて得られる前記第２信号の分散値、並びに前記複数チャネルの観測信号を用いて、前記複数チャネルに対応した要素を含む前記第２信号のみから構成される状態方程式、及び前記複数チャネルに対応した要素を含む前記第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、前記第１信号または前記第２信号を推定する推定手段と、を含んで構成されている。 In order to achieve the above object, a multi-channel signal processing device according to a first aspect of the present invention provides a time domain of a plurality of channels including a first signal that is commonly included in each channel and a second signal that is different for each channel Each of the observation signals is converted into a spectrum signal in the frequency domain, an estimated first spectrum signal estimated as the spectrum signal of the first signal is extracted based on a ratio of each spectrum signal, and the estimated first signal which is a frequency domain signal is extracted. Extracting means for converting one spectrum signal into a signal in the time domain and extracting an estimated first signal in the time domain estimated as the first signal; and an estimated first signal in the time domain extracted by the extracting means dispersion value, dispersion value of the second signal obtained by subtracting the variance of the estimated first signal from the variance of the observation signal, and using the observation signals of the plurality of channels, wherein the plurality channels A state space model represented by a state equation composed only of the second signal including an element corresponding to, and an observation equation composed of the first signal and the second signal including an element corresponding to the plurality of channels And an estimation means for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source .

第１の発明に係る多チャネル信号処理装置によれば、抽出手段が、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換し、各スペクトル信号の比に基づいて第１信号のスペクトル信号と推定される推定第１スペクトル信号を抽出し、周波数領域の信号である推定第１スペクトル信号を時間領域の信号に変換して第１信号と推定される時間領域の推定第１信号を抽出する。そして、抽出手段で抽出された時間領域の推定第１信号の分散値、観測信号の分散値から推定第１信号の分散値を差し引いて得られる第２信号の分散値、並びに複数チャネルの観測信号を用いて、複数チャネルに対応した要素を含む第２信号のみから構成される状態方程式、及び複数チャネルに対応した要素を含む第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、第１信号または第２信号を推定する。 According to the multi-channel signal processing device according to the first aspect of the present invention, the extraction means includes a plurality of time domain observation signals including a first signal that is commonly included among the channels and a second signal that is different for each channel. Each is converted into a spectrum signal in the frequency domain, an estimated first spectrum signal estimated as a spectrum signal of the first signal is extracted based on a ratio of each spectrum signal, and an estimated first spectrum signal that is a signal in the frequency domain is extracted. An estimated first signal in the time domain that is converted into a time domain signal and estimated as the first signal is extracted. Then, the variance value of the estimated first signal in the time domain extracted by the extraction means, the variance value of the second signal obtained by subtracting the variance value of the estimated first signal from the variance value of the observation signal, and the observation signal of a plurality of channels Are used to express a state equation composed only of a second signal including elements corresponding to a plurality of channels, and an observation equation composed of a first signal and a second signal including elements corresponding to a plurality of channels. A Kalman filter with a colored drive source is applied to the state space model to estimate the first signal or the second signal.

有色駆動源付カルマンフィルタとは、駆動源が有色信号の場合にも適用可能なカルマンフィルタであり、観測信号から目的の状態量（ここでは、第１信号または第２信号）を推定するためのカルマンフィルタである。 The Kalman filter with a colored drive source is a Kalman filter applicable even when the drive source is a colored signal, and is a Kalman filter for estimating a target state quantity (here, the first signal or the second signal) from an observation signal. is there.

これにより、多チャネル信号に含まれる特定の信号を抽出または抑圧する場合において、各チャネルに対して同一の差信号で補完するような場合に比べて、ステレオ感を損なわず再現性の良いステレオ信号を出力することができる。また、時間領域から周波数領域への変換、または周波数領域から時間領域への逆変換処理が１回軽減される。 As a result, when extracting or suppressing a specific signal included in a multi-channel signal, a stereo signal with good reproducibility without impairing the stereo effect compared to the case where each channel is complemented with the same difference signal. Can be output. Further, the conversion from the time domain to the frequency domain or the inverse conversion process from the frequency domain to the time domain is reduced once.

また、第１の発明の多チャネル信号処理装置は、前記推定手段により推定された前記第１信号または前記第２信号を含む複数チャネルの時間領域の観測信号各々の自己相関のピーク値に基づいて、前記第２信号と推定される時間領域の推定第２信号を抽出する後段抽出手段と、前記後段抽出手段で抽出された前記時間領域の推定第２信号の分散値、前記観測信号の分散値から前記推定第２信号の分散値を差し引いて得られる前記第１信号の分散値、並びに前記複数チャネルの観測信号を用いて、有色駆動源付カルマンフィルタにより、前記第１信号または前記第２信号を推定する後段推定手段と、をさらに含んで構成することができる。 According to a first aspect of the present invention, there is provided a multi-channel signal processing device based on an autocorrelation peak value of each of a plurality of time domain observation signals including the first signal or the second signal estimated by the estimation unit. A second-stage extraction means for extracting an estimated second signal in the time domain estimated as the second signal, a variance value of the estimated second signal in the time domain extracted by the second-stage extraction means, and a variance value of the observation signal The first signal or the second signal is obtained by a Kalman filter with a colored drive source using the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the observation signal of the plurality of channels. And a post-stage estimating means for estimating.

また、第１の発明の多チャネル信号処理装置は、前記推定手段により推定された前記第１信号または前記第２信号を含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換し、スペクトル信号から各々得られるスペクトルエントロピーに基づいて前記第２信号のスペクトル信号と推定される推定第２スペクトル信号を抽出し、周波数領域の信号である前記推定第２スペクトル信号を時間領域の信号に変換して前記第２信号と推定される時間領域の推定第２信号を抽出する後段抽出手段と、前記後段抽出手段で抽出された前記時間領域の推定第２信号の分散値、前記観測信号の分散値から前記推定第２信号の分散値を差し引いて得られる前記第１信号の分散値、並びに前記複数チャネルの観測信号を用いて、有色駆動源付カルマンフィルタにより、前記第１信号または前記第２信号を推定する後段推定手段と、をさらに含んで構成することができる。 The multi-channel signal processing apparatus according to the first aspect of the present invention converts each of the plurality of time domain observation signals including the first signal or the second signal estimated by the estimation means into a frequency domain spectrum signal. Then, an estimated second spectrum signal that is estimated as the spectrum signal of the second signal is extracted based on the spectrum entropy obtained from each spectrum signal, and the estimated second spectrum signal that is a frequency domain signal is converted into a time domain signal. A second-stage extraction means for extracting an estimated second signal in the time domain that is converted and estimated as the second signal; a variance value of the estimated second signal in the time domain extracted by the second-stage extraction means; Colored driving using the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value and the observation signals of the plurality of channels The urging Kalman filter, and the rear stage estimating means for estimating the first signal or the second signal may be configured further comprising a.

また、第２の発明に係る多チャネル信号処理装置は、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々の自己相関のピーク値に基づいて、前記第２信号と推定される時間領域の推定第２信号を抽出する抽出手段と、前記抽出手段で抽出された前記時間領域の推定第２信号の分散値、前記観測信号の分散値から前記推定第２信号の分散値を差し引いて得られる前記第１信号の分散値、並びに前記複数チャネルの観測信号を用いて、前記複数チャネルに対応した要素を含む前記第２信号のみから構成される状態方程式、及び前記複数チャネルに対応した要素を含む前記第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、前記第１信号または前記第２信号を推定する推定手段と、を含んで構成されている。 Further, the multi-channel signal processing device according to the second aspect of the invention relates to the autocorrelation of each of the observation signals in the time domain of a plurality of channels including the first signal that is commonly included among the channels and the second signal that is different for each channel. Extraction means for extracting an estimated second signal in the time domain estimated as the second signal based on a peak value of the second signal, a variance value of the estimated second signal in the time domain extracted by the extraction means, and the observation Using the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the signal, and the observation signal of the plurality of channels, the second signal including elements corresponding to the plurality of channels state equation composed only, and the first signal and the state space model represented by composed observation equation and a second signal containing the elements corresponding to the plurality of channels, Karumanfu with colored driving source By applying the filter, it is configured to include a, an estimation unit for estimating the first signal or the second signal.

第２の発明に係る多チャネル信号処理装置によれば、抽出手段が、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々の自己相関のピーク値に基づいて、第２信号と推定される時間領域の推定第２信号を抽出する。推定第１信号を抽出した第１の発明の抽出手段と異なり、第２の発明の抽出手段は推定第２信号を抽出する。そして、抽出手段で抽出された時間領域の推定第２信号の分散値、観測信号の分散値から推定第２信号の分散値を差し引いて得られる第１信号の分散値、並びに複数チャネルの観測信号を用いて、複数チャネルに対応した要素を含む第２信号のみから構成される状態方程式、及び複数チャネルに対応した要素を含む第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、第１信号または第２信号を推定する。 According to the multi-channel signal processing device of the second invention, the extraction means includes a plurality of time domain observation signals including a first signal that is included in common among the channels and a second signal that is different for each channel. Based on each autocorrelation peak value, an estimated second signal in the time domain estimated as the second signal is extracted. Unlike the extracting means of the first invention that extracts the estimated first signal, the extracting means of the second invention extracts the estimated second signal. Then, the variance value of the estimated second signal in the time domain extracted by the extraction means, the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observation signal, and the observation signals of a plurality of channels Are used to express a state equation composed only of a second signal including elements corresponding to a plurality of channels, and an observation equation composed of a first signal and a second signal including elements corresponding to a plurality of channels. A Kalman filter with a colored drive source is applied to the state space model to estimate the first signal or the second signal.

これにより、多チャネル信号に含まれる特定の信号を抽出または抑圧する場合において、ステレオ感を損なわず再現性の良いステレオ信号を出力することができる。また、時間領域のみの信号処理となるため、第１の発明に比べて演算量が軽減される。 As a result, when a specific signal included in a multi-channel signal is extracted or suppressed, a stereo signal with good reproducibility can be output without impairing the stereo feeling. Further, since the signal processing is performed only in the time domain, the amount of calculation is reduced as compared with the first invention.

また、第３の発明に係る多チャネル信号処理装置は、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換して各々のパワースペクトル密度を演算し、該パワースペクトル密度から各々得られるスペクトルエントロピーに基づいて前記第２信号のスペクトル信号と推定される推定第２スペクトル信号を抽出し、周波数領域の信号である前記推定第２スペクトル信号を時間領域の信号に変換して前記第２信号と推定される時間領域の推定第２信号を抽出する抽出手段と、前記抽出手段で抽出された前記時間領域の推定第２信号の分散値、前記観測信号の分散値から前記推定第２信号の分散値を差し引いて得られる前記第１信号の分散値、並びに前記複数チャネルの観測信号を用いて、前記複数チャネルに対応した要素を含む前記第２信号のみから構成される状態方程式、及び前記複数チャネルに対応した要素を含む前記第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、前記第１信号または前記第２信号を推定する推定手段と、を含んで構成されている。 The multi-channel signal processing apparatus according to the third aspect of the present invention provides a multi-channel time domain observation signal including a first signal that is commonly included in each channel and a second signal that is different for each channel. And calculating each power spectral density, extracting an estimated second spectral signal estimated as a spectral signal of the second signal based on a spectral entropy obtained from the power spectral density, and a frequency An extracting means for converting the estimated second spectrum signal, which is a signal in a region, into a time domain signal and extracting an estimated second signal in the time domain estimated as the second signal, and the extraction means extracted by the extracting means A variance value of the estimated second signal in the time domain, a variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observed signal, Using observation signals of the plurality of channels each time, the plurality of channels state equation composed only of the second signal containing the elements corresponding to, and said first and second signals containing the elements corresponding to the plurality of channels And an estimation means for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to a state space model represented by an observation equation .

第３の発明に係る多チャネル信号処理装置によれば、抽出手段が、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換して各々のパワースペクトル密度を演算し、パワースペクトル密度から各々得られるスペクトルエントロピーに基づいて第２信号のスペクトル信号と推定される推定第２スペクトル信号を抽出し、周波数領域の信号である推定第２スペクトル信号を時間領域の信号に変換して第２信号と推定される時間領域の推定第２信号を抽出する。推定第１信号を抽出した第１の発明の抽出手段と異なり、第２の発明の抽出手段は推定第２信号を抽出する。そして、抽出手段で抽出された時間領域の推定第２信号の分散値、観測信号の分散値から推定第２信号の分散値を差し引いて得られる第１信号の分散値、並びに複数チャネルの観測信号を用いて、複数チャネルに対応した要素を含む第２信号のみから構成される状態方程式、及び複数チャネルに対応した要素を含む第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、第１信号または第２信号を推定する。 According to the multi-channel signal processing device of the third invention, the extraction means includes a plurality of time domain observation signals including a first signal that is commonly included among the channels and a second signal that is different for each channel. Each is converted into a spectrum signal in the frequency domain, each power spectral density is calculated, and an estimated second spectral signal estimated as a spectral signal of the second signal is extracted based on the spectral entropy respectively obtained from the power spectral density. Then, the estimated second spectrum signal, which is a frequency domain signal, is converted into a time domain signal to extract a time domain estimated second signal estimated as the second signal. Unlike the extracting means of the first invention that extracts the estimated first signal, the extracting means of the second invention extracts the estimated second signal. Then, the variance value of the estimated second signal in the time domain extracted by the extraction means, the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observation signal, and the observation signals of a plurality of channels Are used to express a state equation composed only of a second signal including elements corresponding to a plurality of channels, and an observation equation composed of a first signal and a second signal including elements corresponding to a plurality of channels. A Kalman filter with a colored drive source is applied to the state space model to estimate the first signal or the second signal.

これにより、多チャネル信号に含まれる特定の信号を抽出または抑圧する場合において、ステレオ感を損なわず再現性の良いステレオ信号を出力することができる。 As a result, when a specific signal included in a multi-channel signal is extracted or suppressed, a stereo signal with good reproducibility can be output without impairing the stereo feeling.

また、第４の発明に係る多チャネル信号処理方法は、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換し、各スペクトル信号の比に基づいて前記第１信号のスペクトル信号と推定される推定第１スペクトル信号を抽出し、周波数領域の信号である前記推定第１スペクトル信号を時間領域の信号に変換して前記第１信号と推定される時間領域の推定第１信号を抽出し、抽出された前記時間領域の推定第１信号の分散値、前記観測信号の分散値から前記推定第１信号の分散値を差し引いて得られる前記第２信号の分散値、並びに前記複数チャネルの観測信号を用いて、前記複数チャネルに対応した要素を含む前記第２信号のみから構成される状態方程式、及び前記複数チャネルに対応した要素を含む前記第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、前記第１信号または前記第２信号を推定する方法である。 In addition, the multi-channel signal processing method according to the fourth aspect of the present invention is to apply each of the observation signals in the time domain of a plurality of channels including the first signal that is commonly included among the channels and the second signal that is different for each channel to the frequency domain. The estimated first spectrum signal that is estimated as the spectrum signal of the first signal is extracted based on the ratio of each spectrum signal, and the estimated first spectrum signal that is a frequency domain signal is extracted in the time domain The first signal in the time domain estimated as the first signal is extracted, and the estimated first signal in the time domain is extracted from the extracted variance value of the estimated first signal in the time domain and the variance value of the observed signal. variance value of the second signal obtained by subtracting a variance value of the first signal, and by using the observation signals of the plurality of channels, composed only said second signal containing the elements corresponding to the plurality of channels Equation of state is, and the first signal and the state space model represented by composed observation equation and a second signal containing the elements corresponding to the plurality of channels, by applying the Kalman filter with colored driving source, the second This is a method for estimating one signal or the second signal.

また、第５の発明に係る多チャネル信号処理方法は、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々の自己相関のピーク値に基づいて、前記第２信号と推定される時間領域の推定第２信号を抽出し、抽出された前記時間領域の推定第２信号の分散値、前記観測信号の分散値から前記推定第２信号の分散値を差し引いて得られる前記第１信号の分散値、並びに前記複数チャネルの観測信号を用いて、前記複数チャネルに対応した要素を含む前記第２信号のみから構成される状態方程式、及び前記複数チャネルに対応した要素を含む前記第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、前記第１信号または前記第２信号を推定する方法である。 In addition, the multi-channel signal processing method according to the fifth aspect of the invention relates to the autocorrelation of each of the observation signals in the time domain of a plurality of channels including the first signal that is commonly included among the channels and the second signal that is different for each channel. Based on the peak value of the second signal, an estimated second signal in the time domain estimated as the second signal is extracted, and the estimated value is extracted from the extracted variance value of the estimated second signal in the time domain and the variance value of the observed signal A state equation composed of only the second signal including elements corresponding to the plurality of channels using the dispersion value of the first signal obtained by subtracting the dispersion value of the second signal and the observation signals of the plurality of channels. and wherein the plurality of channels the first signal comprising the elements corresponding to the state space model represented by composed observation equation and a second signal, by applying a Kalman filter with colored driving source, said first It is a degree or a method of estimating the second signal.

また、第６の発明に係る多チャネル信号処理方法は、各チャネル間で共通に含まれる第１信号と、チャネル毎に異なる第２信号とを含む複数チャネルの時間領域の観測信号各々を周波数領域のスペクトル信号に変換して各々のパワースペクトル密度を演算し、該パワースペクトル密度から各々得られるスペクトルエントロピーに基づいて前記第２信号のスペクトル信号と推定される推定第２スペクトル信号を抽出し、周波数領域の信号である前記推定第２スペクトル信号を時間領域の信号に変換して前記第２信号と推定される時間領域の推定第２信号を抽出し、抽出された前記時間領域の推定第２信号の分散値、前記観測信号の分散値から前記推定第２信号の分散値を差し引いて得られる前記第１信号の分散値、並びに前記複数チャネルの観測信号を用いて、前記複数チャネルに対応した要素を含む前記第２信号のみから構成される状態方程式、及び前記複数チャネルに対応した要素を含む前記第１信号と第２信号とから構成される観測方程式で表される状態空間モデルに、有色駆動源付カルマンフィルタを適用して、前記第１信号または前記第２信号を推定する方法である。 In addition, the multi-channel signal processing method according to the sixth aspect of the present invention provides a frequency domain observation signal in each of a plurality of channels including a first signal that is commonly included among the channels and a second signal that is different for each channel. And calculating each power spectral density, extracting an estimated second spectral signal estimated as a spectral signal of the second signal based on a spectral entropy obtained from the power spectral density, and a frequency The estimated second spectrum signal, which is a signal in a domain, is converted into a signal in the time domain, a second estimated signal in the time domain estimated as the second signal is extracted, and the extracted second estimated signal in the time domain is extracted. , The dispersion value of the first signal obtained by subtracting the dispersion value of the estimated second signal from the dispersion value of the observed signal, and the view of the plurality of channels. Using the signal, observation composed of a plurality channel state equation composed only of the second signal containing the elements corresponding to, and said first and second signals containing the elements corresponding to the plurality of channels This is a method for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to a state space model represented by an equation .

また、第７の発明に係る多チャネル信号処理プログラムは、コンピュータを、上記の多チャネル信号処理装置を構成する各手段として機能させるためのプログラムである。 A multi-channel signal processing program according to the seventh invention is a program for causing a computer to function as each means constituting the multi-channel signal processing device.

以上説明したように、本発明の多チャネル信号処理装置、方法、及びプログラムによれば、ステレオ及び２チャネルを含む多チャネルの入力信号に含まれる特定の信号を抽出または抑圧する場合において、ステレオ感を損なわず再現性の良い多チャネル信号を出力することができる、という効果が得られる。 As described above, according to the multi-channel signal processing apparatus, method, and program of the present invention, when a specific signal included in a multi-channel input signal including stereo and two channels is extracted or suppressed, The effect that a multi-channel signal with good reproducibility can be output without impairing the above is obtained.

第１の実施の形態に係るステレオ信号処理装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the stereo signal processing apparatus which concerns on 1st Embodiment. 観測信号の観測状況を説明するための概略図である。It is the schematic for demonstrating the observation condition of an observation signal. 周波数領域変換部の処理を説明するための図である。It is a figure for demonstrating the process of a frequency domain conversion part. ボーカル信号抽出部の処理を説明するための図である。It is a figure for demonstrating the process of a vocal signal extraction part. 時間領域変換部の処理を説明するための図である。It is a figure for demonstrating the process of a time domain conversion part. 状態空間モデルを表すブロック図である。It is a block diagram showing a state space model. 第１の実施の形態におけるステレオ信号処理の内容を示すフローチャートである。It is a flowchart which shows the content of the stereo signal process in 1st Embodiment. 有色駆動原付カルマンアルゴリズムの内容を示すフローチャートである。It is a flowchart which shows the content of a colored drive moped Kalman algorithm. 第２の実施の形態に係るステレオ信号処理装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the stereo signal processing apparatus which concerns on 2nd Embodiment. 第２の実施の形態におけるステレオ信号処理の内容を示すフローチャートである。It is a flowchart which shows the content of the stereo signal process in 2nd Embodiment. 第３の実施の形態に係るステレオ信号処理装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the stereo signal processing apparatus which concerns on 3rd Embodiment. 自己相関処理部の処理を説明するための図である。It is a figure for demonstrating the process of an autocorrelation process part. ピーク値検出部の処理を説明するための図である。It is a figure for demonstrating the process of a peak value detection part. 第３の実施の形態におけるステレオ信号処理の内容を示すフローチャートである。It is a flowchart which shows the content of the stereo signal process in 3rd Embodiment. 第４の実施の形態に係るステレオ信号処理装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the stereo signal processing apparatus which concerns on 4th Embodiment. 第４の実施の形態におけるステレオ信号処理の内容を示すフローチャートである。It is a flowchart which shows the content of the stereo signal processing in 4th Embodiment. 演算量軽減型有色駆動原付カルマンフィルタを説明するための図である。It is a figure for demonstrating the amount-of-computation reduction type | formula color drive mop Kalman filter. 演算量軽減型有色駆動原付カルマンフィルタを説明するための図である。It is a figure for demonstrating the amount-of-computation reduction type | formula color drive mop Kalman filter. 演算量軽減型有色駆動原付カルマンフィルタを説明するための図である。It is a figure for demonstrating the amount-of-computation reduction type | formula color drive mop Kalman filter. 演算軽減型有色駆動原付カルマンアルゴリズムの内容を示すフローチャートである。It is a flowchart which shows the content of the calculation reduction type | mold color drive moped Kalman algorithm.

以下、図面を参照して本発明の実施の形態を詳細に説明する。
＜第１の実施の形態＞
第１の実施の形態では、本発明の第１信号の一例を、例えばＬチャネルマイクとＲチャネルマイクとの中央付近を音源位置とするボーカル信号とし、本発明の第２信号の一例を、例えば楽器等を音源とする楽曲信号とする場合について説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
<First Embodiment>
In the first embodiment, an example of the first signal of the present invention is a vocal signal having a sound source position near the center of the L channel microphone and the R channel microphone, and an example of the second signal of the present invention is, for example, A case where the music signal is a musical instrument or the like will be described.

図１に示すように、第１の実施の形態に係るステレオ信号処理装置１０は、Ａ／Ｄ変換部１２Ｌ，１２Ｒと、周波数領域変換部１４Ｌ，１４Ｒと、スペクトル比演算部１６と、ボーカル信号抽出部１８と、時間領域変換部２０と、楽曲信号推定部２２と、Ｄ／Ａ変換部２４Ｌ，２４Ｒとを含んで構成されている。ステレオ信号処理装置１０は、ＡＳＩＣ（Application Specific Integrated Circuit）等の半導体集積回路により構成することができる。 As shown in FIG. 1, the stereo signal processing apparatus 10 according to the first embodiment includes A / D conversion units 12L and 12R, frequency domain conversion units 14L and 14R, a spectrum ratio calculation unit 16, and a vocal signal. The extraction unit 18, the time domain conversion unit 20, the music signal estimation unit 22, and the D / A conversion units 24L and 24R are configured. The stereo signal processing apparatus 10 can be configured by a semiconductor integrated circuit such as an ASIC (Application Specific Integrated Circuit).

Ａ／Ｄ変換部１２Ｌ，１２Ｒは、外部から入力されたアナログ信号である観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)（図１中では観測信号Ｌ，観測信号Ｒと表記。以下、図９、１１、１５においても同様）を各々ディジタル信号に変換し、ディジタル信号に変換した観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)を各々周波数領域変換部１４Ｌ，１４Ｒへ出力する。 The A / D converters 12L and 12R are observation signals x _L (n) and x _R (n) which are analog signals inputted from the outside (indicated as observation signal L and observation signal R in FIG. 9, 11, and 15) are converted into digital signals, and the observation signals x _L (n) and x _R (n) converted into digital signals are output to the frequency domain converters 14 L and 14 R, respectively.

ここで、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)は、図２に示すように、楽曲信号（Ｌチャネル信号ｉ_Ｌ(ｎ)，Ｒチャネル信号ｉ_Ｒ(ｎ)）と、ボーカル信号ｄ(ｎ)とを観測した信号である。時刻ｎにおいて、Ｌチャネルマイクで観測されたＬチャネルの観測信号がｘ_Ｌ(ｎ)、Ｒチャネルマイクで観測されたＲチャネルの観測信号がｘ_Ｒ(ｎ)である。観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)は、下記（１）式及び（２）式で表される。 Here, the observation signals x _L (n), x _R (n) are composed of a music signal (L channel signal i _L (n), R channel signal i _R (n)) and a vocal signal, as shown in FIG. This is a signal obtained by observing d (n). At time n, the L channel observation signal observed by the L channel microphone is x _L (n), and the R channel observation signal observed by the R channel microphone is x _R (n). The observation signals x _L (n) and x _R (n) are expressed by the following formulas (1) and (2).

ｘ_Ｌ(ｎ)＝ｄ(ｎ)＋ｉ_Ｌ(ｎ) （１）
ｘ_Ｒ(ｎ)＝ｄ(ｎ)＋ｉ_Ｒ(ｎ) （２）
周波数領域変換部１４Ｌ，１４Ｒは、Ａ／Ｄ変換部１２Ｌ，１２Ｒから入力された時間領域の信号である観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)を各々周波数領域の観測信号Ｘ_Ｌ(ｌ，ｋ)，Ｘ_Ｒ(ｌ，ｋ)に変換し、スペクトル比演算部１６及びボーカル信号抽出部１８へ出力する。具体的には、周波数領域変換部１４Ｌ，１４Ｒは、図３に示すように、所定フレーム長のフレーム内の観測信号ｘ_Ｌ(ｌ，ｎ)，ｘ_Ｒ(ｌ，ｎ)各々を、下記（３）式及び（４）式によりフーリエ変換して、各周波数ビンのスペクトルに変換する。ここで、２Ｍは１フレーム当たりのサンプル数、ｌはフレーム番号、ｋは周波数ビン番号である。また、以下では、周波数領域の信号に変換された観測信号を「観測スペクトル」ともいう。 x _L (n) = d (n) + i _L (n) (1)
_{x R (n) = d (} n) + i R (n) (2)
The frequency domain transform units 14L and 14R respectively convert the observation signals x _L (n) and x _R (n), which are time domain signals input from the A / D conversion units 12L and 12R, to the frequency domain observation signals X _L ( l, k), X _R (l, k) and output to the spectrum ratio calculation unit 16 and the vocal signal extraction unit 18. Specifically, as shown in FIG. 3, the frequency domain transform units 14L and 14R perform the following observation signal x _L (l, n) and x _R (l, n) in a frame having a predetermined frame length ( 3) Fourier transform is performed according to equations (4) and (4) to convert the spectrum into each frequency bin. Here, 2M is the number of samples per frame, l is a frame number, and k is a frequency bin number. Hereinafter, an observation signal converted into a frequency domain signal is also referred to as an “observation spectrum”.

スペクトル比演算部１６は、周波数領域変換部１４Ｌ，１４Ｒから入力された観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜と｜Ｘ_Ｒ(ｌ，ｋ)｜とのスペクトル比を演算し、ボーカル信号抽出部１８に出力する。ボーカル信号は、（１）式及び（２）式に示すようにＬチャネルの観測信号ｘ_Ｌ(ｎ)とＲチャネルの観測信号ｘ_Ｒ(ｎ)とに同等に含まれる。そのため、観測スペクトルにおいても、下記（５）式及び（６）式に示すように、Ｌチャネルの観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜とＲチャネルの観測スペクトル｜Ｘ_Ｒ(ｌ，ｋ)｜とに、ボーカル信号のスペクトル｜Ｄ(ｌ，ｋ)｜が同等に含まれる。なお、｜Ｉ_Ｌ(ｌ，ｋ)｜及び｜Ｉ_Ｒ(ｌ，ｋ)｜はＬチャネルの楽曲信号のスペクトル及びＲチャネルの楽曲信号のスペクトルである。 The spectrum ratio calculation unit 16 calculates a spectrum ratio between the observed spectrums | X _L (l, k) | and | X _R (l, k) | input from the frequency domain conversion units 14L and 14R, and extracts a vocal signal. To the unit 18. The vocal signal is equally included in the L channel observation signal x _L (n) and the R channel observation signal x _R (n) as shown in the equations (1) and (2). Therefore, also in the observed spectrum, as shown in the following equations (5) and (6), the L-channel observed spectrum | X _L (l, k) | and the R-channel observed spectrum | X _R (l, k) And | include the spectrum | D (l, k) | of the vocal signal equally. In addition, | I _L (l, k) | and | I _R (l, k) | are the spectrum of the L channel music signal and the spectrum of the R channel music signal.

｜Ｘ_Ｌ(ｌ，ｋ)｜＝｜Ｄ(ｌ，ｋ)｜＋｜Ｉ_Ｌ(ｌ，ｋ)｜（５）
｜Ｘ_Ｒ(ｌ，ｋ)｜＝｜Ｄ(ｌ，ｋ)｜＋｜Ｉ_Ｒ(ｌ，ｋ)｜（６）
このことから、Ｌチャネル観測スペクトルとＲチャネル観測スペクトルとのスペクトル比が小さい場合には、その信号をボーカル信号と判定し、スペクトル比が大きい場合には、その信号を楽曲信号と判定することができる。そこで、スペクトル比演算部１６は、Ｌチャネル観測スペクトルとＲチャネル観測スペクトルとのスペクトル比を演算する。特許文献１及び２では、同じ周波数帯域毎に周波数領域の信号に変換された左チャネル信号と右チャネル信号との類似度を計算するが、本実施の形態では、下記（７）式により、Ｌチャネル観測スペクトルとＲチャネル観測スペクトルとのスペクトル比Ａ_ｅ(ｌ，ｋ)を演算する。 | X _L (l, k) | = | D (l, k) | + | I _L (l, k) | (5)
| X _R (l, k) | = | D (l, k) | + | I _R (l, k) | (6)
From this, when the spectrum ratio between the L channel observation spectrum and the R channel observation spectrum is small, the signal is determined as a vocal signal, and when the spectrum ratio is large, the signal is determined as a music signal. it can. Therefore, the spectrum ratio calculation unit 16 calculates the spectrum ratio between the L channel observation spectrum and the R channel observation spectrum. In Patent Documents 1 and 2, the similarity between the left channel signal and the right channel signal converted into a frequency domain signal is calculated for each same frequency band. In the present embodiment, L is calculated by the following equation (7). A spectrum ratio A _e (l, k) between the channel observation spectrum and the R channel observation spectrum is calculated.

ボーカル信号抽出部１８は、スペクトル比演算部１６から入力されたスペクトル比Ａ_ｅ(ｌ，ｋ)に基づいて、周波数領域変換部１４Ｌ，１４Ｒから入力された観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜，｜Ｘ_Ｒ(ｌ，ｋ)｜からボーカル信号と推定される信号（以下、「推定ボーカル信号」という）のスペクトル（以下、「推定ボーカルスペクトル」という）を抽出し、時間領域変換部２０へ出力する。具体的には、スペクトル比Ａ_ｅ(ｌ，ｋ)に基づいて、各フレームの観測スペクトルの周波数ビン毎にボーカル信号か楽曲信号かを判定する。そして、下記（８）式に示すように、ボーカル信号と判定された場合には観測スペクトルをそのまま抽出し、楽曲信号と判定された場合にはその観測スペクトルを抑圧することで、推定ボーカルスペクトル｜Ｄ^(ｌ，ｋ)｜を抽出する。なお、特許文献２では、目的音源信号である楽曲信号を抽出しているが、ここでは、最終的な抽出対象である楽曲信号ではなく推定ボーカルスペクトルを抽出する。 Based on the spectrum ratio A _e (l, k) input from the spectrum ratio calculation unit 16, the vocal signal extraction unit 18 observes the spectrum | X _L (l, k) input from the frequency domain conversion units 14L and 14R. A spectrum (hereinafter referred to as “estimated vocal spectrum”) of a signal estimated as a vocal signal (hereinafter referred to as “estimated vocal signal”) is extracted from ||| _R (l, k) | Output to. Specifically, based on the spectral ratio A _e (l, k), determines whether the vocal signal or music signal for each frequency bin of the observed spectrum of each frame. Then, as shown in the following equation (8), when the vocal signal is determined, the observed spectrum is extracted as it is, and when it is determined as the music signal, the observed spectrum is suppressed, so that the estimated vocal spectrum | D ^ (l, k) | is extracted. In Patent Document 2, a music signal that is a target sound source signal is extracted, but here, an estimated vocal spectrum is extracted instead of a music signal that is a final extraction target.

ここで、αはＬチャネルマイクとＲチャネルマイクとの中央付近に定位している音源信号（ここではボーカル信号）以外の音源信号（ここでは楽曲信号）をどの程度許容するかを決定する閾値であり、０≦α≦１である。またｋ_０は楽曲信号の抑圧度を調節するための係数で、０≦ｋ_０≦１である。図４に示すように、ｋ_０＝０の場合、楽曲信号は完全に抑圧される。なお、（８）式では観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜から推定ボーカルスペクトル｜Ｄ^(ｌ，ｋ)｜を抽出する場合を示しているが、観測スペクトル｜Ｘ_Ｒ(ｌ，ｋ)｜を用いてもよい。 Here, α is a threshold value that determines how much a sound source signal (here, a music signal) other than a sound source signal (here, a vocal signal) localized near the center of the L channel microphone and the R channel microphone is allowed. Yes, 0 ≦ α ≦ 1. K ₀ is a coefficient for adjusting the degree of suppression of the music signal, and 0 ≦ k ₀ ≦ 1. As shown in FIG. 4, when k ₀ = 0, the music signal is completely suppressed. Note that the expression (8) shows the case where the estimated vocal spectrum | D ^ (l, k) | is extracted from the observed spectrum | X _L (l, k) |, but the observed spectrum | X _R (l, k) ) | May be used.

なお、上記処理を図４に示すように、ボーカル帯域Ｗ_０に対してのみ行うようにしてもよい。Ｗ_０は観測信号に含まれるボーカル信号の帯域を指定する係数である。男性ボーカルの場合ボーカル信号は低い帯域に集中し、女性ボーカルの場合ボーカル信号は高い帯域に集中する。そのことより、Ｗ_０のような処理帯域を設けることで、特許文献１の手法のように観測信号の全帯域に渡って処理をするような場合と比較して、演算量を軽減することができる。また、本実施の形態では、第１信号をボーカル信号としているため、ボーカル信号の特性に応じた処理帯域Ｗ_０を設定しているが、第１信号をどのような信号とするかに応じて、その信号の特性に応じた処理帯域Ｗ_０を設定すればよい。 Note that the above processing may be performed only for the vocal band W ₀ as shown in FIG. W ₀ is a coefficient that specifies the band of the vocal signal included in the observed signal. In the case of male vocals, the vocal signal is concentrated in a low band, and in the case of female vocals, the vocal signal is concentrated in a high band. Therefore, by providing a processing band such as W ₀ , it is possible to reduce the amount of calculation compared to the case where processing is performed over the entire band of the observation signal as in the method of Patent Document 1. it can. In this embodiment, since the first signal is a vocal signal, the processing band W ₀ corresponding to the characteristics of the vocal signal is set. However, depending on what kind of signal the first signal is to be used. The processing band W ₀ corresponding to the characteristics of the signal may be set.

時間領域変換部２０は、ボーカル信号抽出部１８から入力された推定ボーカルスペクトル｜Ｄ^(ｌ，ｋ)｜を、下記（９）式により逆フーリエ変換して、時間領域の推定ボーカル信号ｄ^(ｌ，ｎ)に変換する（図５も参照）。なお、特許文献１及び２の手法と比較して、逆フーリエ変換の回数が１回でよい。 The time domain transform unit 20 performs an inverse Fourier transform on the estimated vocal spectrum | D ^ (l, k) | input from the vocal signal extraction unit 18 according to the following equation (9) to obtain an estimated vocal signal d ^ in the time domain. Convert to (l, n) (see also FIG. 5). In addition, compared with the method of patent document 1 and 2, the frequency | count of an inverse Fourier transform may be 1 time.

次いで、オーバーラップアド法により１フレーム前の後半Ｍサンプルを用いた時間領域推定ボーカル信号ｄ^(ｌ−１，ｎ＋Ｍ)と現フレームの前半Ｍサンプルを用いた時間領域推定ボーカル信号ｄ^(ｌ，ｎ)とを足し合わせて、現フレームのＭサンプル時間領域推定ボーカル信号ｄ^(ｎ)（１≦ｎ≦Ｍ）を得る。オーバーラップアド法を数式で表現すると下記のように表わすことができる。 Next, the time domain estimated vocal signal d ^ (l-1, n + M) using the latter half M samples one frame before by the overlap add method and the time domain estimated vocal signal d ^ (l using the first half M samples of the current frame are used. , N) are added together to obtain an M sample time domain estimated vocal signal d ^ (n) (1 ≦ n ≦ M) of the current frame. When the overlap add method is expressed by a mathematical expression, it can be expressed as follows.

楽曲信号推定部２２は、時間領域変換部２０から入力された推定ボーカル信号ｄ^(ｎ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、楽曲信号と推定される信号（以下、「推定楽曲信号」という）を抽出する。本実施の形態では、ＡＲ係数の推定を用いない有色駆動源付カルマンフィルタにより、観測信号に含まれる特定の信号（ここでは楽曲信号）を抽出する。 The music signal estimation unit 22 is estimated as a music signal based on the estimated vocal signal d ^ (n) input from the time domain conversion unit 20 and the observation signals x _L (n), x _R (n). A signal (hereinafter referred to as “estimated music signal”) is extracted. In the present embodiment, a specific signal (music signal in this case) included in the observation signal is extracted by a Kalman filter with a colored drive source that does not use estimation of the AR coefficient.

具体的には、観測信号を、下記（１０）式に示す楽曲信号のみから構成される状態方程式、及びボーカル信号と楽曲信号とから構成される観測方程式で表される状態空間モデルに置き換える。 Specifically, the observation signal is replaced with a state equation composed of only a music signal represented by the following equation (10) and a state space model represented by an observation equation composed of a vocal signal and a music signal.

ただし、（１０）式中のベクトルｘ_ｐ２、δ_ｐ２、ｙ_ｐ２、ε_ｐ２、Φ_ｐ２及びＭ_ｐ２は、下記（１１）式でそれぞれ定義される。ベクトルｘ_ｐ２は所望の楽曲信号からなる２Ｌ_ｐ２×１次の状態ベクトル、ベクトルδ_ｐ２は２Ｌ_ｐ２×１次の駆動源ベクトル、ベクトルｙ_ｐ２は２×１次の観測信号ベクトル、ベクトルε_ｐ２は２×１次のボーカル信号ベクトルである。行列Φ_ｐ２は０及び１のみで構成される状態遷移行列、行列Ｍ_ｐ２は２×２Ｌ_ｐ２次の観測遷移行列である。また、図６は、この状態空間モデルを表すブロック図である。なお、２Ｌ_ｐ２は、状態遷移行列のサイズである。また、ｐ２は有色駆動原付カルマンフィルタが適用される状態方程式及び観測方程式の変数であることを表す添え字である。 However, the vectors x _p2 , δ _p2 , y _p2 , ε _p2 , Φ _p2, and M _p2 in the equation (10) are respectively defined by the following equation (11). The vector x _p2 is a 2L _p2 × first-order state vector composed of a desired music signal, the vector δ _p2 is a 2L _p2 × primary drive source vector, the vector y _p2 is a 2 × first-order observation signal vector, and the vector ε _p2 is This is a 2 × first-order vocal signal vector. The matrix Φ _p2 is a state transition matrix composed of only 0 and 1, and the matrix M _p2 is a 2 × 2L _p- order observation transition matrix. FIG. 6 is a block diagram showing this state space model. 2L _p2 is the size of the state transition matrix. Further, p2 is a subscript indicating that it is a variable of the state equation and the observation equation to which the colored drive moped Kalman filter is applied.

（１０）式における状態方程式は、推定対象（ここでは楽曲信号）のシステムを状態空間モデルで記述したものであり、内部状態つまり状態変数（ここでは、状態ベクトルｘ_ｐ２）の時間変化を表している。また、（１０）式における観測方程式は、何らかの観測装置を通じて観測する過程を記述したものであり、観測結果（ここでは、観測信号ベクトルｙ_ｐ２）が、被観測量つまり入力（ここでは、状態ベクトルｘ_ｐ２）に依存して時間発展する様子を示している。なお、「時刻ｎにおける状態ベクトルｘ_ｐ２(ｎ)」とは、時刻ｎまでの楽曲信号からなる状態ベクトルを意味する。 The state equation in equation (10) describes the system to be estimated (here, the music signal) with a state space model, and represents the internal state, that is, the time change of the state variable (here, the state vector x _p2 ). Yes. The observation equation in equation (10) describes the process of observation through some kind of observation device, and the observation result (here, the observation signal vector y _p2 ) is the observed quantity, that is, the input (here, the state vector). The state of time evolution depending on x _p2 ) is shown. “State vector x _p2 (n) at time n” means a state vector composed of music signals up to time n.

（１０）式に示す状態方程式及び観測方程式により、下記に示すＬ・Ｒチャネル結合型カルマンアルゴリズムを導出する。 The L / R channel coupled Kalman algorithm shown below is derived by the state equation and the observation equation shown in the equation (10).

上記のアルゴリズムは、初期設定の過程［Initialization］と反復の過程［Iteration］とに大別され、反復の過程では、１〜５の手順を逐次繰り返す。なお、各過程及び手順の詳細な処理フローは後述し、ここでは、各過程及び手順の概略について説明する。 The above algorithm is roughly divided into an initialization process [Initialization] and an iterative process [Iteration]. In the iterative process, steps 1 to 5 are sequentially repeated. A detailed processing flow of each process and procedure will be described later, and an outline of each process and procedure will be described here.

初期設定の過程では、推定する楽曲信号を示す状態ベクトルの最適推定値（以下、「最適推定値ベクトル」という）の初期値ｘ^_ｐ２(０｜０)、状態ベクトルを推定した場合の誤差である共分散行列の初期値Ｐ_ｐ２(０｜０)、ボーカル信号の分散値Ｒ_εｐ２(ｎ)[ｉ，ｊ]、及び楽曲信号の分散値Ｒ_δｐ２(ｎ)[ｉ，ｊ]の値を、上記のようにそれぞれ設定する。なお、楽曲信号の分散値は、観測信号の分散値からボーカル信号の分散値を差し引いたものである。また、＊[ｉ，ｊ]は、変数名＊のｉ行ｊ列の要素、Ｉは単位行列を示す。 In the initial setting process, the initial value x ^ _p2 (0 | 0) of the optimal estimated value of the state vector (hereinafter referred to as “optimum estimated value vector”) indicating the music signal to be estimated, and the error when the state vector is estimated the initial value _{P p2} of a covariance matrix (0 | 0), the dispersion value of the vocal signals _{R εp2 (n) [i,} j], and variance of the music signal _{R δp2 (n) [i,} j] values of , Set as above. The music signal variance is obtained by subtracting the vocal signal variance from the observed signal variance. * [I, j] is an element of i row and j column of the variable name *, and I is a unit matrix.

また、反復の過程では、手順１において、時刻ｎまでの情報により時刻ｎ＋１の状態ベクトルを推定した場合の誤差である共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ)を計算する。次に、手順２において、観測信号ベクトルの推定誤差にカルマンゲイン行列をかけて、時刻ｎまでの情報による時刻ｎ＋１での最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ)を加えたものが、時刻ｎ＋１までの情報によるその時刻での最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)になるようなカルマンゲイン行列Ｋ_ｐ２(ｎ＋１)を計算する。 In the iteration process, in step 1, a covariance matrix P _p2 (n + 1 | n), which is an error when the state vector at time n + 1 is estimated from information up to time n, is calculated. Next, in step 2, the estimated error of the observed signal vector is multiplied by the Kalman gain matrix and the optimum estimated value vector x ^ _p2 (n + 1 | n) at time n + 1 based on the information up to time n is added. A Kalman gain matrix K _p2 (n + 1) is calculated so as to be the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) at that time based on the information up to n + 1.

次に、手順３において、時刻ｎまでの情報による時刻ｎ＋１での最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ)を計算する。次に、手順４において、時刻ｎ＋１までの情報によるその時刻での最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)を計算する。手順３及び４で、状態量の更新が行われる。次に、手順５において、時刻ｎ＋１までの情報によりその時刻の共分散行列を更新する。 Next, in procedure 3, the optimum estimated value vector x ^ _p2 (n + 1 | n) at time n + 1 is calculated from information up to time n. Next, in step 4, the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) at that time based on the information up to time n + 1 is calculated. In steps 3 and 4, the state quantity is updated. Next, in step 5, the covariance matrix at that time is updated with information up to time n + 1.

楽曲信号推定部２２は、上記の反復過程を所定回数繰り返して、手順４により得られた最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)の１行１列目をＬチャネルの推定楽曲信号ｉ^_Ｌ(ｎ)として、（Ｌ_ｐ２＋１）行１列目をＲチャネルの推定楽曲信号ｉ^_Ｒ(ｎ)として、各々をＤ／Ａ変換部２４Ｌ，２４Ｒへ出力する。 The music signal estimation unit 22 repeats the above iterative process a predetermined number of times, and the L channel estimated music signal i ^ in the first row and the first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) obtained in the procedure 4 is obtained. _{As L} (n), the (L _p2 +1) row and first column are output to the D / A converters 24L and 24R as the estimated music signal i ^ _R (n) of the R channel.

Ｄ／Ａ変換部２４Ｌ，２４Ｒは、楽曲信号推定部２２から入力されたディジタル信号である推定楽曲信号ｉ^_Ｌ(ｎ)及びｉ^_Ｒ(ｎ)を各々アナログ信号に変換して、最終的な出力信号Ｌ，Ｒとして出力する。 The D / A converters 24L and 24R convert the estimated music signals i ^ _L (n) and i ^ _R (n), which are digital signals input from the music signal estimation section 22, into analog signals, respectively, and finally Output signals L and R.

次に、図７を参照して、第１の実施の形態に係るステレオ信号処理装置１０の作用について説明する。 Next, the operation of the stereo signal processing apparatus 10 according to the first embodiment will be described with reference to FIG.

ステップ１００で、Ａ／Ｄ変換部１２Ｌ，１２Ｒが、外部から入力されたアナログ信号である観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)を各々ディジタル信号に変換する。次に、ステップ１０２で、周波数領域変換部１４Ｌ，１４Ｒが、上記ステップ１００でディジタル信号に変換された時間領域の信号である観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)を各々周波数領域の信号である観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜，｜Ｘ_Ｒ(ｌ，ｋ)｜に変換する。 In step 100, the A / D converters 12L and 12R convert the observation signals x _L (n) and x _R (n), which are analog signals input from the outside, into digital signals, respectively. Next, in step 102, the frequency domain converters 14L and 14R convert the observation signals x _L (n) and x _R (n), which are time domain signals converted into digital signals in step 100, into frequency domains. It is converted into an observation spectrum | X _L (l, k) |, | X _R (l, k) |

次に、ステップ１０４で、スペクトル比演算部１６が、上記ステップ１０２で変換された観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜と｜Ｘ_Ｒ(ｌ，ｋ)｜とのスペクトル比Ａ_ｅ(ｌ，ｋ)を演算する。次に、ステップ１０６で、ボーカル信号抽出部１８が、上記ステップ１０４で演算されたスペクトル比Ａ_ｅ(ｌ，ｋ)が予め定めた閾値αより大きいか否かを判定する。Ａ_ｅ(ｌ，ｋ)＞αの場合には、ステップ１０８へ移行し、その信号を楽曲信号とみなして、楽曲信号の抑圧度を調節するための係数ｋ_０（０≦ｋ_０≦１）を例えばｋ_０＝０として観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜または｜Ｘ_Ｒ(ｌ，ｋ)｜に乗算することにより楽曲信号を抑圧する。一方、Ａ_ｅ(ｌ，ｋ)≦αの場合には、ステップ１１０へ移行し、その信号をボーカル信号とみなして、例えばｋ_０＝１として、観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜または｜Ｘ_Ｒ(ｌ，ｋ)｜を推定ボーカルスペクトル｜Ｄ^(ｌ，ｋ)｜として抽出する。 Next, in step 104, the spectrum ratio calculation unit 16 calculates the spectral ratio A _e (l between the observed spectrum | X _L (l, k) | converted in step 102 and | X _R (l, k) | , K). Next, in step 106, the vocal signal extraction unit 18 determines whether or not the spectrum ratio A _e (l, k) calculated in step 104 is larger than a predetermined threshold value α. If A _e (l, k)> α, the process proceeds to step 108, where the signal is regarded as a music signal and a coefficient k ₀ (0 ≦ k ₀ ≦ 1) for adjusting the degree of suppression of the music signal. For example, k ₀ = 0, and the observed spectrum | X _L (l, k) | or | X _R (l, k) | is multiplied to suppress the music signal. On the other hand, if A _e (l, k) ≦ α, the process proceeds to step 110, where the signal is regarded as a vocal signal, for example, k ₀ = 1, and the observed spectrum | X _L (l, k) | | X _R (l, k) | is extracted as an estimated vocal spectrum | D ^ (l, k) |.

次に、ステップ１１２で、時間領域変換部２０は、上記ステップ１０８及び１１０の処理を経て抽出された推定ボーカルスペクトル｜Ｄ^(ｌ，ｋ)｜を、（９）式により逆フーリエ変換して、時間領域の推定ボーカル信号ｄ^(ｌ，ｎ)に変換する。次いで、オーバーラップアド法により１フレーム前の後半Ｍサンプルを用いた時間領域推定ボーカル信号ｄ^(ｌ−１，ｎ＋Ｍ)と現フレームの前半Ｍサンプルを用いた時間領域推定ボーカル信号ｄ^(ｌ，ｎ)とを足し合わせて、現フレームのＭサンプル時間領域推定ボーカル信号ｄ^(ｎ)（１≦ｎ≦Ｍ）を得る。 Next, in step 112, the time domain transforming unit 20 performs an inverse Fourier transform on the estimated vocal spectrum | D ^ (l, k) | extracted through the processing of steps 108 and 110 according to the equation (9). And converted into an estimated vocal signal d ^ (l, n) in the time domain. Next, the time domain estimated vocal signal d ^ (l-1, n + M) using the latter half M samples one frame before by the overlap add method and the time domain estimated vocal signal d ^ (l using the first half M samples of the current frame are used. , N) are added together to obtain an M sample time domain estimated vocal signal d ^ (n) (1 ≦ n ≦ M) of the current frame.

次に、ステップ１１４で、楽曲信号推定部２２が、推定ボーカル信号ｄ^(ｎ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、楽曲信号推定処理を実行することにより、推定楽曲信号を抽出する。楽曲信号推定処理は、図８に示す有色駆動原付カルマンアルゴリズムに相当する。ここで、図８を参照して、有色駆動原付カルマンアルゴリズムのフローについて説明する。 Next, in step 114, the music signal estimation unit 22 performs music signal estimation processing based on the estimated vocal signal d ^ (n) and the observation signals x _L (n), x _R (n). Thus, the estimated music signal is extracted. The music signal estimation process corresponds to the colored drive moped Kalman algorithm shown in FIG. Here, with reference to FIG. 8, the flow of the colored drive moped Kalman algorithm will be described.

ステップ１１４０で、（１０）式に示す状態方程式及び観測方程式により状態空間モデルを定義し、最適推定値ベクトルの初期値ｘ^_ｐ２(０｜０)、状態ベクトルを推定した場合の誤差である共分散行列の初期値Ｐ_ｐ２(０｜０)、ボーカル信号の分散値Ｒ_εｐ２(ｎ) [ｉ，ｊ]、及び楽曲信号の分散値Ｒ_δｐ２(ｎ)[ｉ，ｊ]を、上述の初期設定の過程［Initialization］に示した初期状態に設定する。また、時刻を示す変数ｎを０に設定する。 In step 1140, the state space model is defined by the state equation and the observation equation shown in the equation (10), the initial value x ^ _p2 (0 | 0) of the optimum estimated value vector, and the error when the state vector is estimated. the initial value _{P p2} of the dispersion matrix (0 | 0), the dispersion value of the vocal signals _{R εp2 (n) [i,} j], and variance of the music signal _{R δp2 (n) [i,} j] , and the above-mentioned initial Setting process Set to the initial state shown in [Initialization]. Also, a variable n indicating time is set to 0.

次に、ステップ１１４２で、上記ステップ１１４０で定義した状態空間モデルにおける状態遷移行列Φ_ｐ２、設定した状態ベクトルの共分散行列の初期値Ｐ_ｐ２(０｜０)（ｎ＝０の場合）、または１時刻前に後述するステップ１１５０で更新された共分散行列Ｐ_ｐ２(ｎ｜ｎ) （ｎ≧１の場合）、及び楽曲信号の分散値Ｒ_δｐ２(ｎ＋１)[ｉ，ｊ]の値を用いて、時刻ｎまでの情報により時刻ｎ＋１の状態ベクトルを推定した場合の誤差である共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ)を計算する（上述の反復の過程［Iteration］の手順１）。 Next, in step 1142, the state transition matrix Φ _p2 in the state space model defined in step 1140 above, the initial value P _p2 (0 | 0) of the covariance matrix of the set state vector (when n = 0), or Using the covariance matrix P _p2 (n | n) (when n ≧ 1) updated in step 1150 described later one time ago and the value of the variance value R _δp2 (n + 1) [i, j] of the music signal Then, the covariance matrix P _p2 (n + 1 | n), which is an error when the state vector at time n + 1 is estimated from the information up to time n, is calculated (procedure 1 of the above iteration process [Iteration]).

次に、ステップ１１４４で、上記ステップ１１４２で計算した共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ)、上記ステップ１１４０で定義した状態空間モデルにおける観測遷移行列Ｍ_ｐ２、及びボーカル信号の分散値Ｒ_εｐ２(ｎ) [ｉ，ｊ]を用いて、カルマンゲイン行列Ｋ_ｐ２(ｎ＋１)を計算する（同手順２）。 Next, in step 1144, the covariance matrix P _p2 (n + 1 | n) calculated in step 1142, the observed transition matrix M _p2 in the state space model defined in step 1140, and the variance value R _εp2 (n of the vocal signal) ) The Kalman gain matrix K _p2 (n + 1) is calculated using [i, j] (same procedure 2).

次に、ステップ１１４６で、状態遷移行列Φ_ｐ２、及び上記ステップ１１４０で設定した最適推定値ベクトルの初期値ｘ^_ｐ２(０｜０) （ｎ＝０の場合）、または１時刻前に本ステップで得られた最適推定値ベクトルｘ^_ｐ２(ｎ｜ｎ) （ｎ≧１の場合）を用いて、時刻ｎまでの情報による時刻ｎ＋１での最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ)を計算する（同手順３）。そして、計算した最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ)、上記ステップ１１４４で計算したカルマンゲイン行列Ｋ_ｐ２(ｎ＋１)、観測ベクトルｙ_ｐ２(ｎ＋１)、及び観測遷移行列Ｍ_ｐ２を用いて、時刻ｎ＋１までの情報によるその時刻での最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)を計算する（同手順４）。 Next, in step 1146, the state transition matrix Φ _p2 and the initial value x ^ _p2 (0 | 0) of the optimum estimated value vector set in step 1140 (if n = 0), or this step one time before _Is used to obtain the optimum estimated value vector x ^ _p2 (n + 1 | n) at time n + 1 from the information up to time n using the optimum estimated value vector x ^ _p2 (n | n) (when n ≧ 1) Calculate (same procedure 3). Then, using the calculated optimum estimated value vector x ^ _p2 (n + 1 | n), the Kalman gain matrix _Kp2 (n + 1) calculated in step 1144, the observed vector _yp2 (n + 1), and the observed transition matrix _Mp2 , The optimum estimated value vector x ^ _p2 (n + 1 | n + 1) at that time based on the information up to time n + 1 is calculated (same procedure 4).

次に、ステップ１１４８で、処理を終了するか否かを判定する。この判定は、時刻ｎが所定のサンプル数Ｎに達した場合を処理終了と判定してもよいし、サンプルがなくなった時点で処理終了と判定してもよい。処理を終了しない場合には、ステップ１１５０へ移行し、処理を終了する場合には、ステップ１１５４へ移行する。 Next, in step 1148, it is determined whether or not to end the process. In this determination, when the time n reaches a predetermined number N of samples, it may be determined that the process is ended, or may be determined when the sample is exhausted. If the process is not terminated, the process proceeds to step 1150. If the process is terminated, the process proceeds to step 1154.

ステップ１１５０では、単位行列Ｉ、カルマンゲイン行列Ｋ_ｐ２(ｎ＋１)、観測遷移行列Ｍ_ｐ２、及び上記ステップ１１４２で計算された共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ)を用いて、時刻ｎ＋１までの情報によるその時刻での共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ＋１)を更新する。次に、ステップ１１５２で、ｎを１インクリメントして、ステップ１１４２へ戻る。 In step 1150, using the unit matrix I, the Kalman gain matrix K _p2 (n + 1), the observed transition matrix M _p2 , and the covariance matrix P _p2 (n + 1 | n) calculated in step 1142, information up to time n + 1 is obtained. To update the covariance matrix P _p2 (n + 1 | n + 1) at that time. Next, in step 1152, n is incremented by 1, and the process returns to step 1142.

一方、ステップ１１５４では、上記ステップ１１４６で計算された最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１) の１行１列目をＬチャネルの推定楽曲信号ｉ^_Ｌ(ｎ)として、（Ｌ_ｐ２＋１）行１列目をＲチャネルの推定楽曲信号ｉ^_Ｒ(ｎ)として出力し、図７の処理へリターンする。 On the other hand, in step 1154, the first channel and the first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) calculated in step 1146 are set as the L channel estimated music signal i ^ _L (n), and ( _Lp2 + 1). ) The first row and the first column are output as the R channel estimated music signal i ^ _R (n), and the process returns to the process of FIG.

次に、ステップ１１６で、Ｄ／Ａ変換部２４Ｌ，２４Ｒが、上記ステップ１１４の処理により出力されたディジタル信号である推定楽曲信号ｉ^_Ｌ(ｎ)及びｉ^_Ｒ(ｎ)を各々アナログ信号に変換して、最終的な出力信号Ｌ，Ｒとして出力し、処理を終了する。 Next, in step 116, the D / A converters 24L and 24R respectively convert the estimated music signals i ^ _L (n) and i ^ _R (n), which are digital signals output by the processing of step 114, into analog signals. And output as final output signals L and R, and the process is terminated.

以上説明したように、第１の実施の形態のステレオ信号処理装置によれば、Ｌチャネル及びＲチャネルの観測スペクトルの比に基づいて抽出した推定ボーカル信号と観測信号とに、有色駆動源付カルマンフィルタを適用して推定楽曲信号を抽出するため、ステレオ感を損なわず再現性の良いステレオ信号を出力することができる。 As described above, according to the stereo signal processing apparatus of the first embodiment, a Kalman filter with a color driving source is used for the estimated vocal signal and the observation signal extracted based on the ratio of the observation spectrum of the L channel and the R channel. Is applied to extract the estimated music signal, so that a stereo signal with good reproducibility can be output without impairing the stereo feeling.

また、抽出したボーカルスペクトルを時間領域の信号に変換するための逆フーリエ変換が１回でよい。
＜第２の実施の形態＞
第２の実施の形態では、本発明の第１信号の一例を、例えばＬチャネルマイクとＲチャネルマイクとの中央付近を音源位置とするボーカル信号とし、本発明の第２信号の一例を、例えば楽器等を音源とする楽曲信号とする場合について説明する。 Further, only one inverse Fourier transform is required to convert the extracted vocal spectrum into a time domain signal.
<Second Embodiment>
In the second embodiment, an example of the first signal of the present invention is a vocal signal having a sound source position near the center of the L channel microphone and the R channel microphone, and an example of the second signal of the present invention is, for example, A case where the music signal is a musical instrument or the like will be described.

第２の実施の形態では、ボーカル信号または楽曲信号を選択的に抽出する場合について説明する。なお、第２の実施の形態のステレオ信号処理装置について、第１の実施の形態のステレオ信号処理装置１０と同一の部分については、同一符号を付して詳細な説明を省略する。 In the second embodiment, a case where a vocal signal or a music signal is selectively extracted will be described. Note that, in the stereo signal processing device of the second embodiment, the same parts as those of the stereo signal processing device 10 of the first embodiment are denoted by the same reference numerals, and detailed description thereof is omitted.

図９に示すように、第２の実施の形態に係るステレオ信号処理装置２１０は、Ａ／Ｄ変換部１２Ｌ，１２Ｒと、周波数領域変換部１４Ｌ，１４Ｒと、スペクトル比演算部１６と、ボーカル信号抽出部１８と、時間領域変換部２０と、特定信号推定部２２２と、Ｄ／Ａ変換部２４Ｌ，２４Ｒとを含んで構成されている。 As shown in FIG. 9, the stereo signal processing apparatus 210 according to the second embodiment includes A / D conversion units 12L and 12R, frequency domain conversion units 14L and 14R, a spectrum ratio calculation unit 16, and a vocal signal. The extraction unit 18, the time domain conversion unit 20, the specific signal estimation unit 222, and the D / A conversion units 24L and 24R are configured.

特定信号推定部２２２は、楽曲信号またはボーカル信号のいずれを抽出するかを選択するための選択信号に従って、時間領域変換部２０から入力された推定ボーカル信号ｄ^(ｎ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、推定楽曲信号または推定ボーカル信号を抽出する。選択信号が楽曲信号を抽出することを示すものである場合には、特定信号推定部２２２は、第１の実施の形態の楽曲信号推定部２２と同様の処理により、推定楽曲信号を抽出する。 The specific signal estimation unit 222 and the estimated vocal signal d ^ (n) input from the time domain conversion unit 20 and the observation signal x _L according to a selection signal for selecting whether to extract a music signal or a vocal signal Based on (n) and x _R (n), an estimated music signal or an estimated vocal signal is extracted. When the selection signal indicates that the music signal is to be extracted, the specific signal estimation unit 222 extracts the estimated music signal by the same processing as that of the music signal estimation unit 22 of the first embodiment.

一方、選択信号がボーカル信号を抽出することを示すものである場合には、下記に示すＬ・Ｒチャネル結合型有色駆動原付カルマンアルゴリズムにより、推定ボーカル信号を抽出する。なお、初期設定の過程［Initialization］については、第１の実施の形態と同様であるため記載を省略する。 On the other hand, if the selection signal indicates that a vocal signal is to be extracted, an estimated vocal signal is extracted by the following L / R channel combined color drive moped Kalman algorithm. Note that the initialization process [Initialization] is the same as that in the first embodiment, and thus the description thereof is omitted.

推定ボーカル信号を抽出する場合には、第１の実施の形態における反復の過程［Iteration］の手順１の楽曲信号の分散値Ｒ_δｐ２(ｎ＋１) [ｉ，ｊ]と、手順２のボーカル信号の分散値Ｒ_εｐ２(ｎ＋１) [ｉ，ｊ]とを入れ替える。これにより、手順４において計算される最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)の１行１列目または（Ｌ_ｐ２＋１）行１列目を、推定ボーカル信号ｄ'^_ｐ２(ｎ)として得ることができる。ここで得られる推定ボーカル信号は、ミュージカルノイズのない信号となる。 When extracting the estimated vocal signal, the variance value R _δp2 (n + 1) [i, j] of the music signal in the procedure 1 of the iteration process [Iteration] in the first embodiment and the vocal signal of the procedure 2 are _extracted . The variance R _εp2 (n + 1) [i, j] is replaced. As a result, the first row and the first column or the (L _p2 +1) row and the first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) calculated in the procedure 4 are used as the estimated vocal signal d ′ ^ _p2 (n). Can be obtained. The estimated vocal signal obtained here is a signal without musical noise.

次に、図１０を参照して、第２の実施の形態に係るステレオ信号処理装置１０の作用について説明する。なお、第１の実施の形態における処理と同一の処理については、同一符号を付して詳細な説明を省略する。 Next, the operation of the stereo signal processing apparatus 10 according to the second embodiment will be described with reference to FIG. In addition, about the process same as the process in 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

ステップ１００〜１１２を経て、スペクトル比に基づいて抽出された推定ボーカルスペクトル｜Ｄ^(ｌ，ｋ)｜を、時間領域の推定ボーカル信号ｄ^(ｎ)に変換する。 Through steps 100 to 112, the estimated vocal spectrum | D ^ (l, k) | extracted based on the spectrum ratio is converted into an estimated vocal signal d ^ (n) in the time domain.

次に、ステップ２００で、特定信号推定部２２２が、選択信号に基づいて楽曲信号またはボーカル信号のいずれを抽出するかを判定する。楽曲信号を抽出すると判定された場合には、ステップ１１４へ移行して、楽曲信号推定部２２が、推定ボーカル信号ｄ^(ｎ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、楽曲信号推定処理を実行することにより、推定楽曲信号を抽出する。 Next, in step 200, the specific signal estimation unit 222 determines whether to extract a music signal or a vocal signal based on the selection signal. If it is determined that the music signal is to be extracted, the process proceeds to step 114 where the music signal estimation unit 22 determines the estimated vocal signal d ^ (n) and the observation signals x _L (n), x _R (n). Based on the above, an estimated music signal is extracted by executing a music signal estimation process.

一方、ボーカル信号を抽出すると判定された場合には、ステップ２０２へ移行し、楽曲信号推定部２２が、推定ボーカル信号ｄ^(ｎ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、ボーカル信号推定処理を実行することにより、推定ボーカル信号を抽出する。 On the other hand, if it is determined that the vocal signal is to be extracted, the process proceeds to step 202, where the music signal estimation unit 22 determines the estimated vocal signal d ^ (n) and the observed signals x _L (n), x _R (n). Based on the above, the vocal signal estimation process is executed to extract the estimated vocal signal.

ボーカル信号推定処理は、第１の実施の形態と同様に、図８に示す有色駆動原付カルマンアルゴリズムに相当する。ここでは、楽曲信号推定処理として実行される有色駆動原付カルマンアルゴリズムのフローと異なる処理について説明する。 The vocal signal estimation process corresponds to the colored drive moped Kalman algorithm shown in FIG. 8 as in the first embodiment. Here, processing that is different from the flow of the Kalman algorithm with colored driving executed as music signal estimation processing will be described.

ボーカル信号推定処理として実行されるカルマンアルゴリズムでは、ステップ１１４２で、楽曲信号の分散値Ｒ_δｐ２(ｎ＋１) [ｉ，ｊ]を、ボーカル信号の分散値Ｒ_εｐ２(ｎ) [ｉ，ｊ]に入れ替えて、時刻ｎまでの情報により時刻ｎ＋１の状態ベクトルを推定した場合の誤差である共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ)を計算する（上述の反復の過程［Iteration］の手順１）。 In the Kalman algorithm executed as vocal signal estimation process, replaced in step 1142, the variance value of the music signal _{R δp2 (n + 1) [} i, j] , and variance _{R εp2 (n) [i,} j] of the vocal signals Then, the covariance matrix P _p2 (n + 1 | n), which is an error when the state vector at time n + 1 is estimated from the information up to time n, is calculated (procedure 1 of the above iteration process [Iteration]).

また、ステップ１１４４で、ボーカル信号の分散値Ｒ_εｐ２(ｎ) [ｉ，ｊ]を、楽曲信号の分散値Ｒ_δｐ２(ｎ＋１)に入れ替えて、カルマンゲイン行列Ｋ_ｐ２(ｎ＋１)を計算する（同手順２）。 In step 1144, the variance value R _εp2 (n) [i, j] of the vocal signal is replaced with the variance value R _δp2 (n + 1) of the music signal to calculate the Kalman gain matrix K _p2 (n + 1) (same as above). Procedure 2).

また、ステップ１１５４では、上記ステップ１１４６で計算された最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１) の１行１列目または（Ｌ_ｐ２＋１）行１列目を推定ボーカル信号ｄ'^_ｐ２(ｎ)として出力し、図１０の処理へリターンする。ここで得られる推定ボーカル信号は、ミュージカルノイズのない信号となる。 In step 1154, the estimated vocal signal d ′ ^ _p2 (1st row, first column or (L _p2 + 1) th row, first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) calculated in step 1146 is used. n) and return to the process of FIG. The estimated vocal signal obtained here is a signal without musical noise.

以上説明したように、第２の実施の形態のステレオ信号処理装置によれば、第１の実施の形態の効果に加え、所望の信号（ボーカル信号または楽曲信号）を選択的に抽出することができる。
＜第３の実施の形態＞
第３の実施の形態では、本発明の第１信号の一例を、例えばＬチャネルマイクとＲチャネルマイクとの中央付近を音源位置とするボーカル信号（音声信号）とし、本発明の第２信号の一例を、例えば白色雑音に近い雑音信号とする場合について説明する。 As described above, according to the stereo signal processing apparatus of the second embodiment, in addition to the effects of the first embodiment, it is possible to selectively extract a desired signal (vocal signal or music signal). it can.
<Third Embodiment>
In the third embodiment, an example of the first signal of the present invention is a vocal signal (voice signal) having a sound source position near the center of the L channel microphone and the R channel microphone, for example. For example, a case where a noise signal close to white noise is used will be described.

第３の実施の形態では、図２に示すような状況において観測された観測信号から、雑音信号を抑圧する場合について説明する。なお、第３の実施の形態のステレオ信号処理装置について、第１の実施の形態のステレオ信号処理装置１０と同一の部分については、同一符号を付して詳細な説明を省略する。 In the third embodiment, a case will be described in which a noise signal is suppressed from an observation signal observed in the situation shown in FIG. Note that in the stereo signal processing device of the third embodiment, the same parts as those of the stereo signal processing device 10 of the first embodiment are denoted by the same reference numerals, and detailed description thereof is omitted.

図１１に示すように、第３の実施の形態に係るステレオ信号処理装置３１０は、Ａ／Ｄ変換部１２Ｌ，１２Ｒと、自己相関処理部２６Ｌ，２６Ｒと、ピーク値検出部２８Ｌ，２８Ｒと、雑音判定部３０Ｌ，３０Ｒと、雑音抑圧部３２２と、Ｄ／Ａ変換部２４Ｌ，２４Ｒとを含んで構成されている。 As shown in FIG. 11, the stereo signal processing apparatus 310 according to the third embodiment includes A / D conversion units 12L and 12R, autocorrelation processing units 26L and 26R, peak value detection units 28L and 28R, The noise determination units 30L and 30R, the noise suppression unit 322, and the D / A conversion units 24L and 24R are configured.

自己相関処理部２６Ｌ，２６Ｒは、Ａ／Ｄ変換部１２Ｌ，１２Ｒから入力された時間領域の信号である観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)各々の自己相関関数を計算し、ピーク値検出部２８Ｌ，２８Ｒへ各々出力する。具体的には、自己相関処理部２６Ｌ，２６Ｒは、図１２に示すように、観測信号をＬサンプルでフレーム分割する。ｌフレーム目のｉ番目のサンプルに関する観測信号ｘ_Ｌ(ｌ，ｉ)，ｘ_Ｒ(ｌ，ｉ)は、下記（１２）式及び（１３）式で表される。 The autocorrelation processing units 26L and 26R calculate the autocorrelation functions of the observation signals x _L (n) and x _R (n), which are time domain signals input from the A / D conversion units 12L and 12R, Output to the value detectors 28L and 28R, respectively. Specifically, the autocorrelation processing units 26L and 26R divide the observation signal into frames by L samples as shown in FIG. Observation signals x _L (l, i), x _R (l, i) relating to the i-th sample in the l frame are expressed by the following equations (12) and (13).

ｘ_Ｌ(ｌ，ｉ)＝ｄ(ｌ，ｉ)＋ｉ_Ｌ(ｌ，ｉ) （１２）
ｘ_Ｒ(ｌ，ｉ)＝ｄ(ｌ，ｉ)＋ｉ_Ｒ(ｌ，ｉ) （１３）
自己相関処理部２６Ｌ，２６Ｒは、遅れ時間をτとして、下記（１４）式及び（１５）式により、Ｌチャネル及びＲチャネル観測信号各々の自己相関関数Ｒ_ｘＬ(ｌ，τ)，Ｒ_ｘＲ(ｌ，τ)（τ＝０，・・・，Ｌ−１）を計算する。 x _L (l, i) = d (l, i) + i _L (l, i) (12)
_{x R (l, i) =} d (l, i) + i R (l, i) (13)
The autocorrelation processing units 26L and 26R set the delay time as τ, and the autocorrelation functions R _xL (l, τ), R _xR (of the L channel and R channel observation signals according to the following equations (14) and (15): l, τ) (τ = 0,..., L−1).

ピーク値検出部２８Ｌ，２８Ｒは、自己相関処理部２６Ｌ，２６Ｒから各々入力された自己相関関数Ｒ_ｘＬ(ｌ，τ)，Ｒ_ｘＲ(ｌ，τ)におけるピーク値を検出し、各々雑音判定部３０Ｌ，３０Ｒへ出力する。具体的には、自己相関関数Ｒ_ｘＬ(ｌ，τ)，Ｒ_ｘＲ(ｌ，τ)において、τ＝０以外におけるピーク値を、下記（１６）式及び（１７）式により検出する（図１３も参照）。なお、ｍａｘ{＊}は、関数＊の最大値を見つける処理である。 The peak value detectors 28L and 28R detect the peak values in the autocorrelation functions R _xL (l, τ) and R _xR (l, τ) respectively input from the autocorrelation processing units 26L and 26R, and each noise determination unit Output to 30L and 30R. Specifically, in autocorrelation functions R _xL (l, τ), R _xR (l, τ), peak values other than τ = 0 are detected by the following equations (16) and (17) (FIG. 13). See also). Note that max {*} is processing for finding the maximum value of the function *.

雑音判定部３０Ｌ，３０Ｒは、ピーク値検出部２８Ｌ，２８Ｒから入力されたピーク値ｐ_Ｌ(ｌ)，ｐ_Ｒ(ｌ)各々に基づいて、フレーム毎に雑音信号と推定される信号（以下、「推定雑音信号」という）を各々判定して、雑音抑圧部３２２へ出力する。具体的には、下記（１８）式及び（１９）式に従って、ピーク値ｐ_Ｌ(ｌ)，ｐ_Ｒ(ｌ)各々と閾値σ_１とを比較し、ピーク値が閾値σ_１より大きい場合には、フレームｌをボーカル信号（音声信号）と判定し、１フレーム前の推定雑音信号をコピーして、フレームｌの推定雑音信号ｉ^_Ｌ(ｌ，ｉ)，ｉ^_Ｒ(ｌ，ｉ)とする。一方、ピーク値が閾値σ_１より小さい場合には、フレームｌを雑音信号と判定し、そのまま推定雑音信号ｉ^_Ｌ(ｌ，ｉ) ，ｉ^_Ｒ(ｌ，ｉ)とする。なお、閾値σ_１は観測信号のＳＮＲによって決まる値である。 The noise determination units 30L and 30R are signals (hereinafter, referred to as noise signals) estimated for each frame based on the peak values p _L (l) and p _R (l) input from the peak value detection units 28L and 28R, respectively. Each of which is referred to as “estimated noise signal” and output to the noise suppression unit 322. Specifically, the peak values p _L (l) and p _R (l) are compared with the threshold σ ₁ according to the following formulas (18) and (19), and the peak value is greater than the threshold σ _1. Determines that frame l is a vocal signal (speech signal), copies the estimated noise signal one frame before, and estimates estimated noise signals i ^ _L (l, i), i ^ _R (l, i) of frame l. And On the other hand, when the peak value is smaller than the threshold σ ₁ , the frame l is determined as a noise signal, and is directly used as the estimated noise signals i ^ _L (l, i) and i ^ _R (l, i). The threshold σ ₁ is a value determined by the SNR of the observation signal.

雑音抑圧部３２２は、雑音判定部３０Ｌ，３０Ｒから入力された推定雑音信号ｉ^_Ｌ(ｌ，ｉ)，ｉ^_Ｒ(ｌ，ｉ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、雑音信号を抑圧する。具体的には、下記に示すＬ・Ｒチャネル結合型有色駆動原付カルマンアルゴリズムにより、雑音信号を抑圧する。なお、反復の過程［Iteration］については、第１の実施の形態と同様であるため記載を省略する。 The noise suppression unit 322, the noise determination unit 30L, estimated noise input from the 30R signal _{i ^ L (l, i)} , i ^ R (l, i) and observed signal _{_{x L (n), x R}} (n ) To suppress the noise signal. Specifically, the noise signal is suppressed by the following L / R channel coupling type color driving moped Kalman algorithm. Since the iteration process [Iteration] is the same as that in the first embodiment, the description thereof is omitted.

初期設定の過程では、最適推定値ベクトルの初期値ｘ^_ｐ２(０｜０)、状態ベクトルを推定した場合の誤差である共分散行列の初期値Ｐ_ｐ２(０｜０)、雑音信号の分散値行列Ｒ_δｐ２(ｎ)のｉ行ｊ列の要素Ｒ_δｐ２(ｎ)[ｉ，ｊ]、及びボーカル信号の分散値行列Ｒ_εｐ２(ｎ)の値を、上記のようにそれぞれ設定する。なお、ボーカル信号の分散値は、観測信号の分散値から雑音信号の分散値を差し引いたものである。以下、第１の実施の形態と同様に、反復の過程を実行し、反復の過程の手順４において計算される最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)の１行１列目または（Ｌ_ｐ２＋１）行１列目を、推定ボーカル信号ｄ^(ｎ)として得ることができる。すなわち、観測信号において雑音信号が抑圧された信号が得られる。 In the course of initialization, the initial value x ^ _p2 optimum estimate vector (0 | 0), the initial value P _p2 of the covariance matrix is the error in estimating the state vector (0 | 0), the variance of the noise signal The values of the element R _δp2 (n) [i, j] in the i row and j column of the value matrix R _δp2 (n) and the variance value matrix R _εp2 (n) of the vocal signal are set as described above. The variance value of the vocal signal is obtained by subtracting the variance value of the noise signal from the variance value of the observation signal. Thereafter, as in the first embodiment, the iterative process is executed, and the first row or the first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) calculated in step 4 of the iterative process or (L _p2 + 1) row 1st column can be obtained as the estimated vocal signal d ^ (n). That is, a signal in which the noise signal is suppressed in the observation signal is obtained.

次に、図１４を参照して、第３の実施の形態に係るステレオ信号処理装置３１０の作用について説明する。なお、第１の実施の形態における処理と同一の処理については、同一符号を付して詳細な説明を省略する。 Next, the operation of the stereo signal processing apparatus 310 according to the third embodiment will be described with reference to FIG. In addition, about the process same as the process in 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

ステップ１００で、Ａ／Ｄ変換部１２Ｌ，１２Ｒが、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)を各々ディジタル信号に変換する。次に、ステップ３００で、自己相関処理部２６Ｌ，２６Ｒが、遅れ時間をτとして、（１４）式及び（１５）式により、Ｌチャネル及びＲチャネル各々の自己相関関数Ｒ_ｘＬ(ｌ，τ)，Ｒ_ｘＲ(ｌ，τ)（τ＝０，・・・，Ｌ−１）を計算する。 In step 100, the A / D converters 12L and 12R convert the observation signals x _L (n) and x _R (n) into digital signals, respectively. Next, in step 300, the autocorrelation processing units 26L and 26R set the delay time as τ, and the autocorrelation functions R _xL (l, τ) of the L channel and the R channel according to the equations (14) and (15), respectively. , R _xR (1, τ) (τ = 0,..., L−1).

次に、ステップ３０２で、ピーク値検出部２８Ｌ，２８Ｒが、上記ステップ３００で計算された自己相関関数Ｒ_ｘＬ(ｌ，τ)，Ｒ_ｘＲ(ｌ，τ)各々のτ＝０以外におけるピーク値ｐ_Ｌ(ｌ)，ｐ_Ｒ(ｌ)を検出する。 Next, in step 302, the peak value detectors 28L and 28R determine the peak values of the autocorrelation functions R _xL (l, τ) and R _xR (l, τ) calculated in step 300 other than τ = 0. p _L (l) and p _R (l) are detected.

次に、ステップ３０４で、雑音判定部３０Ｌが、ピーク値ｐ_Ｌ(ｌ)が閾値σ_１より大きいか否かを判定する。ｐ_Ｌ(ｌ)＞σ_１の場合には、ステップ３０６へ移行し、フレームｌをボーカル信号と判定し、１フレーム前の推定雑音信号をコピーして、フレームｌの推定雑音信号ｉ^_Ｌ(ｌ，ｉ)とする。一方、ｐ_Ｌ(ｌ)≦σ_１の場合には、ステップ３０８へ移行し、フレームｌを雑音信号と判定し、そのまま推定雑音信号ｉ^_Ｌ(ｌ，ｉ)とする。 Next, in step 304, the noise determination unit 30L determines whether or not the peak value p _L (l) is greater than the threshold σ ₁ . If p _L (l)> σ ₁ , the process proceeds to step 306, where the frame l is determined to be a vocal signal, the estimated noise signal of the previous frame is copied, and the estimated noise signal i ^ _L (1) of the frame l is copied. l, i). On the other hand, if p _L (l) ≦ σ ₁ , the process proceeds to step 308, where the frame l is determined as a noise signal and is directly used as the estimated noise signal i ^ _L (l, i).

Ｒチャネルについても同様に、雑音判定部３０Ｒが、ステップ３０４〜３０８を実行して、フレームｌの推定雑音信号ｉ^_Ｒ(ｌ，ｉ)を判定する。 Similarly, for the R channel, the noise determination unit 30R executes steps 304 to 308 to determine the estimated noise signal i ^ _R (l, i) of the frame l.

次に、ステップ３１０で、雑音抑圧部３２２が、推定雑音信号ｉ^_Ｌ(ｌ，ｉ)，ｉ^_Ｒ(ｌ，ｉ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、雑音抑圧処理を実行することにより、雑音信号を抑圧する。 Next, at step 310, the noise suppressor 322, the estimated noise signal _{i ^ L (l, i)} , and _{i ^ R (l, i)} , the observed signal _{x L} (n), and _{x R} (n) Based on this, a noise signal is suppressed by executing a noise suppression process.

雑音抑圧処理は、第１の実施の形態と同様に、図８に示す有色駆動原付カルマンアルゴリズムに相当する。ここでは、第１の実施の形態の楽曲信号推定処理として実行される有色駆動原付カルマンアルゴリズムのフローと異なる処理について説明する。 The noise suppression processing corresponds to the colored drive moped Kalman algorithm shown in FIG. 8 as in the first embodiment. Here, a process different from the flow of the colored drive moped Kalman algorithm executed as the music signal estimation process of the first embodiment will be described.

雑音抑圧処理として実行される有色駆動原付カルマンアルゴリズムでは、ステップ１１４０で、最適推定値ベクトルの初期値ｘ^_ｐ２(０｜０)、状態ベクトルを推定した場合の誤差である共分散行列の初期値Ｐ_ｐ２(０｜０)、雑音信号の分散値行列Ｒ_δｐ２(ｎ)のｉ行ｊ列の要素Ｒ_δｐ２(ｎ)[ｉ，ｊ]、及びボーカル信号の分散値行列Ｒ_εｐ２(ｎ)の値を、上記のようにそれぞれ設定する。 In the colored drive moped Kalman algorithm executed as noise suppression processing, in step 1140, the initial value x ^ _p2 (0 | 0) of the optimum estimated value vector and the initial value of the covariance matrix that is an error when the state vector is estimated P _p2 (0 | 0), noise signal variance value matrix R _δp2 (n), i row j column element R _δp2 (n) [i, j], and vocal signal variance value matrix R _εp2 (n) Set the values as described above.

また、ステップ１１５４では、上記ステップ１１４６で計算された最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１) の１行１列目または（Ｌ_ｐ２＋１）行１列目を推定ボーカル信号ｄ^(ｎ)、すなわち、観測信号において雑音信号が抑圧された推定音声信号として出力し、図１４の処理へリターンする。 In step 1154, the estimated vocal signal d ^ (n) in the first row and first column or the (L _p2 +1) row and first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) calculated in step 1146 is used. That is, an estimated speech signal in which the noise signal is suppressed in the observation signal is output, and the process returns to the process of FIG.

以上説明したように、第３の実施の形態のステレオ信号処理装置によれば、第１の実施の形態の効果に加え、自己相関を用いて推定された推定雑音信号を用いて有色駆動源型のカルマンフィルタを適用することにより、白色性の雑音に対して抑圧効果を高めることができる。また、時間領域の信号処理のみであるため、演算量を削減できる。
＜第４の実施の形態＞
第４の実施の形態では、本発明の第１信号の一例を、例えばＬチャネルマイクとＲチャネルマイクとの中央付近を音源位置とするボーカル信号（音声信号）とし、本発明の第２信号の一例を、例えば雑音信号とする場合について説明する。 As described above, according to the stereo signal processing apparatus of the third embodiment, in addition to the effects of the first embodiment, the color drive source type is used by using the estimated noise signal estimated using the autocorrelation. By applying the Kalman filter, it is possible to enhance the suppression effect against white noise. Moreover, since only the signal processing in the time domain is performed, the amount of calculation can be reduced.
<Fourth embodiment>
In the fourth embodiment, an example of the first signal of the present invention is a vocal signal (voice signal) having a sound source position near the center of the L channel microphone and the R channel microphone, for example. For example, a case where a noise signal is used will be described.

第４の実施の形態について説明する。第４の実施の形態では、図２に示すような状況において観測された観測信号から、雑音信号を抑圧する場合について説明する。なお、第４の実施の形態のステレオ信号処理装置について、第１の実施の形態のステレオ信号処理装置１０及び第３の実施の形態のステレオ信号処理装置３１０と同一の部分については、同一符号を付して詳細な説明を省略する。 A fourth embodiment will be described. In the fourth embodiment, a case will be described in which a noise signal is suppressed from an observation signal observed in the situation shown in FIG. For the stereo signal processing apparatus of the fourth embodiment, the same reference numerals are used for the same parts as the stereo signal processing apparatus 10 of the first embodiment and the stereo signal processing apparatus 310 of the third embodiment. Detailed description will be omitted.

図１５に示すように、第４の実施の形態に係るステレオ信号処理装置４１０は、Ａ／Ｄ変換部１２Ｌ，１２Ｒと、周波数領域変換部１４Ｌ，１４Ｒと、スペクトル密度演算部３２Ｌ，３２Ｒと、スペクトルエントロピー演算部３４Ｌ，３４Ｒと、雑音判定部４３０Ｌ，４３０Ｒと、時間領域変換部２０Ｌ，２０Ｒと、雑音抑圧部３２２と、Ｄ／Ａ変換部２４Ｌ，２４Ｒとを含んで構成されている。 As shown in FIG. 15, the stereo signal processing apparatus 410 according to the fourth embodiment includes A / D conversion units 12L and 12R, frequency domain conversion units 14L and 14R, spectral density calculation units 32L and 32R, Spectral entropy calculation units 34L and 34R, noise determination units 430L and 430R, time domain conversion units 20L and 20R, a noise suppression unit 322, and D / A conversion units 24L and 24R are configured.

スペクトル密度演算部３２Ｌ，３２Ｒは、周波数領域変換部１４Ｌ，１４Ｒから入力された観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜，｜Ｘ_Ｒ(ｌ，ｋ)｜に基づいて、下記（２０）式及び（２１）式により、Ｌチャネル及びＲチャネル観測信号各々のパワースペクトル密度Ｐ_Ｌ(ｌ，ｋ)，Ｐ_Ｒ(ｌ，ｋ)を演算し、スペクトルエントロピー演算部３４Ｌ，３４Ｒへ入力する。ｌはフレーム番号、ｋは周波数ビン番号である。 Based on the observed spectrums | X _L (l, k) | and | X _R (l, k) | inputted from the frequency domain conversion units 14L and 14R, the spectral density calculation units 32L and 32R are expressed by the following equation (20). and by (21), L-channel and R-channel observation signal of each of the power spectral density _{P L (l, k),} P R (l, k) is calculated, and spectral entropy calculation unit 34L, and inputs to 34R. l is a frame number, and k is a frequency bin number.

ここで、ボーカル信号（音声信号）のスペクトルは、２５０〜４０００Ｈｚの周波数帯域内に存在することを考慮し、ｋ≦２５０Ｈｚまたはｋ≧４０００Ｈｚの場合には、｜Ｘ_Ｌ(ｌ，ｋ)｜＝｜Ｘ_Ｒ(ｌ，ｋ)｜＝０とする。 Here, the spectrum of the vocal signal (audio signal), considering that exist within the frequency band of 250～4000Hz, in the case of k ≦ 250 Hz or k ≧ 4000 Hz _{is, | X L (l, k} ) | = Let | X _R (l, k) | = 0.

スペクトルエントロピー演算部３４Ｌ，３４Ｒは、スペクトル密度演算部３２Ｌ，３２Ｒから入力されたスペクトル密度Ｐ_Ｌ(ｌ，ｋ)，Ｐ_Ｒ(ｌ，ｋ)に基づいて、下記（２２）式及び（２３）式により、Ｌチャネル及びＲチャネル観測信号各々のスペクトルエントロピーＨ_Ｌ(ｌ)，Ｈ_Ｒ(ｌ)を演算し、雑音判定部４３０Ｌ，４３０Ｒへ入力する。 Spectral entropy calculation unit 34L, 34R, the spectral density calculation unit 32L, the spectrum input from the 32R density _{P L} (l, k), based on _{P R} (l, k), the following expression (22) and (23) The spectral entropies H _L (l) and H _R (l) of the L channel and R channel observation signals are calculated according to the equations, and input to the noise determination units 430L and 430R.

雑音判定部４３０Ｌ，４３０Ｒは、スペクトルエントロピー演算部３４Ｌ，３４Ｒから入力されたスペクトルエントロピーＨ_Ｌ(ｌ)，Ｈ_Ｒ(ｌ)各々に基づいて、フレーム毎に推定雑音信号のスペクトル（以下、「推定雑音スペクトル」という）を各々判定して、時間領域変換部２０Ｌ，２０Ｒへ出力する。具体的には、下記（２４）式及び（２５）式に従って、スペクトルエントロピーＨ_Ｌ(ｌ)，Ｈ_Ｒ(ｌ)各々と閾値σ_２とを比較し、スペクトルエントロピーが閾値σ_２より小さい場合には、フレームｌをボーカル信号（音声信号）と判定し、１フレーム前の推定雑音スペクトルをコピーして、フレームｌの推定雑音スペクトル｜Ｉ^_Ｌ(ｌ，ｋ)｜，｜Ｉ^_Ｒ(ｌ，ｋ)｜とする。一方、スペクトルエントロピーが閾値σ_２より大きい場合には、フレームｌを雑音信号と判定し、そのまま推定雑音スペクトル｜Ｉ^_Ｌ(ｌ，ｋ)｜，｜Ｉ^_Ｒ(ｌ，ｋ)｜とする。 The noise determination units 430L and 430R are based on the spectrum entropy H _L (l) and H _R (l) input from the spectrum entropy calculation units 34L and 34R, respectively, and the spectrum of the estimated noise signal (hereinafter referred to as “estimation”). Each of which is referred to as “noise spectrum” and output to the time domain conversion units 20L and 20R. Specifically, according to the following formulas (24) and (25), the spectral entropy H _L (l), H _R (l) is compared with the threshold σ ₂ and the spectral entropy is smaller than the threshold σ _2. Determines that the frame l is a vocal signal (voice signal), copies the estimated noise spectrum of the previous frame, and estimates the estimated noise spectrum | I ^ _L (l, k) |, | I ^ _R (l , K) | On the other hand, when the spectral entropy is larger than the threshold σ ₂ , the frame l is determined as a noise signal, and the estimated noise spectrum | I ^ _L (l, k) |, | I ^ _R (l, k) | .

ここで、閾値σ_２は以下のようにして決定する。まずＮフレーム分のスペクトルエントロピーの平均値を用いて閾値σ’_２(ｌ)を下式のように導出する。 Here, the threshold value σ ₂ is determined as follows. First, the threshold value σ ′ ₂ (l) is derived as follows using the average value of the spectral entropy for N frames.

次に閾値σ’_２(ｌ)と現フレームのスペクトルエントロピーとを比較し、閾値σ’_２(ｌ)よりも現フレームのスペクトルエントロピーの方が小さい場合は閾値σ’_２(ｌ)をα倍する。 Then 'compared ₂ (l) and a spectral entropy of the current frame, the threshold value sigma' threshold sigma alpha times the threshold sigma _'2 (l) If even smaller for spectral entropy of the current frame than ₂ (l) To do.

そして過去３フレームが連続して音声信号か否かを判定した後に最終的な閾値σ_２(ｌ)を得る。 Then, after determining whether or not the past three frames are audio signals in succession, a final threshold σ ₂ (l) is obtained.

もし音声信号が連続していない場合は、過去３フレーム雑音信号が連続したか否かを判定した後に最終的な閾値σ_２（ｌ）を得る。 If the audio signal is not continuous, the final threshold σ ₂ (l) is obtained after determining whether or not the past three frame noise signals have been continuous.

時間領域変換部２０Ｌ，２０Ｒは、雑音判定部４３０Ｌ，４３０Ｒから入力された周波数領域の信号である推定雑音スペクトル｜Ｉ^_Ｌ(ｌ，ｋ)｜，｜Ｉ^_Ｒ(ｌ，ｋ)｜を逆フーリエ変換して、時間領域の信号である推定雑音信号ｉ^_Ｌ(ｌ，ｎ)，ｉ^_Ｒ(ｌ，ｎ)に変換する。次いで、オーバーラップアド法を用いて１フレーム前の後半Ｍサンプルを用いた時間領域推定楽曲信号ｉ^_Ｌ(ｌ−１，ｎ＋Ｍ)，ｉ^_Ｒ(ｌ−１，ｎ＋Ｍ)と現フレームの前半Ｍサンプルを用いた時間領域推定楽曲信号ｉ^_Ｌ(ｌ，ｎ)，ｉ^_Ｒ(ｌ，ｎ)とを足し合わせて、現フレームのＭサンプル時間領域推定楽曲信号ｉ^_Ｌ(ｎ)，ｉ^_Ｒ(ｎ)（１≦ｎ≦Ｍ）を得る。 The time domain transforming units 20L and 20R receive the estimated noise spectrums | I ^ _L (l, k) |, | I ^ _R (l, k) |, which are frequency domain signals input from the noise determination units 430L and 430R. Inverse Fourier transform is performed to convert the estimated noise signals i ^ _L (l, n) and i ^ _R (l, n), which are time domain signals. Next, using the overlap add method, the time domain estimated music signal i ^ _L (l-1, n + M), i ^ _R (l-1, n + M) using the latter half M samples of the previous frame and the first half of the current frame are used. The time-domain estimated music signal i ^ _L (l, n), i ^ _R (l, n) using M samples is added together to obtain the M-sample time-domain estimated music signal i ^ _L (n), i ^ _R (n) (1≤n≤M) is obtained.

次に、図１６を参照して、第４の実施の形態に係るステレオ信号処理装置４１０の作用について説明する。なお、第１の実施の形態における処理と同一の処理については、同一符号を付して詳細な説明を省略する。 Next, the operation of the stereo signal processing device 410 according to the fourth embodiment will be described with reference to FIG. In addition, about the process same as the process in 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.

ステップ１００で、Ａ／Ｄ変換部１２Ｌ，１２Ｒが、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)を各々ディジタル信号に変換し、次に、ステップ１０２で、周波数領域変換部１４Ｌ，１４Ｒが、周波数領域の信号である観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜，｜Ｘ_Ｒ(ｌ，ｋ)｜に変換する。 In step 100, the A / D converters 12L and 12R convert the observation signals x _L (n) and x _R (n) into digital signals, respectively, and then in step 102, the frequency domain converters 14L and 14R , The observed spectrum | X _L (l, k) |, | X _R (l, k) |

次に、ステップ４００で、スペクトル密度演算部３２Ｌ，３２Ｒ、観測スペクトル｜Ｘ_Ｌ(ｌ，ｋ)｜，｜Ｘ_Ｒ(ｌ，ｋ)｜に基づいて、Ｌチャネル及びＲチャネル観測信号各々のパワースペクトル密度Ｐ_Ｌ(ｌ，ｋ)，Ｐ_Ｒ(ｌ，ｋ)を演算する。 Next, in step 400, the power of each of the L channel and R channel observation signals based on the spectral density calculation units 32L and 32R and the observation spectrums | X _L (l, k) |, | X _R (l, k) | spectral density _{P L} (l, k), calculates the _{P R} (l, k).

次に、ステップ４０２で、スペクトルエントロピー演算部３４Ｌ，３４Ｒが、上記ステップ４００で演算されたパワースペクトル密度Ｐ_Ｌ(ｌ，ｋ)，Ｐ_Ｒ(ｌ，ｋ)に基づいて、Ｌチャネル及びＲチャネル観測信号各々のスペクトルエントロピーＨ_Ｌ(ｌ)，Ｈ_Ｒ(ｌ)を演算する。 Next, at step 402, spectral entropy calculation unit 34L, 34R is, the power spectrum calculated in the step 400 density _{P L} (l, k), based on _{P R (l, k),} L -channel and R-channel Spectral entropy H _L (l), H _R (l) of each observation signal is calculated.

次に、ステップ４０４で、雑音判定部４３０Ｌが、上述のように閾値σ_２を決定し、決定した閾値σ_２を用いて、スペクトルエントロピーＨ_Ｌ(ｌ)が閾値σ_２より小さいか否かを判定する。Ｈ_Ｌ(ｌ)＜σ_２の場合には、ステップ４０６へ移行し、フレームｌをボーカル信号と判定し、１フレーム前の推定雑音スペクトルをコピーして、フレームｌの推定雑音スペクトル｜Ｉ^_Ｌ(ｌ，ｋ)｜とする。一方、Ｈ_Ｌ(ｌ)≧σ_２の場合には、ステップ４０８へ移行し、フレームｌを雑音信号と判定し、そのまま推定雑音スペクトル｜Ｉ^_Ｌ(ｌ，ｋ)｜とする。 Next, in step 404, the noise determination unit 430L is, the threshold value sigma ₂ was determined as described above, using the determined threshold value sigma _2, spectral entropy H _L (l) is whether a threshold value sigma ₂ is smaller than judge. If H _L (l) <σ ₂ , the process proceeds to step 406, where frame l is determined as a vocal signal, the estimated noise spectrum of the previous frame is copied, and the estimated noise spectrum | I ^ _{L of} frame l Let (l, k) | On the other hand, if H _L (l) ≧ σ ₂ , the process proceeds to step 408, where the frame l is determined to be a noise signal and is directly used as the estimated noise spectrum | I ^ _L (l, k) |.

Ｒチャネルについても同様に、雑音判定部４３０Ｒが、ステップ４０４〜４０８を実行して、フレームｌの推定雑音スペクトル｜Ｉ^_Ｒ(ｌ，ｋ)｜を判定する。 Similarly, for the R channel, the noise determination unit 430R executes steps 404 to 408 to determine the estimated noise spectrum | I ^ _R (l, k) |

次に、ステップ１１２で、時間領域変換部２０Ｌ，２０Ｒが、推定雑音スペクトル｜Ｉ^_Ｌ(ｌ，ｋ)｜，｜Ｉ^_Ｒ(ｌ，ｋ)｜を、フーリエ逆変換を用いて時間領域の信号である推定雑音信号ｉ^_Ｌ(ｌ，ｎ)，ｉ^_Ｒ(ｌ，ｎ)に変換する。次いで、オーバーラップアド法を用いて１フレーム前の後半Ｍサンプルを用いた時間領域推定楽曲信号ｉ^_Ｌ(ｌ−１，ｎ＋Ｍ)，ｉ^_Ｒ(ｌ−１，ｎ＋Ｍ)と現フレームの前半Ｍサンプルを用いた時間領域推定楽曲信号ｉ^_Ｌ(ｌ，ｎ)，ｉ^_Ｒ(ｌ，ｎ)とを足し合わせて、現フレームのＭサンプル時間領域推定楽曲信号ｉ^_Ｌ(ｎ)，ｉ^_Ｒ(ｎ)（１≦ｎ≦Ｍ）を得る。 Next, in step 112, the time domain transforming units 20L and 20R convert the estimated noise spectrum | I ^ _L (l, k) |, | I ^ _R (l, k) | into the time domain using Fourier inverse transform. Are converted into estimated noise signals i ^ _L (l, n), i ^ _R (l, n). Next, using the overlap add method, the time domain estimated music signal i ^ _L (l-1, n + M), i ^ _R (l-1, n + M) using the latter half M samples of the previous frame and the first half of the current frame are used. The time-domain estimated music signal i ^ _L (l, n), i ^ _R (l, n) using M samples is added together to obtain the M-sample time-domain estimated music signal i ^ _L (n), i ^ _R (n) (1≤n≤M) is obtained.

次に、ステップ４１０で、雑音抑圧部３２２が、推定雑音信号ｉ^_Ｌ(ｎ)，ｉ^_Ｒ(ｎ)と、観測信号ｘ_Ｌ(ｎ)，ｘ_Ｒ(ｎ)とに基づいて、雑音抑圧処理を実行することにより、雑音信号を抑圧する。雑音抑圧処理は、第３の実施の形態と同様である。 Next, in step 410, the noise suppressor 322, based on the estimated noise signal i ^ _L (n), and i ^ _R (n), the observed signal _{_x L} (n), _x _R (n), and the noise The noise signal is suppressed by executing the suppression process. The noise suppression process is the same as that in the third embodiment.

以上説明したように、第４の実施の形態のステレオ信号処理装置によれば、第１の実施の形態の効果に加え、スペクトルエントロピーを用いて推定された推定雑音信号を用いて有色駆動源型のカルマンフィルタを適用することにより、白色性及び有色性の様々な雑音に対して抑圧効果を高めることができる。 As described above, according to the stereo signal processing device of the fourth embodiment, in addition to the effects of the first embodiment, the color drive source type is used by using the estimated noise signal estimated using the spectral entropy. By applying this Kalman filter, it is possible to enhance the suppression effect against various white and colored noises.

なお、上記第１〜第４の実施の形態で用いた有色駆動源付カルマンフィルタの演算量を軽減した演算量軽減型有色駆動源付カルマンフィルタを用いてもよい。演算量軽減型有色駆動源付カルマンフィルタでは、所望の信号の推定に必要な処理だけを取り出す。 In addition, you may use the Kalman filter with a calculation amount reduction type color drive source which reduced the calculation amount of the Kalman filter with a color drive source used in the said 1st-4th embodiment. In the Kalman filter with a colored drive source that reduces the amount of computation, only processing necessary for estimating a desired signal is extracted.

詳細には、図１７に示すように、手順４の状態量の更新において、Ｌチャネル及びＲチャネルの推定楽曲信号を示す部分のみ取り出すと、手順２におけるカルマンゲイン行列の４つの要素が必要であることがわかる。そこで、図１８に示すように、この必要な４つの要素の部分のみを取り出すと、手順１における共分散行列の４つの要素が必要であることがわかる。そこで、図１９に示すように、この必要な４つの要素の部分のみを取り出すと、楽曲信号の分散値が必要であることがわかる。 Specifically, as shown in FIG. 17, when only the portions indicating the estimated music signal of the L channel and the R channel are extracted in the state quantity update in the procedure 4, four elements of the Kalman gain matrix in the procedure 2 are necessary. I understand that. Therefore, as shown in FIG. 18, when only the necessary four element portions are taken out, it is understood that the four elements of the covariance matrix in the procedure 1 are necessary. Therefore, as shown in FIG. 19, when only the necessary four element portions are taken out, it can be seen that the dispersion value of the music signal is necessary.

以上をまとめると、演算量軽減型有色駆動源付カルマンアルゴリズムは、下記に示すとおりとなり、ステップが減ったことにより演算量が軽減できる。なお、ｐ３は演算量軽減型有色駆動原付カルマンフィルタが適用される状態方程式及び観測方程式の変数であることを表す添え字である。 Summarizing the above, the Kalman algorithm with a colored drive source with reduced calculation amount is as shown below, and the amount of calculation can be reduced by reducing the number of steps. Note that p3 is a subscript indicating that it is a variable of the state equation and the observation equation to which the color reduction moped Kalman filter with a reduced calculation amount is applied.

ここで、第１の実施の形態における楽曲信号推定処理（図７のステップ１１４）に、上記の演算量軽減型有色駆動原付カルマンフィルタを適用した場合に実行される演算量軽減型有色駆動原付カルマンアルゴリズムのフローについて、図２０を参照して説明する。 Here, the amount-of-computation-reduced color driving original Kalman algorithm that is executed when the above-described amount-reduction-type colored driving mop Kalman filter is applied to the music signal estimation process (step 114 in FIG. 7) in the first embodiment. This flow will be described with reference to FIG.

ステップ２１４０で、（１０）式に示す状態方程式及び観測方程式により状態空間モデルを定義し、最適推定値ベクトルの初期値ｘ^_ｐ３(０｜０)、状態ベクトルを推定した場合の誤差である共分散行列の初期値Ｐ_ｐ３(０｜０)、ボーカル信号の分散値Ｒ_εｐ３(ｎ) [ｉ，ｊ]、及び楽曲信号の分散値Ｒ_δｐ３(ｎ)[ｉ，ｊ]を、上述の初期設定の過程［Initialization］に示した初期状態に設定する。また、時刻を示す変数ｎを０に設定する。 In step 2140, the state space model is defined by the state equation and the observation equation shown in the equation (10), and the initial value x ^ _p3 (0 | 0) of the optimum estimated value vector and the error when the state vector is estimated are shared. the initial value _{P p3} of the dispersion matrix (0 | 0), the dispersion value of the vocal signals _{R εp3 (n) [i,} j], and variance of the music signal _{R δp3 (n) [i,} j] , and the above-mentioned initial Setting process Set to the initial state shown in [Initialization]. Also, a variable n indicating time is set to 0.

次に、ステップ２１４２で、楽曲信号の分散値Ｒ_δｐ３(ｎ＋１)[ｉ，ｊ]の値を用いて、時刻ｎまでの情報により時刻ｎ＋１の状態ベクトルを推定した場合の誤差である共分散行列Ｐ_ｐ３(ｎ＋１｜ｎ)を計算する（上述の反復の過程［Iteration］の手順１）。 Next, in step 2142, a covariance matrix that is an error when the state vector at time n + 1 is estimated from information up to time n using the value of the variance value R _δp3 (n + 1) [i, j] of the music signal. P _p3 (n + 1 | n) is calculated (step 1 of the above iteration process [Iteration]).

次に、ステップ２１４４で、上記ステップ２１４２で計算した共分散行列Ｐ_ｐ３(ｎ＋１｜ｎ)、及びボーカル信号の分散値Ｒ_εｐ３(ｎ) [ｉ，ｊ]を用いて、カルマンゲイン行列Ｋ_ｐ３(ｎ＋１)を計算する（同手順２）。 Next, in step 2144, using the covariance matrix P _p3 (n + 1 | n) calculated in step 2142 and the variance value R _εp3 (n) [i, j] of the vocal signal, the Kalman gain matrix K _p3 ( n + 1) is calculated (same procedure 2).

次に、ステップ２１４６で、上記ステップ２１４４で計算したカルマンゲイン行列Ｋ_ｐ３(ｎ＋１)、及び観測ベクトルｙ_ｐ３(ｎ＋１)を用いて、時刻ｎ＋１までの情報によるその時刻での最適推定値ベクトルｘ^_ｐ３(ｎ＋１｜ｎ＋１)を計算する（同手順３）。 Next, in step 2146, using the Kalman gain matrix K _p3 (n + 1) calculated in step 2144 and the observation vector y _p3 (n + 1), the optimum estimated value vector x ^ at that time by the information up to time n + 1. _p3 (n + 1 | n + 1) is calculated (same procedure 3).

次に、ステップ２１４８で、処理を終了するか否かを判定する。この判定は、時刻ｎが所定のサンプル数Ｎに達した場合を処理終了と判定してもよいし、サンプルがなくなった時点で処理終了と判定してもよい。処理を終了しない場合には、ステップ２１５２へ移行し、ｎを１インクリメントして、ステップ１１４２へ戻る。処理を終了する場合には、ステップ２１５４へ移行し、上記ステップ２１４６で計算された最適推定値ベクトルｘ^_ｐ３(ｎ＋１｜ｎ＋１) の１行１列目をＬチャネルの推定楽曲信号ｉ^_Ｌ(ｎ)として、（Ｌ_ｐ３＋１）行１列目をＲチャネルの推定楽曲信号ｉ^_Ｒ(ｎ)として出力し、図７の処理へリターンする。 Next, in step 2148, it is determined whether or not to end the process. In this determination, when the time n reaches a predetermined number N of samples, it may be determined that the process is ended, or may be determined when the sample is exhausted. If the process is not terminated, the process proceeds to step 2152, n is incremented by 1, and the process returns to step 1142. When the processing is ended, the process proceeds to step 2154, and the L-channel estimated music signal i ^ _L (in the first row and the first column of the optimum estimated value vector x ^ _p3 (n + 1 | n + 1) calculated in step 2146 above. n), (L _p3 +1) row 1st column is output as an R channel estimated music signal i ^ _R (n), and the process returns to the process of FIG.

また、第３及び第４の実施の形態では、雑音信号を抑圧する場合について説明したが、第２の実施の形態と同様に、有色駆動原付カルマンアルゴリズムにおいて、雑音信号の分散値とボーカル信号の分散値とを入れ替えることにより、ボーカル信号を抑圧した信号、すなわち推定雑音信号を抽出するようにしてもよい。具体的には、有色駆動原付カルマンアルゴリズムの反復の過程［Iteration］の手順１において、雑音信号の分散値Ｒ_δｐ２(ｎ＋１) [ｉ，ｊ]を、ボーカル信号の分散値Ｒ_εｐ２(ｎ) [ｉ，ｊ]とを入れ替えて、時刻ｎまでの情報により時刻ｎ＋１の状態ベクトルを推定した場合の誤差である共分散行列Ｐ_ｐ２(ｎ＋１｜ｎ)を計算する。また、同手順２において、ボーカル信号の分散値Ｒ_εｐ２(ｎ) [ｉ，ｊ]を、雑音信号の分散値Ｒ_δｐ２(ｎ＋１) [ｉ，ｊ]に入れ替えて、カルマンゲイン行列Ｋ_ｐ２(ｎ＋１)を計算する。そして、同手順４において計算される最適推定値ベクトルｘ^_ｐ２(ｎ＋１｜ｎ＋１)の１行１列目及び（Ｌ_ｐ２＋１）行１列目を、推定雑音信号ｉ'^_Ｌ(ｌ，ｉ)，ｉ'^_Ｒ(ｌ，ｉ)として得ることができる。 Further, in the third and fourth embodiments, the case where the noise signal is suppressed has been described. As in the second embodiment, in the colored drive motivated Kalman algorithm, the variance value of the noise signal and the vocal signal By exchanging the variance value, a signal in which the vocal signal is suppressed, that is, an estimated noise signal may be extracted. Specifically, in step 1 of the iterative process [Iteration] of the colored drive motivated Kalman algorithm, the noise signal variance R _δp2 (n + 1) [i, j] and the vocal signal variance R _εp2 (n) [ i, j] are replaced, and a covariance matrix P _p2 (n + 1 | n), which is an error when the state vector at time n + 1 is estimated from information up to time n, is calculated. In the same procedure 2, the Kalman gain matrix K _p2 (n + 1) is replaced by replacing the variance value R _εp2 (n) [i, j] of the vocal signal with the variance value R _δp2 (n + 1) [i, j] of the noise signal. ). Then, the estimated noise signal i ′ ^ _L (l, i) is calculated from the first row and first column and the (L _p2 +1) row and first column of the optimum estimated value vector x ^ _p2 (n + 1 | n + 1) calculated in the procedure 4. ), I ′ ^ _R (l, i).

また、上記の演算量軽減型カルマンアルゴリズムを第１の実施の形態に適用した場合（または第３及び第４の実施の形態に適用した場合）においても、第２の実施の形態と同様に、楽曲信号（または雑音信号）の分散値とボーカル信号の分散値とを入れ替えることにより、ボーカル信号を抑圧した信号、すなわち推定楽曲信号（または推定雑音信号）を抽出することができる。具体的には、演算量軽減型有色駆動原付カルマンアルゴリズムの反復の過程［Iteration］の手順１において、楽曲信号（または雑音信号）の分散値Ｒ_δｐ３(ｎ＋１) [ｉ，ｊ]を、ボーカル信号の分散値Ｒ_εｐ３(ｎ) [ｉ，ｊ]と入れ替えて、時刻ｎまでの情報により時刻ｎ＋１の状態ベクトルを推定した場合の誤差である共分散行列Ｐ_ｐ３(ｎ＋１｜ｎ)を計算する。また、同手順２において、ボーカル信号の分散値Ｒ_εｐ３(ｎ) [ｉ，ｊ]を、楽曲信号（または雑音信号）の分散値Ｒ_δｐ３(ｎ＋１) [ｉ，ｊ]に入れ替えて、カルマンゲイン行列Ｋ_ｐ３(ｎ＋１)を計算する。そして、同手順３において計算される最適推定値ベクトルｘ^_ｐ３(ｎ＋１｜ｎ＋１)の１行１列目及び（Ｌ_ｐ３＋１）行１列目を、推定雑音信号ｉ'^_Ｌ(ｌ，ｉ)，ｉ'^_Ｒ(ｌ，ｉ)として得ることができる。 In addition, when the above-described computation amount-reducing Kalman algorithm is applied to the first embodiment (or when applied to the third and fourth embodiments), as in the second embodiment, By exchanging the variance value of the music signal (or noise signal) and the variance value of the vocal signal, a signal in which the vocal signal is suppressed, that is, an estimated music signal (or estimated noise signal) can be extracted. Specifically, in step 1 of the iteration process [Iteration] of the colored drive motivated Kalman algorithm with reduced amount of computation, the variance value R _δp3 (n + 1) [i, j] of the music signal (or noise signal) is expressed as a vocal signal. The covariance matrix P _p3 (n + 1 | n), which is an error when the state vector at time n + 1 is estimated from the information up to time n, is replaced with the variance value R _εp3 (n) [i, j]. Also, in the same procedure 2, the vocal signal variance R _εp3 (n) [i, j] is replaced with the variance value R _δp3 (n + 1) [i, j] of the music signal (or noise signal) to obtain the Kalman gain. The matrix K _p3 (n + 1) is calculated. Then, the estimated noise signal i ′ ^ _L (l, i) is calculated from the first row and first column and the (L _p3 +1) row and first column of the optimum estimated value vector x ^ _p3 (n + 1 | n + 1) calculated in the same procedure 3. ), I ′ ^ _R (l, i).

また、上記各実施の形態は、適宜組み合わせて適用可能である。例えば、第１または第２の実施の形態により所望の信号を抽出した上で、第３または第４の実施の形態により雑音を抑圧するようにすることができる。 The above embodiments can be applied in appropriate combination. For example, after extracting a desired signal according to the first or second embodiment, noise can be suppressed according to the third or fourth embodiment.

また、第１及び第２の実施の形態では、第１信号をボーカル信号、第２信号を楽曲信号とする場合について、第３及び第４の実施の形態では、第１信号をボーカル信号（音声信号）、第２信号を雑音信号とする場合について説明したが、これに限定されない。複数チャネルの入力信号において、第１信号各はチャネル間で共通に含まれる信号であり、第２信号はチャネル毎に異なる信号であればよい。 In the first and second embodiments, the first signal is a vocal signal and the second signal is a music signal. In the third and fourth embodiments, the first signal is a vocal signal (audio). Signal), the case where the second signal is a noise signal has been described, but the present invention is not limited to this. In the input signals of a plurality of channels, each of the first signals is a signal included in common between the channels, and the second signal may be a signal that is different for each channel.

また、上記実施の形態では、各部をハードウエアにより構成する場合について説明したが、コンピュータに各部の処理を実行させるためのプログラムとすることもできる。プログラムは、予め装置にインストールされていてもよいし、コンピュータ読み取り可能な記録媒体に格納して提供してもよいし、ネットワークを介して提供してもよい。 Further, although cases have been described with the above embodiment where each unit is configured by hardware, a program for causing a computer to execute the processing of each unit may be employed. The program may be installed in the apparatus in advance, may be provided by being stored in a computer-readable recording medium, or may be provided via a network.

１０、２１０、３１０、４１０ステレオ信号処理装置
１２Ｌ，１２ＲＡ／Ｄ変換部
１４Ｌ，１４Ｒ周波数領域変換部
１６スペクトル比演算部
１８ボーカル信号抽出部
２０、２０Ｌ，２０Ｒ時間領域変換部
２２楽曲信号推定部
２４Ｌ，２４ＲＤ／Ａ変換部
２６Ｌ，２６Ｒ自己相関処理部
２８Ｌ，２８Ｒピーク値検出部
３０Ｌ，３０Ｒ、４３０Ｌ，４３０Ｒ雑音判定部
３２Ｌ，３２Ｒスペクトル密度演算部
３４Ｌ，３４Ｒスペクトルエントロピー演算部
２２２特定信号推定部
３２２雑音抑圧部 10, 210, 310, 410 Stereo signal processor 12L, 12R A / D converter 14L, 14R Frequency domain converter 16 Spectrum ratio calculator 18 Vocal signal extractor 20, 20L, 20R Time domain converter 22 Music signal estimator 24L, 24R D / A conversion units 26L, 26R Autocorrelation processing units 28L, 28R Peak value detection units 30L, 30R, 430L, 430R Noise determination units 32L, 32R Spectral density calculation units 34L, 34R Spectral entropy calculation units 222 Specific signal estimation 322 Noise suppression unit

Claims

Each of the time domain observation signals of a plurality of channels including a first signal commonly included in each channel and a second signal different for each channel is converted into a spectrum signal in the frequency domain, and based on the ratio of each spectrum signal The estimated first spectrum signal estimated as the spectrum signal of the first signal is extracted, and the estimated first spectrum signal, which is a frequency domain signal, is converted into a time domain signal to be estimated as the first signal. Extracting means for extracting an estimated first signal of the region;
The variance value of the estimated first signal in the time domain extracted by the extraction means, the variance value of the second signal obtained by subtracting the variance value of the estimated first signal from the variance value of the observation signal, and the plurality A state equation composed only of the second signal including an element corresponding to the plurality of channels using an observation signal of the channel, and composed of the first signal and the second signal including an element corresponding to the plurality of channels An estimation means for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to the state space model represented by the observed equation ;
A multi-channel signal processing apparatus.

Based on the autocorrelation peak value of each of the time domain observation signals of the plurality of channels including the first signal or the second signal estimated by the estimation means, the time domain estimation A subsequent extraction means for extracting two signals;
A variance value of the estimated second signal in the time domain extracted by the subsequent extraction means, a variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observation signal, and the Post-stage estimation means for estimating the first signal or the second signal by a Kalman filter with a colored drive source using observation signals of a plurality of channels;
The multi-channel signal processing apparatus according to claim 1, comprising:

Each of the time domain observation signals of a plurality of channels including the first signal or the second signal estimated by the estimation means is converted into a spectrum signal in the frequency domain, and the first signal is obtained based on the spectrum entropy obtained from each spectrum signal. An estimated second spectrum signal that is estimated to be a spectrum signal of two signals is extracted, the estimated second spectrum signal that is a frequency domain signal is converted into a time domain signal, and a time domain signal that is estimated to be the second signal is converted. A subsequent extraction means for extracting the estimated second signal;
A variance value of the estimated second signal in the time domain extracted by the subsequent extraction means, a variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observation signal, and the Post-stage estimation means for estimating the first signal or the second signal by a Kalman filter with a colored drive source using observation signals of a plurality of channels;
The multi-channel signal processing apparatus according to claim 1, comprising:

The second signal is estimated based on the autocorrelation peak value of each of the observation signals in the time domain of a plurality of channels including the first signal that is commonly included among the channels and the second signal that is different for each channel. Extracting means for extracting an estimated second signal in the time domain;
The variance value of the estimated second signal in the time domain extracted by the extraction means, the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observation signal, and the plurality A state equation composed only of the second signal including an element corresponding to the plurality of channels using an observation signal of the channel, and composed of the first signal and the second signal including an element corresponding to the plurality of channels An estimation means for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to the state space model represented by the observed equation ;
A multi-channel signal processing apparatus.

Each of the time domain observation signals of a plurality of channels including a first signal commonly included in each channel and a second signal different for each channel is converted into a spectrum signal in the frequency domain, and each power spectral density is calculated. The second spectrum signal estimated as the spectrum signal of the second signal is extracted based on the spectrum entropy respectively obtained from the power spectrum density, and the estimated second spectrum signal, which is a frequency domain signal, is extracted in the time domain. Extracting means for converting to a signal and extracting an estimated second signal in the time domain estimated as the second signal;
The variance value of the estimated second signal in the time domain extracted by the extraction means, the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observation signal, and the plurality A state equation composed only of the second signal including an element corresponding to the plurality of channels using an observation signal of the channel, and composed of the first signal and the second signal including an element corresponding to the plurality of channels An estimation means for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to the state space model represented by the observed equation ;
A multi-channel signal processing apparatus.

The extraction means converts the observation signal into a spectrum signal in the frequency domain for each frame of a predetermined frame length, obtains the spectrum entropy for each frame, and an average σ ′ of spectrum entropy for a first predetermined number of frames is When the spectral entropy of the current frame is smaller than σ ′, and when larger, σ ′ is multiplied by a predetermined coefficient α to obtain a value σ ″ that is ασ ′, and the first signal continues for the second predetermined number of frames in the past. Σ ″ if the first signal is not continuous for the second predetermined number of frames in the past, and σ ′ if the second signal is continuous for the second predetermined number of frames in the past. not continuous past second number given frame, and when said second signal is not continuous past number second predetermined frame is a threshold sigma "sigma, less than the spectral entropy threshold sigma of the current frame If stomach, the current frame is determined as the first signal, when the spectral entropy is above a threshold value σ of the current frame, the multi-channel signal processor of the present frame is judged according to claim 5, wherein the second signal.

Each of the time domain observation signals of a plurality of channels including a first signal commonly included in each channel and a second signal different for each channel is converted into a spectrum signal in the frequency domain, and based on the ratio of each spectrum signal The estimated first spectrum signal estimated as the spectrum signal of the first signal is extracted, and the estimated first spectrum signal, which is a frequency domain signal, is converted into a time domain signal to be estimated as the first signal. Extracting means for extracting an estimated first signal of the region;
The variance value of the estimated first signal in the time domain extracted by the extraction means, the variance value of the second signal obtained by subtracting the variance value of the estimated first signal from the variance value of the observation signal, and the plurality An estimation means for estimating the first signal or the second signal by a Kalman filter with a colored driving source with a reduced amount of calculation using an observation signal of the channel,
The Kalman filter with a colored drive source that reduces the amount of computation is an element related to the time n + 1 among the elements of the Kalman filter with a colored drive source that estimates the first signal or the second signal at the time n + 1 using the observation signal up to the time n. This is a Kalman filter that extracts only
Multi- channel signal processing device.

Each of the time domain observation signals of a plurality of channels including a first signal commonly included in each channel and a second signal different for each channel is converted into a spectrum signal in the frequency domain, and based on the ratio of each spectrum signal The estimated first spectrum signal estimated as the spectrum signal of the first signal is extracted, and the estimated first spectrum signal, which is a frequency domain signal, is converted into a time domain signal to be estimated as the first signal. Extracting an estimated first signal of the region;
The extracted variance value of the estimated first signal in the time domain, the variance value of the second signal obtained by subtracting the variance value of the estimated first signal from the variance value of the observed signal, and the observed signals of the plurality of channels And a state equation composed only of the second signal including elements corresponding to the plurality of channels, and an observation equation composed of the first signal and the second signal including elements corresponding to the plurality of channels. A multichannel signal processing method for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to a state space model represented by :

The second signal is estimated based on the autocorrelation peak value of each of the observation signals in the time domain of a plurality of channels including the first signal that is commonly included among the channels and the second signal that is different for each channel. Extract an estimated second signal in the time domain;
The extracted variance value of the estimated second signal in the time domain, the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observed signal, and the observed signals of the plurality of channels And a state equation composed only of the second signal including elements corresponding to the plurality of channels, and an observation equation composed of the first signal and the second signal including elements corresponding to the plurality of channels. A multi-channel signal processing method including: estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to a state space model represented by :

Each of the time domain observation signals of a plurality of channels including a first signal commonly included in each channel and a second signal different for each channel is converted into a spectrum signal in the frequency domain, and each power spectral density is calculated. The second spectrum signal estimated as the spectrum signal of the second signal is extracted based on the spectrum entropy respectively obtained from the power spectrum density, and the estimated second spectrum signal, which is a frequency domain signal, is extracted in the time domain. Extracting a second estimated signal in the time domain that is converted into a signal and estimated as the second signal;
The extracted variance value of the estimated second signal in the time domain, the variance value of the first signal obtained by subtracting the variance value of the estimated second signal from the variance value of the observed signal, and the observed signals of the plurality of channels And a state equation composed only of the second signal including elements corresponding to the plurality of channels, and an observation equation composed of the first signal and the second signal including elements corresponding to the plurality of channels. A multichannel signal processing method for estimating the first signal or the second signal by applying a Kalman filter with a colored drive source to a state space model represented by :

The multi-channel signal processing program for functioning a computer as each means which comprises the multi-channel signal processing apparatus of any one of Claims 1-7.