JP2007240605A

JP2007240605A - Sound source separating method and sound source separation system using complex wavelet transformation

Info

Publication number: JP2007240605A
Application number: JP2006059516A
Authority: JP
Inventors: Satoru Hongo; 哲本郷; Yoichi Suzuki; 陽一鈴木
Original assignee: Tohoku University NUC; Institute of National Colleges of Technologies Japan
Current assignee: Tohoku University NUC; Institute of National Colleges of Technologies Japan
Priority date: 2006-03-06
Filing date: 2006-03-06
Publication date: 2007-09-20

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a system that extract a specified sound by estimating the direction of a sound source in an environment wherein a plurality of sounds are generated. <P>SOLUTION: The system has a reception unit 101 which collect sound signals generated by the plurality of sound sources by both right and left reception parts between two different points (between 2 channels), a transformation unit 102 which inputs an inter-2ch signal from the reception unit 101 and divides the signal by logarithmic frequency bands of an octave structure (divides the inter-2ch signal by complex wavelet scales through complex wavelet transformation), an estimation unit 103 which calculates difference data (a difference in argument and a difference in level of a complex wavelet coefficient) between the 2 channels by the logarithmic frequency bands and estimates directions of the sound sources by using DOA estimation based upon the difference data, a separation unit 104 which emphasizes a sound in a specified sound source direction obtained through the DOA estimation, and a database 105 in which measurement data needed for the DOA estimation are previously recorded. Estimation precision of the sound source direction can be improved and a target sound source can be extracted. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、複数の音が異なる方位から提示されている環境下において、ある音を選択的に入力する選択的両耳聴アルゴリズムを用いて音源方向を推定し分離する技術に関するものである。 The present invention relates to a technique for estimating and separating sound source directions using a selective binaural algorithm that selectively inputs a certain sound in an environment where a plurality of sounds are presented from different directions.

複数の音が異なる方位から提示されている環境下においてある音を選択的に入力する選択的両耳聴アルゴリズム（カクテルパーティ効果アルゴリズム）は、ヒトの聴覚機構の実現という観点から様々な研究がされている。 A selective binaural hearing algorithm (cocktail party effect algorithm) that selectively inputs a sound in an environment where multiple sounds are presented from different orientations has been studied from the viewpoint of realizing the human auditory mechanism. ing.

特許文献１では、周波数領域両耳聴モデル（FDBM: Frequency Domain Binaural Model）を用いて、複数の音源から発生される音響信号を左右両受音部から入力し、入力した左右両入力信号を周波数帯域ごとに分割し、左右両入力信号のクロススペクトルから周波数帯域ごとの両耳間位相差（IPD）、左右両入力信号のパワースペクトルのレベル差から両耳間レベル差（ILD）を求め、全周波数帯域で各周波数帯域ごとに得られたIPD / ILDと、データベースのそれとを比較することにより各周波数帯域ごとに音源方向の候補を求め、各周波数帯域ごとに得られた音源方向のうち出現頻度が高い方向を、音源方向と推定する方法により、複数の音が発生している環境下で、左右、上下二次元的に存在する複数の音源方向を推定する方法が提案されている。 In Patent Document 1, using a frequency domain binaural model (FDBM), acoustic signals generated from a plurality of sound sources are input from both left and right sound receiving units, and the input left and right input signals are frequency-converted. Dividing into bands, the interaural phase difference (IPD) for each frequency band is obtained from the cross spectrum of the left and right input signals, and the interaural level difference (ILD) is obtained from the level difference of the power spectrum of both the left and right input signals. By comparing the IPD / ILD obtained for each frequency band in the frequency band with that of the database, sound source direction candidates are obtained for each frequency band, and the appearance frequency among the sound source directions obtained for each frequency band A method for estimating a plurality of sound source directions that exist two-dimensionally in the left and right and up and down directions in an environment in which a plurality of sounds are generated has been proposed.

一方、人間の聴覚処理は対数周波数軸である1/4〜1/6 オクターブ単位であることが知られており、このようなオクターブ構造の解析手段には、ウェーブレット変換が適している。特許文献１の周波数軸にフーリエ変換する手法では、周波数軸上に線形に解析結果が並ぶことになる。すなわち、例えばサンプリング周波数が16kHzで512ポイントの周波数解析を行った場合、16k/512 = 31.25Hzの間隔で解析結果が得られる。人間の聴覚機構は、指数的な周波数分解能を持っているため、100Hz程度の低周波領域では、分解能として31.25Hzは粗い分解になる。一方4kHz程度の高周波領域になると、31.25Hzの違いを聞き分けることが難しいほどになる。すなわちフーリエ変換のように線形で解析を行う手法は、高周波領域では冗長的なデータを持ってしまうことになり、低周波領域では分解能が足りないことになる。 On the other hand, it is known that human auditory processing is in a unit of 1/4 to 1/6 octave which is a logarithmic frequency axis, and wavelet transform is suitable for such an octave structure analysis means. In the method of performing Fourier transform on the frequency axis of Patent Document 1, analysis results are arranged linearly on the frequency axis. That is, for example, when a 512-point frequency analysis is performed at a sampling frequency of 16 kHz, an analysis result is obtained at an interval of 16 k / 512 = 31.25 Hz. Since the human auditory mechanism has an exponential frequency resolution, in the low frequency region of about 100 Hz, 31.25 Hz is a rough resolution as a resolution. On the other hand, in the high frequency range of about 4 kHz, it becomes difficult to distinguish the difference of 31.25 Hz. That is, a linear analysis method such as Fourier transform has redundant data in the high frequency region and lacks resolution in the low frequency region.

周波数軸上で線形に分割されたフーリエ変換に比べて、周波数軸の分解がオクターブ構造になっているウェーブレット変換は、ヒトが持つ聴覚フィルタと整合がとれており、その品質が向上することが、非特許文献１で述べられている。 Compared to the Fourier transform that is linearly divided on the frequency axis, the wavelet transform, which has an octave structure for the decomposition of the frequency axis, is consistent with the auditory filter of humans, and its quality improves. It is described in Non-Patent Document 1.

特開2004-325284 『音源方向を推定する方法、そのためのシステム、および複数の音源の分離方法、そのためのシステム』JP 2004-325284 "Method for estimating sound source direction, system therefor, and method for separating a plurality of sound sources, system therefor" 西村他，“Wavelet変換を用いたSpectral Subtractionによる音声強調”，信学論，J79-A（12），pp.1986-1993，(1996)．Nishimura et al., “Speech Enhancement by Spectral Subtraction Using Wavelet Transform”, IEICE Theory, J79-A (12), pp.1986-1993, (1996).

しかしながら、非特許文献１では、実ウェーブレット変換をベースに音声信号の分離を行っているため、基本的にDOA(Direction Of Arrival)推定を行うことはできず、音声抽出を行う際に抽出する音のスペクトル分布などを予め知っておく必要がある。例えば、右30度の音声と左30度からの音声が混合された環境下で音声分離をする応用を考えたとき、非特許文献１の場合には、音声そのもののスペクトルを予め与えてやる必要がでてくる。 However, in Non-Patent Document 1, since the speech signal is separated based on the actual wavelet transform, DOA (Direction Of Arrival) estimation cannot be basically performed, and the sound to be extracted when performing speech extraction. It is necessary to know beforehand the spectral distribution of the. For example, when considering an application in which speech separation is performed in an environment in which sound of 30 degrees right and 30 degrees left is mixed, in the case of Non-Patent Document 1, it is necessary to give the spectrum of the sound itself in advance. Comes out.

本発明は、上記問題を解決するため、ウェーブレット変換する際に特にDOA 推定が可能な時間差情報（偏角）を保持できる複素ウェーブレット変換に注目し、複素ウェーブレット係数を用いてDOA 推定を行い、その結果に基づいて音の分離を行う方法を提供することを目的とする。この方法では、解像度がオクターブよりも細かい連続複素ウェーブレット変換や複素ウェーブレットパケット解析を用いて、1/4オクターブから1/6オクターブ程度の対数周波数分割に対応した分割（複素ウェーブレットスケール毎の分割）を行うとともに、ウェーブレット変換構造を利用してDOAの推定を行い、これを利用して音声抽出を行うため、抽出する音声のスペクトルを予め知っておく必要はなく、抽出したい音の音源方向のみを与えてやることで、音声抽出が可能となる。 In order to solve the above-mentioned problem, the present invention pays attention to the complex wavelet transform that can hold the time difference information (declination) that can be DOA estimated especially when performing the wavelet transform, performs DOA estimation using complex wavelet coefficients, It is an object of the present invention to provide a method for separating sounds based on results. In this method, division (corresponding to complex wavelet scales) corresponding to logarithmic frequency division from 1/4 octave to 1/6 octave is performed using continuous complex wavelet transform and complex wavelet packet analysis whose resolution is smaller than octave. In addition, DOA is estimated using the wavelet transform structure and voice extraction is performed using this, so there is no need to know the spectrum of the extracted voice in advance, and only the sound source direction of the sound to be extracted is given. By doing so, voice extraction becomes possible.

上記目的を達成するため、請求項１に記載の音源分離方法は、複数音源から発生される音響信号を異なる2点間（2ch間）の左右両受信部で収録する受信プロセスと、前記2ch間信号をオクターブ構造の対数周波数帯域ごとに分割する変換プロセスと、前記分割されたデータから2ch間の差分データを算出するとともに、前記差分データを基にDOA(Direction of Arrival)推定を用いて音源の方向を推定する推定プロセスと、前記推定により得られた特定音源方向の音を強調する分離プロセスと、を有することを特徴とする。 In order to achieve the above object, a sound source separation method according to claim 1 includes a reception process of recording acoustic signals generated from a plurality of sound sources at two left and right receivers between two different points (between two channels), and between the two channels. A conversion process that divides the signal into logarithmic frequency bands of an octave structure, and calculates difference data between the two channels from the divided data, and also uses DOA (Direction of Arrival) estimation based on the difference data An estimation process for estimating a direction and a separation process for enhancing a sound in a specific sound source direction obtained by the estimation are characterized.

請求項２に記載の変換プロセスは、前記2ch間信号をそれぞれ複素ウェーブレット変換することにより複素ウェーブレットスケール毎に分割するとともに、複素ウェーブレット係数の偏角およびレベルを算出することを特徴とする。 According to a second aspect of the present invention, the inter-channel signal is divided into complex wavelet scales by performing complex wavelet transformation on each of the signals between the two channels, and the declination angle and level of the complex wavelet coefficient are calculated.

請求項３に記載の推定プロセスは、複素ウェーブレットスケール毎に2ch間の差分データとして複素ウェーブレット係数の偏角差およびレベル差を算出するとともに、算出された偏角差およびレベル差のデータと、データベースに予め記録しておいた測定データとを比較し、最も近い偏角差およびレベル差を与えるデータベース中の方向をDOA推定値とすることを特徴とする。 The estimation process according to claim 3 calculates a declination difference and a level difference of a complex wavelet coefficient as difference data between two channels for each complex wavelet scale, and also calculates the declination difference and level difference data calculated, and a database Is compared with the measurement data recorded in advance, and the direction in the database that gives the closest declination difference and level difference is used as the DOA estimation value.

請求項４に記載の測定データは、様々な角度から発生される音響信号を異なる2点間（2ch間）で収録した信号データから、複素ウェーブレット変換を用いて、複素ウェーブレット係数の偏角差およびレベル差を算出し、複素ウェーブレットスケール毎に偏角差およびレベル差と方向角との関係を表すデータとして、予め記録されていることを特徴とする。 The measurement data according to claim 4 is obtained by using a complex wavelet transform from signal data obtained by recording acoustic signals generated from various angles at two different points (between two channels) and a complex wavelet coefficient declination difference and The level difference is calculated, and is recorded in advance as data representing the deviation angle difference and the relationship between the level difference and the direction angle for each complex wavelet scale.

請求項５に記載の分離プロセスは、前記推定により得られたDOA推定値に基づいて音源方向に向いている複素ウェーブレット係数を強調し、それ以外の係数を抑圧して、目的音波形再構築を行うこと、および前記目的音波形の再構築にはウェーブレット逆変換を用いることを特徴とする。 The separation process according to claim 5 emphasizes a complex wavelet coefficient directed to a sound source direction based on a DOA estimation value obtained by the estimation, suppresses other coefficients, and performs target sound waveform reconstruction. Wavelet inverse transformation is used for performing and reconstructing the target sound waveform.

請求項６に記載の音源分離システムは、複数音源から発生される音響信号を異なる2点間（2ch間）の左右両受信部で収録する受信手段と、前記2ch間信号をオクターブ構造の対数周波数帯域ごとに分割する変換手段と、前記分割されたデータから2ch間の差分データを算出するとともに、前記差分データを基にDOA推定を用いて音源の方向を推定する推定手段と、前記推定により得られた特定音源方向の音を強調する分離手段と、を有することを特徴とする。 The sound source separation system according to claim 6 is a receiving means for recording acoustic signals generated from a plurality of sound sources at both left and right receiving sections between two different points (between two channels), and the log signal having an octave structure for the signals between the two channels. Conversion means for dividing each band, difference means for calculating difference data between two channels from the divided data, estimation means for estimating the direction of a sound source using DOA estimation based on the difference data, and obtained by the estimation Separating means for emphasizing the sound in the specified sound source direction.

請求項７に記載の変換手段は、前記2ch間信号をそれぞれ複素ウェーブレット変換することにより複素ウェーブレットスケール毎に分割するとともに、複素ウェーブレット係数の偏角およびレベルを算出することを特徴とする。 According to a seventh aspect of the present invention, the inter-channel signal is divided into complex wavelet scales by performing complex wavelet transformation on each of the signals between the two channels, and the declination angle and level of the complex wavelet coefficient are calculated.

請求項８に記載の推定手段は、複素ウェーブレットスケール毎に2ch間の差分データとして複素ウェーブレット係数の偏角差およびレベル差を算出するとともに、算出された偏角差およびレベル差のデータと、データベースに予め記録しておいた測定データとを比較し、最も近い偏角差およびレベル差を与えるデータベース中の方向をDOA推定値とすることを特徴とする。 The estimation means according to claim 8 calculates a declination difference and a level difference of a complex wavelet coefficient as difference data between two channels for each complex wavelet scale, and calculates the declination difference and level difference data calculated, and a database Is compared with the measurement data recorded in advance, and the direction in the database that gives the closest declination difference and level difference is used as the DOA estimation value.

請求項９に記載のデータベースには、様々な角度から発生される音響信号を異なる2点間（2ch間）で収録した信号データから、複素ウェーブレット変換を用いて、複素ウェーブレット係数の偏角差およびレベル差を算出し、複素ウェーブレットスケール毎に偏角差およびレベル差と方向角との関係を表すデータが、予め記録されていることを特徴とする。 The database according to claim 9 uses a complex wavelet transform from signal data in which acoustic signals generated from various angles are recorded at two different points (between two channels), and a declination difference between complex wavelet coefficients and The level difference is calculated, and data representing the deviation angle difference and the relationship between the level difference and the direction angle is recorded in advance for each complex wavelet scale.

請求項１０に記載の分離手段は、前記推定により得られたDOA推定値に基づいて音源方向に向いている複素ウェーブレット係数を強調し、それ以外の係数を抑圧して、目的音波形の再構築を行うこと、および前記目的音波形の再構築にはウェーブレット逆変換を用いることを特徴とする。 The separation means according to claim 10 reconstructs the target sound waveform by emphasizing complex wavelet coefficients facing the sound source direction based on the DOA estimation value obtained by the estimation and suppressing other coefficients. And inverse wavelet transform is used to reconstruct the target sound waveform.

請求項１または請求項６に係る発明によれば、複数の音が発生している環境下で、音源の方向を推定すること、および特定の音を抽出することが可能となり、従来手法よりも精度が向上するという利点がある。すなわち、音響信号を解析する際にオクターブ構造の対数周波数帯域ごとに分割する手法は、人間の聴覚処理が対数周波数軸であるという特徴を捉えており、DOA推定を用いて音源の方向を推定する際の精度を向上させることができる。 According to the invention according to claim 1 or claim 6, it is possible to estimate the direction of the sound source and to extract a specific sound in an environment where a plurality of sounds are generated. There is an advantage that accuracy is improved. In other words, the method of dividing the logarithmic frequency band of the octave structure when analyzing the acoustic signal captures the feature that human auditory processing is a logarithmic frequency axis, and estimates the direction of the sound source using DOA estimation Accuracy can be improved.

請求項２または請求項７に係る発明によれば、周波数軸の分解がオクターブ構造になっているウェーブレット変換は人間が持つ聴覚フィルタと整合がとれている特徴を利用して、解析手段に複素ウェーブレット変換を用いることで、人間の聴覚上最も無駄のないDOA推定のためのデータベースを保存することが可能となる。また解析手段に実ウェーブレット変換を用いた場合には基本的にDOA推定は行うことはできないため、複素ウェーブレット変換を用いることで、複素ウェーブレット係数の偏角を利用してDOA推定を行うことが可能となる。
また従来の周波数軸にフーリエ変換する手法では、周波数軸上に線形に解析結果が並ぶことになり、例えばサンプリング周波数が16kHzで512ポイントの周波数解析を行った場合、16k/512 = 31.25Hzの間隔で解析結果が得られる。人間の聴覚機構は、指数的な周波数分解能を持っているため、100Hz程度の低周波領域では、分解能として31.25Hzは粗い分解になる。一方4kHz程度の高周波領域になると、31.25Hzの違いを聞き分けることが難しいほどになる。すなわちフーリエ変換のように線形で解析を行う手法は、高周波領域では冗長的なデータを持ってしまうことになり、低周波領域では分解能が足りないことになる。
一方、複素ウェーブレット変換する手法では、例えば63Hz〜8000Hzまでの帯域を1/4オクターブでデータベースを作成した場合、低周波数の分解能は約15Hz程度と周波数分解能が向上し、高周波数の分解能は周波数に応じて疎になるため、冗長な部分がなくなる。すなわち本発明では、高周波領域では冗長なデータの保存が不要となり、さらに低周波領域では分解能が向上するという利点がある。 According to the second or seventh aspect of the invention, the wavelet transform having the octave structure of the frequency axis decomposition utilizes the characteristic that is matched with the human auditory filter, and the analysis means uses the complex wavelet as the analysis means. By using the transformation, it is possible to store a database for DOA estimation that is least wasteful to human hearing. In addition, when real wavelet transform is used as an analysis means, DOA estimation is basically not possible, so by using complex wavelet transform, DOA estimation can be performed using the declination of complex wavelet coefficients. It becomes.
Also, with the conventional method of performing Fourier transform on the frequency axis, the analysis results are arranged linearly on the frequency axis.For example, if a sampling frequency is 16 kHz and a 512-point frequency analysis is performed, an interval of 16 k / 512 = 31.25 Hz An analysis result is obtained. Since the human auditory mechanism has an exponential frequency resolution, in the low frequency region of about 100 Hz, 31.25 Hz is a rough resolution as a resolution. On the other hand, in the high frequency range of about 4 kHz, it becomes difficult to distinguish the difference of 31.25 Hz. That is, a linear analysis method such as Fourier transform has redundant data in the high frequency region and lacks resolution in the low frequency region.
On the other hand, with the complex wavelet transform method, for example, when a database is created with a 1/4 octave band from 63 Hz to 8000 Hz, the low frequency resolution is about 15 Hz and the frequency resolution is improved. Since it becomes sparse accordingly, there is no redundant part. That is, according to the present invention, there is an advantage that redundant data need not be stored in the high frequency region and the resolution is improved in the low frequency region.

請求項３、４または請求項８、９に係る発明によれば、DOA推定のためにデータベースへ保存するデータ量が従来よりも減少するため、高効率な小容量データベースの作成が可能となること、およびDOA推定の過程においてデータベースの中から最も近い数値を検索する際に検索速度が向上するという利点がある。
データベースの小容量化について具体的事例を基に説明する。
従来法では、例えばサンプリング周波数が16kHzで512ポイントの周波数解析を行った場合、16k/512 = 31.25Hzの間隔で解析結果が得られる。従って分解能31.25Hzとした場合、1つの角度方向に対する必要なデータベース容量として、512×2個の数値データが必要であった。
一方、本発明では、例えば63Hz〜8000Hzまでの帯域を1/4オクターブでデータベースを作成した場合、低周波数の分解能は約15Hz程度と周波数分解能が向上し、高周波数の分解能は周波数に応じて疎になるため、冗長な部分がなくなる。１つの角度方向に対する必要なデータベース容量は、28×2個の数値で良いことになる。1/6の分解能を持たせた場合でも、42×2個の数値であり、データベース容量を従来法の1/10以下にすることができる。 According to the inventions according to claims 3, 4 or 8, 9, since the amount of data stored in the database for DOA estimation is smaller than before, it is possible to create a highly efficient small-capacity database. In the process of DOA estimation, there is an advantage that the search speed is improved when the closest numerical value is searched from the database.
We will explain how to reduce the database capacity based on specific examples.
In the conventional method, for example, when a frequency analysis of 512 points is performed at a sampling frequency of 16 kHz, an analysis result is obtained at an interval of 16k / 512 = 31.25 Hz. Therefore, when the resolution is 31.25 Hz, 512 × 2 numerical data is necessary as the necessary database capacity for one angular direction.
On the other hand, in the present invention, for example, when a database is created with a 1/4 octave in the band from 63 Hz to 8000 Hz, the low frequency resolution is improved to about 15 Hz, and the high frequency resolution is sparse according to the frequency. Therefore, there is no redundant part. The required database capacity for one angular direction can be 28 × 2 numbers. Even with a resolution of 1/6, the number is 42 x 2 and the database capacity can be reduced to 1/10 or less of the conventional method.

請求項５または請求項１０に係る発明によれば、DOA推定値に基づいて音源方向に向いている複素ウェーブレット係数を強調し、それ以外の係数を抑圧して、（ウェーブレット逆変換による）目的音波形の再構築を行うことにより、DOA推定により得られた特定音源方向の音を強調し、目的とする音源を抽出することが可能となる。 According to the invention according to claim 5 or claim 10, the target wave (by inverse wavelet transform) is emphasized by emphasizing the complex wavelet coefficients facing the sound source direction based on the DOA estimation value and suppressing other coefficients. By reconstructing the shape, it is possible to emphasize the sound in the specific sound source direction obtained by DOA estimation and extract the target sound source.

次に、本発明の実施の形態に係る音源分離システムについて図面に基づいて説明する。なお、この実施の形態により本発明が限定されるものではない。 Next, a sound source separation system according to an embodiment of the present invention will be described with reference to the drawings. In addition, this invention is not limited by this embodiment.

図１は、本発明の実施の形態に係る音源分離システムの構成を示すブロック図である。図１に示すように、音源分離システムは、複数音源から発生される音響信号を異なる2点間（2ch間）の左右両受信部で収録する受信部101と、受信部101からの2ch間信号を入力として該信号をオクターブ構造の対数周波数帯域ごとに分割する変換部102と、それぞれの対数周波数帯域ごとに2ch間の差分データを算出して、該差分データを基にDOA推定を用いて音源の方向を推定する推定部103と、DOA推定により得られた特定音源方向の音を強調する分離部104と、DOA推定のために必要な測定データが予め記録されたデータベース105と、を有する。 FIG. 1 is a block diagram showing a configuration of a sound source separation system according to an embodiment of the present invention. As shown in FIG. 1, the sound source separation system includes a receiving unit 101 that records acoustic signals generated from a plurality of sound sources at both left and right receiving units between two different points (between two channels), and a signal between two channels from the receiving unit 101. As an input, a conversion unit 102 that divides the signal into logarithmic frequency bands of an octave structure, and calculates difference data between two channels for each logarithmic frequency band, and uses DOA estimation based on the difference data as a sound source An estimation unit 103 that estimates the direction of the sound source, a separation unit 104 that emphasizes the sound in a specific sound source direction obtained by DOA estimation, and a database 105 in which measurement data necessary for DOA estimation is recorded in advance.

受信部101は、左右にそれぞれ1個のマイクロフォン（合計2個）が配置された構成により、左右2点間（2ch間）で複数音源から発生される音響信号を収録する。収録された2ch間の音響信号は、それぞれ電気的な信号データに変換されて変換部102へ渡される。 The receiving unit 101 records acoustic signals generated from a plurality of sound sources between two left and right points (between two channels) with a configuration in which one microphone (two in total) is arranged on each of the left and right sides. The recorded acoustic signals between the two channels are each converted into electrical signal data and passed to the conversion unit 102.

変換部102は、受信部101から渡された2ch間の信号データをそれぞれオクターブ構造の対数周波数帯域ごとに分割する。分割手段として、解像度がオクターブよりも細かい連続複素ウェーブレット変換や複素ウェーブレットパケット解析を用いて、例えば1/4オクターブから1/6オクターブ程度の対数周波数分割（複素ウェーブレットスケール毎の分割）に対応した分割を行う。すなわち、複素ウェーブレット変換を用いることにより、2ch間の信号データを複素ウェーブレットスケール毎に分割するとともに、複素ウェーブレット係数の「偏角」および「レベル」を算出する。算出された2ch間の「偏角」および「レベル」のデータは、推定部103へ渡される。 The conversion unit 102 divides the signal data between the two channels passed from the reception unit 101 for each logarithmic frequency band having an octave structure. As a division means, using continuous complex wavelet transform or complex wavelet packet analysis with resolution finer than octave, for example, division corresponding to logarithmic frequency division (division for each complex wavelet scale) from 1/4 octave to 1/6 octave I do. That is, by using the complex wavelet transform, the signal data between the two channels is divided for each complex wavelet scale, and the “deflection angle” and “level” of the complex wavelet coefficient are calculated. The calculated “deflection angle” and “level” data between the two channels are passed to the estimation unit 103.

さらに変換部102は、周波数軸の分解がオクターブ構造になっているウェーブレット変換は人間が持つ聴覚フィルタと整合がとれている特徴を利用して、解析手段に複素ウェーブレット変換を用いることで、人間の聴覚上最も無駄のないDOA推定のためのデータベースを保存することが可能となる。 Furthermore, the transform unit 102 uses the characteristic that the wavelet transform, which has an octave structure of the frequency axis decomposition, is consistent with the human auditory filter, and uses the complex wavelet transform for the analysis means, It is possible to save a database for DOA estimation that is the least wasteful in hearing.

推定部103は、変換部102から渡された2ch間の「偏角」および「レベル」のデータから差分データとして「偏角差」および「レベル差」を、複素ウェーブレットスケール毎に算出する。算出された「偏角差」および「レベル差」のデータと、データベース105に予め記録しておいた測定データとを比較し、最も近い「偏角差」および「レベル差」を与えるデータベース中の方向をDOA推定値とする。 The estimation unit 103 calculates “declination difference” and “level difference” as difference data from the “declination angle” and “level” data between the two channels passed from the conversion unit 102 for each complex wavelet scale. Compare the calculated "declination difference" and "level difference" data with the measurement data recorded in advance in the database 105, and give the closest "declination difference" and "level difference" in the database Let the direction be the DOA estimate.

データベース105には、様々な角度から発生される音響信号を異なる2点間（2ch間）で収録した信号データから、複素ウェーブレット変換を用いて、複素ウェーブレット係数の「偏角差」および「レベル差」を算出し、複素ウェーブレットスケール毎に「偏角差」および「レベル差」と方向角との関係を表すデータが、予め記録されている。 The database 105 uses the complex wavelet transform from signal data recorded at two different points (between two channels) for acoustic signals generated from various angles, and uses the complex wavelet transform to determine the "declination difference" and "level difference". ”Is calculated, and data representing the relationship between the“ declination difference ”and“ level difference ”and the direction angle is recorded in advance for each complex wavelet scale.

分離部104は、前記推定により得られたDOA推定値に基づいて音源方向に向いている複素ウェーブレット係数を強調し、それ以外の係数を抑圧して、目的音波形の再構築を行う。この目的音波形の再構築の手段としてウェーブレット逆変換を用いる。 The separation unit 104 reconstructs the target sound waveform by enhancing the complex wavelet coefficients facing the sound source direction based on the DOA estimation value obtained by the estimation and suppressing the other coefficients. The wavelet inverse transform is used as means for reconstructing the target sound waveform.

次に変換部102を詳細に説明するために、周波数や時間の局在化が明確で基本的な検討ができる複素ガウシアンウェーブレットと連続ウェーブレット変換を用いて説明する。
複素ガウシアン連続ウェーブレット変換は次式のように与えられる。（非特許文献２，３）

ここで、

であり、f(t), C_p, a, b, p, t, jはそれぞれ分析対象信号、正規化するための係数、スケール、シフト、次数、連続時間、および虚数単位である。また、*は複素共役、(p)はp階微分である。図１のWTは(1)式により計算され、分析帯域幅は、1/4オクターブ帯域になるようにaを設定する。 Next, in order to describe the transform unit 102 in detail, explanation will be given using a complex Gaussian wavelet and a continuous wavelet transform in which localization of frequency and time is clear and a basic study can be performed.
The complex Gaussian continuous wavelet transform is given by (Non-Patent Documents 2 and 3)

here,

F (t), C _p , a, b, p, t, j are the analysis target signal, the coefficient for normalization, the scale, the shift, the order, the continuous time, and the imaginary unit, respectively. * Is a complex conjugate, and (p) is a p-order derivative. WT in FIG. 1 is calculated by the equation (1), and a is set so that the analysis bandwidth is a 1/4 octave band.

(1)式を用いて、左信号の係数W_L、および右信号の係数W_Rを求め，2ch間のレベル差および偏角差を計算する。
2ch間のレベル差D_ILDおよび偏角差D_IAD は次のように与えられる。

ただし、＊は複素共役である。 Using the equation (1), the left signal coefficient W _L and the right signal coefficient W _R are obtained, and the level difference and declination difference between the two channels are calculated.
The level difference D _ILD and declination difference D _IAD between the two channels are given as follows.

However, * is a complex conjugate.

次に分離部104を詳細に説明するために、ウェーブレット逆変換を用いて説明する。
元信号f(t)から(1)の変換により複素係数W(b,a) が求められたとすると、時間シフトb、スケールaに対応するf(t) の成分

は、次式のようにあらわすことができる。

元信号を復元するためには、数式解析的に次式のようになる。

ここで、Cは、振幅を補正するための係数である。(3)式は数学的な解析解であるが、実用上は必要な周波数帯域と分離対象の時間区間を設定して、次のように計算をすることで、元信号の実用上十分な近似波形が得られる。

(3)’式は、元信号の再現式であるが、この式に対し、時間シフトb、スケールa毎に所望する方向からのものであるかないかを決定し、所望しない方向からの音声成分を減衰させる。αを所望する角度と推定された角度の角度差に対応させて係数を減衰する割合で0〜1の値をとるものとすると、所望しない方向を抑圧した音声は次式のように表すことができる。

その目的に応じて関数形は変わる。例えば、δをDOA推定方向と抽出を所望する音源方向の角度差すれば、

とすることにより10度外れる毎に-20dB減ずるように設定することができる。 Next, in order to describe the separating unit 104 in detail, description will be made using wavelet inverse transformation.
Assuming that the complex coefficient W (b, a) is obtained by conversion from the original signal f (t) to (1), the component of f (t) corresponding to the time shift b and scale a

Can be expressed as:

In order to restore the original signal, the following mathematical expression is obtained.

Here, C is a coefficient for correcting the amplitude. Equation (3) is a mathematical analysis solution, but in practice it is necessary to set the necessary frequency band and the time interval to be separated, and perform the following calculation to obtain a practical approximation of the original signal. A waveform is obtained.

Equation (3) 'is a reproduction equation of the original signal, but for this equation, it is determined whether or not it is from the desired direction for each time shift b and scale a, and the sound component from the undesired direction is determined. Is attenuated. Assuming that α takes a value of 0 to 1 at a rate at which the coefficient is attenuated corresponding to the angle difference between the desired angle and the estimated angle, the speech in which the undesired direction is suppressed can be expressed as follows: it can.

The function form changes depending on the purpose. For example, if δ is the angle difference between the DOA estimation direction and the desired sound source direction,

It can be set to decrease by -20dB every time it deviates 10 degrees.

次に、今回実験を行った実用上の一例を示す。例えば、16kHzサンプリングのデジタルデータに対して、スケールaを1/4オクターブ分析として計算する場合には、表１に示すような量になり、時間シフトbは、聴感上違和感のないフレーム長（16kHz256ポイントとすると0.016秒毎）にすることで実現することができる。

なお、上述の表は、次のように計算できる。中心周波数f_cとスケールaの関係は、解析的に次のように与えられることが知られている。

ここで、aはスケール、Δはサンプル周期、F_Cはヘルツ単位の基本ウェーブレットの中心周波数、F_aはスケールaに対応するヘルツ単位で表したものである。上述の表において、Δ= 1/16000はF_C = 1.929375となる。即ち、基本ウェーブレットのもつ中心周波数、サンプリング周波数、対数周波数軸における分析幅が決定することによって、用いるスケールの値を算出できる。 Next, a practical example of the experiment conducted this time is shown. For example, when calculating the scale a as 1/4 octave analysis for 16 kHz sampled digital data, the amount is as shown in Table 1, and the time shift b is a frame length (16 kHz 256 that does not give a sense of incongruity to hearing. It can be realized by setting the point to every 0.016 seconds).

The above table can be calculated as follows. Relationship of the center frequency f _c and scale a is known to be given analytically as follows.

Here, a is the scale, delta is the sample period, F _C is the basic wavelet of the center frequency in Hertz, is F _a are those in Hertz units corresponding to the scale a. In the above table, Δ = 1/16000 becomes F _C = 1.929375. That is, the scale value to be used can be calculated by determining the analysis frequency in the center frequency, sampling frequency, and logarithmic frequency axis of the basic wavelet.

次に、DOA推定および音源分離実験とその評価について説明する。
東北大学通研の無響室でfs=44.1kHzで測定した頭部伝達関数（インパルスレスポンス）を入力として、データベースを作成した。前方180度水平面を10度毎に合計19方向の分析を74.33Hz〜8000Hzまでの1/4 オクターブ帯域で分析を行い、レベル差および偏角差を保存した。このとき帯域分割数は28である。データベースの一部を図２に示す。レベル差、偏角差ともにDOAに応じた変化が確認できる。 Next, DOA estimation and sound source separation experiments and their evaluation will be described.
A database was created using the head-related transfer function (impulse response) measured at fs = 44.1kHz in the anechoic chamber of Tohoku University. A total of 19 directions were analyzed on a 180-degree horizontal plane every 10 degrees in a 1/4 octave band from 74.33 Hz to 8000 Hz, and the level difference and declination difference were preserved. At this time, the number of band divisions is 28. A part of the database is shown in FIG. Changes according to DOA can be confirmed for both level difference and angle difference.

次に、DOA推定および音抽出について説明する。サンプリング周波数44.1kHzで収録した1ch目的音および1ch妨害音を用い、これを上述の頭部伝達関数にたたみ込み図３のように目的音30°、妨害音−30°からの音を合成し、フレーム長1024でシステムに入力した。データベースを作成した過程と同様な計算を行い、得られたレベル差および偏角差とDOAデータベース上の角が最も近いものをDOA推定値とする。DOA推定値は変換スケール毎に求める。 Next, DOA estimation and sound extraction will be described. Using the 1ch target sound and 1ch interference sound recorded at a sampling frequency of 44.1kHz, convolve this with the above-mentioned head-related transfer function and synthesize the sound from the target sound of 30 ° and the interference sound of -30 ° as shown in Figure 3. Input to the system with a frame length of 1024. The same calculation as in the process of creating the database is performed, and the obtained level difference and declination difference are the closest to the angle on the DOA database as the DOA estimate. DOA estimates are obtained for each conversion scale.

分離音を得るために目的音の方向情報をシステムに与え、方向情報と推定DOAが一致したスケールは振幅を保持し、異なった場合には推定DOA 10度の差につき2.5dB係数の振幅を減衰させた。このように補正した係数に対して分析ウェーブレットを畳みこみ、再生分離音を得た。DOAの推定結果および音波形を図４に示す。図４の(e),(f)は1フレーム中の28スケールをDOA毎に推定された数をカウントし、濃度により示したものである。妨害音なしのときは、安定なDOAの推定が行われ（図４(f)）、混合音に対しては、目的音が優位のフレームでは目的音方向に推定DOAが移動することがわかった。（図４(e)） The direction information of the target sound is given to the system to obtain the separated sound, the scale where the direction information and the estimated DOA match retains the amplitude, and if it is different, the amplitude of the estimated DOA is attenuated by 2.5dB coefficient per 10 degree difference I let you. The analysis wavelet was convolved with the coefficients corrected in this way to obtain reproduced separated sounds. FIG. 4 shows DOA estimation results and sound waveforms. (E) and (f) in FIG. 4 show the 28 scales in one frame by counting the number estimated for each DOA, and indicating them by the density. When there was no interfering sound, stable DOA estimation was performed (Fig. 4 (f)). For mixed sounds, it was found that the estimated DOA moves in the direction of the target sound in frames where the target sound is dominant. . (Figure 4 (e))

以上のDOA推定および音源分離実験結果から、複素ウェーブレット変換を用いた音源分離方法は、DOA推定を用いて音源の方向を推定し目的とする音源を抽出することに有効であることがわかった。 From the above DOA estimation and sound source separation experiment results, it was found that the sound source separation method using the complex wavelet transform is effective for estimating the direction of the sound source using DOA estimation and extracting the target sound source.

C.K.Chui, “An Introduvtion to Wavelets”, Academic Press(1992)（邦訳：「ウェーブレット入門」，桜井，新井，東京電気大学出版局(1993)C.K.Chui, “An Introduvtion to Wavelets”, Academic Press (1992) (Japanese translation: “Introduction to Wavelets”, Sakurai, Arai, Tokyo Denki University Press (1993) 戸田，川畑，章，C MAGAZINE 2003 6 pp.34Toda, Kawabata, Akira, C MAGAZINE 2003 6 pp.34

本発明の実施の形態に係る音源分離システムの構成を示すブロック図である。It is a block diagram which shows the structure of the sound source separation system which concerns on embodiment of this invention. 本発明の実施の形態に係る音源分離システムで用いるデータベースの一部を示す図である。It is a figure which shows a part of database used with the sound source separation system which concerns on embodiment of this invention. 目的音30°、妨害音−30°からの音を合成する実験環境を示す図である。It is a figure which shows the experimental environment which synthesize | combines the sound from the target sound 30 degrees and the interference sound -30 degrees. DOAの推定結果および音波形を示す図である。It is a figure which shows the estimation result and sound waveform of DOA.

Explanation of symbols

１０１受信部
１０２変換部
１０３推定部
１０４分離部
１０５データベース
101 receiving unit 102 converting unit 103 estimating unit 104 separating unit 105 database

Claims

A reception process for recording sound signals generated from a plurality of sound sources at two left and right receivers between two different points (between two channels), a conversion process for dividing the signals between the two channels into logarithmic frequency bands of an octave structure, and the division And calculating the difference data between the two channels from the obtained data, estimating the direction of the sound source using DOA (Direction of Arrival) estimation based on the difference data, and the specific sound source direction obtained by the estimation A sound source separation method comprising: a separation process for enhancing sound.

2. The sound source separation according to claim 1, wherein the conversion process divides the signals between the two channels into complex wavelet scales by performing complex wavelet transformation, and calculates a declination and a level of a complex wavelet coefficient. Method.

The estimation process calculates the declination difference and level difference of the complex wavelet coefficient as difference data between the two channels for each complex wavelet scale, and records the calculated declination difference and level difference data in the database in advance. 2. The sound source separation method according to claim 1, wherein the direction in the database that gives the closest declination difference and level difference is used as the DOA estimation value by comparing with the measured data.

The measurement data is calculated by calculating the declination difference and level difference of the complex wavelet coefficient from the signal data recorded at two different points (between two channels) of the acoustic signal generated from various angles using the complex wavelet transform. The sound source separation method according to claim 3, wherein the sound source separation method is recorded in advance as data representing a relationship between a declination difference, a level difference, and a direction angle for each complex wavelet scale.

The separation process emphasizes complex wavelet coefficients facing the sound source direction based on the DOA estimation value obtained by the estimation, suppresses other coefficients, and reconstructs the target sound waveform; and 2. The sound source separation method according to claim 1, wherein wavelet inverse transformation is used to reconstruct the target sound waveform.

Receiving means for recording acoustic signals generated from a plurality of sound sources at two different left and right receivers (between two channels), a converting means for dividing the signals between the two channels into logarithmic frequency bands of an octave structure, and the division And calculating the difference data between the two channels from the obtained data, and estimating means for estimating the direction of the sound source using DOA estimation based on the difference data, and separating to emphasize the sound of the specific sound source direction obtained by the estimation And a sound source separation system.

The sound source separation according to claim 6, wherein the conversion unit divides the signal between the two channels into complex wavelet scales by performing complex wavelet transform, and calculates a declination and a level of the complex wavelet coefficient. system.

The estimation means calculates the deviation angle and level difference of the complex wavelet coefficient as difference data between two channels for each complex wavelet scale, and records the calculated deviation angle difference and level difference data in advance in a database. The sound source separation system according to claim 6, wherein the direction in the database that gives the closest declination difference and level difference is used as the DOA estimation value by comparing with the measured data.

The database calculates the declination difference and level difference of the complex wavelet coefficients from the signal data recorded at two different points (between two channels) of acoustic signals generated from various angles using complex wavelet transform. 9. The sound source separation system according to claim 8, wherein data representing a relationship between a declination difference and a level difference and a direction angle is recorded in advance for each complex wavelet scale.

The separating means emphasizes the complex wavelet coefficients facing the sound source direction based on the DOA estimation value obtained by the estimation, suppresses other coefficients, and reconstructs the target sound waveform; and The sound source separation system according to claim 6, wherein wavelet inverse transform is used for reconstructing the target sound waveform.