KR20110089782A

KR20110089782A - Target speech enhancement method based on degenerate unmixing and estimation technique

Info

Publication number: KR20110089782A
Application number: KR1020100009326A
Authority: KR
Inventors: 박형민
Original assignee: 서강대학교산학협력단
Priority date: 2010-02-01
Filing date: 2010-02-01
Publication date: 2011-08-09
Also published as: KR101161248B1

Abstract

PURPOSE: A target speech enhancement method based on degenerated unmixing is provided to apply a real application without estimating time delay coefficient without the number of sound source signal. CONSTITUTION: A first channel signal and a second channel signal are converted into time-frequency function(200). A histogram about a parameter is generated by estimating the parameter of a first channel signal and a second channel signal(210). An initial value of the parameter is set up about interest sound source through a histogram(220).

Description

Target Speech Enhancement Method based on degenerate unmixing and estimation technique}

본 발명은 DUET(Degenerate Unmixing and Estimation Technique)를 기반으로 한 관심 음원 향상 방법에 관한 것으로서, 더욱 구체적으로는 DUET를 기반으로 하여, 모든 음원들에 대한 매개 변수를 추정하는 것이 아니라 관심 음원에 대한 매개 변수들만을 추정함으로써 실제 응용에 적용할 수 있도록 하여 보다 효율적으로 관심 음원을 향상시킬 수 있는 방법에 관한 것이다. The present invention relates to a method for improving a sound source of interest based on degenerate unmixing and estimation technique (DUET), and more specifically, based on DUET, rather than estimating a parameter for all sound sources, The present invention relates to a method of improving a sound source of interest more efficiently by estimating only variables so that they can be applied to actual applications.

여러 음원 신호가 혼합된 음향 신호에서 개별적인 음원 신호를 분리해 내는 것을 BSS(Blind Source Separation 또는 Blind Signl Separation)라고 하며, 여기서 Blind는 원본 음원 신호에 대한 정보가 없으며, 혼합 환경에 대해서도 정보가 없다는 것을 의미한다. 그리고, 최종적으로 신호를 분리하는 과정을 디믹스(Demix) 또는 언믹스(Unmix)라고 표현한다. 이러한 BSS기법 중 시간-주파수 영역에서 이루어지는 대표적인 것이 DUET(Degenerate Unmixing and Estimation Technique)이다. The separation of individual sound sources from sound signals mixed with multiple sound sources is called BSS (Blind Source Separation or Blind Signl Separation), where Blind has no information about the original source signal and no information about the mixed environment. it means. Finally, the process of separating the signal is referred to as Demix or Unmix. A typical example of the BSS technique performed in the time-frequency domain is DUET (Degenerate Unmixing and Estimation Technique).

이하, DUET 에 대하여 간략하게 설명한다. 도 1은 다수 개의 음원들의 신호가 혼합되어 2개의 센서로 입력되는 신호들의 모델들을 설명하기 위하여 도시한 구성도이며, 도 2는 일반적인 DUET 방법을 개략적으로 도시한 블록도이다. The following briefly describes the DUET. FIG. 1 is a block diagram illustrating models of signals mixed with a plurality of sound sources and input to two sensors, and FIG. 2 is a block diagram schematically illustrating a general DUET method.

도 1 및 도 2를 참조하면, DUET는 s₀(t), s₁(t), s₂(t), ..., s_N(t)와 같이 N개의 음원들의 신호가 혼합되어 2개의 분리된 센서들로 입력되는 경우, 각 센서들로 입력되는 신호들(x₁(t), x₂(t))에 대한 모델을 수학식 1과 같이 설정한다. 1 and 2, DUET is a mixture of signals of N sound sources such as s ₀ (t), s ₁ (t), s ₂ (t), ..., s _N (t) When input to the separated sensors, a model for the signals x ₁ (t) and x ₂ (t) input to each sensor is set as in Equation 1.

여기서, x₁(t)는 제1 센서로 입력되는 제1 채널 신호, x₂(t)는 제2 센서로 입력되는 제2 채널 신호, s_j(t)는 j번째 음원 신호, a_j는 j번째 음원 신호에 대한 제1 채널 신호와 제2 채널 신호의 세기 차이를 나타내는 감쇄 변수, δ_j는 j번째 음원 신호에 대한 제1 채널 신호와 제2 채널 신호의 도달 시간의 차이를 나타내는 시간 지연 변수이며, n₁(t) 및 n₂(t)는 각 채널에 삽입된 잡음이다. 도 3은 동일한 음원으로부터 발생하여 2개의 센서에 각각 도달한 신호의 세기 차이와 시간 차이를 도식화한 것이다. 도 3과 같이 특정 방향에서 소리가 전달되면 수학식 1과 같이 도달 거리에 따른 신호의 크기 차이와 지연이 발생하게 된다. DUET 알고리즘은 이와 같은 스테레오 신호 모델을 기반으로 하여 잡음(즉, n₁(t) 및 n₂(t))이 없다고 가정함으로써, 모델을 단순화시켰다. Here, x ₁ (t) is the first channel signal input to the first sensor, x ₂ (t) is the second channel signal input to the second sensor, s _j (t) is the j-th sound source signal, a _j is Attenuation variable indicating the difference between the intensity of the first channel signal and the second channel signal for the j-th sound source signal, δ _j is a time delay indicating the difference between the arrival time of the first channel signal and the second channel signal for the j-th sound source signal Variable, n ₁ (t) and n ₂ (t) are the noise embedded in each channel. 3 is a diagram illustrating the difference in intensity and time between signals generated from the same sound source and reaching the two sensors, respectively. When sound is transmitted in a specific direction as shown in FIG. The DUET algorithm simplifies the model by assuming no noise (ie, n ₁ (t) and n ₂ (t)) based on this stereo signal model.

먼저, 각 센서로부터 입력된 x₁(t) 및 x₂(t)에 대하여 수학식 2와 같이 Windowed Fourier Transform을 수행한다(100). Windowed Fourier Transform은 시간(t)에 대한 신호를 시간 프레임(τ)과 주파수(ω)에 대한 신호로 변환시키는 변환이다. 이러한 변환을 통해, 각 시간 프레임마다 주파수 성분을 얻을 수 있다. First, a windowed fourier transform is performed on x ₁ (t) and x ₂ (t) input from each sensor as shown in Equation 2 (100). The Windowed Fourier Transform is a transform that transforms a signal over time t into a signal over time frame τ and frequency ω. Through this conversion, a frequency component can be obtained for each time frame.

여기서, S_i(ω,τ)는 i번째 음원의 신호를 시간-주파수로 표시한 것이며, X_j(ω,τ)는 j번째 센서로 입력된 신호를 시간-주파수로 표시한 것이다. Here, S _i (ω, τ) represents the signal of the i-th sound source in time-frequency, and X _j (ω, τ) represents the signal input to the j-th sensor in time-frequency.

다음, DUET 알고리즘은 합쳐지기 이전의 신호들이 W-Disjoint Orthogonal(WDO) 하다고 가정한다. 이러한 가정은 믹싱되기 전의 각각의 음원들이 모든 시간-주파수 영역에서 겹치지 않는다는 가정으로서, i≠j라면, s_i(t)와 s_j(t)의 Windowed Fourier Transform이 상호배반적(Disjoint)이라는 가정이다. 즉, 각각의 음원 신호들이 지배적인 시간-주파수 성분은 서로 겹치지 않아서 합쳐진 신호의 모든 시간-주파수 성분은 하나의 신호와만 연관이 있다는 가정이다. 이러한 WDO 가정을 수학식 3과 같이 표현한다. 이러한 가정은 실제 음향 신호에 완전하게 대응하지는 않지만, 연구를 통해 음성 신호의 경우에는 매우 적절하게 대응한다고 알려져 있다. Next, the DUET algorithm assumes that the signals before combining are W-Disjoint Orthogonal (WDO). This assumption assumes that the individual sources before mixing do not overlap in all time-frequency domains. If i ≠ j, the assumption is that the Windowed Fourier Transform of s _i (t) and s _j (t) is disjoint. to be. That is, it is assumed that all the time-frequency components of the combined signals do not overlap each other, and the time-frequency components dominated by each sound source signal are related to only one signal. This WDO assumption is expressed as in Equation 3. This assumption does not completely correspond to the actual acoustic signal, but research has been found to correspond very well in the case of voice signals.

전술한 W-Disjoint Orthogonal 가정을 적용하여 감쇄 변수(a) 및 시간 지연 변수(δ)를 구한다. 수학식 2로 표현된 각 센서 신호들에 W-Disjoint Orthogonal 가정을 적용하면, 특정 시간 프레임(τ_ｎ)과 특정 주파수(ω_ｍ)에서의 각 센서 신호들은 수학식 4로 정리될 수 있다. 가정에 의해 특정 j번째 음원 신호를 제외한 모든 음원 신호들이 제거되기 때문이다. Attenuation variable (a) and time delay variable (δ) are obtained by applying the above-described W-Disjoint Orthogonal assumption. When the W-Disjoint Orthogonal hypothesis is applied to each sensor signal represented by Equation 2, the respective sensor signals in a specific time frame τ _n and a specific frequency ω _m can be summarized as Equation 4. This is because all sound source signals except the specific j-th sound source signal are removed by the assumption.

수학식 4의 연립방정식으로부터, j번째 음원 신호에 대하여 채널간의 상대적인 크기를 나타내는 감쇄 변수(a _j ) 및 채널간의 상대적인 시간 지연을 나타내는 시간 지연 변수(δ _j )는 수학식 5와 같이 구할 수 있다. From the simultaneous equation of Equation 4, the attenuation variable a _j indicating the relative magnitude between channels and the time delay variable δ _j indicating the relative time delay between channels with respect to the j th sound source signal can be obtained as in Equation 5. .

수학식 5에 따라 추정된 감쇄 변수 및 시간 지연 변수를 이용하여 전체 시간-주파수 성분에 걸쳐 에너지 히스토그램을 생성하고(110), 생성된 히스토그램으로부터 피크 값들의 위치로 모든 음원들에 대한 감쇄 변수 및 시간 지연 변수들의 초기 값을 설정한다(120). An energy histogram is generated over the entire time-frequency component using the attenuation variable and the time delay variable estimated according to Equation 5, and the attenuation parameter and time for all sound sources from the generated histogram to the position of the peak values. An initial value of delay variables is set (120).

다음, 분리하기 원하는 관심 음원의 신호 이외의 음원 신호를 가우시안 잡음(Gaussian Noise) 형태로 가정하고, 각각의 신호들이 독립이라는 가정을 사용하면, 수학식 6과 같은 우도 함수(Likelihood Function)를 얻을 수 있다. Next, if a sound source signal other than the signal of interest source to be separated is assumed in the form of Gaussian Noise, and each signal is assumed to be independent, a likelihood function as shown in Equation 6 can be obtained. have.

다음, 수학식 3을 만족하는 WDO 음원 신호들에 있어서, i 번째 음원 신호가 (ω,τ)에서 지배적인 경우, 우도 함수의 정규화 된 지수 항(ρ_i(ω,τ))이 수학식 7로 표현될 수 있다.Next, for the WDO sound source signals satisfying Equation 3, when the i th sound source signal is dominant in (ω, τ), the normalized exponential term (ρ _i (ω, τ)) of the likelihood function is represented by Equation 7 It can be expressed as.

모든 음원 신호들에 대한 매개 변수들 (a _i ,δ _i )은 비용 함수(cost function) (J)를 최소화시킴에 의해 추정될 수 있으며, (J)는 수학식 8에 의해 계산될 수 있다. The parameters a _i , δ _i for all sound source signals can be estimated by minimizing the cost function J , and ( J ) can be calculated by equation (8).

전술한 바와 같이, 비용 함수를 최소화시킴으로써, 모든 음원 신호들에 대한 매개 변수 (a_i,δ_i)를 수학식 9에 의하여 구한다(130).As described above, by minimizing the cost function, the parameters (a _i , δ _i ) for all the sound source signals are obtained by Equation 9 (130).

다음, 관심 음원 신호의 우도 함수 값이 다른 신호들의 그것에 비해 모두 큰 경우에는 1을, 그렇지 아니한 경우에는 0을 마스크 값으로 설정한 이진 마스크(Binary Mask)를 이용하여 원하는 신호를 분리할 수 있다. 이진 마스크는 수학식 10에 의해 생성된다(140). Next, a desired signal may be separated using a binary mask in which the likelihood function value of the sound source signal of interest is larger than that of other signals, and otherwise, 0 is set as a mask value. The binary mask is generated by Equation 10 (140).

다음, 수학식 11과 같이, 이진 마스크를 이용하여 관심 음원인 i번째 음원 신호를 추정한다. Next, as shown in Equation 11, an i-th sound source signal of interest sound source is estimated using a binary mask.

최종적으로 원하는 음원 신호는 Windowed-Fourier Transform을 역변환시킴으로써 얻게 된다(170). Finally, the desired sound source signal is obtained by inversely transforming the Windowed-Fourier Transform (170).

DUET 기법을 이용하여 WDO 음원 신호를 분리하기 위하여 단지 2개의 혼합된 신호가 필요하지만, 음원 신호를 성공적으로 분리해내기 위하여 음원 신호의 개수를 미리 알고 있어야 하며, 모든 음원 신호들에 대한 감쇄 변수(a)와 시간 지연 변수(δ)가 추정되어야만 한다. 그런데, 실제로는 음원 신호의 개수를 사전에 파악하기가 매우 어렵다. 더 나아가서, 단지 몇 개의 시간-주파수에서 일부 음원 신호들이 우세(dominant)하다면 해당 변수들은 부정확하게 추정될 수도 있다. 하지만, 음성 인식과 같은 많은 음성 향상 응용분야에 있어서, 단일의 목표 신호(target signal)의 향상이 요구된다. 이런 경우, 본 발명은 단지 관심 음원에 대한 매개 변수(a 및 δ)만을 추정함으로써 관심 음원의 신호를 효율적으로 향상시킬 수 있는 방법을 제안하고자 한다. Only two mixed signals are needed to separate the WDO sound source signal using the DUET technique, but the number of sound source signals must be known in advance in order to successfully separate the sound source signal, and the attenuation parameters for all sound source signals ( a) and the time delay variable δ must be estimated. In reality, however, it is very difficult to know the number of sound source signals in advance. Furthermore, if some sound source signals are dominant at only a few time-frequency, those variables may be incorrectly estimated. However, in many speech enhancement applications such as speech recognition, a single target signal enhancement is required. In this case, the present invention intends to propose a method for efficiently improving the signal of the sound source of interest by estimating only the parameters a and δ for the sound source of interest.

전술한 문제점을 해결하기 위한 본 발명의 목적은 사전에 음원 신호의 개수를 알 필요가 없고 잡음원 신호들에 대한 매개 변수들을 추정할 필요가 없기 때문에 실제 응용에 적합한 DUET를 기반으로 한 관심 음원 향상 방법을 제공하는 것이다. An object of the present invention for solving the above-mentioned problems is that it is not necessary to know the number of sound source signals in advance and it is not necessary to estimate the parameters for the noise source signals. To provide.

전술한 목적을 달성하기 위한 본 발명의 특징은 관심 음원 향상 방법에 관한 것으로서, N 개의 음원 신호들(s₁(t), s₂(t), ... , s_n(t))이 2개의 센서로 입력된 경우, 2개의 센서 신호들인 제1 채널 신호(x₁(t)) 및 제2 채널 신호(x₂(t))를 이용하여 관심 음원을 향상시키는 방법에 있어서, (a) 제1 채널 신호 및 제2 채널 신호를 시간-주파수 함수로 변환하는 단계; (b) W-Disjoint Orthogonal 하다는 가정을 적용하여, 모든 시간-주파수 성분에서 상기 변환된 제1 채널 신호(x₁(t))와 제2 채널 신호(x₂(t))에 대한 매개 변수들을 추정하여 매개 변수에 대한 히스토그램을 생성하는 단계; (c) 히스토그램을 이용하여 관심 음원에 대한 매개 변수의 초기 값을 설정하는 단계; (d) 매개 변수의 초기 값을 이용하여 초기 연속 마스크(initial continuous mask)를 생성하는 단계; (e) 사전에 설정된 제1 문턱 값 및 초기 연속 마스크를 이용하여 관심 음원에 대한 매개 변수들을 학습하여 추정하는 단계; (f) 추정된 관심 음원에 대한 매개 변수들을 이용하여 연속 마스크를 생성하는 단계; (g) 사전에 설정된 제2 문턱 값 및 상기 연속 마스크를 이용하여 이진 마스크를 생성하는 단계; (h) 제1 채널 신호 및 이진 마스크를 이용하여 관심 음원을 추출하는 단계; 를 구비한다. A feature of the present invention for achieving the above object is a method of improving a sound source of interest, wherein N sound source signals (s ₁ (t), s ₂ (t), ..., s _n (t)) is 2 A method of improving a sound source of interest by using two sensor signals, the first channel signal x ₁ (t) and the second channel signal x ₂ (t), when input to two sensors, the method comprising: (a) Converting the first channel signal and the second channel signal into a time-frequency function; (b) Applying the assumption of W-Disjoint Orthogonal, the parameters for the transformed first channel signal (x ₁ (t)) and second channel signal (x ₂ (t)) in all time-frequency components Estimating and generating a histogram for the parameter; (c) setting an initial value of a parameter for the sound source of interest using the histogram; (d) generating an initial continuous mask using the initial value of the parameter; (e) learning and estimating parameters for a sound source of interest using a preset first threshold value and an initial continuous mask; (f) generating a continuous mask using parameters for the estimated sound source of interest; (g) generating a binary mask using a second threshold value previously set and the continuous mask; (h) extracting a sound source of interest using the first channel signal and the binary mask; It is provided.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, 상기 매개 변수는 제1 채널 신호와 제2 채널 신호의 세기 차이를 나타내는 감쇄 변수 및 제1 채널 신호와 제2 채널 신호의 도달 지연 시간의 차이를 나타내는 시간 지연 변수로 구성되는 것이 바람직하다. In the method of improving a sound source of interest according to the above-mentioned feature, the parameter represents an attenuation variable representing the difference in intensity between the first channel signal and the second channel signal, and a difference between the arrival delay time between the first channel signal and the second channel signal. It is preferred to consist of a time delay variable.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, (a) 단계의 제1 채널 신호 및 제2 채널 신호는 Windowed-Fourier Transform을 이용하여 시간-주파수 함수로 변환시키는 것이 바람직하다. In the method of improving a sound source of interest according to the above-described feature, the first channel signal and the second channel signal of step (a) are preferably transformed into a time-frequency function using a Windowed-Fourier Transform.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, (c) 단계는 히스토그램의 최대 피크 값에 대응하는 매개 변수를 관심 음원에 대한 매개 변수의 초기 값으로 설정하거나, 사전에 관심 음원의 방향의 정보를 알고 있는 경우는 관심 음원의 방향에 대응하는 매개 변수와 가장 근접한 피크 값을 관심 음원에 대한 매개 변수의 초기 값으로 설정하는 것이 바람직하다. In the method of improving a sound source of interest according to the above-described feature, step (c) sets a parameter corresponding to the maximum peak value of the histogram as an initial value of the parameter for the sound source of interest, or advances information on the direction of the sound source of interest in advance. If known, it is preferable to set the peak value closest to the parameter corresponding to the direction of the sound source of interest as the initial value of the parameter for the sound source of interest.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, (d) 단계의 초기 연속 마스크(

)는 아래의 수학식에 의해 생성되는 것이 바람직하며,In the sound source enhancement method according to the above-described features, the initial continuous mask of step (d)

) Is preferably generated by the following equation,

여기서, a_target 및 δ_target은 각각 관심음원에 대한 감쇄 변수 및 시간 지연 변수이다. Here, a _target and δ _target are attenuation variables and time delay variables for the sound source of interest, respectively.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, 제2 문턱 값은 제1 문턱 값보다 일반적으로 낮게 설정되는 것이 바람직하다. In the sound source enhancement method according to the above-described feature, the second threshold value is preferably set to be generally lower than the first threshold value.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, (g)의 이진 마스크는 아래의 수학식에 의해 생성되는 것이 바람직하며,In the sound source enhancement method of interest according to the above features, the binary mask of (g) is preferably generated by the following equation,

여기서, Th(ω)는 제2 문턱 값이며,

는 연속 마스크 값이다. Where Th (ω) is the second threshold,

Is a continuous mask value.

전술한 특징에 따른 관심 음원 향상 방법에 있어서, (e) 단계는 아래의 수학식을 이용하여 매개 변수들을 학습하여 추정하는 것이 바람직하며,In the method of improving a sound source of interest according to the above-mentioned feature, in step (e), it is preferable to learn and estimate parameters using the following equation,

여기서, a _target 은 관심음원의 감쇄 변수이며, δ _target 은 관심음원의 시간 지연 변수이며, J는 비용 함수이다. Here, a _target is attenuation variable of the sound source of interest, δ _target is a time delay variable of the sound source of interest, and J is a cost function.

상기 비용 함수(J)는 각 주파수 bin에 대하여 제1 문턱 값보다 큰 값을 가지는 초기 연속 마스크의 시간-주파수 성분들에 대한 우도 지수항(ρ_target(ω,τ))을 누적하여 구하며, 상기 우도 지수항은 수학식

에 의해 계산된다.
The cost function (J) is obtained by accumulating a likelihood index term (ρ _target (ω, τ)) for time-frequency components of an initial continuous mask having a value greater than a first threshold value for each frequency bin. The likelihood index term is a mathematical expression

Is calculated by.

본 발명에 따른 관심 음원 향상 방법은 DUET를 기반으로 하였으나 사전에 음원 신호의 개수를 알 필요가 없으며, 잡음원 신호들에 대한 감쇄 변수 및 시간 지연 변수들을 추정할 필요가 없으므로, 실제 응용에 가장 적합하다.The sound source enhancement method according to the present invention is based on DUET but does not need to know the number of sound source signals in advance, and does not need to estimate attenuation and time delay variables for noise source signals, which is most suitable for practical applications. .

본 발명에 따른 관심 음원 향상 방법의 효능을 테스트하기 위하여, 제1 문턱 값은 주파수 bin별로 상위 10%의 시간 프레임 성분이 선택되는 문턱 값으로 설정하고, 제2 문턱 값은 주파수 bin별로 상위 35%의 시간 프레임 성분이 선택되는 문턱 값으로 설정하였다. 이 경우, 본 발명 및 종래의 DUET 방법에 따른 SNR gain 및 RSR(Retained speech ratio)을 5dB SNR 입력신호에 대해서 비교하였으며, 그 결과를 표 1에 도시하였다. 여기서, SNR gain 및 RSR은 아래의 수학식 18 및 수학식 19로 정의된다. In order to test the efficacy of the method of improving a sound source of interest according to the present invention, the first threshold value is set to a threshold value at which a top 10% time frame component is selected for each frequency bin, and the second threshold value is a top 35% for each frequency bin. The time frame component of was set to the threshold value selected. In this case, SNR gain and retained speech ratio (RSR) according to the present invention and the conventional DUET method were compared with respect to the 5dB SNR input signal, and the results are shown in Table 1. Here, SNR gain and RSR are defined by Equations 18 and 19 below.

표 1은 관심 음원과 잡음원이 발생되는 위치의 방위각 차이에 따른 SNR gain들과 RSR들의 평균값을 나타낸다. 실험에 사용된 혼합신호는 음원 신호들의 위치에서부터 마이크로폰의 위치까지 직접 경로에 해당하는 감쇄 변수(a)들과 시간 지연 변수(δ)들을 계산한 뒤, 음원 신호들을 이에 맞게 감쇄시키고 시간 지연 시켜서 모두 더하여 만들어졌다. Table 1 shows the average values of the SNR gains and RSRs according to the azimuth difference between the sound source of interest and the position where the noise source is generated. In the mixed signal used in the experiment, the attenuation variables (a) and time delay variables (δ) corresponding to the direct path from the position of the sound source signals to the position of the microphone are calculated, and then the sound signal signals are attenuated accordingly and time delayed. In addition it was made.

본 발명에 따른 관심 음원 향상 방법은 DUET 방법보다 더 높은 SNR gain들을 제공하는데, 이는 잡음 음원에 해당하는 매개변수를 추정하지 않으므로 추정 오차에 따른 성능저하가 없기 때문이다. 이에 반해 RSR들은 더 낮은 것을 볼 수 있는데, 이는 시간-주파수 성분을 35%만 사용하여 관심 음원 신호를 향상시키기 때문이다. 하지만 모든 RSR들이 89.5%를 상회하므로 향상된 음성 신호가 알아듣기에 충분히 깨끗하고 명료한 음질을 나타낸다. 더 나아가, 본 발명에 따른 관심 음원 향상 방법은 오직 관심 음원 신호에 대한 감쇄 변수(a)와 시간 지연 변수(δ)만 추정하면 되므로 DUET 방법보다 더 빠르게 처리가 가능하다.The sound source enhancement method of the present invention provides higher SNR gains than the DUET method, since there is no performance degradation due to an estimation error since the parameter corresponding to the noise sound source is not estimated. RSRs, on the other hand, can be seen to be lower because they use only 35% of the time-frequency components to enhance the sound source signal of interest. However, all RSRs exceed 89.5%, so the enhanced voice signal is clear and clear enough to be heard. Furthermore, the method of improving a sound source of interest according to the present invention may process faster than the DUET method because only the attenuation parameter a and the time delay variable δ of the sound source signal of interest need to be estimated.

표 2는 실제 사무실 환경에서 두 개의 마이크로폰으로 직접 녹음한 음향 신호를 이용한 실험 결과를 보여준다. 녹음된 음원 신호는 두 개의 선택된 위치에서 스피커를 통해 재생되고 있는 음성 신호이다. 실제 방에는 반향 효과가 존재하기 때문에 SNR gain들과 RSR들이 전체적으로 떨어진 것을 볼 수 있는데, 그래도 두 방법에 대한 상대적인 결과는 30˚일 때 약간 역전된 것 외에는 반향이 없는 환경에서의 실험 결과처럼 본 발명에 따른 관심 음원 향상 방법이 더 좋은 성능을 보여준다.Table 2 shows the results of experiments using sound signals recorded by two microphones in real office environment. The recorded sound source signal is an audio signal being reproduced through the speaker at two selected positions. Since there is an echo effect in the real room, we can see that the SNR gains and the RSRs are totally dropped.However, the relative results of the two methods are the same as the experimental results in an environment where there is no echo other than slightly reversed at 30˚. Sound source enhancement method according to shows better performance.

본 발명에 따른 관심 음원 향상 방법은 마이크로폰 및 혼합된 음원 신호의 개수, 각 음원 신호의 지속 여부 등 실제 응용 시 문제가 되는 제약 조건들을 해결하였을 뿐만 아니라, 관심 음원에 대한 매개변수만 추정하면 되므로 처리속도도 빨라서 실제적인 환경에 적용하기에 더 적합하다고 볼 수 있다. The method of improving a sound source of interest according to the present invention not only solves the constraints that are problematic in actual application, such as the number of microphones and mixed sound signals, and whether each sound signal is continuous, but also needs to estimate only parameters for the sound source of interest. Its fast speed makes it more suitable for practical applications.

그리고 본 발명에 따른 관심 음원 향상 방법은 종래의 DUET 방법보다 나은 SNR gain과 충분한 RSR 성능을 제공하는 것을 실험 결과를 통해 보여준다.The experimental result shows that the sound source enhancement method of the present invention provides better SNR gain and sufficient RSR performance than the conventional DUET method.

도 1은 일반적으로 다수 개의 음원들의 신호가 혼합되어 2개의 센서로 입력되는 신호들의 모델들을 설명하기 위하여 도시한 구성도이며, 도 2는 일반적인 DUET 방법을 개략적으로 도시한 블록도이다.
도 3은 동일한 음원으로부터 발생하여 2개의 센서에 각각 도달한 신호의 세기 차이와 시간 차이를 도식화한 것이다.
도 4는 본 발명의 바람직한 실시 예에 따른 관심 음원 향상 방법을 전체적으로 도시한 블록도이다.
도 5는 본 발명의 바람직한 실시 예에 따라 생성된 히스토그램을 예시적으로 도시한 그래프이다.FIG. 1 is a block diagram generally illustrating models of signals mixed with a plurality of sound sources and input to two sensors, and FIG. 2 is a block diagram schematically illustrating a general DUET method.
3 is a diagram illustrating the difference in intensity and time between signals generated from the same sound source and reaching the two sensors, respectively.
4 is a block diagram showing an overall method of improving a sound source of interest according to an exemplary embodiment of the present invention.
5 is a graph illustrating a histogram generated according to a preferred embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예에 따른 DUET를 기반으로 한 관심 음원 향상 방법을 구체적으로 설명한다. Hereinafter, a sound source enhancement method based on DUET according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 4는 본 발명의 바람직한 실시 예에 따른 관심 음원 향상 방법을 전체적으로 도시한 블록도이다. 도 4를 참조하면, 본 발명의 바람직한 실시 예에 따른 관심 음원 향상 방법은 먼저 2개의 센서로 입력된 신호들(x₁(t), x₂(t))을 Windowed-Fourier Transform을 한다(200). 4 is a block diagram showing an overall method of improving a sound source of interest according to an exemplary embodiment of the present invention. Referring to FIG. 4, the method of improving a sound source of interest according to an exemplary embodiment of the present invention first performs windowed- _fourth transform on signals (x ₁ (t) and x ₂ (t)) input to two sensors (200). ).

다음, WDO 가정 하에 모든 시간-주파수 성분에 대하여 매개 변수인 감쇄 변수(a) 및 시간 지연 변수(δ)를 아래의 수학식 12에 따라 추정하며, 추정된 매개 변수들을 이용하여 2차원 히스토그램을 생성한다(210). 도 5는 본 발명에 따라 생성된 히스토그램을 예시적으로 도시한 그래프이다. 도 5를 참조하면, 4개의 피크 값이 있음을 파악할 수 있다. 본 발명에 따라 WDO 가정 하에 매개변수들을 추정하고 히스토그램을 생성하는 과정은 전술한 종래의 기술과 동일하므로, 중복되는 설명은 생략한다. Next, attenuation parameters (a) and time delay variables (δ), which are parameters for all time-frequency components under the WDO assumption, are estimated according to Equation 12 below, and a two-dimensional histogram is generated using the estimated parameters. (210). 5 is a graph illustrating a histogram generated according to the present invention. Referring to Figure 5, it can be seen that there are four peak values. Since the process of estimating the parameters and generating the histogram under the WDO assumption according to the present invention is the same as the above-described conventional technique, redundant descriptions are omitted.

다음, 히스토그램을 통해 가장 우세한(dominant) 피크값을 선택하거나, 관심 음원의 사전 예측된 방향에 대응되는 매개 변수와 가장 가까운 피크값을 선택하고, 선택된 피크 값을 이용하여 관심 음원에 대한 감쇄 변수 및 시간 지연 변수를 초기화시킨다(220). 만약 모든 시간-주파수 성분들에 대하여 수학식 12에서 추정되는 매개 변수들이 bin에 속하는 횟수를 세어 (a,δ)에 대한 2차원 히스토그램을 생성한다면, 각 피크 값들은 음원 신호들의 매개 변수 (a,δ)에 대응한다. 일반적으로 주위 잡음이 세지면 연설자(speaker)는 청취자가 더 잘 들을 수 있도록 하기 위하여 무의식적으로 목소리의 세기를 증가시키게 되는데, 이를 Lombard Effect라 한다. 이에 따라 보통 관심 음원(target speaker)에 대한 매개변수들(a,δ)은 주요한 피크(primary peak)에 해당하게 된다. 따라서 본 발명에서는 주요한 피크(peak)에서의 값들에 의하여 매개변수들을 초기화시키고자 한다. 관심 음원의 근접 방향에 대한 사전 정보가 가능하다면, 매개변수들은 그 방향에 대응하는 (a,δ)에 가장 가까운 피크에서의 값들에 의해 초기화될 수 있다. Next, select the dominant peak value through the histogram, or select the peak value closest to the parameter corresponding to the pre-predicted direction of the sound source of interest, and use the selected peak value to determine the attenuation parameter and Initialize the time delay variable (220). If all of the time-frequency components generate a two-dimensional histogram for (a, δ) by counting the number of times the parameters estimated in Eq. 12 belong to bins, then each peak value is the parameter (a, δ). In general, when the ambient noise is high, the speaker unconsciously increases the strength of the voice so that the listener can hear better. This is called the Lombard Effect. Accordingly, the parameters a, δ for the target speaker of interest usually correspond to a primary peak. Therefore, the present invention attempts to initialize the parameters by the values at the main peak. If advance information on the proximity direction of the sound source of interest is available, the parameters can be initialized by the values at the peak closest to (a, δ) corresponding to that direction.

다음, 초기화된 관심 음원에 대한 매개 변수인 감쇄 변수 및 시간 지연 변수 (a_target,δ_target)를 사용하여 초기 연속 마스크를 생성하며, 이는 수학식 13과 같이 표현된다. Next, an initial continuous mask is generated using the attenuation variable and the time delay variable (a _target , δ _target ), which are parameters for the sound source of interest, which is represented by Equation (13).

만약 관심 음원이 특정 시간-주파수에서 지배적이라면 해당 초기 연속 마스크 값은 1에 근접하게 되며, 최소 값은 0이 된다. 그러므로 그 값은 어느 지점에서의 관심 음원 우세성의 척도가 된다. 관심 음원에 대한 매개 변수들을 정확하게 추정하기 위하여, DUET 에서의 J에 해당하는 비용 함수를 정의해야 된다. 전술한 방법은 관심 음원에 대한 매개 변수만이 필요하고 연속 마스크 값에 의해서 특정 시간-주파수에서의 관심 음원 비중을 측정하므로, 비용 함수는 각 주파수 bin 에 대한 연속 마스크 값들의 상위 일정 퍼센트에 해당하는 시간주파수에서의 ρ_target(ω,τ)를 누적하여 얻을 수 있다. ρ_target(ω,τ)는 수학식 14에 의해 구할 수 있다. If the source of interest is dominant at a particular time-frequency, the corresponding initial continuous mask value is close to 1, and the minimum value is zero. The value is therefore a measure of the preponderance of the sound source of interest at some point. In order to accurately estimate the parameters for the sound source of interest, we need to define a cost function corresponding to J in the DUET. Since the method described above only requires parameters for the sound source of interest and measures the specific gravity of the sound source of interest at a particular time-frequency by the continuous mask value, the cost function corresponds to the upper constant percentage of the continuous mask values for each frequency bin. It can be obtained by accumulating ρ _target (ω, τ) at time frequency. ρ _target (ω, τ) can be obtained by equation (14).

따라서 초기 연속 마스크에 적용하여 상위 일정 퍼센트에 해당하는 시간-주파수 성분을 선택할 수 있는 제1 문턱 값을 설정한다(224). 다음, 제1 문턱 값 및 초기 연속 마스크를 이용하여 감쇄 변수와 시간 지연 변수들을 학습하여, 관심 음원에 대한 감쇄 변수와 시간 지연 변수를 추정한다(230). 수학식 15는 제1 문턱 값 및 초기 연속 마스크를 이용하여 추정하는 감쇄 변수 및 시간 지연 변수를 표현한 것이다. Therefore, a first threshold value for selecting a time-frequency component corresponding to an upper constant percentage may be applied to the initial continuous mask in operation 224. Next, the attenuation variable and the time delay variables are learned using the first threshold value and the initial continuous mask to estimate the attenuation variable and the time delay variable for the sound source of interest (230). Equation 15 expresses the attenuation variable and the time delay variable estimated using the first threshold value and the initial continuous mask.

전술한 수학식 15를 이용하여 추정된 관심 음원의 감쇄 변수 및 시간 지연 변수를 이용하여 연속 마스크를 생성한다(240). 다음, 제2 문턱 값(Th(ω))을 설정하고(242), 수학식 16에 따라 연속 마스크 및 제2 문턱 값을 이용하여 이진 마스크를 생성한다(250). A continuous mask is generated using the attenuation variable and the time delay variable of the sound source of interest estimated using Equation 15 (240). Next, the second threshold Th (ω) is set (242), and a binary mask is generated using the continuous mask and the second threshold value according to Equation 16 (250).

다음, 수학식 17에 따라, 전술한 과정에 의해 얻은 이진 마스크와 제1 채널 신호를 곱하여 음원 신호(S _j (ω,τ))를 얻는다(260). 다음, 최종적으로 원하는 음원 신호(s _j (t))는 Inverse Windowed Fourier Transform시킴으로써 얻게 된다(270). Next, according to Equation 17, the binary mask obtained by the above-described process is multiplied by the first channel signal to obtain a sound source signal S _j (ω, τ) (260). Next, the desired sound source signal s _j (t) is obtained by Inverse Windowed Fourier Transform (270).

본 발명에 따른 방법은 음원의 개수를 모르더라도 관심 음원 신호를 효율적으로 향상시킬 수 있게 된다. The method according to the present invention can effectively improve the sound source signal of interest without knowing the number of sound sources.

Claims

When N sound source signals s ₁ (t), s ₂ (t), ..., s _n (t) are input to two sensors, two sensor signals, the first channel signal x ₁ ( t)) and a second channel signal (x ₂ (t)) to improve the sound source of interest,
(a) converting the first channel signal and the second channel signal into a time-frequency function;
(b) Parameters for the transformed first channel signal (x ₁ (t)) and second channel signal (x ₂ (t)) for all time-frequency components, applying the assumption of W-Disjoint Orthogonal Estimating them to generate a histogram for the parameter;
(c) setting an initial value of a parameter for the sound source of interest using the histogram;
(d) generating an initial continuous mask using the initial value of the parameter;
(e) learning and estimating parameters for a sound source of interest using a preset first threshold value and an initial continuous mask;
(f) generating a continuous mask using parameters for the estimated sound source of interest;
(g) generating a binary mask using a second threshold set in advance and the continuous mask;
(h) extracting a sound source of interest using the first channel signal and the binary mask;
DUET-based interest source enhancement method comprising a.

The method of claim 1, wherein the parameter comprises an attenuation variable representing the difference between the strengths of the first channel signal and the second channel signal, and a time delay variable representing the difference between the arrival delay times of the first channel signal and the second channel signal. DUET-based interest source enhancement method characterized in that the.

The method of claim 1, wherein the first channel signal and the second channel signal of step (a) are transformed into a time-frequency function using a Windowed-Fourier Transform.

The method of claim 1, wherein step (c) sets the parameter corresponding to the maximum peak value of the histogram to an initial value of the parameter for the sound source of interest, or if the information of the direction of the sound source of interest is known in advance. DUET-based sound source enhancement method according to claim 1, characterized in that the peak value closest to the parameter corresponding to the direction of the set to the initial value of the parameter for the sound source of interest.

The method of claim 1, wherein the initial continuous mask of step (d)

) Is a DUET-based interest source enhancement method, characterized in that generated by the following equation.

Where a _target And δ _target are attenuation variables and time delay variables for the sound source of interest, respectively.

The method according to claim 1, wherein the second threshold value is set lower than the first threshold value.

The method of claim 1, wherein the binary mask of (g) is generated by the following equation.

Where Th (ω) is the second threshold,

Is the continuous mask value.

The method according to claim 1, wherein step (e) comprises estimating and estimating parameters by using the following equation.

Here, a _target is attenuation variable of the sound source of interest, δ _target is a time delay variable of the sound source of interest, and J is a cost function.

The method of claim 8, wherein the cost function J is obtained by accumulating a likelihood index term ρ _target (ω, τ) for points of an initial continuous mask having a value greater than a first threshold value for each frequency bin. And the likelihood index term is calculated by the following equation.