KR100901367B1

KR100901367B1 - Speech enhancement method based on minima controlled recursive averaging technique incorporating conditional map

Info

Publication number: KR100901367B1
Application number: KR1020080099002A
Authority: KR
Inventors: 장준혁; 금종모
Original assignee: 인하대학교 산학협력단
Priority date: 2008-10-09
Filing date: 2008-10-09
Publication date: 2009-06-05

Abstract

A speech enhancement method using a condition post maximum probability base minimum value control recursion average techniques capable of improving the quality of the voice is provided to modify the speech signal probability by granting a condition about the presence of the voice at the previous frame and member. A threshold value is changed according to whether to exist a previous frame(S10). A critical value is compared with a ratio of the minimum value of the local energy in a signal mixing the voice and noise. It determines whether to exist the voice signal in the current frame(S20). According to existence and nonexistence of the determined speech signal, the voice probability of being is presumed(S30). The noises are presumed by using the presumed voice probability(S40). The tone quality of the voice in which the noises is mixed is improved by using the presumed noise.

Description

Condition Enhancement Based Minimum Probability Based Minimum Control Recursive Average Technique Speech Enhancement Method {SPEECH ENHANCEMENT METHOD BASED ON MINIMA CONTROLLED RECURSIVE AVERAGING TECHNIQUE INCORPORATING CONDITIONAL MAP}

본 발명은 음성 향상 방법에 관한 것으로서, 특히 최소값 제어 재귀평균기법에 조건 사후 최대 확률을 추가함으로써, 즉 현재 프레임에 들어온 신호가 이전 프레임에서의 음성의 존재와 부재에 대한 조건을 부여해 주는 방식으로 음성 신호 존재 확률을 수정함으로써, 음성의 품질을 크게 향상시킬 수 있는 음성 향상 방법에 관한 것이다.The present invention relates to a speech enhancement method, and more particularly, by adding a post-condition maximum probability to a minimum control recursive average technique, i.e., in a manner in which a signal entering a current frame imposes conditions on the presence and absence of speech in a previous frame. The present invention relates to a speech enhancement method capable of significantly improving speech quality by modifying a signal existence probability.

실제적인 음성 향상 시스템에서 잡음을 정확하게 추정하는 것은 가장 중요한 요소 중 하나이며, 특히 비정상 잡음 신호를 처리할 수 있어야 한다. 잡음 신호의 추정은 음성 향상 시스템의 품질에 미치는 영향이 크기 때문에 추정된 잡음 신호가 너무 작을 경우에는 자연스럽지 못한 잔류 잡음이 감지되며, 너무 클 경우에는 음성 신호가 둔탁하게 들리며 명료도가 떨어지게 된다. 전통적인 잡음 추정은 음성 부재 구간에서 잡음의 평균을 구하는 것이 가장 일반적인데, 이와 같은 음성 검출기(Voice Activity Detection; VAD)에 의존하는 추정은 조정하기가 어렵고 신호대 잡음비가 낮은 응용분야에 사용될 때에는 왜곡된 음성 신호를 출력하게 된다는 문제점이 있다. 최근 진행 중인 연구들은 주로 Soft decision 방식을 적용하여 음성 영역에서도 잡음 신호의 파워 스펙트럼을 추정하고 있으며, 음성 검출기를 사용하는 방법에 대한 대안으로 나온 최소 확률 잡음 신호 추정(Minimum Statistics Noise Estimation; MSNE) 기법은 VAD를 사용하지 않는 독창성을 가진다. 구체적으로 살펴보면, 최적으로 신호의 파워 스펙트럼을 스무딩하고 최소 확률을 적용하는 것을 기본으로 하며, 파워 스펙트럼 스무딩 알고리즘은 시간과 주파수에 종속 관계가 있는 스무딩 매개변수로 1차 회귀 시스템을 사용한다. 이때 스무딩 매개변수는 조건적인 평균 자승 오차를 최소로 하도록 최적화하여 비정상 잡음 신호를 추적한다. 그러나 이러한 잡음 추정은 특이점들에서만 민감하게 반응하고 전통적인 방법들보다 두 배 정도 큰 분산을 가진다. 더욱이 이 방법은 특히 최소 탐색 윈도우를 너무 작게 하면 작은 에너지를 갖는 음소를 더욱 약하게 만드는 단점이 있다. 계산량을 줄이고 효율적인 최소 추적 방법이 제안되기도 하였으나, 잡음 신호의 에너지가 급상승하는 경우엔 잡음 추정이 느려지고 신호를 소멸시키는 문제점이 있다.Accurate estimation of noise in a practical speech enhancement system is one of the most important factors, especially the ability to handle abnormal noise signals. Since the estimation of the noise signal has a large influence on the quality of the speech enhancement system, when the estimated noise signal is too small, unnatural residual noise is detected, and when it is too large, the speech signal becomes dull and intelligibility is reduced. Traditional noise estimates are most commonly averaged in the absence of speech, and such estimates that rely on Voice Activity Detection (VAD) are difficult to adjust and are distorted when used in applications with low signal-to-noise ratios. There is a problem that the signal is output. Recent studies are mainly applying the soft decision method to estimate the power spectrum of the noise signal in the speech domain, and the Minimum Statistics Noise Estimation (MSNS) technique that is an alternative to the method of using the speech detector. Has originality without using VAD. Specifically, it is based on optimally smoothing the signal's power spectrum and applying a minimum probability. The power spectrum smoothing algorithm uses a first-order regression system as a smoothing parameter that is time and frequency dependent. The smoothing parameter tracks the abnormal noise signal by optimizing it to minimize the conditional mean square error. However, these noise estimates are sensitive only to outliers and have a variance that is twice as large as traditional methods. Moreover, this method has the disadvantage of making the phone with less energy weaker, especially if the minimum search window is too small. Although a small amount of computation and an efficient minimum tracking method have been proposed, when the energy of the noise signal rises, the noise estimation becomes slow and the signal disappears.

한편, 최근에 Cohen이 제안한 최소값 제어 재귀평균기법(minima controlled recursive averaging; MCRA)은 서브밴드에서 신호 존재 확률로 조절하는 스무딩 매개변수를 이용하여 파워 스펙트럼에 평균을 취하는 방법으로 이러한 단점을 보완한다. 각 서브밴드에서 신호의 존재는 잡음이 섞인 신호의 국부 에너지와 주어진 윈도우에서의 최소값 사이의 비로 정한다. 이 비율과 특정 임계값과 비교하여, 비율 값이 임계값보다 작으면 음성 신호가 없는 것으로 간주하는 방법을 사용하며, 음성 신호가 있는 부분과 없는 부분 사이에서 발생하는 변동을 줄이기 위해 시간축으로도 평균을 취한다. 이것은 음성 신호의 존재에 있어서 프레임 사이에 높은 상관성이 있기 때문이다.On the other hand, Cohen's recently proposed minima controlled recursive averaging (MCRA) compensates for these shortcomings by averaging the power spectrum using smoothing parameters that are controlled by the signal presence probability in subbands. The presence of the signal in each subband is determined by the ratio between the local energy of the noisy signal and the minimum in a given window. Compare this ratio to a specific threshold, and if the ratio value is less than the threshold, use a method that assumes no speech signal, and averages the time base to reduce variations that occur between and without the speech signal. Take This is because there is a high correlation between frames in the presence of voice signals.

하지만 이러한 MCRA 알고리즘에도 몇 가지 문제점들이 있다. 즉, 효율적인 지역 최소 추적 기술을 사용하여 계산의 복잡성은 줄였지만 갑작스러운 잡음 존재 시 딜레이가 생기고, 또한 각 서브밴드에서 신호의 존재를 잡음이 섞인 신호의 국부 에너지와 주어진 윈도우에서의 최소값 사이의 비로 정한 비율과 특정 임계값만 가지고 비교를 하기 때문에 신뢰성이 떨어진다. 일반적으로, 음성의 활동은 인접한 프레임들과 강력한 상호 연관성이 있으므로, 음성이 활동하는 프레임의 바로 전 프레임이나 바로 다음 프레임은 음성이 활동할 가능성이 크다고 할 수 있고, 그 반대의 가능성도 맞는다고 할 수 있다.However, there are some problems with this MCRA algorithm. That is, the computational complexity is reduced by using efficient local minimum tracking techniques, but delays occur in the presence of sudden noise, and the presence of the signal in each subband is also the ratio between the local energy of the noisy signal and the minimum in a given window. It is not reliable because it compares only the ratio and the specific threshold. In general, voice activity is strongly correlated with adjacent frames, so the frame immediately preceding or following the frame in which the voice is active is likely to be active, and vice versa. have.

이와 같은 연구 결과들을 고려해 볼 때, 기존의 통계적 가정을 바탕으로 Hidden Markov Model(HMM)을 이용한 행 오버 알고리즘을 조건 사후 최대 확률(Maximum A Posteriori; MAP)에 적용하여 MCRA와 결합시킴으로써 잡음이 섞인 음성의 품질 향상을 시도해 볼 필요가 있다.Considering these results, based on the existing statistical assumptions, the noise-overlaid speech is applied by combining the MC with a hangover algorithm using the Hidden Markov Model (HMM) to the maximum condition A posteriori (MAP). There is a need to try to improve the quality.

본 발명은, 상기와 같은 필요성의 인식에서 비롯된 것으로서, 특정 임계값만을 사용하여 각 서브밴드에서 음성신호의 유무를 추정하는 기존의 MCRA 방법 대신에, 이전 서브밴드의 음성신호 유무에 대한 조건을 추가하여 더욱 신뢰성 있는 비교를 통해 음성 존재 확률의 성능을 향상시켜 우수한 잡음 추정 방법을 도출하고, 이를 이용하여 잡음 스펙트럼을 추정하고 또한 잡음이 섞인 음성의 품질을 향상시킬 수 있는 음성 향상 방법을 제안하는 것을 그 목적으로 한다.The present invention is derived from the recognition of the necessity as described above, and instead of the conventional MCRA method of estimating the presence or absence of a speech signal in each subband using a specific threshold value, a condition for the presence or absence of a speech signal in a previous subband is added. To improve the performance of speech presence probability through a more reliable comparison, to derive an excellent noise estimation method, and to propose a speech enhancement method that can estimate the noise spectrum and improve the quality of the speech mixed noise. For that purpose.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른, 조건 사후 최대 확률 기반 최소값 제어 재귀평균기법을 이용한 음성 향상 방법은, According to a feature of the present invention for achieving the above object, a method for improving speech using a post-condition maximum probability based minimum value control recursive average technique,

(1) 이전 프레임의 음성신호의 유무에 따라 임계값을 변화시키는 제1 단계;(1) a first step of changing a threshold in accordance with the presence or absence of a voice signal of a previous frame;

(2) 음성과 잡음이 섞인 신호에서 국부 에너지와 국부 에너지의 최소값의 비를 상기 제1 단계에서 구한 임계값과 비교하여 현재 프레임의 음성신호 유무를 결정하는 제2 단계;(2) a second step of determining the presence or absence of a voice signal in the current frame by comparing a ratio of the minimum value of the local energy and the local energy in a signal mixed with voice and noise with a threshold obtained in the first step;

(3) 상기 제2 단계에서 결정한 음성신호의 유무에 따라, 음성 존재 확률을 추정하는 제3 단계;(3) a third step of estimating a voice presence probability according to the presence or absence of the voice signal determined in the second step;

(4) 상기 제3 단계에서 추정한 음성 존재 확률을 사용하여 잡음을 추정하는 제4 단계; 및(4) a fourth step of estimating noise using the voice presence probability estimated in the third step; And

(5) 상기 제4 단계에서 추정한 잡음을 이용하여 잡음이 섞인 음성의 음질을 향상시키는 제5 단계를 포함하는 것을 그 구성상의 특징으로 한다.(5) The fifth aspect of the present invention is characterized by including a fifth step of improving the sound quality of the speech-mixed speech using the noise estimated in the fourth step.

바람직하게는, 상기 제1 단계에서, 이전 프레임의 음성신호의 유무에 따라 다음과 같은 수학식을 이용하여 임계값에 변화를 주는 것을 특징으로 한다.Preferably, in the first step, the threshold value is changed using the following equation according to the presence or absence of the voice signal of the previous frame.

만약 상기 수학식에서 이전 프레임에 음성구간이 있지 않을 경우에는,If there is no speech section in the previous frame in the above equation,

,

이전 프레임에 음성구간이 있을 경우에는,If there is a voice segment in the previous frame,

.

위의 식을 간략히 표현하면, 아래 수학식과 같이 주어짐.In short, the above equation is given by the following equation.

.

여기서,

로서 정의됨.here,

Defined as

본 발명의 음성 향상 방법에 따르면, 최소값 제어 재귀평균기법에 조건 사후 최대 확률을 추가함으로써, 즉 현재 프레임에 들어온 신호가 이전 프레임에서의 음성의 존재와 부재에 대한 조건을 부여해 주는 방식으로 음성 신호 존재 확률을 수정함으로써, 음성의 품질을 크게 향상시킬 수 있다.According to the speech enhancement method of the present invention, the presence of the speech signal in such a manner that the maximum value of the post-condition condition is added to the minimum value control recursive average technique, that is, the signal entering the current frame gives a condition for the presence or absence of the speech in the previous frame. By modifying the probability, the quality of speech can be greatly improved.

이하에서는 첨부된 도면들을 참조하여, 본 발명에 따른 실시예에 대하여 상세하게 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음성 향상 방법의 구성을 나타내는 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 음성 향상 방법은, 임계값 갱신 단계(S10), 현재 프레임의 음성신호 유무 결정 단계(S20), 음성 존재 확률 추정 단계(S30), 잡음 추정 단계(S40), 및 음질 향상 단계(S50)를 포함한다.1 is a view showing the configuration of a voice enhancement method according to an embodiment of the present invention. As shown in FIG. 1, the voice enhancement method according to an embodiment of the present invention includes a threshold value updating step S10, a voice signal presence determination step S20, a voice presence probability estimation step S30, A noise estimation step S40, and a sound quality improvement step S50.

먼저, 임계값 갱신 단계(S10)에서는, 이전 프레임의 음성신호의 유무에 따라 임계값을 변화시킨다. 구체적으로 어떤 원리에 의하여, 어떤 수학식을 적용하여 임계값을 변화시키는지는 추후 상세히 설명하기로 한다.First, in the threshold value updating step S10, the threshold value is changed depending on the presence or absence of the audio signal of the previous frame. Specifically, by what principle, which equation is applied to change the threshold will be described in detail later.

다음으로, 현재 프레임의 음성신호 유무 결정 단계(S20)에서는, 음성과 잡음이 섞인 신호에서 국부 에너지와 국부 에너지의 최소값의 비를 단계 S10에서 구한 임계값과 비교하여 현재 프레임의 음성신호 유무를 결정하고, 음성 존재 확률 추정 단계(S30)에서는, 단계 S20에서 결정한 음성신호의 유무에 따라 음성 존재 확률을 추정하며, 잡음 추정 단계(S40)에서는, 단계 S30에서 추정한 음성 존재 확률을 사용하여 잡음을 추정한다.Next, in the step S20 of determining the presence or absence of a voice signal in the current frame, the presence or absence of a voice signal in the current frame is determined by comparing the ratio of the minimum value of the local energy and the local energy to the threshold obtained in step S10 in a signal mixed with voice and noise. In the speech presence probability estimation step S30, the speech presence probability is estimated according to the presence or absence of the speech signal determined in step S20. In the noise estimation step S40, the noise is estimated using the speech presence probability estimated in step S30. Estimate.

마지막으로, 음질 향상 단계(S50)에서는, 단계 S40에서 추정한 잡음을 이용하여 잡음이 섞인 음성의 음질을 향상시킨다.Finally, in the sound quality improvement step S50, the sound quality of the voice mixed with noise is improved by using the noise estimated in step S40.

이하에서는, 먼저 잡음을 추정하기 위해 음성 존재 확률로 조정되는 시변 스무딩 매개변수를 구하기 위한 과정을 설명한 후, 음성 존재 확률을 구하기 위한 국부 에너지와 국부 에너지의 최소값의 비를 이전프레임의 음성의 유무에 따른 임계값과 비교를 하는 과정을 상세히 설명하기로 한다.Hereinafter, a process for obtaining a time-varying smoothing parameter adjusted to speech presence probability to estimate noise will be described first, and then the ratio of the minimum value of local energy and local energy for obtaining the speech presence probability is determined by the presence or absence of the voice of the previous frame. The process of comparing with the threshold value according to this will be described in detail.

1. 최소값 제어 재귀평균기법에 의한 잡음 추정1. Estimation of Noise by Minimum Controlled Recursive Average Technique

x(n)과 d(n)을 각각 음성신호와, 상관성이 없는 가산 잡음 신호라고 하자. 여기서, n은 이산시간을 나타낸다. 이때 관측되는 신호 y(n)은 y(n)=x(n)+d(n)으로 주어지고, 부분적으로 중복으로 나누어 윈도우를 취한 다음 실시간 푸리에 변환(short-time Fourier transform; STFT)을 이용하면 다음 수학식 1과 같이 나타낼 수 있다.Let x (n) and d (n) be an additive noise signal that has no correlation with the audio signal, respectively. Where n represents a discrete time. The observed signal y (n) is given by y (n) = x (n) + d (n), partially divided into overlapping windows, and then using a short-time Fourier transform (STFT). It can be expressed as Equation 1 below.

Y(k,l) = X(k,l) + D(k,l)Y (k, l) = X (k, l) + D (k, l)

여기서, Y(k,l)은 l번째 프레임에서의 k번째 주파수 성분이 된다. 음성 향상 기법에서 사용되고 있는 기본 가설을 음성의 부재와 존재 각각에 대해 H₀(k,l)과 H₁(k,l)이라고 하면, 다음 수학식 2와 같이 나타낼 수 있다.Here, Y (k, l) becomes the k-th frequency component in the l-th frame. If the basic hypothesis used in the speech enhancement technique is H ₀ (k, l) and H ₁ (k, l) for each of absence and presence of speech, it can be expressed as Equation 2 below.

H₀(k,l): Y(k,l) = D(k,l)H ₀ (k, l): Y (k, l) = D (k, l)

H₁(k,l): Y(k,l) = X(k,l) + D(k,l)H ₁ (k, l): Y (k, l) = X (k, l) + D (k, l)

여기서, X(k,l)와 D(k,l)는 각각 원래 음성 신호와 잡음 신호의 푸리에 변환 계수를 나타낸다. 여기서, 을 번째 서브밴드에서 잡음 신호의 분산이라고 하면, 추정하기 위해 음성 신호 부재 구간에서 관측된 신호에 시간의 반복 스무딩을 적용하면 다음 수학식 3과 같이 나타낼 수 있다.Where X (k, l) and D (k, l) represent the Fourier transform coefficients of the original speech signal and the noise signal, respectively. Here, when is the variance of the noise signal in the th subband, it can be expressed as Equation 3 by applying the iterative smoothing of time to the signal observed in the absence of the speech signal to estimate.

여기서, α_d(0<α_d<1)는 스무딩 매개변수이며, H'₀, H'₁은 각각 가설에 근거하여 잡음 전력 갱신을 목적으로 한 음성 신호의 부재와 존재를 나타낸다. 음성 신호를 추정하는데 사용되는 수학식 2에서의 가설과 잡음 신호의 스펙트럼 갱신을 조절하는데 사용되는 수학식 3은 구별해야 한다. 즉, 음성신호가 존재(H₁)할 때 음 성신호의 부재(H₀)라고 결정하는 것이 잡음 신호를 추정할 때보다 음성 신호를 추정할 때 더 위험스럽다. 그러므로 서로 다른 결정 법칙이 사용되고, 일반적으로 H'₁보다는 H₁에 더 높은 신뢰를 두고 있으며, P(H₁|Y)≥P(H'₁|Y)인 것이다.Here, α _d (0 <α _d <1) is a smoothing parameter, and H ′ ₀ and H ′ ₁ respectively represent the absence and presence of a speech signal for noise power update based on the hypothesis. The hypothesis in Equation 2 used to estimate the speech signal and Equation 3 used to adjust the spectral update of the noise signal should be distinguished. In other words, determining that there is no voice signal H ₀ when the voice signal is present (H ₁ ) is more dangerous when estimating the voice signal than when estimating the noise signal. Therefore, different decision rule is used, in general, and with a higher confidence in the H ₁ rather than _{_{1, P (H 1 | Y}} ) ≥P (H 'H | is the Y _1).

p(k,l)=P(H'₁(k,l)|Y(k,l))가 음성 존재의 조건 확률을 나타낸다고 하면, 수학식 3은 다음 수학식 4와 같다.If p (k, l) = P (H ' ₁ (k, l) | Y (k, l)) represents the conditional probability of the presence of speech, Equation 3 is expressed by Equation 4 below.

여기서,

는 음성 존재 확률로 조정하는 시변 스무딩 매개변수로서 다음 수학식 5와 같이 주어진다.here,

Is a time-varying smoothing parameter that adjusts to a voice presence probability as given by Equation 5 below.

p(k,l)에 대한 추정은 다음 수학식 6을 사용한다.Equation 6 is used to estimate p (k, l).

여기서 α_p(0<α_p<1)는 스무딩 매개변수이고, δ는 음성신호 존재의 임계값이다.Where α _p (0 <α _p <1) is the smoothing parameter and δ is the threshold of speech signal presence.

특히, 국부 에너지는 다음 수학식 7과 같이 표시된다.In particular, the local energy is represented by the following equation (7).

S(k,l) = α_SS(k,l-1) + (1-α_S)|Y(k,l)|² S (k, l) = α _S S (k, l-1) + (1-α _S ) | Y (k, l) | ²

여기서, α_S(0<α_S<1)는 매개변수이다.Where α _S (0 <α _S <1) is a parameter.

국부 에너지의 최소값 S_min(k,l)과 임시변수 S_tmp(k,l)은 S_min(k,0)=S(k,0)과 S_tmp(k,0)=S(k,0)으로 초기화한다. 그리고 다음 수학식 8과 같이, 국부 에너지와 이전 프레임의 최소값을 비교하여 현재 프레임에 대한 최소값을 구한다.The minimum value of local energy S _min (k, l) and the temporary variable S _tmp (k, l) are given by S _min (k, 0) = S (k, 0) and S _tmp (k, 0) = S (k, 0 Initialize to). As shown in Equation 8, the minimum value of the current frame is obtained by comparing the local energy with the minimum value of the previous frame.

S_min(k,l) = min{S_min(k,l-1), S(k,l)}S _min (k, l) = min {S _min (k, l-1), S (k, l)}

S_tmp(k,l) = min{S_tmp(k,l-1), S(k,l)}S _tmp (k, l) = min {S _tmp (k, l-1), S (k, l)}

L개의 프레임을 모두 처리하고 나면, 다음 수학식 9로 초기화시킨다.After all L frames have been processed, Equation 9 is initialized.

S_min(k,l) = min{S_tmp(k,l-1), S(k,l)}S _min (k, l) = min {S _tmp (k, l-1), S (k, l)}

S_tmp(k,l) = S(k,l)S _tmp (k, l) = S (k, l)

상기 수학식 8 및 9의 과정을 반복함으로써, 다음 수학식 10과 같이 최소값을 탐색한다.By repeating the processes of Equations 8 and 9, the minimum value is searched as shown in Equation 10 below.

여기서,

이다.here,

to be.

2. 조건 사후 확률기반 최소값 제어 재귀평균기법2. Conditional Post-Probability Based Minimum Value Control Recursive Average Technique

전통적인 MAP 기준은, 다음 수학식 11과 같은 결정법을 따른다.The traditional MAP criterion follows a determination method as in Equation 11 below.

여기서, H_n은 n번째 프레임의 정확한 가정을 나타낸다. Bayes 법칙에 따르면 수학식 11의 기준을 다음 수학식 12와 같이 바꿀 수 있다.Where H _n represents the exact assumption of the nth frame. According to Bayes' law, the criterion of Equation 11 can be changed as Equation 12 below.

바이어스된 것을 보상해 주기 위해, 상기 수학식 12를 변형시키면 다음 수학식 13과 같이 나타낼 수 있다.In order to compensate for the bias, the equation (12) may be modified as shown in the following equation (13).

여기서 α≥1이다.Where α ≧ 1.

본 발명에서는, MCRA의 하나의 임계값만을 사용하여 각 서브밴드에서 신호의 존재를 추정할 때 이전 서브밴드의 신호가 음성을 포함하고 있는지 아닌지에 대한 조건을 추가하여 이 조건에 의해 두 가지의 임계값을 두어 더욱 신뢰성 있는 비교를 통해 신호 존재 확률의 성능을 더욱 좋게 한다. 다음은 이전 서브밴드의 신호가 음성을 포함하고 있는지 아닌지에 대한 조건을 추가한 새로운 음성 신호 존재 확률의 정의를 유도하는 과정이다.In the present invention, when estimating the presence of a signal in each subband using only one threshold of the MCRA, two thresholds are defined by this condition by adding a condition as to whether or not the signal of the previous subband contains voice. Value to make the comparison more reliable than ever before. The following is a process of deriving the definition of a new voice signal existence probability by adding a condition of whether or not a signal of a previous subband includes voice.

상기 수학식 13을 Bayes의 룰에 의해 바꾸어 표현하면, 다음 수학식 14와 같다.Equation (13) is replaced with the Bayes' rule and is expressed by the following equation (14).

일반적으로, 음성의 활동은 인접한 프레임들과 강력한 상호 연관성이 있다. 즉, 음성이 활동하는 프레임의 바로 전 프레임이나 바로 다음 프레임은 음성이 활동할 가능성이 크다고 할 수 있고, 그 반대의 가능성도 맞는다고 할 수 있다. 이러한 상호 연관성에 기초하여 HMM을 이용한 행 오버를 사용함으로써 통계 모델을 기반으로 한 VAD의 에러를 효과적으로 줄일 수 있다. 음성 활동 전후 프레임들의 강력한 상호 연관성의 성질들로 인해 기존의 사후 확률(a posterior probability) P(H_n|X) 대신에 현재 프레임의 관찰 결과와 이전 프레임의 결정을 조건으로 한 사후 확률 P(H_n|X, H_n _-1)를 계산할 수 있다. 상기 수학식 14에 이전 프레임의 조건을 추가한 후 Bayes 법칙에 의해 다음 수학식 15와 같이 바꿀 수 있다.In general, voice activity is strongly correlated with adjacent frames. In other words, the frame immediately preceding the frame in which the voice is active or the frame immediately following the voice is likely to be active, and vice versa. Based on this correlation, the use of Hover with HMM can effectively reduce the error of VAD based on statistical model. Due to the strong interrelationship of the frames before and after the voice activity, the posterior probability P (H _n | X) is used instead of the posterior probability P (H _n | X). _n | X, H _n _-1 ) can be calculated. After the condition of the previous frame is added to Equation 14, it can be changed into Equation 15 by Bayes' law.

상기 수학식 15는 다시 다음 수학식 16과 같이 바꿀 수 있다.Equation 15 may be changed to Equation 16 below.

비록 현재 프레임의 음성 활동이 이전 프레임에 의존할지라도, 현재 프레임 의 음성 활동은 현재 프레임에서 관찰된 잡음 섞인 음성 신호의 DFT 계수의 분포에 지배적인 영향을 받는다. 더욱이, p(S_r|H_n=H₁, H_n _-1=H₀)와 p(S_r|H_n=H₀, H_n _-1=H₁)에 분포된 매개변수들은 샘플 데이터의 부족으로 신뢰성 있는 추정이 불가하다. 이러한 이유로, 우리는 다음 수학식 17과 같이 간단한 가정을 할 수 있다.Although the speech activity of the current frame depends on the previous frame, the speech activity of the current frame is dominantly influenced by the distribution of DFT coefficients of the noisy speech signal observed in the current frame. Furthermore, the parameters distributed in p (S _r | H _n = H ₁ , H _n _-1 = H ₀ ) and p (S _r | H _n = H ₀ , H _n _-1 = H ₁ ) are obtained from the sample data. Lack of reliable estimates is not possible. For this reason, we can make simple assumptions such as

p(S_r|H_n=H_j, H_n _-1=H_i) = p(S_r|H_n=H_j), i=0,1 j=0,1p (S _r | H _n = H _j , H _n _-1 = H _i ) = p (S _r | H _n = H _j ), i = 0,1 j = 0,1

이를 이용하여 상기 수학식 16을 다음 수학식 18과 같이 간략화할 수 있다.By using this, Equation 16 may be simplified as in Equation 18 below.

상기 수학식 18에서 만약 이전 프레임에 음성구간이 있지 않을 경우에는, 다음 수학식 19와 같이 되며,In Equation 18, if there is no speech section in the previous frame, the following Equation 19 is obtained.

이전 프레임에 음성구간이 있을 경우에는, 다음 수학식 20과 같이 된다.If there is a voice section in the previous frame, the following equation (20) is obtained.

이러한 임계값 개수의 증가는 VAD의 성능이 향상될 수 있도록 추가적인 자유도를 제공한다. 인접한 프레임들 간 음성 활동의 강력한 상호 연관성으로 인해 일반적으로 다음 수학식 21이 성립된다.This increase in the number of thresholds provides additional degrees of freedom for improving the performance of the VAD. Due to the strong correlation of voice activity between adjacent frames, Equation 21 is generally established.

상기 수학식 21은 이전 프레임에서 음성 활동이 관측되었다면 현재 프레임에서 음성 활동이 관측될 확률이 높고 관측되지 않을 확률이 낮으며, 그 반대의 경우도 가능하기 때문에 결론적으로 β'이 β''보다 크다는 것을 알 수 있다.Equation (21) shows that if the voice activity is observed in the previous frame, the probability that the voice activity is observed in the current frame is high, the probability of not being observed is low, and vice versa. It can be seen that.

상기 결과들을 정리해 보면, 다음 수학식 22와 같이 나타낼 수 있다.In summary, the results can be expressed as in Equation 22 below.

여기서, γ는 다음 수학식 23과 같이 주어진다.Is given by Equation 23 below.

위와 같이, 본 발명에서는, 이전 프레임의 음성신호의 유무에 따라 임계값에 변화를 주어 이를 가지고 음성 존재 확률을 유동적으로 추정함으로써, 더 정확한 잡음 추정을 할 수가 있다.As described above, according to the present invention, a more accurate noise estimation can be performed by varying the threshold value according to the presence or absence of the voice signal of the previous frame and fluidly estimating the voice presence probability.

도 2는 MAP를 기반으로 한 임계값 테스트를 통한 음성 존재 확률을, F16 잡음(SNR=10dB)의 경우를 예를 들어 비교하여 보여주고 있는 도면이다. 도 2a는 깨끗한 음성 파형을 나타내는 도면이며, 도 2b는 실시간 프레임에서의 음성 존재 확률을 나타내는 도면이다. 도 2로부터, 본 발명에서 제안하고 있는 음성 향상 방법이, 기존의 MCRA 방법보다 음성이 시작하는 부분은 더 빨리 음성임을 알아내고, 음성이 끝나는 부분에서는 급격히 떨어지지 않아 음성임에도 불구하고 음성이 아니라고 판단해서 음성 정보를 잃어버리는 것을 줄여줌을 분명하게 확인할 수 있다.FIG. 2 is a diagram illustrating a comparison of voice presence probability through a threshold test based on MAP, for example, in the case of F16 noise (SNR = 10dB). FIG. 2A is a diagram showing a clean voice waveform, and FIG. 2B is a diagram showing voice presence probability in a real time frame. From FIG. 2, the voice enhancement method proposed in the present invention finds that the voice starts at a faster rate than the conventional MCRA method, and determines that the voice is not a voice even though it does not drop sharply at the end of the voice. It can be clearly seen that it reduces the loss of voice information.

3. 실험 결과3. Experimental Results

본 발명에서는 기존의 MCRA 알고리즘에서 각 서브밴드에서 신호의 존재를 잡음이 섞인 신호의 국부 에너지와 주어진 윈도우에서의 최소값 사이의 비로 정해 이 비율과 특정 임계값만을 가지고 비교하기 때문에 신뢰성이 떨어지는 문제를 조건 MAP를 적용함으로써 임계값을 지속적으로 갱신하여 음성 존재 확률값을 더 신뢰할 수 있도록 하여 음성 향상을 유도하였다. 제안된 음성 향상 알고리즘의 음질 평가를 위해 널리 적용되고 있는 ITU-T P.862 PESQ, 주관적 음질 평가를 수행하였으며, 표 1과 표 2는 각각 추출된 PESQ수치와, 주관적 음질 평가 비교를 보여주고 있다.In the present invention, in the existing MCRA algorithm, the presence of the signal in each subband is determined as the ratio between the local energy of the noisy signal and the minimum value in a given window, and the ratio is compared with only a specific threshold. By applying the MAP, the threshold is continuously updated to make the speech presence probability more reliable, thereby inducing speech enhancement. ITU-T P.862 PESQ and subjective sound quality evaluation, which are widely applied to evaluate the sound quality of the proposed speech enhancement algorithm, were performed. Table 1 and Table 2 show the extracted PESQ values and the subjective sound quality evaluations, respectively. .

Noise typeNoise type MethodMethod SNR (dB)SNR (dB) 55 1010 1515 White noiseWhite noise MCRA ProposedMCRA Proposed 1.936 2.0501.936 2.050 2.296 2.3912.296 2.391 2.641 2.7152.641 2.715 Babble noiseBabble noise MCRA ProposedMCRA Proposed 2.379 2.3972.379 2.397 2.697 2.7242.697 2.724 2.972 3.0042.972 3.004 F16 noiseF16 noise MCRA ProposedMCRA Proposed 2.056 2.1522.056 2.152 2.467 2.5482.467 2.548 2.776 2.8462.776 2.846

표 1의 PESQ 테스트를 위해 샘플은 남성, 여성 화자 각각이 100개의 문장을 발음하도록 한 음성을 한 프레임의 크기가 10ms에서 8kHz로 샘플링한 데이터에 네 가지 형태의 잡음이 부가되었다. 잡음은 NOISEX-92 데이터베이스의 white noise, babble noise, F16 noise를 사용하였으며, SNR을 5, 10, 15dB로 달리하여 테스트하였다. PESQ 값은 이들 샘플에 대한 평균 수치로 나타내었고, 기존 MCRA에 의한 PESQ를 위하여 가중치 파라미터는 α_d=0.95, α_p=0.2, α_S=0.45로 각각 설정하였으며, 임계값 β', β''는 다양한 잡음 환경에서 최적화된 실험치로 구하여 β'=9, β''=3.5로 설정하였다.For the PESQ test of Table 1, the sample added four types of noise to data sampled from 10ms to 8kHz in size for a voice that allowed each male and female speaker to pronounce 100 sentences. Noise was measured using white noise, babble noise, and F16 noise from the NOISEX-92 database. SNRs were tested at 5, 10, and 15 dB. The PESQ values were expressed as average values for these samples, and the weighting parameters were set to α _d = 0.95, α _p = 0.2, α _S = 0.45, respectively, for PESQ by the existing MCRA, and the threshold values β ', β''. Was set to be β '= 9 and β''= 3.5.

표 1로부터, 제안하고 있는 방법이 기존 방법에 비하여, 특히 white noise와 F16 noise의 5dB와 10 dB에서 두드러진 성능 향상을 보임을 확인할 수 있다. 이는 제안한 알고리즘이 도 1에서와 같이 낮은 SNR에서도 음성 존재 확률을 더욱 잘 추정함에 따라 개선된 음성 향상을 보이고 있음을 알 수 있다. babble noise의 경우, 잡음의 특성상 제안한 알고리즘의 음성 존재 확률이 기존과 비슷하지만 약간의 향상된 결과를 보여줌을 확인할 수 있다.From Table 1, it can be seen that the proposed method shows a significant performance improvement over 5dB and 10dB of white noise and F16 noise, compared with the conventional method. It can be seen that the proposed algorithm shows an improved speech enhancement by better estimating the presence of speech even at low SNR as shown in FIG. In the case of babble noise, the voice probability of the proposed algorithm is similar to the existing one, but shows a slightly improved result.

Noise typeNoise type MethodMethod SNR (dB)SNR (dB) 55 1010 1515 White noiseWhite noise MCRA ProposedMCRA Proposed 2.17 2.232.17 2.23 2.56 2.652.56 2.65 3.03 3.073.03 3.07 Babble noiseBabble noise MCRA ProposedMCRA Proposed 2.93 2.952.93 2.95 3.32 3.423.32 3.42 3.69 3.763.69 3.76 F16 noiseF16 noise MCRA ProposedMCRA Proposed 2.23 2.332.23 2.33 2.77 2.902.77 2.90 3.20 3.463.20 3.46

표 2의 주관적 음질 평가는 남성, 여성 화자 각각이 10개의 문장을 발음하도록 한 음성에 white, babble, F16 noise가 SNR이 5, 10, 15dB로 부가된 잡음 신호를 20명의 청취자를 대상으로 평가하였다. 표 2로부터, 제안한 방법이 기존 방법에 비하여, 세 가지 잡음 환경과 각 SNR에 대하여 모두 향상된 결과를 보여주고 있음을 확인할 수 있다. 결론적으로, 제안된 조건부 MAP를 기반으로 한 MCRA 알고리즘이 다양한 잡음 환경에서 우수함을 확인할 수 있다.In the subjective sound quality evaluation of Table 2, 20 listeners evaluated noise signals with white, babble, and F16 noise added with SNR of 5, 10, and 15 dB, respectively, for male and female speakers to pronounce 10 sentences. . From Table 2, it can be seen that the proposed method shows improved results for all three noise environments and each SNR compared to the conventional method. In conclusion, we can confirm that the proposed MCRA algorithm based on conditional MAP is excellent in various noise environments.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above may be variously modified or applied by those skilled in the art, and the scope of the technical idea according to the present invention should be defined by the following claims.

도 1은 본 발명의 일 실시예에 따른 음성 향상 방법의 구성을 나타내는 도면.1 is a view showing the configuration of a voice enhancement method according to an embodiment of the present invention.

도 2는 F16 잡음(SNR=10dB)에서의 확률을 비교하여 나타내는 도면으로서, 도 2a는 깨끗한 음성 파형을 나타내는 도면이며, 도 2b는 실시간 프레임에서의 음성 존재 확률을 나타내는 도면.FIG. 2 is a diagram showing a comparison of probabilities in F16 noise (SNR = 10dB), and FIG. 2A is a diagram of a clean speech waveform, and FIG. 2B is a diagram of speech presence probability in a real time frame.

<도면 중 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

S10: 임계값 갱신 단계S10: threshold update step

S20: 현재 프레임의 음성신호 유무 결정 단계S20: Step of determining whether or not a voice signal of the current frame

S30: 음성 존재 확률 추정 단계S30: speech existence probability estimation step

S40: 잡음 추정 단계S40: Noise Estimation Step

S50: 음질 향상 단계S50: Sound Quality Step

Claims

(1) a first step of changing a threshold in accordance with the presence or absence of a voice signal of a previous frame;

(2) a second step of determining the presence or absence of a voice signal in the current frame by comparing the ratio of the minimum value of the local energy and the local energy in the signal mixed with the voice and noise with the threshold obtained in the first step;

(3) a third step of estimating a voice presence probability according to the presence or absence of the voice signal determined in the second step;

(4) a fourth step of estimating noise using the voice presence probability estimated in the third step; And

(5) a fifth step of improving the sound quality of the voice-mixed voice using the noise estimated in the fourth step

Voice enhancement method comprising a.

The method of claim 1,

In the first step, according to the presence or absence of the audio signal of the previous frame, the voice enhancement method characterized in that the threshold value is changed using the following equation a.

If there is no speech section in the previous frame in Equation a, it is given as Equation b below.

Equation b

,

If there is a voice interval in the previous frame, it is given by Equation c below.

.

When the equation b and the equation c are briefly expressed as one equation, the following equation d is given.

here,

S _r = S _r (k, l) is defined as S (k, l) / S _min (k, l) as the ratio of the minimum value of local energy and local energy, and S (k, l) And S _min (k, l) represent the local energy and the minimum of the local energy, respectively. In addition, H _n represents the exact assumption of the n-th frame, and H ₀ , H ₁ represent the assumption of the absence of speech and the presence of speech, respectively. In addition, α is a coefficient for compensating for the biased, and the range is given as α ≧ 1.