KR101124712B1

KR101124712B1 - A voice activity detection method based on non-negative matrix factorization

Info

Publication number: KR101124712B1
Application number: KR1020100074108A
Authority: KR
Inventors: 장준혁; 강상익
Original assignee: 인하대학교 산학협력단
Priority date: 2010-07-30
Filing date: 2010-07-30
Publication date: 2012-03-20
Also published as: KR20120021428A

Abstract

본 발명은 비음수 행렬 인수분해 기반의 음성 검출 방법에 관한 것으로서, 보다 구체적으로는 음성 검출 방법으로서, (1) 비음수 행렬 인수분해 기법을 이용하여 잡음 신호 및 입력 신호로부터 기초 벡터를 도출하는 단계, (2) 상기 단계 (1)에서 도출한 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산하는 단계, 및 (3) 상기 단계 (2)에서 계산된 오차에 문턱 값을 적용하여 음성을 검출하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.
본 발명에서 제안하고 있는 비음수 행렬 인수분해 기반의 음성 검출 방법에 따르면, 잡음 신호로부터 추출한 기초 벡터들과 입력 신호와의 오차를 계산하고 이를 기반으로 음성의 활성 구간을 구분하되, 잡음 추정 구간에서의 오차 값 분포에 따라 잡음 환경을 추정하여 최적의 문턱 값을 선정하여 오차에 적용하여 음성 신호를 검출함으로써, 신호 대 잡음비가 낮아 상대적으로 열악한 잡음 환경에서도 우수한 성능의 음성 검출이 가능하다.The present invention relates to a speech detection method based on non-negative matrix factorization, and more particularly, to a speech detection method, comprising: (1) deriving a basis vector from a noise signal and an input signal using a non-negative matrix factorization technique; (2) calculating an error between the noise signal base vector and the input signal base vector derived in step (1), and (3) applying a threshold to the error calculated in step (2) And including the step of detecting.
According to the speech detection method based on the non-negative matrix factorization proposed in the present invention, the error between the basis vectors extracted from the noise signal and the input signal is calculated, and the active section of the speech is classified based on the same. The noise threshold is estimated according to the distribution of error values, and the optimal threshold value is selected and applied to the error to detect the speech signal. Therefore, the signal-to-noise ratio is low, so that the speech detection can be performed in a relatively poor noise environment.

Description

Speech detection method based on non-negative matrix factorization {A VOICE ACTIVITY DETECTION METHOD BASED ON NON-NEGATIVE MATRIX FACTORIZATION}

본 발명은 음성 검출 방법에 관한 것으로서, 보다 구체적으로는 비음수 행렬 인수분해 기반의 음성 검출 방법에 관한 것이다.The present invention relates to a speech detection method, and more particularly, to a speech detection method based on non-negative matrix factorization.

음성과 비음성 구간을 검출하는 음성 검출기(voice activity detector, VAD)는 음성 부호화, 음성인식 그리고 음향학적 반향제거기 등 음성 통신 시스템에서 많이 적용된다. 특히, 음성 통신 시스템의 대역폭을 효율적으로 사용하기 위해서 필수적으로 요구되며 최근 Ephraim과 Malah의 연구에서 시작된 minimum mean square error(MMSE) 기반의 음성 향상 기법에 사용된 음성의 존재와 부재에 대한 통계적 모델을 음성 검출기에 적용한 것이 우수한 성능을 가진 것으로 알려져 있다.
Voice activity detectors (VADs), which detect voice and non-voice intervals, are widely used in voice communication systems such as voice encoding, speech recognition, and acoustic echo canceller. In particular, it is essential to use the bandwidth of voice communication system efficiently and statistical model for the existence and absence of speech used in the minimum mean square error (MMSE) based speech enhancement technique, which was recently started in Ephraim and Malah's research, It is known that the application to the voice detector has excellent performance.

기존의 음성 검출 알고리즘에서는 음성과 잡음에 대한 통계적 모델을 기반으로 음성의 활성 여부를 판단하기 때문에, 신호대 잡음 비가 높은 신호에서는 비교적 정확한 음성 검출이 가능하지만 상대적으로 열악한 잡음 환경에서는 음성 검출의 성능이 급격히 저하되는 단점이 있다.
Conventional speech detection algorithms determine whether speech is active based on statistical models of speech and noise, so that it is possible to detect speech more accurately in signals with high signal-to-noise ratios, but the performance of speech detection is rapidly reduced in relatively poor noise environments. There is a disadvantage of deterioration.

한편, 비음수 행렬 인수분해는 이미지 등의 패턴 학습과 패턴 인식에 우수한 성능을 보이는 알고리즘이다. 특히 다수의 입력 데이터에서 최적의 기초 패턴을 분리하여 이들의 선형 조합으로 전체 데이터를 근사할 수 있기 때문에 데이터 특징 추출에도 유용하다. 그러나 아직까지 비음수 행렬 인수분해를 이용한 효율적인 음성 검출 방법은 제안되지 못한 상황이다.On the other hand, non-negative matrix factorization is an algorithm that shows excellent performance in pattern learning and pattern recognition of images. In particular, it is useful for data feature extraction because the optimal basic pattern can be separated from a plurality of input data and the entire data can be approximated by a linear combination thereof. However, an efficient speech detection method using non-negative matrix factoring has not been proposed yet.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 잡음 신호로부터 추출한 기초 벡터들과 입력 신호와의 오차를 계산하고 이를 기반으로 음성의 활성 구간을 구분하되, 잡음 추정 구간에서의 오차 값 분포에 따라 잡음 환경을 추정하여 최적의 문턱 값을 선정하여 오차에 적용하여 음성 신호를 검출함으로써, 신호대 잡음 비가 낮아 상대적으로 열악한 잡음 환경에서도 음성 검출의 성능이 우수한, 비음수 행렬 인수분해 기반의 음성 검출 방법을 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems of the conventionally proposed methods, and calculates an error between an input signal and a basis vector extracted from a noise signal and classifies an active section of a speech based on the noise estimation. Non-negative matrix, which estimates the noise environment according to the error value distribution in the interval, selects the optimal threshold value, and applies the error to detect the speech signal. It is an object of the present invention to provide a factor detection-based speech detection method.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법은,In order to achieve the above object, a non-negative matrix factorization-based speech detection method according to a feature of the present invention,

(1) 비음수 행렬 인수분해 기법을 이용하여 잡음 신호 및 입력 신호로부터 기초 벡터를 도출하는 단계;(1) deriving a basis vector from the noise signal and the input signal using a non-negative matrix factorization technique;

(2) 상기 단계 (1)에서 도출한 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산하는 단계; 및(2) calculating an error between the noise signal base vector and the input signal base vector derived in step (1); And

(3) 상기 단계 (2)에서 계산된 오차에 문턱 값을 적용하여 음성을 검출하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.
(3) detecting the speech by applying a threshold value to the error calculated in the step (2).

바람직하게는,Preferably,

통계 모델 기반의 음성 검출 방법에 의해 추정된 음성 부재구간에서 비음수 행렬 인수분해 기법을 이용하여 도출된 잡음 신호 및 입력 신호의 오차 값 분포에 따른 최적화된 문턱 값을 구하고,In the absence of speech estimated by the statistical model-based speech detection method, an optimized threshold value is obtained according to the distribution of error values of the noise signal and the input signal derived by using the non-negative matrix factorization method.

상기 단계 (3)의 문턱 값을 상기 최적화된 문턱 값으로 하여 음성을 검출할 수 있다.
The voice may be detected using the threshold value of step (3) as the optimized threshold value.

더욱 바람직하게는, 상기 최적화된 문턱 값은,More preferably, the optimized threshold value is

각 잡음 환경별로 서로 다른 값일 수 있다.
Each noise environment may have a different value.

바람직하게는,Preferably,

상기 기초 벡터는 통계적 모델을 이용하여 추정한 a posteriori signal-to-noise ratio(SNR)와 a priori SNR일 수 있다.The basis vector may be a posteriori signal-to-noise ratio (SNR) and a priori SNR estimated using a statistical model.

본 발명에서 제안하고 있는 비음수 행렬 인수분해 기반의 음성 검출 방법에 따르면, 잡음 신호로부터 추출한 기초 벡터들과 입력 신호와의 오차를 계산하고 이를 기반으로 음성의 활성 구간을 구분하되, 잡음 추정 구간에서의 오차 값 분포에 따라 잡음 환경을 추정하여 최적의 문턱 값을 선정하여 오차에 적용하여 음성 신호를 검출함으로써, 신호대 잡음 비가 낮아 상대적으로 열악한 잡음 환경에서도 우수한 성능의 음성 검출이 가능하다.According to the speech detection method based on the non-negative matrix factorization proposed in the present invention, the error between the basis vectors extracted from the noise signal and the input signal is calculated, and the active section of the speech is classified based on the same. The noise threshold is estimated according to the distribution of error values, and the optimal threshold value is selected and applied to the error to detect the voice signal. Thus, the low-to-noise ratio enables excellent voice detection even in a relatively poor noise environment.

도 1은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법의 흐름을 도시한 도면.
도 2는 통계적 모델 기반의 음성 검출 방법을 통해 잡음 환경이라고 추정된 구간에서의 비음수 행렬 인수분해 결과의 분포를 도시한 도면.
도 3은 백색 잡음 환경에서 잡음 대 음성 신호 비에 대한 비음수 행렬 인수분해 기반의 음성 검출 결과를 도시한 도면.
도 4는 잡음 환경에 따른 음성 검출 확률의 변화를 도시한 도면.1 is a flow diagram illustrating a non-negative matrix factorization-based speech detection method according to an embodiment of the present invention.
2 is a diagram illustrating a distribution of non-negative matrix factorization results in a section estimated as a noise environment through a statistical model-based speech detection method.
FIG. 3 is a diagram illustrating non-negative matrix factorization based speech detection results for noise to speech signal ratios in a white noise environment. FIG.
4 is a diagram illustrating a change in voice detection probability according to a noise environment.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.
Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. However, in describing the preferred embodiment of the present invention in detail, if it is determined that the detailed description of the related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.
In addition, in the entire specification, when a part is referred to as being 'connected' to another part, it may be referred to as 'indirectly connected' not only with 'directly connected' . In addition, the term 'comprising' of an element means that the element may further include other elements, not to exclude other elements unless specifically stated otherwise.

도 1은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법의 흐름을 도시한 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법은, 잡음 신호 및 입력 신호로부터 기초 벡터를 도출하는 단계(S100), 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산하는 단계(S200), 및 음성을 검출하는 단계(S400)를 포함하여 구현될 수 있으며, 최적화된 문턱 값을 구하는 단계(S300)를 더 포함하여 구현될 수 있다.
1 is a diagram illustrating a flow of a voice detection method based on non-negative matrix factorization according to an embodiment of the present invention. As shown in FIG. 1, in the non-negative matrix factorization-based speech detection method according to an embodiment of the present invention, a step of deriving a basis vector from a noise signal and an input signal (S100), the noise signal base vector, and an input is performed. The method may include the step of calculating an error between the signal base vectors (S200), and the step of detecting a voice (S400), and may further include the step of obtaining an optimized threshold value (S300).

즉, 본 발명의 일실시예에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법은, 먼저 비음수 행렬 인수분해 기법을 이용하여 잡음 신호 및 입력 신호로부터 기초 벡터를 도출한다(S100). 도출한 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산하여(S200), 계산된 오차에 문턱 값을 적용하여 음성을 검출하게 된다(S400). 이때, 음성 검출 성능을 향상시키기 위해 최적화된 문턱 값을 사용할 수 있는데, 이를 위해 통계 모델 기반의 음성 검출 방법에 의해 추정된 음성 부재구간에서 비음수 행렬 인수분해 기법을 이용하여 도출된 잡음 신호 및 입력 신호의 오차 값 분포에 따른 최적화된 문턱 값을 구할 수 있다(S300).
That is, in the speech detection method based on non-negative matrix factorization according to an embodiment of the present invention, first, a base vector is derived from a noise signal and an input signal using a non-negative matrix factorization technique (S100). An error between the derived noise signal base vector and the input signal base vector is calculated (S200), and a voice is detected by applying a threshold value to the calculated error (S400). In this case, an optimized threshold value may be used to improve speech detection performance. To this end, a noise signal and an input obtained by using a non-negative matrix factorization method in a speech absence section estimated by a statistical model-based speech detection method may be used. An optimized threshold value according to an error value distribution of the signal may be obtained (S300).

이하에서는, 본 발명의 일실시예에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법의 세부적인 내용에 대해 상세히 설명하도록 한다.
Hereinafter, the details of the non-negative matrix factorization-based speech detection method according to an embodiment of the present invention will be described in detail.

본 발명의 일실시예에 따른 비음수 행렬 인수분해 기반의 음성 검출 방법을 위해서는 통계적 모델을 기반으로 하는 음성 검출 방법에 대한 분석이 선행되어야 한다. 먼저, 통계적 모델을 기반으로 하는 음성 검출 방법은, 시간 축 상에서 원래의 음성신호 x(t)에 잡음신호 n(t)이 인가된 입력신호 y(t)를 discrete Fourier transform(DFT)을 통해 주파수 축으로 변환하여 아래 수학식 1과 같이 표현한다.For the speech detection method based on non-negative matrix factorization according to an embodiment of the present invention, analysis of the speech detection method based on a statistical model should be preceded. First, in the speech detection method based on a statistical model, the input signal y (t) to which the noise signal n (t) is applied to the original speech signal x (t) on the time axis is frequencyd through a discrete Fourier transform (DFT). Convert it to an axis and express it as in Equation 1 below.

여기서, Y(t)=[Y₁,Y₂,…,Y_M], X(t)=[X₁,X₂,…,X_M] 및 N(t)=[N₁,N₂,…,N_M]는 각각 잡음에 오염된 음성신호, 원래의 음성신호 및 잡음신호의 DFT 계수 벡터를 나타낸다. 주어진 가설 H₀, H₁이 각각 음성의 부재와 존재를 표현한다고 하면 각 주파수 채널별로 다음 수학식 2와 같이 기술된다.Where Y (t) = [Y ₁ , Y ₂ ,... , Y _M ], X (t) = [X ₁ , X ₂ ,... , X _M ] and N (t) = [N ₁ , N ₂ ,... , N _M ] represent the DFT coefficient vectors of the speech signal contaminated with noise, the original speech signal, and the noise signal, respectively. Given that the hypotheses H ₀ and H ₁ represent the absence and presence of speech, the following equations are described for each frequency channel.

음성과 잡음신호의 스펙트럼이 복소 가우시안 분포를 따른다는 가정으로부터 가설 H₀와 H₁을 조건으로 한 확률밀도함수는 다음 수학식 3과 같이 주어진다.From the assumption that the spectrums of speech and noise signals follow a complex Gaussian distribution, the probability density function under the hypotheses H ₀ and H ₁ is given by Equation 3 below.

여기서, λ_x,k와 λ_d,k는 각각 채널별 음성과 잡음의 분산이며, 이때 k번째 주파수 밴드에 대한 우도 비는 다음 수학식 4와 같이 구한다.Here, λ _{x, k} and λ _{d, k} are the variances of speech and noise for each channel, and the likelihood ratio for the k th frequency band is obtained as in Equation 4 below.

여기서, ξ_k=λ_x,k/λ_d,k와 γ_k=Y_k/λ_d,k는 각각 a priori signal-to-noise ratio(SNR)와 a posteriori SNR이다. 음성 부재 구간에서 갱신되는 잡음 신호로부터 구한 잡음 분산 λ_d,k를 이용하여 a posteriori SNR γ_k를 추정하며, 또한 a priori SNR ξ_k는 decision-directed(DD) 방식을 이용하여 다음 수학식 5와 같이 추정한다.Here, ξ _k = λ _{x, k} / λ _{d, k} and γ _k = Y _k / λ _{d, k} are a priori signal-to-noise ratio (SNR) and a posteriori SNR, respectively. A posteriori SNR γ _k is estimated using the noise variance λ _{d, k} obtained from the noise signal updated in the speech-free interval, and a priori SNR ξ _k is determined using the decision-directed (DD) method Estimate together.

여기서,

은 이전 프레임에서 추정된 음성 신호의 k번째 스펙트럼 성분의 크기에 대한 추정치이며, MMSE에 기반을 두어 구한다. 또한 α는 가중치 값이며, 연산자 P[?]은 다음 수학식 6과 같이 정의된다.here,

Is an estimate of the magnitude of the k th spectral component of the speech signal estimated in the previous frame and is obtained based on the MMSE. Α is a weight value, and the operator P [?] Is defined as in Equation 6 below.

기존의 일반적인 통계적 모델 기반의 음성 검출기에 대한 결정식은 각각의 주파수 채널에서 구해진 우도 비를 기하 평균하여 다음 수학식 7과 같이 음성 검출 여부를 판단한다.In the conventional general statistical model-based speech detector, a geometric mean of the likelihood ratios obtained in each frequency channel is used to determine whether speech is detected, as shown in Equation 7 below.

여기서, M은 전체 주파수 대역의 개수이며, η는 음성 검출 문턱 값이다.
Where M is the number of all frequency bands, and η is the voice detection threshold.

본 발명에서는 위와 같이 잡음 분산 λ_d,k를 이용하여 추정한 a posteriori SNR와 DD 방식을 이용하여 추정한 a priori SNR을 기초 벡터로 할 수 있다.
In the present invention, a posteriori SNR estimated using the noise variance λ _{d, k} as described above and a priori estimated using the DD scheme SNR can be used as a base vector.

단계 S100에서는, 비음수 행렬 인수분해 기법을 이용하여 잡음 신호 및 입력 신호로부터 기초 벡터를 도출한다.
In step S100, a base vector is derived from the noise signal and the input signal using a non-negative matrix factorization technique.

비음수 행렬 인수분해(Non-negative Matrix Factorization, NMF)는 PCA, VQ와 마찬가지로 기초 벡터들의 선형 결합으로 근사하여 행렬을 분해하는 기법으로서, NMF는 특별히 모든 성분이 비음수인 제약을 가진다. 비음수 성분으로 구성된 n×m 행렬 V는

형태, 즉 다음 수학식 8과 같이 인수분해된다.Non-negative matrix factorization (NMF), like PCA and VQ, is a technique that decomposes a matrix by approximating a linear combination of basis vectors. NMF has a restriction that all components are non-negative. The n × m matrix V of nonnegative components

Form, that is, factored as

여기서, 행렬 V는 n×1 벡터가 m개 결합된 것으로 볼 수 있으며, 행렬 W는 n×r 크기로 역시 r개의 n×1 벡터 집합, H는 r×m 크기의 계수 행렬이 된다. 즉, m개의 n차 데이터 벡터들을 r개의 n차 기초 벡터들의 선형 결합으로 표현된다.
Here, the matrix V can be regarded as m of n x 1 vectors combined, the matrix W is a set of r n x 1 vectors of size n × r, and H is a coefficient matrix of size r × m. That is, m n-th data vectors are expressed as a linear combination of r n-th basis vectors.

일반적으로 행렬의 비음수 행렬 인수분해 과정은 행렬 W와 H를 반복적으로 갱신하여 행렬 V 와 WH 간의 거리를 최소화하는 방향으로 근사한다. 따라서 어떠한 거리 함수 혹은 목적 함수를 적용할 것인가에 따라 성분 값 갱신 식이 달라진다. 일반적으로 널리 사용되는 목적 함수로는 Euclidean distance

이며 이것을 최소화하는 성분 값 갱신 식은 다음 수학식 9와 같다.In general, the nonnegative matrix factorization process of the matrix is approximated in a direction to minimize the distance between the matrix V and WH by repeatedly updating the matrix W and H. Therefore, the component value update expression varies depending on which distance function or objective function is applied. Commonly used objective function is Euclidean distance.

The component value update equation for minimizing this is as shown in Equation 9.

본 발명에서는 위와 같이 잡음 분산 λ_d,k를 이용하여 추정한 a posteriori SNR와 DD 방식을 이용하여 추정한 a priori SNR을 기초 벡터로 두고, 시작 부분의 m+9개 프레임을 음성 부재 구간으로 가정한 후, m개 프레임의 슈퍼프레임들을 비음수 행렬 인수분해 하여 배경 잡음 신호의 기초 벡터 W를 추출하고 1 프레임씩 이동하여 기초 벡터를 추출할 수 있다.
In the present invention, a posteriori SNR estimated using the noise variance λ _{d, k} as described above and a priori estimated using the DD scheme With the SNR as the basis vector, assuming that m + 9 frames at the beginning are speech absent intervals, the superframes of m frames are non-negative matrix factorized to extract the base vector W of the background noise signal, and then, one frame at a time. The base vector can be extracted by moving.

단계 S200에서는, 단계 S100에서 도출한 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산한다. 비음수 행렬 인수분해 기법은 데이터 벡터를 기초 벡터들의 선형 결합으로 근사한다. 따라서 사전에 잡음 신호로부터 기초 벡터를 추출하여 입력 신호의 기초벡터와의 거리를 계산하여 얻은 값으로 입력 신호와 잡음 신호의 유사도를 판단할 수 있다. 따라서 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산함으로써 오차 크기에 따른 유사도를 판단하여 입력 신호에서 음성을 검출하게 된다.
In step S200, an error between the noise signal base vector and the input signal base vector derived in step S100 is calculated. Non-negative matrix factorization approximates a data vector as a linear combination of base vectors. Therefore, the similarity between the input signal and the noise signal may be determined based on a value obtained by extracting a base vector from the noise signal in advance and calculating a distance from the base vector of the input signal. Accordingly, by calculating the error between the noise signal base vector and the input signal base vector, the similarity according to the error magnitude is determined to detect the voice in the input signal.

단계 S300에서는, 통계 모델 기반의 음성 검출 방법에 의해 추정된 음성 부재구간에서 비음수 행렬 인수분해 기법을 이용하여 도출된 잡음 신호 및 입력 신호의 오차 값 분포에 따른 최적화된 문턱 값을 구한다.
In operation S300, an optimized threshold value according to a distribution of error values of a noise signal and an input signal derived by using a non-negative matrix factorization method in a speech member section estimated by a statistical model based speech detection method is obtained.

본 발명에서는 다양한 잡음 환경에 최적화된 문턱 값을 적용하기 위해 통계적 모델 기반의 음성검출기를 도입한다. NMF 결과를 사용하여 잡음 환경을 인식하여 잡음 환경에 해당하는 최적 문턱 값을 적용하기 위해서는 음성 구간을 최대한 배제하는 것이 바람직하다. 잡음 구간을 효과적으로 추정하기 위해 시작 N프레임에 대하여 통계적 모델 기반의 음성검출기를 적용하여 비음성일 경우 NMF 결과 값의 평균으로 잡음 환경을 구분한다.
In the present invention, a statistical model-based speech detector is introduced to apply a threshold value optimized for various noise environments. In order to recognize the noise environment using the NMF result and apply the optimal threshold value corresponding to the noise environment, it is desirable to exclude the speech section as much as possible. In order to effectively estimate the noise section, we apply a statistical model-based speech detector to the starting N frame and classify the noise environment by the average of NMF results in case of non-voice.

도 2는 통계적 모델 기반의 음성 검출 방법을 통해 잡음 환경이라고 추정된 구간에서의 비음수 행렬 인수분해 결과의 분포를 도시한 도면이다. 도 2에 도시된 바와 같이, NMF 결과 분포는 각각의 잡음 환경에 따라 명확하게 구분할 수 있다. 따라서 각각의 잡음 환경에 대해 최적화된 문턱 값을 다음 수학식 10과 같이 적용한다.FIG. 2 is a diagram illustrating a distribution of non-negative matrix factorization results in a section estimated as a noise environment through a statistical model-based speech detection method. As shown in FIG. 2, the NMF result distribution can be clearly distinguished according to each noise environment. Therefore, the threshold value optimized for each noise environment is applied as in Equation 10 below.

여기서, N은 잡음 환경을 추정하기 위한 프레임 사이즈이고, b는 N 프레임 안에서 logΛ(t)<β인 프레임의 개수를 나타낸다. 주목해야 할 점은 통계적 모델기반의 음성검출 알고리즘을 적용하여 잡음환경에 최적화된 문턱 값을 적용함으로써 보다 향상된 음성 검출 결과를 얻을 수 있으며, 단순히 음성만을 검출하는 것이 아니라 특정 잡음 환경을 검출하여 후처리하는 방법으로 응용할 수 있다.
Where N is the frame size for estimating the noise environment, and b is the number of frames in the N frame where logΛ (t) < It should be noted that statistical model-based speech detection algorithm can be used to apply the optimized threshold to the noise environment to obtain better speech detection results. It can be applied in a way.

단계 S400에서는, 단계 S200에서 계산된 오차에 문턱 값을 적용하여 음성을 검출한다. 이때, 단계 S300에서 도출된 최적화된 문턱 값 η를 이용하면 본 발명의 일실시예에 따른 음성 검출 방법의 성능이 향상될 수 있다.
In step S400, a voice is detected by applying a threshold to the error calculated in step S200. In this case, using the optimized threshold value η derived in step S300 can improve the performance of the voice detection method according to an embodiment of the present invention.

총 10 개의 기초 벡터의 평균

와 입력 신호의 기초 벡터들 사이의 거리를 계산하여 음성 활성 여부를 다음과 같이 판단한다. 특히, 단계 S300에서 도출된 최적화된 문턱 값 η를 다음 수학식 12에 적용하여 최종적으로 음성을 검출할 수 있다.Total of 10 basis vectors

The distance between the basis vectors of the input signal is calculated to determine whether the voice is active as follows. In particular, the optimized threshold value η derived in step S300 may be applied to Equation 12 to finally detect speech.

추출된 기초 벡터

는 배경 잡음 신호로부터 추출된 것이므로 이 기초 벡터들과 유사한 신호가 입력될 경우 거리는 상대적으로 작으며, 잡음 신호와 특성이 다른 음성 신호와의 거리는 상대적으로 매우 크다.
Extracted basis vector

Since is extracted from the background noise signal, when a signal similar to the basis vectors is input, the distance is relatively small, and the distance between the noise signal and the speech signal having different characteristics is relatively large.

도 3은 백색 잡음 환경에서 잡음 대 음성 신호 비에 대한 비음수 행렬 인수분해 기반의 음성 검출 결과를 도시한 도면이다. 도 3에는 백색 잡음 5dB인 SNR일 경우의 결과를 도시하였다. 도 3은 백색 잡음 환경에서 잡음 신호의 기초벡터

와 입력 신호의 기초벡터 W사이의 거리를 보여주고 있다.
FIG. 3 is a diagram illustrating a result of speech detection based on non-negative matrix factor of noise to speech signal ratio in a white noise environment. 3 shows the results when the SNR is 5dB of white noise. 3 is a basis vector of a noise signal in a white noise environment

And the distance between the base vector W and the input signal.

도 4는 잡음 환경에 따른 음성 검출 확률의 변화를 도시한 도면이다. 도 4에는 NMF 기반의 음성 검출기에서 문턱 값 η'의 변화에 따른 음성 검출 확률 P_e(speech detection error probability)의 변화 곡선이다. 도 2에 도시된 바와 같이, NMF 결과가 잡음환경에 따라 각각의 최적화된 문턱 값을 가짐을 알 수 있다. 따라서 최적화된 성능을 위하여 잡음환경에 따른 문턱 값이 존재하는 것이 바람직하다. 음성 부재구간으로 가정한 시작 프레임에서 추출한 잡음 환경에서의 기초 벡터

는 입력신호가 동일한 잡음 환경이라면 0에 가까운 값을 가져야 하지만 그렇지 않다. 이것은 잡음이 stationary 혹은 non-stationary의 정도에 따라 잡음 신호의 기초 벡터와 입력 신호의 기초 벡터 간의 차이가 다른 값을 가지기 때문이다.
4 is a diagram illustrating a change in voice detection probability according to a noise environment. 4 shows a variation curve of the speech detection probability P _e (speech detection error probability) according to the change of the threshold value η 'in the NMF-based speech detector. As shown in FIG. 2, it can be seen that the NMF results have respective optimized threshold values according to the noise environment. Therefore, it is desirable to have a threshold value according to the noise environment for optimized performance. Base Vector in Noise Environment Extracted from Starting Frame Assumed as Speech Absence Section

Must be close to zero if the input signal is in the same noise environment, but it is not. This is because the difference between the basis vector of the noise signal and the basis vector of the input signal is different depending on the degree of noise stationary or non-stationary.

실험결과Experiment result

본 발명에서 제안된 음성 검출 알고리즘의 성능을 평가하기 위하여 오경보(false alarm)와 누락(missing)을 포함한 음성 검출 오류 확률 P_e를 측정하였다. 실험에 사용된 데이터는 총 230초의 깨끗한 음성 데이터에 음성과 비음성 부분을 10㎳마다 수동으로 표시하였다. 분류된 음성 데이터의 음성 구간은 총 57.1%로 유성음 44.0%, 무성음 13.1%로 구성되었으며 잡음 환경을 만들기 위해 NOISEX-92 데이터베이스로부터 white, factory1, destroyer ops 잡음을 ETRI 데이터베이스로부터 street 잡음을 사용하였으며 각각 0, 5, 10, 15㏈ SNR로 원래의 음성신호에 더하여 사용하였다. NMF에서 기초 벡터를 구하기 위해 사용한 특징벡터는 a priori SNR ξ_k, a posteriori SNR γ_k 총 32차를 사용하였으며 m=5, r=3으로 사용하였다. 잡음 기초 벡터는 1 프레임씩 이동하면서 총 10개의 기초 벡터의 평균으로 사용하였으며 각각의 잡음을 추정하기 위하여 50 프레임에서의 NMF 결과 분포를 사용하여 분포의 평균값으로 잡음 환경을 구분하여 문턱 값을 결정하였다.
In order to evaluate the performance of the proposed voice detection algorithm, the voice detection error probability P _e including false alarm and missing was measured. The data used in the experiment was to manually display the voice and non-voice parts every 10 ms in clear voice data of 230 seconds in total. The voice section of the classified voice data was composed of 57.1% of total voice, 44.0% of voiced sound and 13.1% of unvoiced sound. , 5, 10, 15 kHz SNR was used in addition to the original audio signal. The feature vectors used to find the basis vector in NMF were 32 orders of a priori SNR ξ _k , a posteriori SNR γ _k , and m = 5 and r = 3. The noise basis vector was used as the average of a total of 10 basis vectors moving by one frame, and the threshold value was determined by dividing the noise environment by the mean value of the distribution using the NMF result distribution in 50 frames to estimate each noise. .

음성 검출 실험 결과는 다음 표 1에 나타나 있다. 실험 결과를 분석해 보면 낮은 SNR에서 NMF 기반의 음성검출기에 최적화된 문턱 값을 적용한 실험 결과 두드러진 성능향상을 보였다. 이것은 NMF 기반의 음성인식기가 통계적 기반의 음성인식기보다 신호대 잡음 비의 영향을 덜 받는 것으로 볼 수 있으며 평균적으로 6.75%의 음성 검출 성능 향상을 확인할 수 있다.The results of the negative detection experiments are shown in Table 1 below. Experimental results show that the low threshold SNR results in a significant performance improvement for the NMF-based speech detector. It can be seen that the NMF-based speech recognizer is less affected by the signal-to-noise ratio than the statistical speech-based speech recognizer.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above may be variously modified or applied by those skilled in the art, and the scope of the technical idea according to the present invention should be defined by the following claims.

S100: 잡음 신호 및 입력 신호로부터 기초 벡터를 도출하는 단계
S200: 잡음 신호 기초 벡터와 입력 신호 기초 벡터 사이의 오차를 계산하는 단계
S300: 최적화된 문턱 값을 구하는 단계
S400: 음성을 검출하는 단계S100: deriving a basis vector from the noise signal and the input signal
S200: calculating an error between the noise signal base vector and the input signal base vector
S300: calculating the optimized threshold value
S400: detecting voice

Claims

As a voice detection method,
(1) deriving a basis vector from the noise signal and the input signal using a non-negative matrix factorization technique;
(2) calculating an error between the noise signal base vector and the input signal base vector derived in step (1); And
(3) detecting a voice by applying a threshold to the error calculated in step (2),
In the speech component section estimated by the statistical model-based speech detection method, the optimized threshold value is calculated according to the distribution of error values of the noise signal and the input signal derived by using the non-negative matrix factorization method.
Non-negative matrix factorization-based speech detection method characterized in that for detecting the speech using the threshold value of the step (3) as the optimized threshold value.

delete

The method of claim 1, wherein the optimized threshold value,
Non-negative matrix factorization-based speech detection method characterized in that different values for each noise environment.

The method of claim 1,
The basis vector is a negative posterior signal-to-noise ratio (SNR) and a priori SNR estimated using a statistical model.