KR20150078831A

KR20150078831A - Method and system forspeech enhancement using non negative matrix factorization and basis matrix update

Info

Publication number: KR20150078831A
Application number: KR1020130168578A
Authority: KR
Inventors: 김남수; 권기수
Original assignee: 서울대학교산학협력단
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2015-07-08
Also published as: KR101535135B1

Abstract

The present invention relates to a method and a system for speech enhancement. More specifically, the method includes the steps of: (1) drawing a pre-enhanced signal of a complex value converted from a sound signal where a noise and a voice are mixed; (2) estimating a signal-to-noise ratio and gaining a MMSE-LSA gain function to draw a second signal; and (3) updating a basis matrix by using the drawn second signal.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and system for enhancing sound using nonlinear matrix factorization and base matrix update,

본 발명은 음향 개선 방법 및 시스템에 관한 것으로서, 보다 구체적으로는 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법 및 시스템에 관한 것이다.The present invention relates to an acoustic enhancement method and system, and more particularly, to an acoustic enhancement method and system using factorial matrix factorization and base matrix update.

일반적으로 수신 입력 신호에는 사람이 관심 있게 듣고자 하는 소리 이외에 다양한 잡음이 함께 존재한다. 이와 같이 다양한 신호가 혼재할 경우, 사람이 본래 듣고자 하는 소리를 인지하는 데에 방해가 되어 가해성과 명료성이 떨어지므로, 잡음이 섞인 음향 신호의 음향 품질 개선 방법이 개발되고 있다. 통계적 모델에 기반한 음향 개선 기법은, 입력 신호로부터 목표가 되는 음성과 그 외의 제거하고자 하는 잡음을 각기 다른 통계 모델로 만들고 매 시간 프레임마다 음성 존재 검출(voice activity detection), 잡음 추정(noise power tracking) 등의 방법을 결합하여 향상 과정을 수행한다(특허출원 제10-2006-0095820호 참조). 이러한 방법의 향상은 잡음이 시간의 흐름에 따라 천천히 변한다(stationary)는 가정 하에 진행된다. 통계적 모델 기반의 음향 개선 기법은 잡음이 섞인 입력 음성 신호로부터 잡음과 음성의 통계적 특성을 추정하기 때문에 사전에 미리 트레이닝된 데이터의 정보가 없어도 된다. 그러나 잡음의 특성이 천천히 변한다는 가정에 맞는 방법이므로 주변 잡음이 급격히 변하는 non-stationary 환경에서는 음향 향상 성능이 급격히 떨어진다.
Generally, the received input signal includes various noise other than a sound that a person is interested to hear. When such a variety of signals are mixed, it is difficult to perceive a sound that a person intends to listen to, which degrades perceptibility and clarity. Therefore, a method of improving acoustic quality of a noise-mixed acoustic signal is being developed. The statistical model based acoustic enhancement technique is based on statistical modeling of the target speech and other noise to be removed from the input signal, and voice activity detection (noise activity tracking), noise power tracking (See Patent Application No. 10-2006-0095820). The improvement of this method proceeds under the assumption that the noise slowly changes with the passage of time (stationary). The statistical model-based acoustic enhancement technique estimates the statistical characteristics of noise and speech from the noise-mixed input speech signal, so that information of the previously-trained data is not required. However, since it is a method that is based on the assumption that the characteristics of the noise change slowly, the sound improvement performance drops sharply in a non-stationary environment in which the ambient noise changes abruptly.

반면, 템플릿 기반의 음향 개선 기법은, 미리 트레이닝된 음성과 잡음의 패턴이나 통계적인 데이터 정보를 이용하는 방법이다. 트레이닝을 통하여 미리 알고 있는 음성 특성과 잡음 정보를 통하여 입력된 신호로부터 잡음을 제거하는 음향 개선 방법이다. 템플릿 기반의 음향 개선 기법은 주변 잡음이 급격히 변하는 환경에서도 강인한(robust) 음향 향상 성능을 가지나, 사전에 미리 트레이닝된 음성과 잡음의 데이터 정보가 있어야하며, 만약 트레이닝된 잡음 모델과 실제 환경에서의 잡음이 서로 다르면 향상 성능이 떨어진다는 문제가 있다.
On the other hand, the template-based acoustic enhancement method is a method of using a pattern of pre-trained speech and noise or statistical data information. This is an acoustic improvement method that removes noise from an input signal through previously known speech characteristics and noise information through training. The template-based sound improvement technique has a robust sound enhancement performance even in an environment where the ambient noise is rapidly changed. However, it is necessary to have data information of pre-training voice and noise in advance. If the noise level of the training noise model and the noise There is a problem that the improvement performance is lowered.

한편, 비음수 행렬 인수분해(Non-negative matrix factorization, NMF)란, 특정 정보 집단에서 개개의 정보들이 가지고 있는 공통된 부분(basis)들을 분리해내는 것이다. 실제의 정보 집단을 V, 분리하고자 하는 행렬을 W, H라고 하면 V=WH를 만족하게 된다. W는 기저 행렬을, H는 부호화 행렬을 나타낸다. V는 W의 각 열, 기저(basis)들의 합으로 복원될 수 있다. 즉, 다수의 입력 데이터에서 최적의 기초 패턴을 분리하여 이들의 선형 조합으로 전체 데이터를 근사할 수 있기 때문에 데이터 특징 추출에 유용하다. 본 발명에서는, 통계적 모델 기반의 음향 개선 기법과 템플릿 기반의 음향 개선 기법을 결합하여 각 기법의 단점을 해결함으로써 기존의 방법들보다 높은 음향 향상 성능을 도출할 수 있는 방법을 제안하고자 한다.Non-negative matrix factorization (NMF), on the other hand, separates the common basis of individual information in a particular information set. If the actual information group is V, and the matrix to be separated is W, H, then V = WH is satisfied. W denotes a base matrix, and H denotes an encoding matrix. V can be restored to the sum of the columns of W, bases. That is, it is useful for extracting data features because it is possible to approximate the entire data by linear combination of the optimal base patterns from a plurality of input data. In the present invention, a method of deriving a sound enhancement performance by solving the disadvantages of each technique by combining a statistical model-based sound enhancement technique and a template-based sound enhancement technique is proposed.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출한 후에, 비음수 행렬 인수분해(NMF)에 기반하여 제1 신호로부터 추정된 음성과 노이즈(제1 신호로부터 얻은 값)를 바탕으로 사전 신호대 잡음비(SNR) 값과 사후 신호대 잡음비(SNR) 값을 구하고, MMSE-LSA 이득함수를 이용하여 제2 신호를 도출함으로써, 높은 성능의 음향 향상 기능을 가지는, 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법 및 시스템을 제공하는 것을 그 목적으로 한다.
The present invention has been proposed in order to solve the above-mentioned problems of the previously proposed methods. The present invention proposes a method in which a sound signal mixed with noise and speech is converted into a complex number by using a statistical model- (SNR) value and a post-SNR value based on the noise and the noise (value obtained from the first signal) estimated from the first signal based on the non-sound number matrix factorization (NMF) A method and system for improving sound using non-tone number matrix factorization and basis matrix updating, which have a high performance sound enhancement function by obtaining an SNR value of a sound signal and deriving a second signal using an MMSE-LSA gain function For that purpose.

또한, 본 발명은, 제2 신호를 이용하여 다음 시간 프레임에서 수행되는 비음수 행렬 인수분해에 사용할 기저 행렬을 업데이트함으로써, 올바른 잡음 모델을 초깃값으로 유지할 수 있고, 음성 존재 확률 값(SPP) 추정을 통한 업데이트 속도 결정하도록 하여 잡음 환경 변화 속도에 따라 업데이트 비율을 자동으로 계산하여 적용함으로써, 불필요하게 많은 업데이트로 인한 오버피팅(overfitting) 등의 악영향을 끼지는 것을 방지할 수 있는, 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법 및 시스템을 제공하는 것을 다른 목적으로 한다.
Further, the present invention can maintain the correct noise model as the initial value by updating the base matrix to be used for the factorization of the non-sound number matrix performed in the next time frame using the second signal, And the update rate is automatically calculated and applied according to the change rate of the noise environment so as to prevent the adverse effects such as overfitting due to unnecessary updates from being adversely affected. It is another object to provide a sound improvement method and system using decomposition and basis matrix update.

뿐만 아니라 본 발명은, MMSE-LSA 이득함수 이용함으로써, 종래 위너(Weiner) 필터 형태의 이득함수를 이용하는 것보다 안정적인 성능을 이끌어낼 수 있으며, 음성과 잡음의 크기가 따로 추정되어 구해지므로 종래 Decision Directe(DD)기법을 사용하여 잡음과 음성의 파워를 추정하는 것이 아닌 단순한 스무딩 기술(Smoothing technique)을 사용하여 개별적인 파워를 사용함으로써, 음향 개선 효과를 더욱 향상시킬 수 있는, 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법 및 시스템을 제공하는 것을 또 다른 목적으로 한다.In addition, since the present invention utilizes the MMSE-LSA gain function, it is possible to obtain a stable performance rather than a conventional gain function using a Weiner filter type, and the size of voice and noise can be estimated separately. Which can further improve the sound improvement effect by using individual powers using a simple smoothing technique rather than estimating the power of noise and speech using the DD technique, It is another object to provide a sound improvement method and system using matrix updating.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법은, According to an aspect of the present invention, there is provided an acoustic enhancement method using factor matrix nonparametric factorization and base matrix update,

(1) 잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출하는 단계;(1) deriving a pre-enhanced signal obtained by converting a sound signal mixed with noise and speech into a complex number using a statistical model-based sound improvement technique;

(2) 비음수 행렬 인수분해(NMF)에 기반하여 상기 제1 신호로부터 얻은 값을 이용하여 신호대 잡음비(SNR) 값을 추정하며, 상기 추정된 신호대 잡음비(SNR) 값을 이용하여 MMSE-LSA 이득함수를 구함으로써, 제2 신호를 도출하는 단계; 및(2) estimating a signal-to-noise ratio (SNR) value using a value obtained from the first signal based on a non-sound number matrix factorization (NMF), and using the estimated SNR value to calculate an MMSE- Deriving a second signal by obtaining a function; And

(3) 상기 단계 (2)에서 도출된 제2 신호를 이용하여, 다음 시간 프레임에서 수행되는 상기 단계 (2)의 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.
(3) updating a basis matrix to be used for factorizing the non-noise matrix of the step (2) performed in the next time frame using the second signal derived in the step (2) And is characterized by its constitution.

바람직하게는, 상기 단계 (2)는,Preferably, the step (2)

(2-1) 상기 제1 신호의 절댓값을 구하고, 사전에 미리 트레이닝된 잡음 기저 행렬과 음성 기저 행렬을 통하여 부호화 행렬을 추정하는 단계;(2-1) obtaining an absolute value of the first signal, and estimating an encoding matrix through a previously pre-trained noise base matrix and a speech base matrix;

(2-2) 상기 단계 (2-1)에서 추정된 부호화 행렬과, 이전 시간 프레임의 상기 단계 (3)에서 업데이트된 기저 행렬을 이용하여 사전 신호대 잡음비(SNR) 값과 사후 신호대 잡음비(SNR) 값을 추정하는 단계; 및(SNR) value and a post-SNR (SNR) value using the encoding matrix estimated in the step (2-1) and the base matrix updated in the step (3) of the previous time frame, Estimating a value; And

(2-3) 상기 단계 (2-2)에서 추정된 신호대 잡음비(SNR) 값들과 MMSE-LSA 이득함수 이용하여 제2 신호를 도출하는 단계를 포함할 수 있다.
(2-3) deriving the second signal using the SNR values estimated in step (2-2) and the MMSE-LSA gain function.

바람직하게는, 상기 단계 (2)는,Preferably, the step (2)

상기 신호대 잡음비(SNR) 값을 추정하기 위하여 스무딩 기술(Smoothing technique)을 시행할 수 있다.
A smoothing technique may be used to estimate the SNR value.

바람직하게는, 상기 단계 (3)은,Preferably, the step (3)

상기 잡음 및 음성 모델 모두를 동시에 매 프레임에서 업데이트하되, 주파수 별로 개별적으로 연산하여 업데이트할 수 있다.
Both the noise and speech models can be updated in each frame at the same time, and can be individually calculated and updated for each frequency.

바람직하게는, 상기 단계 (3)은,Preferably, the step (3)

(3-1) 상기 제2 신호를 이용하여 미리 정해진 주파수 빈(Frequency bin)에서의 음성 존재 확률 값(SPP)을 추정하는 단계; 및(3-1) estimating a voice presence probability value (SPP) in a predetermined frequency bin using the second signal; And

(3-2) 상기 음성 존재 확률 값을 이용하여 상기 기저 행렬을 업데이트하는 속도를 결정하는 단계를 포함할 수 있다.
(3-2) determining a rate of updating the base matrix using the voice presence probability value.

더욱 바람직하게는, 상기 단계 (3-2)는,More preferably, the step (3-2)

복원 에러(reconstruction error)를 지표로 사용하고 시그모이드 함수(sigmoid function)를 사용하여 연산할 수 있다.
A reconstruction error can be used as an indicator and computed using a sigmoid function.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 시스템은, According to an aspect of the present invention, there is provided an acoustic enhancement system using factor matrix non-noise matrix factorization and base matrix update,

잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출하는 제1 신호 도출 모듈;A first signal derivation module for deriving a pre-enhanced signal obtained by converting an acoustic signal in which noise and speech are mixed into a complex number using an acoustic improvement technique based on a statistical model;

비음수 행렬 인수분해(NMF)에 기반하여, 상기 제1 신호로부터 얻은 값을 이용하여 신호대 잡음비(SNR) 값을 추정하며, 상기 추정된 신호대 잡음비(SNR) 값을 이용하여 MMSE-LSA 이득함수를 구함으로써, 제2 신호를 도출하는 제2 신호 도출 모듈; 및Noise ratio (SNR) value using a value obtained from the first signal based on a non-sound number matrix factorization (NMF), and calculates an MMSE-LSA gain function using the estimated SNR value A second signal derivation module for deriving a second signal; And

상기 제2 신호를 이용하여, 다음 시간 프레임에서 수행되는 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트하는 기저 행렬 업데이트 모듈을 포함하는 것을 그 구성상의 특징으로 한다.
And a base matrix updating module for updating a basis matrix to be used for factoring non-noise factors performed in the next time frame using the second signal.

바람직하게는, 상기 기저 행렬 업데이트 모듈은,Advantageously, the base matrix update module comprises:

상기 제2 신호를 이용하여 미리 정해진 주파수 빈(Frequency bin)에서의 음성 존재 확률 값(SPP)을 추정하는 SPP 추정모듈; 및An SPP estimation module for estimating a voice presence probability value (SPP) in a predetermined frequency bin using the second signal; And

상기 음성 존재 확률 값을 이용하여 상기 기저 행렬을 업데이트하는 속도를 결정하는 업데이트 속도 결정모듈을 포함하여 구성될 수 있다.And an update rate determination module that determines a rate of updating the base matrix using the voice presence probability value.

본 발명에서 제안하고 있는 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법 및 시스템에 따르면, 잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출한 후에, 비음수 행렬 인수분해(NMF)에 기반하여 제1 신호로부터 추정된 음성과 노이즈(제1 신호로부터 얻은 값)를 바탕으로 사전 신호대 잡음비(SNR) 값과 사후 신호대 잡음비(SNR) 값을 구하고, MMSE-LSA 이득함수를 이용하여 제2 신호를 도출함으로써, 높은 성능의 음향 향상 기능을 가진다.
According to the method and system for improving sound using non-sound number matrix factorization and base matrix update proposed in the present invention, a sound signal in which noise and speech are mixed is converted into a complex number using a statistical model- Noise ratio (SNR) value based on the speech and noise (value obtained from the first signal) estimated from the first signal based on the non-sound number matrix factorization (NMF) after deriving the pre-enhanced signal And a post-SNR (Signal-to-Noise Ratio) value, and derives a second signal using the MMSE-LSA gain function.

또한, 본 발명에 따르면, 제2 신호를 이용하여 다음 시간 프레임에서 수행되는 비음수 행렬 인수분해에 사용할 기저 행렬을 업데이트함으로써, 올바른 잡음 모델을 초깃값으로 유지할 수 있고, 음성 존재 확률 값(SPP) 추정을 통한 업데이트 속도 결정하도록 하여 잡음 환경 변화 속도에 따라 업데이트 비율을 자동으로 계산하여 적용함으로써, 불필요하게 많은 업데이트로 인한 오버피팅(overfitting) 등의 악영향을 끼지는 것을 방지할 수 있다.
Further, according to the present invention, a correct noise model can be maintained at a supposed value by updating the base matrix to be used for the factorization of the non-sound number matrix performed in the next time frame using the second signal, and the speech presence probability value (SPP) It is possible to prevent an adverse effect such as overfitting due to unnecessarily large number of updates from being incurred by automatically calculating and applying the update rate according to the change rate of the noise environment.

뿐만 아니라, 본 발명에 따르면, MMSE-LSA 이득함수 이용함으로써, 종래 위너(Weiner) 필터 형태의 이득함수를 이용하는 것보다 안정적인 성능을 이끌어낼 수 있으며, 음성과 잡음의 크기가 따로 추정되어 구해지므로 종래 Decision Directe(DD)기법을 사용하여 잡음과 음성의 파워를 추정하는 것이 아닌 단순한 스무딩 기술(Smoothing technique)을 사용하여 개별적인 파워를 사용함으로써, 음향 개선 효과를 더욱 향상시킬 수 있다.In addition, according to the present invention, by using the MMSE-LSA gain function, it is possible to derive a stable performance rather than using the gain function of the conventional Weiner filter type, and the magnitude of the voice and the noise can be estimated, By using individual power using a simple smoothing technique rather than using the Decision Directe (DD) technique to estimate noise and speech power, the sound improvement effect can be further enhanced.

도 1은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법의 흐름을 도시한 도면.
도 2는 본 발명의 다른 실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법의 흐름을 도시한 도면.
도 3은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법의 흐름을 도식화한 도면.
도 4는 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 시스템을 도시한 도면.
도 5는 본 발명의 다른 실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 시스템을 도시한 도면.Brief Description of the Drawings Fig. 1 is a flowchart illustrating a method for improving sound using non-tone number matrix factorization and base matrix update according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating a method of improving sound using non-tone number factorization and base matrix update according to another embodiment of the present invention. FIG.
FIG. 3 is a diagram illustrating a flow of an acoustic enhancement method using factor matrix factorization and base matrix update according to an embodiment of the present invention; FIG.
FIG. 4 illustrates a sound improvement system using a factorial matrix factorization and a base matrix update according to an embodiment of the present invention. FIG.
FIG. 5 illustrates a sound improvement system using a factorial matrix factorization and a base matrix update according to another embodiment of the present invention. FIG.

이하에서는 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일 또는 유사한 부호를 사용한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. In the following detailed description of the preferred embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The same or similar reference numerals are used throughout the drawings for portions having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’되어 있다고 할 때, 이는 ‘직접적으로 연결’되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.
In addition, in the entire specification, when a part is referred to as being 'connected' to another part, it may be referred to as 'indirectly connected' not only with 'directly connected' . Also, to "include" an element means that it may include other elements, rather than excluding other elements, unless specifically stated otherwise.

도 1은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법의 흐름을 도시한 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법은, 잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출하는 단계(S100), 비음수 행렬 인수분해(NMF)에 기반하여 제1 신호로부터 얻은 값을 이용하여 신호대 잡음비(SNR) 값을 추정하며, 추정된 신호대 잡음비(Signal SNR)값을 이용하여 MMSE-LSA 이득함수를 구함으로써, 제2 신호를 도출하는 단계(S200), 및 단계 S200에서 도출된 제2 신호를 이용하여, 다음 시간 프레임에서 수행되는 단계 S200의 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트하는 단계(S300)를 포함하여 구현될 수 있다.
FIG. 1 is a flowchart illustrating a method of improving sound using non-tone number matrix factorization and basis matrix updating according to an embodiment of the present invention. Referring to FIG. As shown in FIG. 1, the acoustic enhancement method using the factorial matrix factorization and the base matrix update according to an embodiment of the present invention uses an acoustic enhancement technique based on a statistical model, (SNR) value by using a value obtained from the first signal based on the non-noise matrix factorization (NMF) (step S100), deriving a pre-enhanced signal converted to a complex number value (S200) deriving a second signal by obtaining an MMSE-LSA gain function using an estimated signal-to-noise ratio (SNR) value, and using the second signal derived in step S200, (Step S300) of updating the basis matrix to be used in the factorization of the non-sound number matrix of step S200 performed in step S300.

단계 S100에서는, 잡음과 음성이 섞인 음향 신호를 기존의 통계적 모델 기반의 음향 개선 기법을 이용하여 음향을 1차적으로 향상시키는 pre-enhancement 과정이다. t 프레임에서 잡음 섞인 신호는 Y(t)는, 통계적 모델 기반의 음향 개선 기법을 통하여 복소수 값으로 변환되고, 이렇게 변환된 잡음이 섞인 신호를 제1 신호(Y'(t))라고 한다. 제1 신호 Y'(t)의 절댓값이 V가 되고, V는, 기저 행렬 Wn(t), Ws(t)를 통하여 V = [Ws, Wn]H로 나타낼 수 있다.
In step S100, the sound signal mixed with the noise and voice is a pre-enhancement process for primarily improving the sound using the existing statistical model-based sound improvement technique. (t) is transformed into a complex value through a statistical model-based acoustic enhancement technique, and the signal with the thus transformed noise is called a first signal (Y '(t)). The absolute value of the first signal Y '(t) becomes V, and V can be expressed as V = [Ws, Wn] H through the base matrix Wn (t), Ws (t).

단계 S200에서는, 비음수 행렬 인수분해(NMF)에 기반하여 제1 신호(Y'(t))로부터 얻은 값(V, H)을 이용하여 신호대 잡음비(SNR) 값을 추정할 수 있다. 추정된 신호대 잡음비(SNR) 값들을 바탕으로 MMSE-LSA 형태의 이득함수를 구할 수 있으며, 이득함수를 이용하여 제2 신호(enhanced speech)를 얻을 수 있다. 비음수 행렬 인수분해(NMF)란, 특정 정보 집단에서 개개의 정보들이 가지고 있는 공통된 부분(basis)들을 분리해내는 것이다. 실제의 정보 집단을 V, 분리하고자 하는 행렬을 W, H라고 하면 V=WH를 만족하게 된다. W는 기저 행렬을, H는 부호화 행렬을 나타낸다. V는 W의 각 열, 기저(basis)들의 합으로 복원될 수 있다. V는 (n*m) 행렬이고, W는 (n*r)m, H는 (r*m) 크기의 행렬이다. n은 주파수 축 개수, m은 시간 프레임의 개수, r은 기저 개수를 의미한다. 잡음이 섞인 음향에서, W는 음성 기저 행렬(Ws(M*rs))과 잡음 기저 행렬(Wn(M*rn))로 구성된다. 즉, 훈련을 통하여 얻은 음성기저행렬을 Ws, 잡음기저행렬을 Wn이라고 한다. t 프레임에서의 음성 기저(Ws(t))와 잡음기저(Wn(t))는 t번째 프레임의 미리 향상된 시그널로부터 분석된다. 이와 같이 단계 S100 및 단계 S200은 잡음이 섞인 입력 음향 신호로부터 잡음이 제거된 음성을 얻는 과정이다.
In step S200, a signal-to-noise ratio (SNR) value can be estimated using the values (V, H) obtained from the first signal Y '(t) based on the non-sound number matrix factorization NMF. The gain function of the MMSE-LSA type can be obtained based on the estimated SNR values, and a second signal (enhanced speech) can be obtained using the gain function. Nominal Matrix Factorization (NMF) is the separation of common parts of a set of information in a particular set of information. If the actual information group is V, and the matrix to be separated is W, H, then V = WH is satisfied. W denotes a base matrix, and H denotes an encoding matrix. V can be restored to the sum of the columns of W, bases. V is an (n * m) matrix, W is (n * r) m, and H is a matrix of (r * m) size. n is the number of frequency axes, m is the number of time frames, and r is the number of bases. In a noise mixed sound, W is composed of a speech base matrix Ws (M * rs) and a noise base matrix Wn (M * rn). That is, the speech base matrix obtained through training is denoted by Ws and the noise base matrix is denoted by Wn. The speech basis Ws (t) and the noise basis Wn (t) in the t-frame are analyzed from the pre-enhanced signal of the t-th frame. In this manner, steps S100 and S200 are a process for obtaining a noise-removed speech signal from a noise-mixed input acoustic signal.

단계 S300에서는, 단계 S200을 거친 최종 향상 결과를 이용하여 다음 시간 프레임에 수행되는 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트할 수 있다. 구체적으로, 해당 프레임(frame)에 음성이 존재할 확률을 구하고 SPP값은 기저 행렬 업데이트 속도를 결정하거나, 업데이트 값을 도출하는 데에 사용될 수 있다. 단계 S300에서 얻어진 기저 행렬(Wn, Ws)을 바탕으로 단계 S100 내지 단계 S200이 수행되어 올바른 잡음 모델을 초깃값으로 유지할 수 있게 된다.
In step S300, the final enhancement result through step S200 may be used to update the basis matrix to be used for the factorization of the non-note number matrix performed in the next time frame. Specifically, the probability that speech exists in the frame may be determined, and the SPP value may be used to determine a base matrix update rate or to derive an update value. Steps S100 to S200 are performed based on the base matrix Wn, Ws obtained in step S300, so that the correct noise model can be maintained as the initial value.

도 2는 본 발명의 다른 실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법의 흐름을 도시한 도면이다. 도 2에 도시된 바와 같이, 본 발명의 다른 실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법에 따르면, 단계 S200은, 제1 신호의 절댓값을 구하고, 사전에 미리 트레이닝된 잡음 기저 행렬과 음성 기저 행렬을 통하여 부호화 행렬을 추정하는 단계(S210), 단계 S210에서 추정된 부호화 행렬과, 이전 시간 프레임의 단계 S300에서 업데이트된 기저 행렬을 이용하여 사전 신호대 잡음비(SNR) 값과 사후 신호대 잡음비(SNR) 값을 추정하는 단계(S220), 및 단계 S220에서 추정된 신호대 잡음비(SNR) 값들과 MMSE-LSA 이득함수 이용하여 제2 신호를 도출하는 단계(S230)를 포함하여 구현될 수 있다. 즉, 초기에는 사전에 미리 트레이닝된 잡음 및 음성 기저 행렬(Wn, Ws)을 바탕으로 H를 추정하고, 그 이후에는 단계 S300을 통하여 업데이트 되는 잡음 및 음성 기저 행렬(Wn, Ws)을 바탕으로 H를 추정할 수 있다(하기 수학식 1 참조). V(t)는 Y'(t)에 의해 구성(V(t)=Y'(t)의 절댓값)되고, t 번째 프레임에서 비음수 행렬 인수분해 기반의 개선에 사용된다. 이때 매 프레임에서 부호화 행렬, Ht의 초깃값을 무작위로 준다. 하지만 이때, 음성과 잡음의 차이로 인하여 초깃값을 주는 형태가 다르다. 음성은 주파수 영역에서 비교적 이전 프레임의 값과 유사하다. 그래서 Hs(t)의 초깃값은, 이전 프레임의 값인 Hs(t-1)로 주는 것이 성능 향상에 좋다. 반대로 잡음의 경우에는 이전 값으로 주게 되면 성능이 저하되는 현상을 보인다. 그래서 Hn(t)만은 무작위로 초깃값을 매 프레임 설정해준다.FIG. 2 is a flowchart illustrating a method for improving sound using non-tone number matrix factorization and base matrix update according to another embodiment of the present invention. Referring to FIG. As shown in FIG. 2, according to another embodiment of the present invention, in step S200, an absolute value of a first signal is obtained, and a pre- Estimating an encoding matrix through a noise base matrix and a speech base matrix at step S210, using a coding matrix estimated at step S210 and a base matrix updated at step S300 of a previous time frame, Estimating a post-SNR value (S220), and deriving a second signal (S230) using the estimated SNR values and the MMSE-LSA gain function in step S220 . That is, H is initially estimated on the basis of previously pre-trained noise and speech base matrices Wn, Ws, and thereafter, based on the noise and speech base matrix Wn, Ws updated through step S300, H (See Equation 1 below). V (t) is constructed by Y '(t) and is used to improve the factorization matrix of the nonnegative matrix in the t-th frame (V (t) = Y' In this case, the encoding matrix Ht is given randomly at every frame. However, at this time, the difference between voice and noise is different. The speech is comparable in value to the value of the previous frame in the frequency domain. Therefore, it is good for performance improvement to give the initial value of Hs (t) as Hs (t-1) which is the value of the previous frame. On the contrary, in case of noise, performance is degraded when given to the previous value. Thus, only Hn (t) sets the frame every frame at random.

추정된 H와 업데이트되는 잡음 및 음성 기저 행렬(Wn, Ws)을 이용하여 추정된 S(Speech), N(Noise)값을 얻을 수 있다(하기 수학식 2 참조). 실시간으로 매 시간 프레임에서 수행하기 위하여 시간 t에서의 복원 수식을 보면, V(t) = [Ws, Wn][Hs;Hn]으로 나타낼 수 있다. 이때 각각의 Hs와 Hn은 (r*1) 행렬이고, H(t)=[Hs;Hn]으로 볼 수 있다. 이렇게 한 프레임에서 추정 과정을 마치면 아래와 같이 음성과 잡음 크기의 추정 값이 나오게 된다.S (Speech) and N (Noise) values estimated using the estimated H and the updated noise and voice basis matrix Wn, Ws can be obtained (see Equation 2 below). The restoration formula at time t is shown as V (t) = [Ws, Wn] [Hs; Hn] to perform in real time every frame. In this case, Hs and Hn are (r * 1) matrices and H (t) = [Hs; Hn]. When the estimation process is completed in this frame, the estimation value of voice and noise size is shown as below.

S(Speech), N(Noise)값을 이용하여 신호대 잡음비(SNR) 값을 얻을 수 있다(하기 수학식 3 참조). 이러한 과정을 거치고 나면, 실제 잡음에는 어느 정도 S에 의존하게 된다. 그래서 이를 바로 결과로 사용하기에는 성능상 문제가 있고, 이득함수를 구하여 사용한다. 본 발명에서는 MMSE-LSA 이득 함수를 사용한다(하기 수학식 4 참조).(SNR) value can be obtained using S (Speech) and N (Noise) values (see Equation 3 below). After this process, the actual noise depends on S to some extent. Therefore, there is a performance problem in using it as a result, and a gain function is obtained and used. In the present invention, the MMSE-LSA gain function is used (see Equation 4 below).

ξ는 사전 신호대 잡음비(priori SNR), γ는 사후 신호대 잡음비(posteriori SNR)이다. 원래 이 수식에서 사전 신호대 잡음비 또는 사후 신호대 잡음비에 사용되는 파워는 특정시간구간에서의 기댓값(expectation)이다. 바람직하게는, SNR 값을 추정하기 위하여, Smoothing 기술을 시행할 수 있다. 원래 수식에서 사전 신호대 잡음비 또는 사후 신호대 잡음비에 사용되는 파워는 특정 시간 구간에서의 기댓값이므로, 망각률(forgetting factor)을 사용하여 ξ와 γ를 구한다. Ps(m,t)와 Pn(m,t)는 t프레임에서 m번째의 주파수 축을 위한 파워스펙트럼 밀도를 나타낸다. Ts, Tn 은 망각률을 나타낸다. 결과적으로 t번째 프레임에서 향상된 음성 스펙트럼은 X=GY'에 따라 얻어진다. Y'(m,t)는 Y'의 M번째 요소를 의미한다.
ξ is the priori SNR, and γ is the posteriori SNR. Originally, the power used in the preceding signal-to-noise ratio or the post-signal-to-noise ratio is the expectation in a particular time interval. Preferably, a Smoothing technique may be employed to estimate the SNR value. Since the power used for the propor- tional signal-to-noise ratio or the post-signal-to-noise ratio in the original formula is the expected value at a certain time interval, ξ and γ are obtained using a forgetting factor. Ps (m, t) and Pn (m, t) represent the power spectral density for the m-th frequency axis in the t frame. Ts and Tn represent the forgetting rate. As a result, the improved speech spectrum in the t-th frame is obtained according to X = GY '. Y '(m, t) means the Mth element of Y'.

기존에 NMF 기반 음향 향상 기법과 관련하여서는 단순한 Weiner 필터 형태 이득함수를 이용하여 안정적인 성능을 이끌어냈으나, 본 발명에서는, MMSE-LSA 알고리즘을 이득함수로 사용함으로써, 더욱 효과적인 음향 향상 효과를 나타내도록 하고 있다. 또한, 종래에는 Decision Directed(DD)기법을 사용하여 잡음과 음성의 파워를 추정하는 것이 일반적이었으나, 이 기법은 NMF 기반 향상에서는 높은 성능을 보이지 못하는 문제가 있었다. 본 발명에서는 NMF 기반 향상을 이용함으로써, 기존의 DD기법을 사용하는 통계적 기반 향상과 달리, 음성과 잡음의 크기가 따로 추정되어 구해지므로 DD가 아닌 단순한 스무딩(smoothing)을 사용하여 개별적인 파워를 사용할 수 있고, 이로 인하여 음향 개선에 훨씬 효과적임을 확인하였다.
Conventionally, in connection with the NMF-based sound enhancement technique, a stable Weiner filter gain function is used. However, in the present invention, by using the MMSE-LSA algorithm as a gain function, a more effective sound enhancement effect is exhibited have. In addition, conventionally, it has been common to estimate the power of noise and speech using a Decision Directed (DD) technique. However, this technique has a problem in that it does not show high performance in NMF based enhancement. In the present invention, by using the NMF-based enhancement, unlike the statistical basis improvement using the conventional DD technique, the magnitude of the voice and the noise are separately estimated, so that the individual power can be used by using simple smoothing And it is confirmed that it is much more effective for acoustic improvement.

또한, 단계 S300은, 제2 신호를 이용하여 미리 정해진 주파수 빈(Frequency bin)에서의 음성 존재 확률 값(SPP)을 추정하는 단계(S310), 및 음성 존재 확률 값을 이용하여 기저 행렬을 업데이트하는 속도를 결정하는 단계(S320)를 포함하여 구현될 수 있다. 즉, 단계 S300에서는 NMF 기저를 업데이트 할 수 있다. 구체적으로, 단계 S200의 최종 결과를 이용하여 프레임에 음성이 존재할 확률(SPP)을 구하고, SPP값을 기저 행렬 업데이트 속도 결정 및 기저 행렬 업데이트에 사용할 수 있다(하기 수학식 5 내지 7 참조).Step S300 further includes a step S310 of estimating a voice existence probability value SPP in a predetermined frequency bin using the second signal S310 and a step S310 of updating the base matrix using the voice existence probability value And determining a speed (S320). That is, in step S300, the NMF basis can be updated. Specifically, the final result of step S200 may be used to determine the probability (SPP) that speech exists in the frame, and the SPP value may be used for base matrix update rate determination and base matrix update (see Equations 5 through 7 below).

종래의 템플릿 기반의 음향 개선 기법은, 트레이닝된 잡음 basis와 실제 환경에서의 잡음이 서로 다르면 향상 성능이 현저하게 떨어진다는 문제가 있었다. 이를 해결하기 위하여 음성 또는 잡음 모델 중 하나만을 업데이트 하는 방법이 개발되기도 하였으나, 음성이 존재하는 구간에서는 음성 모델과 업데이트를 하되 단순히 NMF 업데이트 방법을 이용하여 잡음이 영향을 끼치도록 업데이트를 하고, 음성이 없는 구간에서만 잡음 모델을 동일한 기법으로 업데이트함으로 인해 매우 제한적이고, 그 효과 또한 현저하지 못하였다. 본 발명에서는, 잡음, 음성 모델 모두를 동시에 매 프레임에서 업데이트하되, 주파수 별로 개별적으로 연산하여 업데이트함으로써, 더욱 높은 성능을 나타낼 수 있다. 업데이트 과정 중에서 음성 basis가 잡음 basis에 영향을 주거나 반대로 잡음 basis가 음성 basis에 영향을 주는 것을 줄이기 위하여 SPP(음성존재확률)을 이용하였다. SPP(음성존재확률)은, 음성과 잡음 기저 컨트롤을 위하여 특별한 주파수축에서 음성 활동의 확률을 의미한다. 이는 상기 수학식 6 및 7에서 나타난 바와 같이 적응형 보간법을 사용하여 도출할 수 있다.
The conventional template-based sound improvement technique has a problem in that the improvement performance is remarkably degraded if the training noise basis and the noise in the actual environment are different from each other. In order to solve this problem, a method of updating only one voice or noise model has been developed. However, in an interval where a voice exists, the voice model and the update are performed. However, the NMF update method is used to update the noise to affect the noise. Only updating the noise model with the same technique in the absence of the noise is very limited and its effect is also not significant. In the present invention, both the noise and the speech model are simultaneously updated in every frame, but they can be separately calculated and updated for each frequency, thereby achieving higher performance. SPP (Voice Presence Probability) was used to reduce the influence of noise bases on the noise basis, or conversely, the noise basis in the updating process. SPP (voice presence probability) means the probability of voice activity on a particular frequency axis for voice and noise base control. This can be derived using adaptive interpolation as shown in Equations (6) and (7) above.

업데이트 속도 결정은, 다음 시간 프레임에서 수행되는 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트하는 속도를 결정하는 단계이다. 실제 잡음의 환경이 빨리 변한다면 그에 맞추어 기저 행렬의 업데이트 또한 빠른 속도로 이루어져야 하며, 반대로 잡음 환경이 천천히 변한다며 기저 행렬의 업데이트 또한 느린 속도로 이루어져야 할 것이다. 사용하는 모델이 적절한 경우에는 불필요하게 많은 업데이트가 overfitting 등 악영향을 끼칠 수 있기 때문이다. 본 발명에서는 이를 고려하여 업데이트 비율(rate) 즉, 속도를 자동으로 계산하여 적용하도록 하였다. 바람직하게는, 음성 존재 확률 값을 이용하여 기저 행렬을 업데이트하는 속도를 결정할 때, 복원 에러(reconstruction error)를 지표로 사용하고 시그모이드 함수(sigmoid function)를 사용하여 연산할 수 있다(하기 수학식 8 내지 10 참조). 수학식 8은 복원 에러 지표를 구하는 식이고, 수학식 9는 복원 에러 지표를 스무딩하는 식이다. 수학식 9에서 Te는 평활상수(smoothing constant)를 의미하고,

는 스무딩된 복원 에러 값을 의미한다. 노잡음만을 위한 복원 에러값을 계산하는 것은 어려우므로 음성 존재 인터벌(speech presence intervals)에서는

는

와 같은 것으로 볼 수 있다. 수학식 7에서 보여지는 잡음 기저 업데이트 비율

는 수학식 10 및 11에 따라 구해질 수 있고,

의 non decreasing function로서 확정될 수 있다.

는 업데이트 비율을 위한 가장 큰 값이고, 수학식 6에서 음성 기저 업데이트 비율

는 상수

로 정정될 수 있다.The update rate decision is a step of determining the rate at which to update the basis matrix for use in the factoring of the non-note number matrix performed in the next time frame. If the environment of the actual noise changes rapidly, the base matrix update must be performed at a high speed. On the contrary, the base matrix update should be performed at a slow rate as the noise environment changes slowly. If the model you use is appropriate, unnecessarily many updates can have an adverse effect, such as overfitting. In the present invention, the updating rate or speed is automatically calculated and applied in consideration of this. Preferably, when determining the rate of updating the base matrix using the presence probability value, a reconstruction error can be used as an indicator and computed using a sigmoid function 8 to 10). Equation (8) is an equation for obtaining a reconstruction error index, and Equation (9) is an equation for smoothing a reconstruction error index. In Equation (9), Te denotes a smoothing constant,

Denotes a smoothing restoration error value. Since it is difficult to calculate the reconstruction error value for noise only, speech presence intervals

The

And so on. The noise base update rate < RTI ID = 0.0 >

Can be obtained according to Equations 10 and 11,

Can be determined as a non-decreasing function of.

Is the largest value for the update rate, and the voice base update rate

Constant

Lt; / RTI >

도 3은 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법의 흐름을 도식화한 도면이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 방법은, 잡음과 음성이 섞인 음향 신호(Y(t), 잡음 스펙트럼)가 통계적 모델 기반의 음향 개선 기법을 통하여 pre enhancement 되어 Y'(t)로 도출되고, 비음수 행렬 인수분해 과정(NMF process), 신호대 잡음비 추정 과정(SNF estimation), MMSE-LSA 이득함수 구하는 과정을 통하여 음성(제2 신호)이 추정(Speech estimate)될 수 있다. 또한, 최종 향상 결과를 이용하여 프레임에 음성이 존재할 확률을 추정(SPP estimation)하고, 이 SPP값(p(t))은 업데이트 속도 결정(Decision of the update rate)과 다음 프레임에 사용할 기저 행렬(Wn, Ws)을 구하여 다음 프레임의 비음수 행렬 인수분해에 제공하는 과정(On-line bases update)을 포함하여 구현될 수 있다. 이와 같이 본 발명은 통계적 모델 기반의 음향 개선 기법과 템플릿 기반의 음향 개선 기법을 결합하여 각 기법의 단점을 해결함으로써, 기존의 방법들보다 높음 음성 향상 성능을 이룰 수 있다. 구체적으로, NMF 기법을 이용하여 추정된 음성과 노이즈(수학식 2)를 바탕으로 사전 SNR과 사후 SNR(수학식 3)을 구하고, MMSE-LSA 이득 함수(수학식 4)를 이용함으로써 음향 개선 효과를 높이고, 업데이트 기법을 통하여 올바를 잡음 모델을 초깃값으로 유지할 수 있다.
FIG. 3 is a diagram illustrating a flow of an acoustic enhancement method using a factorial matrix factorization and a base matrix update according to an embodiment of the present invention. As shown in FIG. 3, the acoustic enhancement method using the factor matrix factorization and the base matrix update according to an embodiment of the present invention is characterized in that an acoustic signal (Y (t), noise spectrum) (NMF process), SNF estimation (SNF estimation), and MMSE-LSA gain function are obtained by pre-enhancement and Y '(t) (Second signal) may be estimated. In addition, SPP estimation is performed by estimating the probability that a speech exists in a frame using the final improvement result, and the SPP value p (t) is calculated based on a decision rate of the update rate and a base matrix Wn, and Ws) and providing them to factorization of the non-sound number matrix of the next frame (On-line bases update). As described above, the present invention combines the statistical model-based acoustic enhancement technique and the template-based acoustic enhancement technique to solve the disadvantages of each technique, thereby achieving higher speech enhancement performance than the conventional methods. More specifically, a sound SNR and post-SNR (Equation 3) are obtained based on the estimated voice and noise (Equation 2) using the NMF technique, and the MMSE-LSA gain function (Equation 4) And the update technique can maintain the correct noise model as the initial value.

도 4는 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 시스템을 도시한 도면이다. 도 4에 도시된 바와 같이, 본 발명의 일실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 시스템은, 잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출하는 제1 신호 도출모듈(100), 비음수 행렬 인수분해(NMF)에 기반하여 제1 신호로부터 얻은 값을 이용하여 신호대 잡음비(SNR) 값을 추정하며 추정된 신호대 잡음비(SNR) 값을 이용하여 MMSE-LSA 이득함수를 구함으로써 제2 신호를 도출하는 제2 신호 도출모듈(200), 및 제2 신호를 이용하여 다음 시간 프레임에서 수행되는 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트하는 기저 행렬 업데이트모듈(300)을 포함하여 구성될 수 있다.
4 is a diagram illustrating an acoustic enhancement system using factor matrix non-noise matrix factorization and base matrix update according to an embodiment of the present invention. As shown in FIG. 4, the sound improvement system using factor matrix non-noise matrix factorization and base matrix update according to an embodiment of the present invention uses a statistical model-based sound improvement technique (SNR) using a value obtained from the first signal based on a non-noise matrix factorization (NMF), a first signal derivation module (100) for deriving a first signal (pre-enhanced signal) A second signal derivation module 200 for deriving a second signal by obtaining an MMSE-LSA gain function using an estimated signal-to-noise ratio (SNR) value, And a base matrix update module 300 for updating a basis matrix to be used for factoring the non-note number matrix to be performed.

도 5는 본 발명의 다른 실시예에 따른 비음수 행렬 인수분해 및 기저 행렬 업데이트를 이용한 음향 개선 시스템을 도시한 도면이다. 도 5에 도시된 바와 같이, 제2 신호 도출모듈(200)은, 비음수 행렬 인수분해 연산모듈(210), SNR 추정모듈(220) 및 MMSE-LSA 이득함수 연산모듈(230)을 포함하여 구성될 수 있으며, 기저 행렬 업데이트모듈(300)은, SPP 추정모듈(310) 및 업데이트 속도 결정모듈(320)을 포함하여 구성될 수 있다. 각 구성에 대해서는 앞서 도 1 내지 도 3과 관련하여 설명한 바와 유사하므로 상세한 설명은 생략하기로 한다.
FIG. 5 is a diagram illustrating a sound improvement system using a factorial matrix factorization and a base matrix update according to another embodiment of the present invention. Referring to FIG. 5, the second signal derivation module 200 includes a non-noise matrix factorization operation module 210, an SNR estimation module 220, and an MMSE-LSA gain function operation module 230, And the base matrix update module 300 may include an SPP estimation module 310 and an update rate determination module 320. Since each configuration is similar to that described above with reference to Figs. 1 to 3, detailed description thereof will be omitted.

이하에서는, 본 발명의 효과를 실험예를 통하여 더욱 상세하게 설명하지만, 본 발명의 권리범위가 하기 실험예에 의해 한정되는 것은 아니다.
Hereinafter, the effects of the present invention will be described in more detail through experimental examples, but the scope of the present invention is not limited by the following experimental examples.

실험예Experimental Example 1. 음향 개선 효과 비교 실험 1. Comparison experiment of sound improvement effect

본 발명의 음향 개선 효과를 평가하기 위하여 음향 개선 비교 실험을 수행하였다. 1단계 SE(통계적 모델 기반의 음향 개선 기법) 모듈로서, S. Rangachari and P. C. Loizou, “A noise-estimation algorithm for highly non-stationary environments,” Speech Commun., vol. 48, pp. 220-231, 2006.에 개시된 알고리즘을 채용하였다. 이는 향상된 스펙트럼뿐만 아니라 각 프레임에서의 SPP, PSD 추정도 제공한다. 음성과 잡음 재료는 각각 TIMIT 및 NOISEX-92 데이터 베이스(DBs)를 선택하였고, 샘플링 속도는 16㎑로 하였다. 75% 오버랩(overlap)을 가지는 512 포인트 고속 푸리에 변환을 사용하였다. 각각의 잡음 기준은 시험 데이터베이스에 포함되지 않은 16초 길이의 잡음 신호로 구성되었다. 음성과 잡음 기지 수는 각각 40(rn=rs=40)으로 하였다. 기저 업데이트와 스무딩 관련 파라미터 값과 관련하여,

는 0.03,

는 0.2, Ts는 0.5, Tn은 0.9, Te는 0.98로 하였다. 또한, 결과는 ITU-T 권고안 P.862 Perceptual evaluation of speech quality (PESQ) 점수(Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, Tech. Rep. ITU-T P.862, 2001. 참조)로 비교하였다.
In order to evaluate the sound improvement effect of the present invention, a sound improvement comparative experiment was conducted. As a first-stage SE (statistical model-based sound improvement technique) module, S. Rangachari and PC Loizou, " A noise-estimation algorithm for highly non-stationary environments, " 48, pp. 220-231, 2006. < / RTI > This provides SPP, PSD estimation in each frame as well as an improved spectrum. TIMIT and NOISEX-92 databases (DBs) were selected as the voice and noise materials, respectively, and the sampling rate was 16 kHz. A 512-point fast Fourier transform with 75% overlap was used. Each noise criterion was composed of a 16-second-long noise signal that was not included in the test database. The number of speech and noise is 40 (rn = rs = 40). With respect to the base update and smoothing related parameter values,

0.03,

0.2, Ts was 0.5, Tn was 0.9, and Te was 0.98. The results are also presented in the ITU-T Recommendation P.862 Perceptual evaluation of speech quality (PESQ) score: An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, Tech. Rep. ITU-T P.862, 2001).

상기 방법으로 5가지 방법의 성능을 비교하였다. 5가지 방법은, 통계적 모델 기반의 음향 개선 기법만을 적용한 경우(SE), NMF 기반 개선 기법만을 사용하고 기저 행렬의 업데이트를 하지 않은 경우(NMF), Cabras 등에 의하여 제안된 방법의 NMF 기반 음향 개선 기법에 의하는 경우(SE & Cabras), 기저 행렬의 업데이트만을 제외하고 본원 발명에서 제안하는 SE, NMF 기반 개선 기법을 사용한 경우(Proposed w/o update), 본 발명의 일실시예에 따른 음향 개선 기법에 따른 경우(Proposed)이다.
The performance of the five methods was compared by the above method. The five methods are NMF-based acoustic enhancement method using only the statistical model-based acoustic enhancement method (SE), NMF-based enhancement method (NMF), Cabras et al. (Pro < SEP > w / o < / RTI > update) proposed by the present invention, except for updating the base matrix, (Proposed).

SE & Cabras는, 음성 존재시만 잡음 기저를 업데이트하고, 반대로 음성 기저는 음향 신호 존재 시 업데이트하는 것으로서, G. Cabras, S. Canazza, P. L. Montessoro and R. Rinaldo, “Restoration of audio documents with low SNR: a NMF parameter estimation and perceptually motivated Bayesian suppression rule,” in Proc. Sound and Music Computing Conference, pp. 314-321, 2010.에 개시된 방법에 따랐다.
SE & Cabras updates the noise baseline only in the presence of speech, while conversely, the voice basis updates in the presence of acoustic signals. G. Cabras, S. Canazza, PL Montessoro and R. Rinaldo, "Restoration of audio documents with low SNR : a NMF parameter estimation and perceptually motivated Bayesian suppression rule, in Proc. Sound and Music Computing Conference, pp. 314-321, 2010. < / RTI >

실험은 세 가지 다른 조건에서 수행하였다. 세 가지 다른 조건이란, Stationary 잡음 환경으로서 잡음 기저와 일치하는 경우(Stationary noise environment with matched noise bases, 표 1), Stationary 잡음 환경으로서 잡음 기저와 일치하지 않는 경우(Stationary noise environment with mismatched noise bases, 표 2), Non-stationary 및 Stationary 잡음 환경으로서 잡음 기저와 일치하지 않는 경우(Non-Stationary and stationary noise environment with mismatched noise bases, 표 3)이다. 잡음 기저와 일치하는 경우는, 잡음 신호로부터 최초 유도된 기저가 테스트 데이터베이스에 사용되는 것과 같은 것을 의미한다. 예를 들어, factory floor 2 (factory2) 잡음 기저는, factory2 잡음 데이터베이스에서만 훈련된 것이다. 반대로 잡음 기저와 일치하지 않는 경우는, 테스트 데이터베이스의 실제 잡음과 다른 데이터베이스에 의해 잡음 기저가 만들어진다. 유리는 본 실험에서 백색 소음(white noise)을 사용하였다.
The experiment was performed under three different conditions. The three different conditions are: stationary noise environment with matched noise bases (Table 1), stationary noise environment with stationary noise environment with mismatched noise bases, 2), non-stationary and stationary noise environments, and non-stationary and stationary noise environments with mismatched noise bases (Table 3). If the noise base is matched, it means that the base derived from the noise signal is the same as that used for the test database. For example, the factory floor 2 (factory2) noise base is trained only in the factory2 noise database. Conversely, if the noise base does not match, the noise base is created by a database that is different from the actual noise of the test database. Glass used white noise in this experiment.

표 1은 Stationary 잡음 환경으로서 잡음 기저와 일치하는 경우의 실험결과를 나타낸 표이고, 표 2는 Stationary 잡음 환경으로서 잡음 기저와 일치하지 않는 경우의 실험결과를 나타낸 표이며, 표 3은 Non-stationary 및 Stationary 잡음 환경으로서 잡음 기저와 일치하지 않는 경우의 실험결과를 나타낸 표이다.Table 1 shows the experimental results when the stationary noise environment coincides with the noise base. Table 2 shows the experimental results when the stationary noise environment does not coincide with the noise base. Table 3 shows the results of the non-stationary and non- This is a table showing experimental results when the stationary noise environment does not coincide with the noise base.

표 1에 나타난 바와 같이, 본 발명에 따른 음향 개선 방법(Proposed)이 다른 기술에 비하여 효과적임을 확인할 수 있다. 또한, 표 2에 나타난 바와 같이, 잡음 기저가 일치하지 않는 경우에는 NMF 기반 기법의 성능이 저하되었으나, 기저 행렬 업데이트 시 효과가 크게 상승하였음을 확인할 수 있다. 또한, 표 3에 나타난 바와 같이, 본 발명에 따른 방법은, 잡음의 특성이 천천히 변하거나 급격히 변하는 것이 혼합되어 있을 경우 본 발명에 따른 방법에서도 매우 효과적임을 확인할 수 있었다. 이와 같이, 통계적 모델 기반의 음향 개선 기법과 음수 행렬 인수분해 및 기저 행렬 업데이트 방법을 조합한 본 발명은, 실제 잡음과 일치하지 않는 훈련 데이터와 급격히 변하는 잡음 환경에서도 음향 개선 효과가 탁월하며, 훈련 DB(factory2, F-16, M109, Lepard)에 관계없이 다른 방법에 비하여 현저한 음향 개선 효과를 나타낸다.
As shown in Table 1, it can be confirmed that the sound improvement method according to the present invention is more effective than the other techniques. Also, as shown in Table 2, when the noise bases do not coincide with each other, the performance of the NMF-based technique deteriorates. However, it can be confirmed that the effect of the base matrix update increases greatly. Also, as shown in Table 3, it can be confirmed that the method according to the present invention is very effective also in the method according to the present invention when the characteristic of the noise changes slowly or rapidly. As described above, the present invention, which is a combination of a statistical model-based sound improvement technique and a negative matrix factorization and base matrix update method, is excellent in sound improvement effect even in training data that does not match actual noise and rapidly changing noise environment, (factory2, F-16, M109, Lepard), it shows remarkable sound improvement effect compared to other methods.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention may be embodied in many other specific forms without departing from the spirit or essential characteristics of the invention.

100: 제1 신호 도출모듈 200: 제2 신호 도출모듈
210: 비음수 행렬 인수분해 연산모듈 220: SNR 추정모듈
230: MMSE-LSA 이득함수 연산모듈 300: 기저 행렬 업데이트모듈
310: SPP 추정모듈 320: 업데이트 속도 결정모듈
S100: 잡음과 음성이 섞인 음향 신호를 통계적 모델 기반의 음향 개선 기법을 이용하여 복소수 값으로 변환한 제1 신호(pre-enhanced signal)를 도출하는 단계
S200: 비음수 행렬 인수분해(NMF)에 기반하여 제1 신호로부터 얻은 값을 이용하여 신호대 잡음비(SNR) 값을 추정하며, 추정된 신호대 잡음비(SNR) 값을 이용하여 MMSE-LSA 이득함수를 구함으로써, 제2 신호를 도출하는 단계
S210: 제1 신호의 절댓값을 구하고, 사전에 미리 트레이닝된 잡음 기저 행렬과 음성 기저 행렬을 통하여 부호화 행렬을 추정하는 단계
S220: 단계 S210에서 추정된 부호화 행렬과, 이전 시간 프레임의 단계 S300에서 업데이트된 기저 행렬을 이용하여 사전 신호대 잡음비(SNR) 값과 사후 신호대 잡음비(SNR) 값을 추정하는 단계
S230: 단계 S220에서 추정된 신호대 잡음비(SNR) 값들과 MMSE-LSA 이득함수 이용하여 제2 신호를 도출하는 단계
S300: 단계 S200에서 도출된 제2 신호를 이용하여, 다음 시간 프레임에서 수행되는 단계 S200의 비음수 행렬 인수분해에 사용할 기저(basis) 행렬을 업데이트하는 단계
S310: 제2 신호를 이용하여 미리 정해진 주파수 빈(Frequency bin)에서의 음성 존재 확률 값(SPP)을 추정하는 단계
S320: 음성 존재 확률 값을 이용하여 기저 행렬을 업데이트하는 속도를 결정하는 단계100: first signal derivation module 200: second signal derivation module
210: non-sound number matrix factorization operation module 220: SNR estimation module
230: MMSE-LSA gain function calculation module 300: Base matrix update module
310: SPP estimation module 320: update rate determination module
S100: deriving a pre-enhanced signal obtained by converting a sound signal mixed with noise and speech into a complex number using a statistical model-based sound improvement technique
S200: estimates the signal-to-noise ratio (SNR) value using the value obtained from the first signal based on the nonnegative matrix factorization (NMF), and calculates the MMSE-LSA gain function using the estimated SNR Thereby deriving a second signal,
S210: Estimating an absolute value of the first signal, and estimating an encoding matrix through a previously pre-trained noise base matrix and a speech base matrix
S220: Estimating the SNR value and the SNR value using the encoding matrix estimated in step S210 and the base matrix updated in step S300 of the previous time frame
S230: deriving a second signal using the SNR values estimated in step S220 and the MMSE-LSA gain function
S300: Updating a basis matrix to be used for factoring the non-note number matrix of step S200 performed in the next time frame using the second signal derived in step S200
S310: estimating a speech presence probability value (SPP) in a predetermined frequency bin using the second signal
S320: determining a rate of updating the base matrix using the probability of voice presence value

Claims

As an acoustic improvement method using non-sound number matrix factorization,
(1) deriving a pre-enhanced signal obtained by converting a sound signal mixed with noise and speech into a complex number using a statistical model-based sound improvement technique;
(2) estimating a signal-to-noise ratio (SNR) value using a value obtained from the first signal based on a non-sound number matrix factorization (NMF), and using the estimated SNR value to calculate an MMSE- Deriving a second signal by obtaining a function; And
(3) updating a basis matrix to be used for factorizing the non-noise matrix of the step (2) performed in the next time frame using the second signal derived in the step (2) A method for acoustic enhancement using factor matrix factorization and base matrix update.

2. The method of claim 1, wherein step (2)
(2-1) obtaining an absolute value of the first signal, and estimating an encoding matrix through a previously pre-trained noise base matrix and a speech base matrix;
(SNR) value and a post-SNR (SNR) value using the encoding matrix estimated in the step (2-1) and the base matrix updated in the step (3) of the previous time frame, Estimating a value; And
(2-3) deriving a second signal using the SNR values estimated in step (2-2) and the MMSE-LSA gain function. A method of improving sound using a base matrix update.

2. The method of claim 1, wherein step (2)
And performing a smoothing technique to estimate the signal-to-noise ratio (SNR) value using the non-noise matrix factorization and the base matrix update.

2. The method of claim 1, wherein step (3)
Wherein the noise and speech models are simultaneously updated in each frame, and the noise and speech models are individually calculated and updated for each frequency, thereby improving the sound using the non-sound number matrix factorization and the base matrix update.

2. The method of claim 1, wherein step (3)
(3-1) estimating a voice presence probability value (SPP) in a predetermined frequency bin using the second signal; And
(3-2) determining a rate at which the base matrix is updated using the speech presence probability value. &Lt; Desc / Clms Page number 19 >

6. The method according to claim 5, wherein the step (3-2)
Wherein a reconstruction error is used as an indicator and computed using a sigmoid function. &Lt; Desc / Clms Page number 20 >

As an acoustic improvement system using the factorization of non-sound number matrix,
A first signal derivation module for deriving a pre-enhanced signal obtained by converting an acoustic signal in which noise and speech are mixed into a complex number using an acoustic improvement technique based on a statistical model;
(SNR) value using a value obtained from the first signal based on a non-sound number matrix factorization (NMF), and calculates an MMSE-LSA gain function using the estimated SNR A second signal derivation module for deriving a second signal; And
And a base matrix update module for updating the basis matrix for use in the factorization of the non-note number matrix performed in the next time frame using the second signal. &Lt; RTI ID = 0.0 > Sound Enhancement System Using.

8. The apparatus of claim 7, wherein the base matrix update module comprises:
An SPP estimation module for estimating a voice presence probability value (SPP) in a predetermined frequency bin using the second signal; And
And an update rate determination module that determines a rate of updating the base matrix using the speech presence probability value. &Lt; Desc / Clms Page number 19 >