KR101184394B1

KR101184394B1 - method of noise source separation using Window-Disjoint Orthogonal model

Info

Publication number: KR101184394B1
Application number: KR1020060041732A
Authority: KR
Inventors: 이수정; 최갑근; 김순협
Original assignee: 에이펫(주)
Priority date: 2006-05-10
Filing date: 2006-05-10
Publication date: 2012-09-20
Also published as: KR20070109156A

Abstract

본 발명의 실시예에 의한 윈도우 분리 직교 모델을 이용한 잡음신호 분리방법은, 적어도 2개 이상의 마이크로폰에 다수의 음원이 혼합되어 입력되는 단계와; 상기 마이크로폰에 수신되는 신호에 대하여 고속 퓨리에 변환이 적용되는 단계와; 상기 고속 퓨리에 변환을 통해 상기 각 마이크로폰에 수신되는 신호에 대한 2차원 시간-주파수 축의 스펙트로그램을 통해 각각의 음원에 대한 믹싱 파라미터가 구해지는 단계와; 상기 믹싱 파라미터의 히스토그램을 통해 혼합 입력되는 각각의 음원에서 하나의 최고 피크(peak)값이 도출되는 단계와; 이진 마스크 함수를 상기 시간-주파수 영역의 한 음원과 각각 대응시키고, 역 고속 퓨리에 변환 및 OLA(OverLab and Add)를 이용하여 상기 혼합되어 입력되는 다수의 음원이 각각 분리되는 단계가 포함됨을 특징으로 한다.A noise signal separation method using a window separation orthogonal model according to an embodiment of the present invention includes the steps of: mixing a plurality of sound sources into at least two microphones; Applying fast Fourier transform on the signal received by the microphone; Obtaining a mixing parameter for each sound source through a spectrogram of a two-dimensional time-frequency axis for a signal received by each microphone through the fast Fourier transform; Deriving one peak value from each sound source mixed and input through the histogram of the mixing parameter; Mapping a binary mask function to each of the sound sources in the time-frequency domain, and separating the plurality of input and mixed sound sources using an inverse fast Fourier transform and an overlab and add (OLA), respectively. .

이와 같은 본 발명에 의하면, 음성신호에서 잡음신호를 분리함에 있어 주파수 영역을 이용함으로써, 기존의 암묵적 신호분리 방법에 비해 연산량과 연산속도에서 향상된 성능을 얻을 수 있으며, 신호대잡음비(SNR) 성능도 향상된다는 장점이 있다.According to the present invention, by using the frequency domain in separating the noise signal from the speech signal, it is possible to obtain an improved performance in the calculation amount and operation speed compared to the conventional implicit signal separation method, and also improve the signal-to-noise ratio (SNR) performance It has the advantage of being.

Description

Method of Noise Source Separation Using Window-Disjoint Orthogonal Model

도 1은 일반적인 음성인식기의 개략 구성도.1 is a schematic configuration diagram of a general voice recognizer.

도 2는 종래의 암묵신호분리에 의해 잡음제거 후 음성인식기의 블록 구성도.Figure 2 is a block diagram of a speech recognizer after the noise is removed by the conventional blind signal separation.

도 3은 본 발명의 실시예에 의한 잡음 신호 분리 방법을 설명하기 위한 도면.3 is a view for explaining a noise signal separation method according to an embodiment of the present invention.

도 4는 도 3에 도시된 실시예에 대응되는 스펙트로그램.4 is a spectrogram corresponding to the embodiment shown in FIG.

도 5는 도 4에 도시된 스펙트로그램에서 도출된 믹싱 파라미터의 히스토그램.5 is a histogram of the mixing parameters derived from the spectrogram shown in FIG.

도 6은 본 발명에 의해 구현된 실험 결과를 나타내는 그래프.6 is a graph showing the experimental results implemented by the present invention.

본 발명은 음성 인식장치에 관한 것으로, 특히 음성 인식의 전처리 단계에서 윈도우 분리 직교 모델을 이용하여 음성신호로부터 잡음을 분리하는 잡음 신호 분리방법에 관한 것이다.The present invention relates to a speech recognition apparatus, and more particularly, to a noise signal separation method for separating noise from a speech signal using a window separation orthogonal model in a preprocessing stage of speech recognition.

일반적으로 음성인식장치는 도 1에 도시된 바와 같이 크게 특징추출부(11)과 추출된 음성특징에 따라 인식하는 음성인식부(12)로 구성된다. 즉, 입력된 음성신호는 인식에 적합한 형태의 특징을 추출하는 단계를 거친 후, 그 결과를 이용하여 입력된 음성을 인식하게 된다.In general, as shown in FIG. 1, the speech recognition apparatus includes a feature extractor 11 and a speech recognizer 12 that recognizes the extracted speech feature. That is, the input voice signal undergoes a step of extracting a feature suitable for recognition, and then recognizes the input voice using the result.

여기서, 특징추출부(11)에 의하여 음성의 특징을 추출하는 여러 가지 방법 중 'MFCC'(Mel-Frequency Cepstrum Coefficient) 또는 'PLPCC'(Perceptual Linear Prediction Cepstrum Coefficient)가 주로 사용되고 있다.Here, 'MFC' (Mel-Frequency Cepstrum Coefficient) or 'PLPCC' (Perceptual Linear Prediction Cepstrum Coefficient) is mainly used among various methods of extracting the feature of speech by the feature extractor 11.

또한, 음성인식부(12)는 'HMM'(Hidden Markov Model), 'DTW'(Dynamic Time Warping) 또는 신경회로망 등의 방법이 많이 사용된다. 그러나, 실제 사용환경에서는 음성인식 성능이 저하되는 현상이 나타나는데, 그 이유는 실제환경에서 음성이 입력될 때 어느 정도의 잡음이 섞여 들어오기 때문이다.In addition, the voice recognition unit 12 is a method such as 'HMM' (Hidden Markov Model), 'DTW' (Dynamic Time Warping) or a neural network. However, the voice recognition performance deteriorates in the actual use environment, because some noise is mixed when the voice is input in the real environment.

즉, 음성인식은 음성신호 입력시 함께 들어오는 배경잡음과 채널잡음, 잔향 등 다양한 노이즈에 의해 그 인식 성능이 현저히 떨어져 실험공간이 아닌 실제 상황에서 사용하기에 어려운 점이 많았으며, 이에 상기 문제점을 극복하려는 노력들이 이어져 왔다.In other words, speech recognition is difficult to use in real situations, not in the experimental space, because the recognition performance is remarkably degraded by various noises such as background noise, channel noise, and reverberation that come together when inputting a voice signal. Efforts have been going on.

이러한 배경잡음을 제거하기 위해 특정 주파수 대역을 제거하는 주파수 차감법, 포만트 트랙킹 등을 사용하여 성능을 향상시키려는 시도가 있었으나, 이러한 방법들은 노이즈 성분의 변화가 심하고 다양한 노이즈가 존재하는 실제 상황에서는 인식 성능에 크게 도움이 되지 못하는 단점이 있다.In order to remove this background noise, attempts have been made to improve performance by using frequency subtraction and formant tracking, which remove specific frequency bands. The disadvantage is that it doesn't really help performance.

이에 최근 들어서는 잡음이 심한 실제 환경에서 음성인식 성능을 향상시키기 위해 음성 신호가 입력되기 전에 상기 음성신호와 함께 입력되는 잡음을 미리 제거 하는 과정이 필수적으로 요청되고 있다.Recently, in order to improve speech recognition performance in a noisy real environment, a process of removing noise input together with the speech signal in advance is required.

이와 같은 잡음을 미리 제거하는 기법으로 암묵적 신호분리(Blind Source Separation) 방법이 사용되고 있다. 상기 암묵적 신호분리방법은 일반적으로 'ICA'(Independent Component Analysis) 방법에 의하여 학습된다.As a technique for removing such noise in advance, a blind source separation method is used. The implicit signal separation method is generally learned by an independent component analysis (ICA) method.

상기 ICA 방법은 음성신호 및 잡음신호들이 혼합되어 입력되는 입력신호들 즉, 각 신호원 사이에 독립적인 특성이 존재한다는 가정 하에 마이크로폰 어레이 시스템을 이용하여 이전사실에 대한 정보 없이 상기 입력신호들로부터 음성신호를 분리할 수 있도록 하는 알고리즘이다.The ICA method uses a microphone array system on the assumption that there is an independent characteristic between input signals inputted by mixing a voice signal and a noise signal, that is, each signal source, and voices from the input signals without information on the previous fact. An algorithm that allows you to separate signals.

즉, 상기 ICA 방법은 상기 음성신호 분리를 위해 필요한 분리 매트릭스를 구하기 위해 혼합 매트릭스의 역행렬을 찾는 것을 그 목적으로 하며, 이 경우 입력되는 음원의 수가 상기 혼합 매트릭스의 수와 같아야만 그 역행렬 계산이 가능하다는 단점이 있다. That is, the ICA method aims to find the inverse of the mixed matrix in order to obtain the separation matrix necessary for the separation of the speech signal. In this case, the inverse matrix calculation is possible only when the number of input sound sources is equal to the number of the mixed matrix. The disadvantage is that.

도 2는 종래 기술에 따른 암묵적 신호분리를 이용한 잡음제거 후 음성인식장치의 블록 구성도로서, 이는 음성신호 또는 잡음신호들이 혼합되어 들어오는 입력신호들로부터 상호 독립적인 신호들을 추출하는 방식으로 혼합되기 전의 신호들을 분리 해내는 방법이다.FIG. 2 is a block diagram of a speech recognition apparatus after noise removal using implicit signal separation according to the prior art, which is a mixture of speech signals or noise signals before they are mixed by extracting mutually independent signals from incoming input signals. It is a way to separate the signals.

즉, 다수의 입력된 음성신호와 잡음신호의 혼합신호들이 입력되고, 암묵신호 분리기(21)에서 상기 입력신호로부터 잡음신호와 음성신호를 분리하여 출력함으로써 음성인식기(22)를 통해 잡음이 분리된 음성신호만을 사용하여 음성인식을 수행하게 된다.That is, mixed signals of a plurality of input voice signals and noise signals are input, and noise is separated through the voice recognizer 22 by outputting the noise signal and the voice signal separated from the input signal by the blind signal separator 21. Voice recognition is performed using only the voice signal.

그러나, 이와 같은 종래기술에 따른 음성인식장치는 혼합되기 전 신호원의 수가 혼합된 후 입력장치를 통해 입력되는 신호의 수와 같아야 한다는 단점이 있다. 또한, 이 경우 분리되는 신호의 수는 신호원의 수와 같으며, 분리된 신호들 중 어느 것이 어떤 신호원에 해당하는지를 알 수 없다는 문제가 있다. However, the speech recognition apparatus according to the related art has a disadvantage in that the number of signal sources before mixing is equal to the number of signals input through the input apparatus after mixing. In addition, in this case, the number of signals to be separated is equal to the number of signal sources, and there is a problem in which which of the separated signals corresponds to which signal source.

즉, 분리된 신호들 중 어느 것을 인식 대상으로 하여 음성인식기에 입력해야 할 지를 판단하지 못하는 문제점이 있었다. 따라서, 모든 분리된 신호들을 음성인식기에 입력시켜 인식과정을 거쳐야 한다. 만약, 차량운행 환경에서 오디오 기기가 켜져 있을 경우에는 암묵적 신호 분리된 신호의 수가 더 늘어나므로 음성인식기에 입력시킬 분리된 신호의 수도 그 만큼 늘어나게 되는 문제점이 있다.That is, there is a problem in that it is not possible to determine which of the separated signals to be input to the speech recognizer. Therefore, all the separated signals are input to the voice recognizer and undergo a recognition process. If the audio device is turned on in a vehicle driving environment, since the number of the implicit signal separated signals increases, the number of separated signals to be input to the voice recognizer increases by that much.

본 발명은 음성 인식의 전처리 단계에서 음성신호로부터 잡음을 분리함에 있어, 입력되는 음성신호가 주파수 영역에서 서로 독립적인 무상관 관계임을 이용하여 기존의 ICA 방법이 아닌 윈도우 분리 직교(Window-Disjoint Orthogonal) 방법을 통해 음성신호 및 잡음신호가 혼합된 혼합신호를 분할하여 나타내고, 이를 통해 원래의 음성신호를 추출토록 하는 윈도우 분리 직교 모델을 이용한 잡음신호 분리방법을 제공함에 그 목적이 있다.In the present invention, when the noise is separated from the speech signal in the preprocessing step of speech recognition, the window-disjoint orthogonal method is used instead of the conventional ICA method by using an independent correlation in the frequency domain. It is aimed to provide a noise signal separation method using a window separation orthogonal model for extracting a mixed signal in which a voice signal and a noise signal are mixed and extracting the original voice signal.

상기 목적을 달성하기 위하여 본 발명의 실시예에 의한 윈도우 분리 직교 모델을 이용한 잡음신호 분리방법은, 적어도 2개 이상의 마이크로폰에 다수의 음원이 혼합되어 입력되는 단계와; 상기 마이크로폰에 수신되는 신호에 대하여 고속 퓨리 에 변환이 적용되는 단계와; 상기 고속 퓨리에 변환을 통해 상기 각 마이크로폰에 수신되는 신호에 대한 2차원 시간-주파수 축의 스펙트로그램을 통해 각각의 음원에 대한 믹싱 파라미터가 구해지는 단계와; 상기 믹싱 파라미터의 히스토그램을 통해 혼합 입력되는 각각의 음원에서 하나의 최고 피크(peak)값이 도출되는 단계와; 이진 마스크 함수를 상기 시간-주파수 영역의 한 음원과 각각 대응시키고, 역 고속 퓨리에 변환 및 OLA(OverLab and Add)를 이용하여 상기 혼합되어 입력되는 다수의 음원이 각각 분리되는 단계가 포함됨을 특징으로 한다.Noise signal separation method using a window separation orthogonal model according to an embodiment of the present invention to achieve the above object comprises the steps of mixing a plurality of sound sources into at least two or more microphones; Applying fast Fourier transform on the signal received by the microphone; Obtaining a mixing parameter for each sound source through a spectrogram of a two-dimensional time-frequency axis for a signal received by each microphone through the fast Fourier transform; Deriving one peak value from each sound source mixed and input through the histogram of the mixing parameter; Mapping a binary mask function to each of the sound sources in the time-frequency domain, and separating the plurality of input and mixed sound sources using an inverse fast Fourier transform and an overlab and add (OLA), respectively. .

여기서, 상기 음원의 수는 상기 마이크로폰의 갯수보다 같거나 많음을 특징으로 한다.Here, the number of the sound source is characterized in that the same or more than the number of the microphone.

또한, 상기 음원은 각각 주파수 영역에서 서로 독립적인 무상관 관계이고, 상기 각각의 음원은 윈도우 분리 직교(WDO) 모델로서

이고,

임을 특징으로 한다.In addition, the sound sources are independent of each other in the frequency domain, and each of the sound sources is a window separation orthogonal (WDO) model.

ego,

It is characterized by that.

또한, 상기 믹싱 파라미터의 히스토그램에서 각 음원에 대해 한 개의 피크 점을 구하기 위해 상기 히스토그램을 스무드시킴을 특징으로 한다.In addition, the histogram is smoothed to obtain one peak point for each sound source in the histogram of the mixing parameter.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 보다 상세히 설명하도록 한다. Hereinafter, embodiments of the present invention will be described in more detail with reference to the accompanying drawings.

본 발명의 설명에 앞서 종래의 암묵적 신호분리(Blind Source Separation, BBS) 방법에 대해 먼저 설명하도록 한다. Prior to the description of the present invention, a conventional blind source separation (BBS) method will be described first.

M개의 독립적인 무상관 신호를 고려할 때, 원래 신호 s(t) ∈ R^M 이고, 상기 원래 신호들은 각각 다른 위치에 있으며, 이에 마이크로폰에 입력되는 신호 x(t) ∈ R^N 이다.Considering the M independent uncorrelated signals, the original signal s (t) ∈ R ^M , the original signals are at different positions, and thus the signal x (t) ∈ R ^N input to the microphone.

상기 수학식 1에서 Si는 i번째 음원으로부터 들어온 신호이고, Xj는 j번째 마이크로폰에 의해 수집된 신호이다. 또한, Aji는 음원 i에서 마이크로폰 j까지의 P점 임펄스 응답(impulse reponse)으로써 음향학적 주위 환경을 나타낸다. In Equation 1, Si is a signal input from the i-th sound source, and Xj is a signal collected by the j-th microphone. Aji also represents the acoustical surroundings as the P point impulse response from sound source i to microphone j.

상기 수학식 1은 행렬 이산 퓨리에 변환방법을 적용하여 수학식 2와 같이 좀 더 간단히 나타낼 수 있다.Equation 1 may be expressed more simply as in Equation 2 by applying a matrix Discrete Fourier Transform method.

이에 콘볼루션(Convolution)을 이용한 신호 분리 방법 즉, 종래의 Convoutive BBS를 이용한 신호 분리 방법은

에 대한 역 필터

를 구하는 것이다. 즉, 하기된 수학식 3에 기술된 식을 통해 상기 역필터를 구할 수 있게 된다.The signal separation method using convolution, that is, the signal separation method using a conventional convoutive BBS

Reverse filter for

To obtain. That is, the inverse filter can be obtained through the equation described in Equation 3 below.

이때, 상기 P는 임의의 순환행렬을 나타내고, S(w)는 각 주파수에 대한 스케일링을 나타내는 대각행렬이다. In this case, P represents an arbitrary cyclic matrix, and S (w) is a diagonal matrix representing scaling for each frequency.

이와 같은 종래의 Convoutive BBS는 각 주파수 대역에서의 스케일링과 순환에 있어서, 마이크로폰의 갯수가 증가할수록 상기 순환행렬의 복잡도가 증가하여 시스템 성능이 저하된다는 단점이 있다.Such conventional Convoutive BBS has a disadvantage in that in the scaling and circulation in each frequency band, as the number of microphones increases, the complexity of the circulation matrix increases and system performance decreases.

본 발명은 이와 같은 순환문제를 극복하기 위한 것으로 음성 인식의 전처리 단계에서 음성신호로부터 잡음을 분리함에 있어, 입력되는 음성신호가 주파수 영역에서 서로 독립적인 무상관 관계임을 이용하여 기존의 ICA 방법이 아닌 윈도우 분리 직교(Window-Disjoint Orthogonal) 방법을 통해 음성신호 및 잡음신호가 혼합된 혼합신호를 분할하여 나타내고, 이를 통해 원래의 음성신호를 추출토록 함을 특징으로 한다.The present invention is to overcome such a circulatory problem, in the separation of noise from the speech signal in the preprocessing stage of speech recognition, the input voice signals are independent of each other in the frequency domain, so that the window is not a conventional ICA method. Window-Disjoint Orthogonal method divides the mixed signal mixed with the voice signal and the noise signal, and extracts the original voice signal.

도 3은 본 발명의 실시예에 의한 잡음 신호 분리 방법을 설명하기 위한 도면이다.3 is a view for explaining a noise signal separation method according to an embodiment of the present invention.

즉, 도 3은 다중 음원(다수의 입력신호)이 입력되는 2채널 마이크로폰을 그 예로 설명하고 있으나, 이는 본 발명이 적용되는 하나의 실시예에 불과한 것으로, 본 발명이 상기 예에 한정되지 아니한다.That is, although FIG. 3 illustrates a two-channel microphone into which multiple sound sources (multiple input signals) are input, this is just an embodiment to which the present invention is applied, and the present invention is not limited to the above example.

본 발명에 의한 윈도우 분리 직교 모델을 이용한 잡음신호 분리방법은, 입력 되는 음성신호가 주파수 영역에서 서로 독립적인 무상관 관계임을 이용하는 것으로, 앞서 설명한 Convolutive BBS보다 연산량이 탁월하게 우수하며, 또한, 도 3에 도시된 바와 같이 음원의 수가 믹싱(mixing) 매트릭스 보다 많아도 사용이 가능하기 때문에 실제 상황에서 보다 유리하게 적용할 수 있다. The noise signal separation method using the window-separated orthogonal model according to the present invention uses an uncorrelated relationship which is independent of each other in the frequency domain, and has an excellent calculation amount than the convolutive BBS described above. As shown, since the number of sound sources can be used even more than the mixing matrix, the present invention can be applied more advantageously in actual situations.

즉, 본 발명에 의할 경우 혼합되기 전 음원의 수가 혼합된 후 입력장치를 통해 입력되는 신호의 수와 같아야 한다는 종래의 단점을 극복할 수 있게 되는 것이다. That is, according to the present invention, it is possible to overcome the conventional disadvantage that the number of sound sources before mixing is equal to the number of signals input through the input device after mixing.

도 3에 도시된 바와 같이, 2개의 마이크로폰을 이용하여 다수의 음원 즉, 입력신호를 입력받는 경우, 상기 제 1 및 제 2 마이크로폰은 위치가 분리되어 상기 다수의 입력신호를 입력받는다.As shown in FIG. 3, when a plurality of sound sources, that is, input signals are input using two microphones, the first and second microphones are separated from each other to receive the plurality of input signals.

상기 제 1 마이크로폰에 입력되는 신호(

) 및 제 2 마이크로폰에 입력되는 신호(

)는 하기된 수학식 4와 같다.A signal input to the first microphone (

) And the signal input to the second microphone (

) Is as shown in Equation 4 below.

여기서, 상기

는 상기 마이크로폰으로 입력되는 다수의 음원 즉, 다수의 입력신호를 나타내고, j번째 신호에 대한

는 제 1 마이크로폰에 대한 제 2 마이크로폰의 상대적인 진폭(amplitude)을 나타낸다. 즉, 거리에 따른 신호감쇠로 생각할 수 있다. 또한, 상기

는 제 1 및 제 2 마이크로폰의 시간 지연이다. 단, 상기 마이크로폰에 입력되는 입력신호들은 실내 공간의 임펄스 응답(impulse response)이 없는 것으로 가정한다. Where

Represents a plurality of sound sources input to the microphone, that is, a plurality of input signals, and

Denotes the relative amplitude of the second microphone relative to the first microphone. In other words, it can be thought of as a signal attenuation according to distance. In addition,

Is the time delay of the first and second microphones. However, it is assumed that the input signals input to the microphone do not have an impulse response of the indoor space.

이에 본 발명의 제 1 단계는 제 1 및 제 2 마이크로폰에서 K개의 신호를 검출하고, j번째 음원과 일치하는

및

값을 구한다. 여기서, 시간지연의 조절없이 여러 입력신호에 대한 진폭 패닝(amplitude panning)을 고려하는 두 개의 스테레오 채널인 경우라면, 오디오 믹싱 콘솔은 상기 진폭(amplitude)과 지연(delay)을 조절하는 기능을 지원하지 않기 때문에 시간지연 계수인

를 생략할 수 있다. Accordingly, the first step of the present invention detects K signals in the first and second microphones, and matches the j th sound source.

And

Find the value. Here, in the case of two stereo channels considering amplitude panning of multiple input signals without adjusting the time delay, the audio mixing console does not support the function of adjusting the amplitude and delay. Time delay coefficient

Can be omitted.

다음으로 상기 각각의 입력신호 S를 윈도우 분리 직교(Window-Disjoint Orthogonal) 모델로 가정하여 상기 제 1 마이크로폰에 입력되는 신호(

) 및 제 2 마이크로폰에 입력되는 신호(

)에 대해 고속 퓨리에 변환(FFT)를 적용한다. Next, assuming that each input signal S is a window-disjoint orthogonal model, a signal input to the first microphone (

) And the signal input to the second microphone (

Fast Fourier transform (FFT)

즉, 본 발명은 상기 각각의 입력신호가 주파수 영역에서 서로 독립적인 무상 관 관계임을 이용하는 것으로 이를 통해 기존의 Convolutive BBS보다 월등히 우수한 연산량을 처리하고, 최초 입력되는 입력신호의 수와 상기 마이크로폰의 갯수가 같아야 한다는 종래의 단점을 극복할 수 있게 된다. That is, the present invention utilizes that each of the input signals are independent of each other in the frequency domain, thereby processing a significantly superior calculation amount than the conventional convolutive BBS, and the number of input signals and the number of microphones that are initially input. It is possible to overcome the conventional disadvantage of being equal.

즉, 상기 각각의 입력신호는 윈도우 분리 직교 모델로서

이고,

이다.That is, each input signal is a window-separated orthogonal model.

ego,

to be.

수학식 5는 제 1 마이크로폰에 입력되는 신호(

) 및 제 2 마이크로폰에 입력되는 신호(

)에 대해 고속 퓨리에 변환(FFT)를 적용한 결과를 나타낸다. 즉, 하기된

및

는 가중된

및

의 윈도우 함수에 고속 퓨리에 변환(FFT)으로 얻어진다. Equation 5 is a signal input to the first microphone (

) And the signal input to the second microphone (

Shows the result of applying the Fast Fourier Transform (FFT). That is,

And

Is weighted

And

It is obtained by fast Fourier transform (FFT) to the window function of.

여기서, k는 주파수 색인, l은 시간 색인이며, L은 FFT의 길이를 나타낸다. Where k is the frequency index, l is the time index, and L is the length of the FFT.

도 4는 도 3에 도시된 실시예에 대응되는 스펙트로그램이다.FIG. 4 is a spectrogram corresponding to the embodiment shown in FIG. 3.

즉, 제 1 마이크로폰에 입력되는 신호(

) 및 제 2 마이크로폰에 입력되는 신호(

)에 대응되는 2차원 시간-주파수 축의 스펙트로그램을 나타낸다.That is, the signal input to the first microphone (

) And the signal input to the second microphone (

) Shows the spectrogram of the two-dimensional time-frequency axis.

본 발명의 경우 각각의 입력신호가 WDO 모델임을 가정하는 것이므로, 도 4에 도시된 바와 같이 상기 각각의 입력신호는 스펙트로그램 상에서 즉, 시간과 주파수축을 중심으로 각 점에 해당하는 요소는 단지 하나만 존재하게 된다. 이는 상기 각각의 입력신호가 주파수 영역에서 서로 독립적인 무상관 관계이기 때문이다.In the present invention, since each input signal is assumed to be a WDO model, as shown in FIG. 4, each input signal exists on the spectrogram, that is, there is only one element corresponding to each point about the time and frequency axis. Done. This is because the respective input signals are independent of each other in the frequency domain.

다음으로 상기 모든 스펙트로그램 인덱스(k,l)에 대하여

및

를 계산하면 수학식 6과 같이 표현할 수 있다.Next, for all the spectrogram indices (k, l)

And

When calculated, it can be expressed as Equation 6.

시간-주파수 영역에 관한 모든 점에 대한 믹싱(mixing) 파라미터를 계산하게 되면, 이 후 상기 믹싱 파라미터의 히스토그램 h(α,δ)를 얻을 수 있다. By calculating the mixing parameters for all points in the time-frequency domain, the histograms h (α, δ) of the mixing parameters can then be obtained.

본 발명의 실시예의 경우 제 1 및 제 2 마이크로폰 중간에 위치한 음원은 상대적인 진폭 값인

=1 이고, 지연된 값인

=0 이다. 중심에 위치한 원래의 히스토그δ램을 구하기 위해 진폭 파라미터를 정의하면, 수학식 7과 같다.In the embodiment of the present invention, the sound source located in the middle of the first and second microphones is a relative amplitude value.

= 1, the delayed value

= 0 If the amplitude parameter is defined to obtain the original histogram δ ram located at the center, Equation 7 is obtained.

⌒a(k,l)=a(k,l)-1/a(k,l)A (k, l) = a (k, l) -1 / a (k, l)

상기 이산 히스토그램에 대한 파라미터 a(k,l)과 δ(k,l)는 양자화 에러가 포함된 양자화값을 가진다. 그에 따라 가중치 히스토그램 h(α,δ)는 모든 (k,l) 포인트에 대해 계산된 포인트인

에 따라 추가 신호 파워에 의해 계산되어 질 수 있다. The parameters a (k, l) and δ (k, l) for the discrete histogram have quantization values including quantization errors. Accordingly, the weighted histogram h (α, δ) is the point computed for all (k, l) points

Can be calculated by the additional signal power.

도 5는 도 4에 도시된 스펙트로그램에서 도출된 믹싱 파라미터의 히스토그램이다.FIG. 5 is a histogram of the mixing parameters derived from the spectrogram shown in FIG. 4.

단, 도 5에서는 원래(original)의 히스토그램과 스무드(smooth)된 히스토그램이 도시되어 있으며, 상기 스무드된 히스토그램은 각각의 음원에서 하나의 최고 피크(peak)값을 얻기 위해 도출된 것이다.In FIG. 5, an original histogram and a smoothed histogram are shown, and the smoothed histogram is derived to obtain one peak value in each sound source.

즉, 상기 믹싱 파라미터의 히스토그램 즉, 모든 음원 신호의 히스토그램에서 피크값을 구하며, 각 음원에 대해 한 개의 피크 점을 구하기 위해 상기 원래의 히스토그램을 스무드시킨다. That is, a peak value is obtained from a histogram of the mixing parameter, that is, a histogram of all sound source signals, and the original histogram is smoothed to obtain one peak point for each sound source.

다음으로는 실시간 처리를 위한 자동 피크 검출이 수행되며, 최대 피크 값의 기울기를 결정하기 위한 계산 방법은 다음 수학식 8의 조건을 따른다.Next, automatic peak detection for real time processing is performed, and a calculation method for determining the slope of the maximum peak value follows the condition of Equation 8 below.

즉, 상기 수학식 8의 조건을 만족하는 local maxima를 기울값으로 구할 수 있다.That is, the local maxima that satisfies the condition of Equation 8 may be obtained as a gradient value.

마지막으로 상기 믹싱된 신호로부터 최초 음원 즉, 입력신호를 검출한다. Finally, the first sound source, that is, the input signal is detected from the mixed signal.

이는 믹싱 파라미터

및

를 알고 있을 때, 믹싱 음원에 대한 분리가 가능하다. 기본적인 개념은 믹싱 신호의 하나의 소스인 j번째 소스에 대응하는 시간 주파수 영역 (k,l)의 각각의 포인트에 j번째 소스를 사상시키는 것이다.This is the mixing parameter

And

When knowing, it is possible to separate the mixing sound source. The basic concept is to map the jth source to each point in the time frequency domain (k, l) corresponding to the jth source, which is one source of the mixing signal.

따라서, j번째 소스에 따른 각 포인트에 대한 시간 주파수 영역을 검사할 수 있다.Therefore, the time frequency domain for each point according to the j th source may be examined.

각각의 포인트에 대해 기본적으로 다음 수학식 9를 가정하면,Assuming the following equation 9 by default for each point,

여기서,

및

는 반사, 회절, 간섭 등에 의한 노이즈이며, 최대 우도 추정은 각각의 시간 주파수에 대한 음원으로부터 유도되며 이미 추정된 믹싱 파라미터

및

에 기반한다.here,

And

Is noise due to reflection, diffraction, interference, etc., and the maximum likelihood estimation is derived from the sound source for each time frequency and the estimated mixing parameters

And

Based on.

각 음원 j와 음원 (k,l)에 대해 우도함수를 다음의 수학식 10과 같이 계산할 수 있다.The likelihood function for each sound source j and the sound source (k, l) may be calculated as in Equation 10 below.

상기 수학식 10을 이용하여 이진 마스크 함수

를 시간 주파수 영역의 한 음원과 각각 대응시킬 수 있으며, 마지막으로 역 고속 퓨리에 변환(Inverse FFT) 및 OLA(OverLab and Add)를 이용하여 모든 음원

분리시킴으로써, 모든 입력 신호를 구할 수 있다.Binary Mask Function Using Equation 10

Can be matched to one sound source in the time-frequency domain, and finally all sound sources can be inverted using the Inverse FFT and OverLab and Add.

By separating, all input signals can be obtained.

도 6은 본 발명에 의해 구현된 실험 결과를 나타내는 그래프이다.6 is a graph showing the experimental results implemented by the present invention.

실험에 사용된 음성과 잡음은 16khz 표본화 되어 있고, 16비트로 양자화 되었으며, 표본의 길이는 5000샘플이다. 그리고, 음성은 실제 남자 2명의 목소리를 선택하여 사용하였으며, 잡음원으로는 임의로 발생시킨 가우시안과 음악소리를 사용하였다.The speech and noise used in the experiment were sampled at 16 kHz, quantized to 16 bits, and the sample length was 5000 samples. In addition, the voice of two males was selected and used as a noise source, randomly generated Gaussian and music sound.

도 6에 도시된 바와 같이 본 발명에 의할 경우 원음과 가우시안 잡음을 혼합하였을 때 SNR 6dB 정도의 신호대 잡음비 향상을 얻을 수 있었고, 2명의 녹음된 화자(speaker)의 목소리는 거의 완벽하게 분리하였다. 단, 상기 실험에서는 음성이외의 신호들이 하나의 잡음원으로 간주되어 어떠한 채널을 통해 제 1 및 제 2 마이크로폰에 도달한다고 가정하였다. As shown in FIG. 6, when the original sound and the Gaussian noise were mixed, the signal-to-noise ratio improvement of about 6 dB was obtained, and the voices of the two recorded speakers were almost completely separated. In this experiment, however, it is assumed that signals other than voice are regarded as a noise source and reach the first and second microphones through a certain channel.

이와 같은 본 발명에 의하면, 음성신호에서 잡음신호를 분리함에 있어 주파 수 영역을 이용함으로써, 기존의 암묵적 신호분리 방법에 비해 연산량과 연산속도에서 향상된 성능을 얻을 수 있으며, 신호대잡음비(SNR) 성능도 향상된다는 장점이 있다.According to the present invention, by using the frequency domain in separating the noise signal from the speech signal, it is possible to obtain improved performance in the calculation amount and operation speed compared to the conventional implicit signal separation method, and also the signal-to-noise ratio (SNR) performance It has the advantage of being improved.

이상 설명한 내용을 통해 당업자라면 본 발명의 기술사상을 일탈하지 아니하는 범위에서 다양한 변경 및 수정이 가능함을 알 수 있을 것이다. 따라서, 본 발명의 기술적 범위는 명세서의 상세한 설명에 기재된 내용으로 한정되는 것이 아니라 특허 청구의 범위에 의하여 정하여져야만 한다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Therefore, the technical scope of the present invention should not be limited to the contents described in the detailed description of the specification but should be defined by the claims.

Claims

Mixing and inputting a plurality of sound sources to at least two microphones,

Applying fast Fourier transform on the signal received by the microphone;

Obtaining a mixing parameter for each sound source through a spectrogram of a two-dimensional time-frequency axis for a signal received by each microphone through the fast Fourier transform;

Deriving one peak value from each sound source mixed and input through the histogram of the mixing parameter;

Correlating a binary mask function with a sound source in the time-frequency domain, respectively, and separating the plurality of input and mixed sound sources using an inverse fast Fourier transform and an overlab and add (OLA), respectively. Noise Signal Separation Using Window Separation Orthogonal Model.

The method of claim 1,

And the number of the sound sources is equal to or greater than the number of the microphones.

The method of claim 1,

And the sound sources are independent of each other in a frequency domain.

The method of claim 3,

Each sound source is a window separation orthogonal (WDO) model.

ego,

Noise signal separation method using a window separation orthogonal model characterized in that.

The method of claim 1,

And smoothing the histogram to obtain one peak point for each sound source in the histogram of the mixing parameter.