KR101251373B1

KR101251373B1 - Sound classification apparatus and method thereof

Info

Publication number: KR101251373B1
Application number: KR1020110110640A
Authority: KR
Inventors: 최종석; 신광호
Original assignee: 한국과학기술연구원
Priority date: 2011-10-27
Filing date: 2011-10-27
Publication date: 2013-04-05

Abstract

PURPOSE: A sound source classification device and a method thereof are provided to accelerate computing speed by classifying sound sources based on the characteristic parameters and information about reference sound source. CONSTITUTION: A sound source detection unit(110) detects a segment in which a sound source generates. A sound source characteristic extraction unit(120) extracts characteristic parameters of the sound source. A reference sound source storage(140) stores the information about at least one reference sound source. A sound source classification unit(130) classifies the sound signals based on the characteristic parameters and the information about the reference sound source. The sound source characteristic extraction unit extracts the characteristic parameters using MFCC(Mel-Frequency Cepstral Coefficients) method. [Reference numerals] (110) Sound source detection unit; (120) Sound source characteristic extraction unit; (130) Sound source classification unit; (140) Reference sound source storage;

Description

Sound classification apparatus and method thereof

본 발명의 실시예는 음원 분류 장치 및 그 방법에 관한 것이며, 보다 구체적으로는 MFCC 및 GMM을 사용하여 종래에 비해 성능과 연산 속도를 향상 시킨 음원 분류 장치 및 그 방법에 관한 것이다.An embodiment of the present invention relates to a sound source classification apparatus and a method thereof, and more particularly, to a sound source classification apparatus and method using the MFCC and GMM to improve the performance and operation speed compared to the conventional.

현재까지 환경 음원의 자동 분류에 관한 연구는 여러 해 동안 꾸준히 진행되어 왔다. 음원에 대한 자동 분류는 직접적 또는 간접적으로 음성 인식, 패턴 인식, 상황 감지 또는 상황 인식 등의 다양한 분야에 적용될 수 있기 때문에 그 중요성은 더 커져가고 있다.
To date, research on the automatic classification of environmental sound sources has been ongoing for many years. The importance of automatic classification of sound sources is increasing because they can be applied directly or indirectly to various fields such as voice recognition, pattern recognition, situation detection or situation recognition.

음원 분류에 관한 종래 기술 중 F. Beritelli는 A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks 에서MFCCs (Mel Frequency Cepstral Coefficients) 특징과 ANN (Artificial Neural Networks) 분류기를 이용하여 10가지 환경 음원을 85%~92%의 정확도로 분류하였다. L. Ma et al.는 Environmental Noise Classification for Context-Aware Applications 에서MFCCs 특징과 HMM(Hidden Markov Model) 분류기를 이용하여 11가지 환경 음원에 대하여 분류하였다. 특히 HMM의 상태수를 11~15 사이에 설정하였을 때 가장 높은 정확도 92.27%를 얻을 수 있었음을 보고하였다. 마지막으로 A. Dufaux et al.는 AUTOMATIC SOUND DETECTION AND RECOGNITION FOR NOISY ENVIRONMENT 에서6가지 충격음 (impulsive sound)에 대해 균일 대역 스펙트럼 에너지를 특징으로 사용하여 GMM (Gaussian Mixture Models)와 HMM (Hidden Markov Models) 두 가지 분류기에 대한 성능 평가를 하였으며, 각각 86%, 91%의 정확도를 얻어 HMM이 GMM에 비해 좀 더 나은 분류 성능을 나타내었다. 이상의 종래기술은 그러나 연구실 환경에서 실시한 시뮬레이션 단계의 연구결과이며, 물론 분류 정확도는 나름 우수함을 보이고 있으나, 실제 환경에서의 응용으로 활용되기에는 아직 많은 부족함을 갖고 있다.
Among the prior arts related to sound source classification, F. Beritelli uses 10 types of environmental sound sources using Mel Frequency Cepstral Coefficients (MFCCs) and ANN (Artificial Neural Networks) classifiers in A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks. Classified with 85% ~ 92% accuracy. L. Ma et al. Classified 11 environmental sources using MFCCs features and Hidden Markov Model (HMM) classifiers in the Environmental Noise Classification for Context-Aware Applications. In particular, the highest accuracy of 92.27% was obtained when the number of states of HMM was set between 11 and 15. Finally, A. Dufaux et al., Characterized by uniform band spectral energy for six impulsive sounds in AUTOMATIC SOUND DETECTION AND RECOGNITION FOR NOISY ENVIRONMENT, using both Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM). The performance of branch classifier was evaluated, and the accuracy of 86% and 91% was obtained, respectively, and HMM showed better classification performance than GMM. However, the above-described prior art, however, is a result of a simulation step conducted in a laboratory environment, and of course, the classification accuracy is excellent, but it is still insufficient to be used in an actual environment.

A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks(F. Beritelli, Member, IEEE, R. Grasso)A Pattern Recognition System for Environmental Sound Classification based on MFCCs and Neural Networks (F. Beritelli, Member, IEEE, R. Grasso) Environmental Noise Classification for Context-Aware Applications(Ling Ma, Dan Smith, and Ben Milner)Environmental Noise Classification for Context-Aware Applications (Ling Ma, Dan Smith, and Ben Milner) AUTOMATIC SOUND DETECTION AND RECOGNITION FOR NOISY ENVIRONMENT(Alain Dufaux, Laurent Besacier*, Michael Ansorge, and Fausto Pellandini)AUTOMATIC SOUND DETECTION AND RECOGNITION FOR NOISY ENVIRONMENT (Alain Dufaux, Laurent Besacier *, Michael Ansorge, and Fausto Pellandini)

본 발명의 일 측면에 따르면, 실제 환경 즉 실제 상황에 존재하는 다양한 종류의 음원을 정확하게 분류할 수 있는 음원 분류 시스템를 구현할 수 있다.According to an aspect of the present invention, a sound source classification system capable of accurately classifying various kinds of sound sources existing in a real environment, that is, a real situation may be implemented.

본 발명의 다른 측면에 따르면 연산 속도를 증가시켜 실시간 (real time) 음원 분류 시스템을 구현할 수 있다.According to another aspect of the present invention, it is possible to implement a real time sound classification system by increasing the computation speed.

본 발명의 일 측면에 따르면, 음성 신호 발생시 음원 발생 구간을 검출하는 음원 검출부; 상기 검출된 음원의 특징 파라미터를 추출하는 음원 특징 추출부; 하나 이상의 참조 음원의 정보가 저장되어 있는 참조 음원 저장부; 및 상기 추출한 음원 특징 파라미터 및 상기 참조 음원들의 정보에 기반하여 상기 음성 신호를 분류하는 음원 분류부를 포함하는, 음원 분류 장치가 제공된다.According to an aspect of the invention, the sound source detection unit for detecting a sound source generation section when generating a voice signal; A sound source feature extracting unit for extracting feature parameters of the detected sound source; A reference sound source storage unit in which information of at least one reference sound source is stored; And a sound source classification unit classifying the voice signal based on the extracted sound source feature parameter and the information of the reference sound sources.

본 발명의 일 측면에 따르면, 상기 음원 검출부는, 상기 음성 신호의 파워와 배경 잡음 파워의 차이가 음원 검출 기준값보다 큰 경우, 상기 음원 발생 구간으로 검출하는, 음원 분류 장치가 제공된다.According to an aspect of the present invention, the sound source detection unit, when the difference between the power of the voice signal and the background noise power is greater than the sound source detection reference value, the sound source classification apparatus is provided.

본 발명의 일 측면에 따르면, 상기 배경 잡음 파워는 시간적 변화를 고려하여 예측되는, 음원 분류 장치가 제공된다.According to an aspect of the present invention, the background noise power is provided, the sound source classification apparatus is estimated in consideration of the temporal change.

본 발명의 일 측면에 따르면, 상기 음원 특징 추출부는, MFCC(Mel-Frequency Cepstral Coefficients) 방식에 의하여 상기 검출된 음원의 특징 파라미터를 추출하는, 음원 분류 장치가 제공된다.According to an aspect of the present invention, the sound source feature extracting unit is provided with a sound source classification apparatus for extracting the feature parameters of the detected sound source by the MFCC (Mel-Frequency Cepstral Coefficients) method.

본 발명의 일 측면에 따르면, 상기 참조 음원들의 정보는 상기 참조 음원들의 가우시안 혼합 모델(Gaussian Mixture Model)인, 음원 분류 장치가 제공된다.According to an aspect of the present invention, a sound source classification apparatus is provided, wherein the reference sound source information is a Gaussian Mixture Model of the reference sound sources.

본 발명의 일 측면에 따르면, 상기 음원 분류부는, 상기 검출된 음원의 특징 파라미터 분포를 추정하는 가우시안 혼합 모델을 계산하고, 상기 계산된 가우시안 혼합 모델과 상기 참조 음원들의 가우시안 혼합 모델이 가장 유사한 참조 음원으로 상기 음성 신호를 분류하는, 음원 분류 장치가 제공된다.According to an aspect of the present invention, the sound source classification unit calculates a Gaussian mixture model for estimating the distribution of feature parameters of the detected sound source, and the reference source is most similar to the calculated Gaussian mixture model and the Gaussian mixture model of the reference sound sources. A sound source classification apparatus for classifying the voice signal is provided.

본 발명의 일 측면에 따르면, 상기 참조 음원의 정보는, 하나 이상의 기본 음원들의 정보를 포함하며, 복수의 음원을 하나의 음원으로 규명한 외부 음원의 정보를 더 포함하는, 음원 분류 장치가 제공된다.According to an aspect of the present invention, the reference sound source information includes information of one or more basic sound sources, and further includes information of an external sound source identifying a plurality of sound sources as one sound source, there is provided a sound source classification device .

본 발명의 일 측면에 따르면, 상기 외부 음원의 정보는 복수의 가우시안 혼합 모델(Gaussian Mixture Model)인, 음원 분류 장치가 제공된다.According to an aspect of the present invention, a sound source classification apparatus is provided, wherein the information of the external sound source is a plurality of Gaussian Mixture Models.

본 발명의 다른 측면에 따르면, 음성 신호 발생시 음원 발생 구간을 검출하는 단계; 상기 검출된 음원의 특징 파라미터를 추출하는 단계; 및 상기 추출한 음원의 특징 파라미터 및 참조 음원들의 정보에 기반하여 상기 음성 신호를 분류하는 단계를 포함하는, 음원 분류 방법이 제공된다.According to another aspect of the invention, the step of detecting a sound source generation section when generating a voice signal; Extracting feature parameters of the detected sound source; And classifying the voice signal based on the extracted feature parameter of the sound source and information of reference sound sources.

본 발명의 다른 측면에 따르면, 상기 음원 발생 구간을 검출하는 단계는, 상기 음성 신호의 파워와 배경 잡음 파워의 차이가 음원 검출 기준값보다 큰 경우, 상기 음원 발생 구간으로 검출하는 단계를 더 포함하는, 음원 분류 방법이 제공된다.According to another aspect of the present disclosure, the detecting of the sound source generation section may further include detecting the sound source generation section when the difference between the power of the voice signal and the background noise power is greater than a sound source detection reference value. A sound source classification method is provided.

본 발명의 다른 측면에 따르면, 상기 음원 발생 구간을 검출하는 단계는, 상기 배경 잡음 파워를 시간적 변화를 고려하여 예측하는 단계를 더 포함하는, 음원 분류 방법이 제공된다.According to another aspect of the present invention, the detecting of the sound source generation section may further include predicting the background noise power in consideration of a temporal change.

본 발명의 다른 측면에 따르면, 상기 검출된 음원의 특징 파라미터를 추출하는 단계는, MFCC(Mel-Frequency Cepstral Coefficients) 방식에 의한, 음원 분류 방법이 제공된다.According to another aspect of the invention, the step of extracting the feature parameter of the detected sound source, by the MFCC (Mel-Frequency Cepstral Coefficients) method, a sound source classification method is provided.

본 발명의 다른 측면에 따르면, 상기 참조 음원들의 정보는 상기 참조 음원들의 가우시안 혼합 모델인, 음원 분류 방법이 제공된다.According to another aspect of the present invention, a sound source classification method is provided wherein the information of the reference sound sources is a Gaussian mixture model of the reference sound sources.

본 발명의 다른 측면에 따르면, 상기 검출된 음원의 특징 파라미터 분포를 추정하는 가우시안 혼합 모델을 계산하는 단계; 상기 계산된 가우시안 혼합 모델과 상기 참조 음원들의 가우시안 혼합 모델이 가장 유사한 참조 음원으로 상기 음성 신호를 분류하는 단계를 더 포함하는, 음원 분류 방법이 제공된다.According to another aspect of the invention, calculating a Gaussian mixture model for estimating the feature parameter distribution of the detected sound source; A sound source classification method is provided, further comprising the step of classifying the speech signal into a reference sound source with which the calculated Gaussian mixture model and the Gaussian mixture model of the reference sound sources are most similar.

본 발명의 다른 측면에 따르면, 상기 참조 음원의 정보는, 하나 이상의 기본 음원들의 정보를 포함하며, 복수의 음원을 하나의 음원으로 규명한 외부 음원의 정보를 더 포함하는, 음원 분류 방법이 제공된다.According to another aspect of the present invention, the information of the reference sound source includes information of one or more basic sound sources, and further includes information of an external sound source identifying a plurality of sound sources as one sound source, there is provided a sound source classification method .

본 발명의 다른 측면에 따르면, 상기 외부 음원의 정보는 복수의 가우시안 혼합 모델인, 음원 분류 방법이 제공된다.
According to another aspect of the present invention, a sound source classification method is provided, wherein the information of the external sound source is a plurality of Gaussian mixed models.

본 발명에 따르면, 종래 연구실 환경에서만 가능한 음원 분류를, 다양한 종류의 음원이 존재하는 실제 환경에서 구현할 수 있는 효과가 있다.According to the present invention, there is an effect that can be implemented in the real environment where the sound source classification that can be only in the conventional laboratory environment, there are various kinds of sound sources.

본 발명에 따르면, 종래 기술에 비해 연산속도가 뛰어난 음원 분류 시스템을 구현할 수 있는 효과가 있다.
According to the present invention, there is an effect that can implement a sound classification system excellent in operation speed compared to the prior art.

도1은 본 발명의 일 실시예에 따른, 음원 분류 장치(100)의 내부 구성도를 도시한 도면이다.
도2는 음원 발생구간의 검출을 도시한 도면이다.
도3는 본 발명의 일 실시예에 따른, 음원 특징 추출부(120)의 내부 구성도를 도시한 도면이다.
도4은 본 발명의 일 실시예에 따른, 음원 분류 방법의 흐름도를 도시한 도면이다.1 is a view showing the internal configuration of the sound source classification apparatus 100 according to an embodiment of the present invention.
2 is a diagram illustrating detection of a sound source generation section.
3 is a diagram illustrating an internal configuration of the sound source feature extraction unit 120 according to an embodiment of the present invention.
4 is a flowchart illustrating a sound source classification method according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.
DETAILED DESCRIPTION The following detailed description of the invention refers to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be embodied in other embodiments without departing from the spirit and scope of the invention with respect to one embodiment. In addition, it is to be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

도1은 본 발명의 일 실시예에 따른, 음원 분류 장치(100)의 내부 구성도를 도시한 도면이다. 음원 분류 장치(100)는 음원 검출부(110), 음원특징 추출부(120), 음원 분류부(130), 참조 음원 저장부(140)를 포함할 수 있다.
1 is a view showing the internal configuration of the sound source classification apparatus 100 according to an embodiment of the present invention. The sound source classification apparatus 100 may include a sound source detection unit 110, a sound source feature extraction unit 120, a sound source classification unit 130, and a reference sound source storage unit 140.

음원 검출부(110)는 음성 신호 발생시 음원 발생 구간(이벤트 구간)을 검출하는 역할을 한다. 음원 발생 구간 검출을 위해 먼저 입력 소리 x(n,t)에 대해 푸리에 변환 (Fourier Transform)을 하면 X(n,f)과 같이 나타낼 수 있다. 여기서 n은 프레임 인덱스, t는 시간을 의미한다. n번째 프레임에 대한 주파수 영역에서의 파워값은 수학식1과 같이 구할 수 있다.
The sound source detector 110 detects a sound source generation section (event section) when a voice signal is generated. In order to detect the sound source generation section, if Fourier transform is performed on the input sound x (n, t), it can be expressed as X (n, f). Where n is the frame index and t is the time. The power value in the frequency domain for the nth frame can be obtained as shown in Equation 1.

f_min은 최소주파수로 일 실시예에서 300HZ이며, f_max 는 최고 주파수로 일 실시예에서 3.4kHZ를 의미한다. 그 후 음원 발생을 검출하기 위하여 수학식2를 이용한다.
f _min is 300 HZ in one embodiment at the minimum frequency and f _max is 3.4 kHZ in one embodiment at the highest frequency. Then, Equation 2 is used to detect sound source generation.

여기서 P_noise(n)은 n번째 프레임에 대한 배경 잡음 파워이며, TH는 음원 검출 기준값 (즉, 3dB)을 의미한다. 즉, 현재 프레임 즉 음원 신호의 파워와 배경 잡음 파워값의 차이가 소정의 기준값(음원 검출 기준값)보다 크면 음원 발생 구간으로 인식하고, 그렇지 않으면 음원 발생이 없는 구간으로 인식하게 된다.Where P _noise (n) is the background noise power for the nth frame, and TH is the sound source detection reference value (ie, 3 dB). That is, if the difference between the power of the current frame, that is, the sound source signal, and the background noise power value is greater than a predetermined reference value (sound detection reference value), it is recognized as a sound source generation section.

일반적으로 음원 발생 구간 검출의 성능은 배경 잡음 파워의 예측 정확도에 많이 좌우된다. 일 실시예에서는, 시간적 변화를 고려하는 적응 잡음 예측 (adaptive noise estimation) 방법이 가능하다. 이는 수학식 3에서 확인할 수 있다.
In general, the performance of sound source generation interval detection is highly dependent on the prediction accuracy of the background noise power. In one embodiment, an adaptive noise estimation method is possible that takes into account temporal changes. This can be confirmed in Equation 3.

과

는 estimation factor로, 일 실시예에서 각각 0.95, 0.99로 설정될 수 있다.

and

Is an estimation factor, and in one embodiment, may be set to 0.95 and 0.99, respectively.

도2는 음성 신호 발생시 검출된 음원 발생기간을 나타낸 도면이다. 이벤트 1,2,3,4로 표시된 박스 부분이 음원 검출부(110)를 통해 검출된 음원 발생 구간이다.
2 is a diagram illustrating a sound source generation period detected when a voice signal is generated. Box portions marked as events 1, 2, 3, and 4 are sound source generation sections detected through the sound source detector 110.

음원 특징 추출부(120)는 상기 검출된 음원의 특징 파라미터를 추출하는 역할을 한다. 일실시예에서 음원 특징 추출부(120)는 MFCC(Mel-Frequency Cepstral Coefficients) 방식에 의하여 상기 검출된 음원의 특징 파라미터를 추출할 수 있다. MFCC방식은 MFC에 의해서 정해지며, MFCC는 음향 주파수의 일종에서 파생된다. MFC의 선형주파수와 비선형 주파수의 차이는 Mel scale의 공백과 일치한다. Mel scale 은 사람의 청각작용에 더 가까우며 좀 더 나은 음향 분석이다. 예를 들어 압축과도 같은 Mel 주파수는 비선형인 주파수 집합을 가르킨다. MFCC는 주파수 에너지 분포를 잘 표현한다는 측면에서 다른 특성 파라미터에 비해 우수한 성능을 나타낸다. 도3은 본 발명의 일 실시예에 따른, 음원 특징 추출부(120)의 내부 구성도를 도시한 도면이다. 시계열적으로 이를 설명하면 먼저 음원의 입력 데이터가 발생한다(210). 그 후 윈도우를 씌워서 프레임화를 한다(220). 일 실시예에서 윈도우는 해밍 원도우 (Hamming window)일 수 있으나 프레임화가 가능한 윈도우이면 족하며, 해밍 윈도우에 제한되지 아니 한다. 그 후 각 프레임에 대하여 이산 푸리에 변환 (Discrete Fourier Transform; DFT)을 통해 주파수영역으로 변환시킨다(230). 이후 주파수 영역에서의 입력값을 삼각 모양의 멜-스케일 필터 뱅크 (Mel-Filter Bank)를 통과시킨다(240). 여기서 멜-스케일 함수는 수학식4와 같으며, f는 주파수를 나타낸다.
The sound source feature extractor 120 extracts a feature parameter of the detected sound source. In one embodiment, the sound source feature extractor 120 may extract the feature parameter of the detected sound source by a Mel-Frequency Cepstral Coefficients (MFCC) scheme. The MFCC method is determined by MFC, which is derived from a kind of acoustic frequency. The difference between the linear and nonlinear frequencies of the MFC coincides with the gap on the Mel scale. Mel scale is closer to human auditory and better acoustical analysis. Mel frequencies, like compression, for example, refer to sets of frequencies that are nonlinear. MFCC shows superior performance compared to other characteristic parameters in terms of expressing frequency energy distribution well. 3 is a diagram illustrating an internal configuration of the sound source feature extraction unit 120 according to an embodiment of the present invention. When this is described in time series, first, input data of a sound source is generated (210). The frame is then framed by covering the window (220). In one embodiment, the window may be a Hamming window, but may be a frameable window, and the window is not limited to the Hamming window. Thereafter, each frame is transformed into a frequency domain through a Discrete Fourier Transform (DFT) (230). Thereafter, the input value in the frequency domain is passed through a Mel-Filter Bank having a triangular shape (240). Here, the mel-scale function is equal to Equation 4, and f represents a frequency.

그 후 멜 스케일 필터 뱅크를 통과한 각 뱅크 밴드의 에너지값에 로그 (Log)연산을 수행하게 된다(S260). 마지막으로, 계산된 필터 뱅크 밴드별 로그 에너지값에 대하여 DCT (Discrete Cosine Transform; DCT)을 수학식5와 같이 적용하여 MFCCs 특징 파라미터를 최종 계산하게 된다.
Thereafter, a log operation is performed on the energy value of each bank band that has passed through the mel scale filter bank (S260). Lastly, DCT (Discrete Cosine Transform; DCT) is applied to the calculated log bank band-specific log energy values as shown in Equation 5 to finally calculate the MFCCs feature parameters.

여기서

는 번째 필터뱅크의 에너지값, N은 필터뱅크의 갯수, m은 MFCC의 차수를 의미한다. 일 실시예에서 N과 m은 각각 24, 12로 각각 설정하는 것이 가능하다.
here

Where n is the energy value of the filter bank, N is the number of filter banks, and m is the order of the MFCC. In one embodiment, N and m can be set to 24 and 12, respectively.

참조 음원 저장부(140)는 하나 이상의 참조 음원의 정보가 저장하는 역할을 한다. 일 실시예에서, 참조 음원 정보는 가우시안 혼합 모델(Gaussian Mixture Model; GMM)일 수 있으며, 가우시안 혼합모델은 출력확률밀도함수가 가우시안밀도(Gaussian density)의 혼합인 1개의 상태만으로 이루어진 CHMM(Continuous HMM)의 한 형태를 갖는다. 가우시안 혼합 모델의 장점 중 하나는 임의의 형태를 갖는 밀도를 부드럽게 근사화된 형태로 모델을 구성한다는 점이다. 단일모드(Unimodal) 가우시안 음원 모델은 평균벡터(mean vector)와 공분산(covariance)으로 각 음원의 특징분포를 표현하고, VQ-distortion 모델은 특징벡터의 이산집합으로 음원 분포를 표현한다. 이들 두 모델의 특징을 고려하여 구성된 가우시안 혼합 모델은 각각의 평균과 공분산을 가진 가우시안 함수의 이산집합을 사용함으로써 각 음원의 특징을 더 잘 표현하는 모델의 생성을 가능하게 한다. 가우시안 혼합밀도는 다른 참조 음원의 모델에 비하여 계산량이 적으면서 음원 분류 성능이 매우 우수하다고 알려져 있다. 가우시안 혼합밀도는 성분(component) 밀도 M개의 가중합이고, 일 실시예에서, 그 형태는 수학식6과 같다.
The reference sound source storage unit 140 stores information of one or more reference sound sources. In one embodiment, the reference source information may be a Gaussian Mixture Model (GMM), wherein the Gaussian Mixture Model (CHMM) consists of only one state in which the output probability density function is a mixture of Gaussian density. ) Has one form. One of the advantages of the Gaussian mixture model is that the model is constructed in a smooth approximation of the density of any shape. The unimodal Gaussian sound source model expresses the feature distribution of each sound source with mean vector and covariance, and the VQ-distortion model expresses the sound source distribution with the discrete set of feature vectors. The Gaussian mixture model constructed by considering the characteristics of these two models enables the generation of models that better represent the characteristics of each sound source by using discrete sets of Gaussian functions with respective mean and covariance. Gaussian mixing density is known to be very excellent in sound source classification performance with less calculation amount than other reference sound sources. The Gaussian mixing density is a weighted sum of the component densities M, and in one embodiment, the form is as shown in Equation (6).

여기서 x는 d차원 랜덤벡터이고, 일 실시예에서 수학식7을 참조하면, N(x;μ_i,∑_i) 은 성분 밀도, c_i 는 혼합 가중치 (mixture weight)를 나타낸다. 각 성분 밀도는 평균μ_i과 공분산 ∑_i 를 갖는 d차원 가우시안 함수이다.Here, x is a d-dimensional random vector, and in an embodiment, referring to Equation 7, N (x; μ _i , ∑ _i ) denotes a component density and c _i denotes a mixture weight. Each component density is a d-dimensional Gaussian function with mean μ _i and covariance Σ _i .

수학식6에서 혼합 가중치는 일 실시예에서, 수학식8의 제한조건을 만족한다.
In Equation 6, the mixed weight satisfies the constraint of Equation 8 in one embodiment.

가우시안 혼합밀도는 각 성분 밀도의 혼합 가중치, 공분산행렬, 평균벡터로 구성되고, 수학식9와 같이 표현된다.
Gaussian mixing density is composed of the mixing weight of each component density, the covariance matrix, and the average vector, and is expressed as in Equation (9).

음원 분류부(130) 추출한 음원 특징 파라미터 및 상기 참조 음원들의 정보에 기반하여 상기 음성 신호를 분류하는 역할을 한다. The sound source classification unit 130 classifies the voice signal based on the extracted sound source feature parameter and the information of the reference sound sources.

일 실시예에서 음원 분류부(130)는 검출된 음원의 특징 파라미터 분포를 추정하는 가우시안 혼합 모델을 계산하고, 계산된 가우시안 혼합 모델과 참조 음원들의 가우시안 혼합 모델이 가장 유사한 참조 음원으로 상기 음성 신호를 분류하는 역할을 한다. In one embodiment, the sound source classification unit 130 calculates a Gaussian mixture model for estimating the distribution of characteristic parameters of the detected sound source, and calculates the voice signal as a reference source having the most similar Gaussian mixture model of the calculated Gaussian mixture model and reference sources. It serves to classify.

일 실시예에서, 가우시안 혼합 모델의 유사도는 수학식10으로부터 계산할 수 있다. In one embodiment, the similarity of the Gaussian mixture model can be calculated from equation (10).

여기서 T개의 학습 벡터열을 X={x1, x2, … , xT}라고 할 수 있으며, 가우시안 혼합모델의 λ를 추정한다. 음원 분류는 등록된 음원 중 한 음원을 식별음원으로 선택해야하므로 선택 대상의 수는 등록대상 음원수와 일치(N-Class classification problem)해야 한다. 이때 음원 분류는 Bayes's rule에 따라, N개의 음원 중 수학식11의 사후확률 P(λ_i|X)를 최대화하는 모델 λ_i 의 음원 i^* 를 찾는 ML(Maximum Likelihood)방법을 이용할 수 있다.
Here, the T training vector sequences are represented by X = {x1, x2,... , xT} and estimate λ of Gaussian mixture model. Since the sound source classification must select one sound source among the registered sound sources as the identification sound source, the number of objects to be selected must match the number of sound sources to be registered (N-Class classification problem). In this case, according to Bayes's rule, the ML (Maximum Likelihood) method of finding the sound source i ^* of the model λ _i maximizing the posterior probability P (λ _i | X) of Equation 11 among N sound sources may be used.

여기에서, 사전정보가 없기 때문에, 일 실시예에서, 사전확률 P(λ_i)는 수학식12와 같이 표현 할 수 있다.
Here, since there is no prior information, in one embodiment, the prior probability P (λ _i ) can be expressed as Equation 12.

로 사후확률은 최대가 되고, 식별 음원은 수학식13을 이용하여 최종 결정할 수 있다.

The posterior probability becomes maximum, and the identification sound source can be finally determined using Equation 13.

일 실시예에서, 참조 음원 정보는 하나 이상의 기본 음원들의 정보를 포함하며, 복수의 음원을 하나의 음원으로 규명한 외부 음원의 정보를 더 포함할 수 있다. 일 실시예에서, 8개의 본 음원 (alarm, door bell, door knock, explosion, footstep, glass smash, door open&close)과 한 가지의 외부 음원(external sound)를 포함한 총 9가지의 음원을 분류 대상으로 구성할 수 있다. 여기서 외부음원이란, 일반적인 실내 환경에서 동물(개, 고양이 등), 자연소리(비, 바람 등) 및 사람에 관한 소리 (웃음, 박수 등) 등과 같은 기본 음원 이외의 소리를 의미한다. 외부 음원은 규정된 음원 정보만 발생하지 않는 실제 외부 상황에서도 음원 분류를 정확하게 하기 위하여, 기본 음원 이외의 다양한 음원을 하나의 외부 음원으로 규명시켜 음원을 분류한다. 일 실시예에서 외부 음원의 다양성을 고려하여, 이에 대한 가우시안 혼합 모델을 1개가 아닌 3개의 모델로 작성하여 음원 분류시에 참조 모델로 사용할 수 있다. 하지만 3개의 모델에 제한되는 것이 아니라, 3개 이상의 모델도 가능할 것이다.In one embodiment, the reference sound source information may include information of one or more basic sound sources, and may further include information of an external sound source that identifies a plurality of sound sources as one sound source. In one embodiment, a total of nine sound sources including eight main sound sources (alarm, door bell, door knock, explosion, footstep, glass smash, door open & close) and one external sound are configured can do. Here, the external sound source refers to sounds other than the basic sound sources such as animals (dogs, cats, etc.), natural sounds (rain, wind, etc.) and sounds about people (laughs, applause, etc.) in a general indoor environment. The external sound source classifies the sound source by identifying various sound sources other than the basic sound source as one external sound source to accurately classify the sound source even in an actual external situation in which only prescribed sound source information does not occur. In one embodiment, considering the diversity of the external sound source, it is possible to create a Gaussian mixture model for the three models instead of one and use it as a reference model when classifying the sound source. However, not limited to three models, three or more models are possible.

예를 들어, 8가지 기본 음원 중의 한 가지 음원이 발생하면, 해당 참조 음원의 가우시안 혼합 모델과 가장 유사하여 해당 음원으로 분류할 수 있고, 기본 음원이 아닌 그 외의 다른 음원, 즉 외부 음원일 경우, 참조하는 GMM 모델이 3개이므로, 그 중에 하나와의 유사도가 제일 높게 나올 확률이 증가되므로, 최종적으로 외부 음원이라는 음원으로 분류하게 된다.
For example, when one of the eight basic sound sources occurs, it is most similar to the Gaussian mixed model of the reference sound source and can be classified as the corresponding sound source.If the sound source is other than the basic sound source, that is, the external sound source, Since there are three GMM models to refer to, the probability of the highest similarity with one of them is increased, and thus finally classified into an external sound source.

도4은 본 발명의 일 실시예에 따른, 음원 분류 방법의 흐름도를 도시한 도면이다.4 is a flowchart illustrating a sound source classification method according to an embodiment of the present invention.

먼저 음원 분류 장치는 음원이 발생하는 지 안하는 지(S320)를 모니터링한다(S310). 음원이 발생하면 음원의 발생구간을 검출한다(S330). 일 실시예에서 음성 신호의 파워와 배경 잡음 파워의 차이가 음원 검출 기준값보다 큰 경우, 상기 음원 발생 구간으로 검출할 수 있다. 다른 실시예에서 배경 잡음 파워는 시간적 변화를 고려하여 예측할 수도 있다. 그 후 검출된 음원의 특징 파라미터, 대표적인 MFCC값을 추출한다(S340). 검출된 MFCC의 분포를 가장 잘 표현하는 가우시안 혼합 모델(GMM)을 추정하고(S350) 참조 음원들의 가우시안 혼합 모델과 비교하여(S360), 기본 음원과 유사한 경우(S370) 기본 음원으로 분류하고(S380), 그렇지 않은 경우 외부 음원으로 분류(S390)한다.
First, the sound source classification apparatus monitors whether or not the sound source is generated (S320) (S310). When a sound source is generated, a generation section of the sound source is detected (S330). In one embodiment, when the difference between the power of the voice signal and the background noise power is greater than the sound source detection reference value, it may be detected as the sound source generation section. In another embodiment, the background noise power may be predicted in consideration of the temporal change. Thereafter, the feature parameters and representative MFCC values of the detected sound source are extracted (S340). A Gaussian mixture model (GMM) that best represents the detected MFCC distribution is estimated (S350), compared with a Gaussian mixture model of reference sources (S360), and similar to the basic sound source (S370), and classified as a basic sound source (S380). If not, it is classified as an external sound source (S390).

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, Those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be construed as being limited to the above-described embodiments, and all of the equivalents or equivalents of the claims, as well as the following claims, I will say.

100: 음원 분류 장치
110: 음원 검출부
120: 음원특징 추출부
130: 음원 분류부
140: 참조 음원 저장부
210: 음원 입력
220: 윈도우
230: DFT
240: Mel filter Bank
250: Log
260: DCT
270: MFCC100: sound source classification device
110: sound source detection unit
120: sound source feature extraction unit
130: sound source classification
140: reference sound storage unit
210: sound source input
220: windows
230: DFT
240: Mel filter Bank
250: Log
260: DCT
270: MFCC

Claims

A sound source detector for detecting a sound source generation section when a voice signal is generated;
A sound source feature extracting unit for extracting feature parameters of the detected sound source;
A reference sound source storage unit in which information of at least one reference sound source is stored; And
A sound source classification unit classifying the voice signal based on the extracted sound source feature parameter and the information of the reference sound sources;
The sound source feature extraction unit extracts a feature parameter of the detected sound source by a Mel-Frequency Cepstral Coefficients (MFCC) scheme.

The method of claim 1,
The sound source detection unit,
And detecting the sound source generation interval when the difference between the power of the voice signal and the background noise power is greater than a sound source detection reference value.

The method of claim 2,
The background noise power is predicted in consideration of a temporal change.

delete

The method of claim 1,
And the information of the reference sound sources is a Gaussian mixture model of the reference sound sources.

The method of claim 5,
The sound source classification unit,
Computing a Gaussian mixture model for estimating the distribution of the characteristic parameters of the detected sound source, and classifying the speech signal into a reference sound source that is most similar to the calculated Gaussian mixture model of the reference sound sources. Sound classification device.

The method of claim 1,
The information of the reference sound source,
Contains information about one or more basic sound sources,
The sound source classification apparatus further including information of the external sound source which identified the several sound source as one sound source.

The method of claim 7, wherein
And the information of the external sound source is a plurality of Gaussian mixed models.

Detecting a sound source generation section when a voice signal is generated;
Extracting feature parameters of the detected sound source; And
And classifying the voice signal based on the extracted feature parameter and information on reference sound sources,
The extracting of the feature parameter of the detected sound source is based on a Mel-Frequency Cepstral Coefficients (MFCC) method.

10. The method of claim 9,
Detecting the sound source generation section,
And detecting the sound source generation interval when the difference between the power of the voice signal and the background noise power is greater than a sound source detection reference value.

The method of claim 10,
Detecting the sound source generation section,
And predicting the background noise power in consideration of a temporal change.

delete

10. The method of claim 9,
And the information of the reference sound sources is a Gaussian Mixture Model of the reference sound sources.

The method of claim 13,
Calculating a Gaussian mixture model for estimating a feature parameter distribution of the detected sound source;
And classifying the speech signal into a reference sound source that is similar to the calculated Gaussian mixture model and the Gaussian mixture model of the reference sound sources.

10. The method of claim 9,
The information of the reference sound source,
Contains information about one or more basic sound sources,
The sound source classification method further including the information of the external sound source which identified the several sound source as one sound source.

16. The method of claim 15,
And the information of the external sound source is a plurality of Gaussian mixed models.