KR101736466B1

KR101736466B1 - Apparatus and Method for context recognition based on acoustic information

Info

Publication number: KR101736466B1
Application number: KR1020120019178A
Authority: KR
Inventors: 홍하나; 임정은; 고한석; 김광윤; 구본화
Original assignee: 한화테크윈 주식회사; 고려대학교 산학협력단
Priority date: 2012-02-24
Filing date: 2012-02-24
Publication date: 2017-05-16
Also published as: KR20130097490A

Abstract

본 발명은 음향 정보 기반 상황 인식 장치 및 방법을 개시한다.
본 발명의 음향 정보 기반 상황 인식 장치는, 입력되는 음향 신호에서 MFCC 특징 및 Timbre 특징을 음향 특징으로 추출하는 특징 추출부와, 복수의 음성 이벤트와 비음성 이벤트를 포함하는 복수의 음향 이벤트에 대하여 훈련에 의해 미리 획득된 기준 음향 특징과 상기 음향 신호의 MFCC 특징 및 Timbre 특징 간의 유사도를 계산하여 상기 음향 특징을 평가하는 특징 평가부와, 상기 음향 신호의 음향 특징과 상기 기준 음향 특징 간의 유사도를 기초로, 상기 음향 신호를 계층적 접근 방식에 의해 상기 복수의 음향 이벤트 중 하나로 분류하는 음향 분류부를 포함할 수 있다. The present invention discloses an acoustic information-based situation recognition apparatus and method.
The acoustic information-based situation recognition apparatus of the present invention comprises: a feature extraction unit for extracting MFCC features and Timbre features from an input acoustic signal in an acoustic feature; a plurality of acoustic events including a plurality of sound events and non- Based on the similarity between the acoustic characteristic of the acoustic signal and the reference acoustic characteristic, and a characteristic evaluation unit for evaluating the acoustic characteristic by calculating the similarity between the reference acoustic characteristic previously obtained by the acoustic characteristic of the acoustic signal and the MFCC characteristic and the Timbre characteristic of the acoustic signal, And a sound classifying unit for classifying the sound signals into one of the plurality of sound events by a hierarchical approach.

Description

TECHNICAL FIELD [0001] The present invention relates to an acoustic information-based situation recognition apparatus and method,

본 발명은 음향 정보 기반 상황 인식 장치 및 방법에 관한 것이다. The present invention relates to an acoustic information-based situation recognition apparatus and method.

최근 CCTV가 거리 및 빌딩에 보안 및 개인적 안전상의 목적으로 널리 채용되고 있다. 고층 빌딩 내 엘리베이터와 같이 밀폐된 장소에 설치되는 상황 감시 시스템은 통상 감시 카메라의 영상 정보를 이용하여 비정상 이벤트(abnomal event)를 검출하고 기록된 영상 정보는 범죄 증거로 사용된다. 이러한 기존의 상황 감시 시스템은 카메라를 기반으로 이루어져 영상 기반의 상황 분석이 필요하게 된다. 영상 분석만을 이용할 시 특별한 움직임에 대한 정보가 없다면 비정상 상황에 대한 인지가 어렵게 된다. Recently CCTV has been widely adopted in streets and buildings for security and personal safety purposes. A situation monitoring system installed in an enclosed place such as an elevator in a high-rise building usually detects an abnormal event using video information of a surveillance camera, and recorded video information is used as a crime proof. These existing situation monitoring systems are based on cameras and need to analyze situation of image based. If there is no information about specific motion when using only image analysis, it becomes difficult to recognize the abnormal situation.

한국등록특허 제1070902호는 수집된 음향을 통해 인식된 음성과 사전에 입력된 음성 간의 유사도가 임계값 이상인지 여부에 의해 감시 지역의 이벤트 발생 여부를 판단하는 감시 방법을 개시한다.Korean Patent Registration No. 1070902 discloses a monitoring method for determining whether an event occurs in a surveillance area based on whether a similarity between a voice recognized through a collected sound and a previously input voice is equal to or greater than a threshold value.

본 발명은 음향 정보를 이용하여 비정상 상황의 인식률을 높일 수 있는 음향 인식 방법 및 장치를 제공한다. The present invention provides an acoustic recognition method and apparatus capable of increasing the recognition rate of an abnormal situation by using acoustic information.

본 발명의 바람직한 일 실시예에 따른 음향 정보 기반 상황 인식 방법은, 입력되는 음향 신호에서 MFCC 특징 및 Timbre 특징을 음향 특징으로 추출하는 단계; 복수의 음성 이벤트와 비음성 이벤트를 포함하는 복수의 음향 이벤트에 대하여 훈련에 의해 미리 획득된 기준 음향 특징과 상기 음향 신호의 MFCC 특징 및 Timbre 특징 간의 유사도를 계산하여 상기 음향 특징을 평가하는 단계; 및 상기 음향 신호의 음향 특징과 상기 기준 음향 특징 간의 유사도를 기초로, 상기 음향 신호를 계층적 접근 방식에 의해 상기 복수의 음향 이벤트 중 하나로 분류하는 단계;를 포함할 수 있다. According to a preferred embodiment of the present invention, there is provided an acoustic information-based situation recognition method including extracting an MFCC feature and a Timbre feature from an input acoustic signal as an acoustic feature; Evaluating the acoustic feature by calculating a similarity between a reference acoustic feature previously obtained by training and a MFCC feature and a Timbre feature of the acoustic signal for a plurality of acoustic events including a plurality of voice events and a non-voice event; And classifying the acoustic signal into one of the plurality of acoustic events by a hierarchical approach based on a similarity degree between the acoustic characteristic of the acoustic signal and the reference acoustic characteristic.

본 발명의 바람직한 일 실시예에 따른 음향 정보 기반 상황 인식 장치는, 입력되는 음향 신호에서 MFCC 특징 및 Timbre 특징을 음향 특징으로 추출하는 특징 추출부; 복수의 음성 이벤트와 비음성 이벤트를 포함하는 복수의 음향 이벤트에 대하여 훈련에 의해 미리 획득된 기준 음향 특징과 상기 음향 신호의 MFCC 특징 및 Timbre 특징 간의 유사도를 계산하여 상기 음향 특징을 평가하는 특징 평가부; 및 상기 음향 신호의 음향 특징과 상기 기준 음향 특징 간의 유사도를 기초로, 상기 음향 신호를 계층적 접근 방식에 의해 상기 복수의 음향 이벤트 중 하나로 분류하는 음향 분류부;를 포함할 수 있다. According to another aspect of the present invention, there is provided an acoustic information-based situation recognition apparatus comprising: a feature extraction unit that extracts MFCC features and Timbre features from acoustic signals as acoustic features; A feature evaluation unit for calculating a similarity between a reference acoustic feature previously obtained by training for a plurality of acoustic events including a plurality of voice events and a non-voice event, and an MFCC feature and a Timbre feature of the acoustic signal, ; And a sound classifying unit classifying the sound signal into one of the plurality of sound events by a hierarchical approach based on a similarity degree between the acoustic characteristic of the sound signal and the reference sound characteristic.

본 발명은 MFCC 특징의 단독 사용이 아닌 Timbre 특징을 함께 사용하고, 계층적 분류기의 사용에 의해 좁은 공간 내 음향 이벤트 인식 및 분류 성능을 높일 수 있다. 또한 계층적 분류기의 재구성으로 다양한 환경에 적용이 가능하고, 다양한 이벤트의 음향 정보 데이터를 수집하여 계층적 분류기를 확장할 수 있다. The present invention can use the Timbre feature together with the MFCC feature alone and enhance the recognition performance and the classification performance in a narrow space by using the hierarchical classifier. In addition, it can be applied to various environments by reconstructing hierarchical classifiers, and it is possible to expand hierarchical classifiers by collecting acoustic information data of various events.

도 1은 본 발명의 일 실시예에 따른 음향 인식 장치를 개략적으로 도시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 MFCC 특징을 획득하는 방법을 설명하는 도면이다.
도 3 및 도 4는 본 발명의 일 실시예에 따른 Timbre 특징을 설명하는 도면이다.
도 5는 본 발명의 일 실시예에 따른 음향 분류부의 구성을 개략적으로 도시한 블록도이다.
도 6은 본 발명의 일 실시예에 따른 음향 정보 기반 상황 인식 방법을 개략적으로 설명하는 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 음향 정보 기반 상황 인식 장치의 구성을 개략적으로 도시한 블록도이다.
도 8은 본 발명의 일 실시예에 따른 영상 신호 기반 비정상 상황을 분류한 예를 도시한 도면이다.
도 9는 본 발명의 실시예의 음향 인식률과 비교예의 음향 인식률을 보여주는 그래프이다. 1 is a block diagram schematically illustrating an acoustic recognition apparatus according to an embodiment of the present invention.
2 is a diagram illustrating a method for obtaining an MFCC feature according to an embodiment of the present invention.
FIG. 3 and FIG. 4 are views for explaining the Timbre feature according to an embodiment of the present invention.
5 is a block diagram schematically showing a configuration of a sound classifying unit according to an embodiment of the present invention.
6 is a flowchart schematically illustrating an acoustic information-based situation recognition method according to an embodiment of the present invention.
7 is a block diagram schematically illustrating the configuration of an acoustic information-based situation recognition apparatus according to an embodiment of the present invention.
FIG. 8 is a diagram illustrating an example of classification of an abnormal condition based on a video signal according to an embodiment of the present invention.
9 is a graph showing the acoustic recognition rate of the embodiment of the present invention and the acoustic recognition rate of the comparative example.

이하 본 발명의 바람직한 실시예가 첨부된 도면들을 참조하여 설명될 것이다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음향 인식 장치를 개략적으로 도시한 블록도이다. 1 is a block diagram schematically illustrating an acoustic recognition apparatus according to an embodiment of the present invention.

음향 인식 장치(30)는 음향 신호 분석을 통해 감시 영역에 비정상 상황이 발생되었는지를 판단하고, 비정상 상황으로 판단되면 경보를 발생한다. 도 1을 참조하면, 음향 인식 장치(30)는 음향 이벤트 검출부(301) 및 상황 판단부(307)를 포함할 수 있다. 본 발명의 음향 인식 장치(30)는 엘리베이터, 교도소 등 한정된 공간 안에서 발생 가능한 음향 이벤트들을 계층적으로 분류하여 인식함으로써, 효과적인 상황 인식이 가능하다. The acoustic recognition device 30 determines whether an abnormal situation has occurred in the surveillance area through the analysis of the acoustic signal, and generates an alarm when it is determined that the abnormal condition is detected. Referring to FIG. 1, the sound recognition apparatus 30 may include an acoustic event detection unit 301 and a situation determination unit 307. The acoustic recognition device 30 of the present invention classifies and recognizes possible acoustic events in a limited space such as an elevator, a prison, and the like, thereby enabling effective situation recognition.

음향 이벤트 검출부(301)는 입력되는 음향 신호에서 추출되는 음향 특징을 분석하여, 음향 신호의 이벤트를 계층적으로 인식 및 분류한다. 여기서 이벤트는 비명, 울음, 발걸음, 충돌 등 음향 신호의 종류를 나타낸다. 음향 이벤트 검출부(301)는 특징 추출부(302), 특징 평가부(303), 음향 모델(304), 음향 분류부(305)를 포함할 수 있다. The sound event detecting unit 301 analyzes the sound characteristics extracted from the inputted sound signals, and hierarchically recognizes and classifies the events of the sound signals. Here, the event indicates kinds of acoustic signals such as screaming, crying, stepping, and collision. The acoustic event detection unit 301 may include a feature extraction unit 302, a feature evaluation unit 303, an acoustic model 304, and an acoustic classification unit 305.

특징 추출부(302)는 입력되는 음향 신호로부터 음향 특징을 추출한다. 입력되는 음향 신호는 예를 들어, 50ms 단위의 음향 프레임이고, 특징 추출부(302)는 음향 프레임에서 음향 특징을 추출한다. 음향 특징은 MFCC(Mel-frequency cepstral coefficient) 특징 및 Timbre 특징을 포함한다. 특징 추출부(302)는 MFCC 특징 및 Timbre 특징 추출을 차례로 또는 병렬로 수행할 수 있다. The feature extraction unit 302 extracts an acoustic feature from an input acoustic signal. The input acoustic signal is, for example, an acoustic frame of 50 ms units, and the feature extraction unit 302 extracts acoustic features from the acoustic frame. Acoustic features include the Mel-frequency cepstral coefficient (MFCC) feature and the Timbre feature. The feature extraction unit 302 may perform MFCC feature extraction and Timbre feature extraction sequentially or in parallel.

도 2는 본 발명의 일 실시예에 따른 MFCC 특징을 획득하는 방법을 설명하는 도면이다. MFCC 특징은 저주파 영역에서의 변화에 상대적으로 민감한 사람의 청각 특성을 반영하여, 고주파 영역에 비해 저주파 영역에서 더 자세히 추출되는 특징 벡터로 음성 인식(speech recognition) 및 다양한 음향 정보를 이용한 인식(acoustic context awarness)에 활용되고 있다. 2 is a diagram illustrating a method for obtaining an MFCC feature according to an embodiment of the present invention. The MFCC feature is a feature vector that is extracted more in the low-frequency domain than the high-frequency domain, reflecting the human auditory characteristics, which are relatively sensitive to changes in the low-frequency domain. Speech recognition and acoustic recognition using various acoustic information awarness).

특징 추출부(203)는 시간 영역의 음향 신호를 전처리(pre-processing)하여 고주파 에너지를 부스팅하고(a), 푸리에 변환(FFT)을 취하여 주파수 영역의 스펙트럼을 구한 후(b), 구한 스펙트럼에 대해 멜 스케일(Mel scale)에 맞춘 삼각 필터 뱅크를 대응시켜 각 밴드에서의 크기의 합을 구하고(c), 필터 뱅크 출력값에 로그를 취한 후(d), 이산 코사인 변환(e)을 하여 MFCC 특징 벡터를 획득(f)한다. 이때 MFCC 특징 벡터를 추출하기 위해 윈도우(window)를 적용하게 되는데, 본 발명의 실시예에서는 일예로 50ms 윈도우를 사용하였다. MFCC 특징 추출은 음성인식 기술 분야에서 통상적으로 사용되는 것이므로 보다 구체적인 설술은 생략한다. MFCC 특징은 MFCC 특징 및 MFCC의 시간에 따른 변화량인 Delta 특징을 결합(이하, 'MFCC 특징'으로 통칭함)하여 사용할 수 있다. 본 발명의 일 실시예에서, MFCC 20차 및 Delta 특징 20차를 결합한 특징 벡터를 사용할 수 있다. The feature extraction unit 203 pre-processes an acoustic signal in a time domain to boost high frequency energy, (a) obtains a frequency domain spectrum by taking a Fourier transform (FFT), (b) (C), the filter bank output value is logarithmized (d), and the discrete cosine transform (e) is performed to obtain the MFCC characteristic Obtain the vector (f). At this time, a window is applied to extract the MFCC feature vector. In the embodiment of the present invention, a 50 ms window is used as an example. Since the MFCC feature extraction is commonly used in the field of speech recognition technology, a more detailed description will be omitted. The MFCC characteristic can be used by combining the MFCC characteristic and the Delta characteristic, which is a variation with time of the MFCC (hereinafter, referred to as 'MFCC characteristic'). In one embodiment of the invention, a feature vector combining MFCC 20th order and Delta feature 20th order can be used.

MFCC 특징이 사람의 청각 특성을 반영한 특징이기 때문에 음향 이벤트 인식에서 어느 정도의 성능은 항상 얻을 수 있지만, 엘리베이터 환경에서와 같이 충돌음과 문닫히는 소리처럼 유사한 음향 이벤트들을 구분하여 인식하지 못할 수 있다. 이에 따라 특징 추출부(302)는 성능 향상을 위해 Timber 특징을 음향 특징으로서 추가로 추출한다. Timber 특징은 다른 음원(악기 또는 음성)에 의한 음향 신호가 동일 레벨(level) 및 크기(loudness)를 가지는 경우 이를 구별할 수 있도록 음색 정보를 포함하는 특징이다. Timbre 특징을 MFCC 특징과 결합하여 사용함으로써, 더 강인한 음향 인식 성능을 얻을 수 있다. Timbre 특징은, 밝기(Brightness), 거칠기(Roughness), 영교차율(Zero-crossing rate) 및 롤오프(Roll-off) 특징을 포함할 수 있으며, 본 발명의 실시예는 이에 한정되지 않는다. Since the MFCC feature is a characteristic that reflects the human auditory characteristics, some performance in acoustic event recognition can always be obtained, but similar acoustic events such as a collision sound and a door closing sound may not be recognized as in an elevator environment. Accordingly, the feature extraction unit 302 further extracts the Timber feature as an acoustic feature in order to improve the performance. The Timber feature is a feature that includes tone color information to distinguish sound signals from other sound sources (musical instruments or voices) when they have the same level and loudness. By using the Timbre feature in combination with the MFCC feature, more robust acoustic recognition performance can be obtained. The Timbre feature may include Brightness, Roughness, Zero-crossing rate, and Roll-off characteristics, and embodiments of the present invention are not limited thereto.

거칠기(Roughness) 특징은 음향 신호의 정현파 쌍(Sinusoids Pair)의 주파수가 가까움에 따라 발생하는 거친 느낌의 정도를 표현하는 특징으로, 여러 정현파 쌍의 값들을 합산하여 추출한다. 영교차율(Zero-crossing rate) 특징은 음향 신호가 0과 교차되는 정도를 표현하는 특징이다. 롤오프(Roll-off) 특징 및 밝기(Brightness) 특징은 음향 신호에서 고주파의 양을 추정하기 위한 특징으로, 비명, 충돌 소리 등의 비정상 음향과 정상 음향의 주파수 대역에서의 에너지 분포의 차이를 반영한다. 롤오프(Roll-off) 특징은 도 3에 도시된 바와 같이 음향 신호의 전체 주파수 영역에서의 에너지 분포가 저주파에 밀집된 정도를 나타내는 특징으로, 전체 에너지의 소정 비율에 해당하는 주파수(도 3의 예에서는 약 2200Hz)로 표현될 수 있다. 밝기(Brightness) 특징은 도 4에 도시된 바와 같이 음향 신호에서 특정 주파수(도 4의 예에서는 1500Hz) 이상의 고주파 대역에서의 에너지 분포 정도를 나타내는 특징이다. The Roughness feature is a feature that expresses the degree of roughness that occurs as the frequency of a sinusoid pair of a sound signal approaches, and extracts the sum of the values of several sinusoid pairs. The zero-crossing rate feature is a feature that represents the extent to which the acoustic signal crosses zero. The roll-off characteristic and the brightness characteristic are characteristics for estimating the amount of high frequency in the acoustic signal and reflect the difference in the energy distribution in the frequency band of the abnormal sound and the normal sound such as screaming, . As shown in FIG. 3, the roll-off characteristic is a characteristic that indicates the degree of energy distribution in the entire frequency region of the acoustic signal is dense at a low frequency, and is a frequency corresponding to a predetermined ratio of the total energy About 2200 Hz). The brightness characteristic is a characteristic that indicates the degree of energy distribution in a high frequency band above a specific frequency (1500 Hz in the example of FIG. 4) in the acoustic signal as shown in FIG.

특징 평가부(303)는 음향 신호의 음향 특징과 기준 음향 특징 간의 유사도를 계산하여 음향 특징을 평가한다. 기준 음향 특징은 정의된 복수의 음향 이벤트에 대하여 훈련에 의해 미리 획득된 특징이다. 음향 이벤트는 복수의 음성 이벤트와 복수의 비음성 이벤트를 포함한다. The feature evaluation unit 303 evaluates the acoustic feature by calculating the similarity between the acoustic feature of the acoustic signal and the reference acoustic feature. The reference acoustic feature is a feature previously obtained by training for a plurality of defined acoustic events. A sound event includes a plurality of sound events and a plurality of non-sound events.

음향 모델(304)은 기준 음향 특징의 데이터베이스로서, 기준 MFCC 특징과 기준 Timbre 특징을 포함한다. Acoustic model 304 is a database of reference acoustic features, including a reference MFCC feature and a reference Timbre feature.

기준 Timbre 특징은 훈련 음향 데이터베이스(Database)에서 추출된 Timbre 특징이다. 기준 MFCC 특징은, 훈련 음향 데이터베이스(Database)에서 추출된 MFCC 특징(X)을 이용하여, 하기 식 (1)의 가우시안 혼합 모델(GMM)의 평균(m_i)과 공분산 행렬(R_i) 및 가중치(αi)를 충분히 업데이트 하여 모델링된 음향 이벤트 모델(λ_E)이다. 여기서, k는 혼합되는 가우시안 분포들의 개수를 의미하고, d는 각각의 가우시안 분포의 차수, 즉 MFCC 특징(X)의 차수를 나타낸다. The reference Timbre feature is a Timbre feature extracted from the training sound database (Database). The reference MFCC feature uses the MFCC feature (X) extracted from the training sound database to calculate the mean (m _i ), covariance matrix (R _i ), and weight of the Gaussian mixture model (GMM) ([lambda] _E ) modeled by sufficiently updating ([alpha] i). Here, k denotes the number of Gaussian distributions to be mixed, and d denotes the degree of each Gaussian distribution, that is, the degree of the MFCC characteristic (X).

...(1)

...(One)

음향 분류부(305)는 음향 신호의 음향 특징과 기준 음향 특징 간의 유사도를 기초로, 음향 신호를 계층적 접근(hierarchical approch) 방식에 의해 복수의 음향 이벤트 중 하나로 분류한다. The acoustic classifier 305 classifies the acoustic signals into one of a plurality of acoustic events by a hierarchical approach based on the degree of similarity between the acoustic characteristics of the acoustic signals and the reference acoustic characteristics.

일반적으로 발생할 수 있는 음향은 사람의 목에서 발생하는 소리인 음성(vocal)과 그 외의 소리인 비음성(non-vocal), 두 개의 대분류로 나눌 수 있다. 엘리베이터를 예로 들면, 음성은 대화(converstion), 비명(scream), 울음(crying), 안내(announcement) 등의 이벤트로 분류할 수 있고, 비음성은 문개폐(door opening/closing), 발걸음(footsteps), 빈 엘리베이터(empty elevator), 알람벨(alarm bell), 충돌(crash) 등의 이벤트로 분류할 수 있다. 음성 및 비음성 이벤트는 전술된 종류에 한정되지 않고, 감시 영역 및 시스템 설계에 따라 다양하게 설정될 수 있음은 물론이다. Generally, the sound that can be generated can be divided into two major categories: vocal, which is the sound generated from the human neck, and non-vocal, which is the other sound. As an example of an elevator, the voice can be classified into events such as converstion, scream, crying, and announcement, and non-voice can be classified into door opening / closing, footsteps ), An empty elevator, an alarm bell, a crash, and the like. It is needless to say that the voice and non-voice events are not limited to the above-described types and can be variously set according to the surveillance region and the system design.

음향 분류부(305)는 음향 신호를 먼저 음성과 비음성으로 분류하고, 음성인 경우 복수의 음성 이벤트 중 하나로 분류하고, 비음성인 경우 비음성 이벤트 중 하나로 분류한다. The sound classifying unit 305 classifies the sound signal into a sound and a non-sound, and classifies the sound signal into one of a plurality of sound events when it is sound and non-sound event when it is non-sound.

도 5는 본 발명의 일 실시예에 따른 음향 분류부(305)의 구성을 개략적으로 도시한 블록도이다. 음향 분류부(305)는 제1분류기(315), 제2분류기(335), 및 제3분류기(355)를 포함할 수 있다. 각 분류기는 최적화된 음향 특징을 이용하며, 시스템 설치 환경에 따라 재구성이 가능하고, 분류기의 구성에 따라 여러 장소에 적용가능하고 비음성에서도 다양한 이벤트들로 세분화할 수 있다. 또한 각 분류기는 설치 환경에 따라 개별적으로 이벤트 학습을 통해 세분화하여 구성 가능하다. 5 is a block diagram schematically showing a configuration of a sound classifying unit 305 according to an embodiment of the present invention. The audio classifier 305 may include a first classifier 315, a second classifier 335, and a third classifier 355. Each classifier uses optimized acoustic features, can be reconfigured according to the system installation environment, can be applied to various places according to the classifier configuration, and can be subdivided into various events even in non-speech. In addition, each classifier can be divided and configured by event learning individually according to the installation environment.

제1분류기(315)는 음향 신호의 MFCC 특징(X)을 기초로 음향 신호를 음성/비음성으로 분류한다. 제1분류기(315)는 하기 식 (2)와 같이 음향 신호의 MFCC 특징(X)의 우도가 가장 큰 음향 이벤트 모델을 선택하여, 선택된 음향 이벤트 모델에 의해 음성 또는 비음성으로 결정한다. 제1분류기(315)는 음성으로 판단되면 음향 신호를 제2분류기(335)로 출력하고, 비음성으로 판단되면 음향 신호를 제3분류기(355)로 출력한다. The first classifier 315 classifies the acoustic signal into speech / non-speech based on the MFCC feature X of the acoustic signal. The first classifier 315 selects an acoustic event model having the greatest likelihood of the MFCC characteristic X of the acoustic signal as shown in the following equation (2), and determines the acoustic event model as speech or non-speech by the selected acoustic event model. The first classifier 315 outputs the acoustic signal to the second classifier 335 when it is judged as a voice and outputs the acoustic signal to the third classifier 355 when it is judged as non-voice.

...(2)

제2분류기(335)는 음성으로 분류된 음향 신호를 음향 신호의 MFCC 특징(X)과 Timbre 특징을 기초로 음성 이벤트들 중 하나로 분류한다. 이때 이용되는 Timbre 특징은 밝기 특징일 수 있다. 제2분류기(335)는 상기 식 (2)와 같이 최대 우도를 갖는 음향 이벤트 모델(λ_E)에 대응하는 음향 이벤트를 찾고, 음향 신호의 밝기 특징과 가장 유사한 기준 밝기 특징을 갖는 음향 이벤트를 찾음으로써, 음향 신호를 음성 이벤트들 중 하나로 분류한다. 이때 제2분류기(335)는 비음성 이벤트의 기준 음향 특징들은 제외하고, 음성 이벤트에 해당하는 음향 이벤트 모델(λ_E)들 간에 계산된 우도와 음성 이벤트들의 기준 밝기 특징과의 비교 결과만을 이용할 수 있다. The second classifier 335 classifies the acoustic signal classified as speech into one of the voice events based on the MFCC characteristic (X) and the Timbre characteristic of the acoustic signal. The Timbre feature used at this time may be a brightness feature. The second classifier 335 finds an acoustic event corresponding to the acoustic event model ( _E ) having the maximum likelihood as shown in Equation (2) and finds an acoustic event having the reference brightness characteristic most similar to the brightness characteristic of the acoustic signal Thereby classifying the acoustic signal into one of the voice events. At this time, the second classifier 335 can use only the result of comparison between the likelihood calculated between the acoustic event models (? _E ) corresponding to the voice event and the reference brightness characteristic of the voice events, excluding the reference acoustic features of the non- have.

예를 들어, 최대 우도를 갖는 음향 이벤트 모델(λ_E)이 비명 모델 및 울음 모델인 경우, 음향 신호는 비명과 울음으로 중복 분류될 수 있다. 그러나, 가장 유사한 기준 밝기 특징을 갖는 음성 이벤트가 비명인 경우, 최종적으로 음향 신호를 비명으로 분류할 수 있다. For example, if the acoustic event model (λ _E ) with maximum likelihood is a scream model and a cry model, the acoustic signal can be classified as screaming and crying. However, if the voice event having the most similar reference brightness characteristic is screaming, the acoustic signal can be finally classified into a scream.

제3분류기(355)는 비음성으로 분류된 음향 신호를 음향 신호의 MFCC 특징(X)과 Timbre 특징을 기초로 비음성 이벤트들 중 하나로 분류한다. 이때 이용되는 Timbre 특징은 거칠기/영교차율/롤오프 특징일 수 있다. 제3분류기(355)는 상기 식 (2)와 같이 최대 우도를 갖는 음향 이벤트 모델(λ_E)을 찾고, 음향 신호의 적어도 하나의 거칠기/영교차율/롤오프 특징과 가장 유사한 적어도 하나의 기준 거칠기/영교차율/롤오프 특징을 갖는 음향 이벤트를 찾음으로써, 음향 신호를 비음성 이벤트들 중 하나로 분류한다. 이때 제3분류기(355)는 음성 이벤트의 기준 음향 특징들은 제외하고, 비음성 이벤트에 해당하는 음향 이벤트 모델(λ_E)들 간에 계산된 우도와 비음성 이벤트들의 기준 거칠기/영교차율/롤오프 특징과의 비교 결과만을 이용할 수 있다. The third classifier 355 classifies the acoustic signal classified as non-speech into one of the non-speech events based on the MFCC characteristic (X) and the Timbre characteristic of the acoustic signal. The Timbre feature used may be a roughness / zero crossing ratio / rolloff feature. The third classifier 355 finds an acoustic event model (λ _E ) having the greatest likelihood as in Equation (2) and calculates at least one reference roughness / roll-off characteristic most similar to at least one roughness / By searching for an acoustic event having a zero cross rate / rolloff feature, the acoustic signal is classified as one of the non-voice events. At this time, the third classifier 355 extracts the reference roughness / zero crossing rate / rolloff characteristic of the likelihood and non-speech events calculated between the acoustic event models ([lambda] _E ) corresponding to the non- Can be used.

예를 들어, 최대 우도를 갖는 음향 이벤트 모델(λ_E)이 충돌 모델 및 문개폐 모델인 경우, 음향 신호는 충돌과 문개폐로 중복 분류될 수 있다. 그러나, 가장 유사한 적어도 하나의 기준 거칠기/영교차율/롤오프 특징을 갖는 비음성 이벤트가 충돌인 경우, 최종적으로 음향 신호를 충돌로 분류할 수 있다. For example, if the acoustic event model (λ _E ) with maximum likelihood is a collision model and a door opening / closing model, the acoustic signal can be classified as collision and door opening / closing. However, if a non-speech event having at least one reference roughness / zero crossing rate / rolloff feature that is most similar is collided, the acoustic signal may be finally classified as a collision.

제2분류기(335) 및 제3분류기(355)는 분류를 위해 MFCC 특징 비교에 의한 우도를 먼저 이용하고, 다음으로 Timbre 특징 비교 결과를 이용할 수 있으며, 그 반대 순서도 가능하며, 본 발명은 특별히 순서에 제한을 두지 않는다. The second classifier 335 and the third classifier 355 can use the likelihood by MFCC feature comparison first for classification and then use the results of the Timbre feature comparison and vice versa, .

상기 실시예에서는 특징 평가부(303)와 음향 분류부(305)를 별도로 구비하고 있으나, 다른 실시예로서, 음향 분류부(305)의 제1분류기(315), 제2분류기(335), 및 제3분류기(355)가 특징 평가부(303)의 기능을 함께 수행하도록 구현할 수 있다. 이 경우, 제1분류기(315), 제2분류기(335), 및 제3분류기(355) 각각이 훈련에 의한 가우시안 혼합 모델을 구비하여 MFCC 특징의 우도를 계산하고, 최대 우도를 갖는 음향 이벤트 모델을 찾을 수 있다. 또한 제2분류기(335)는 Timbre 특징 중 밝기 특징을 추가로 고려하고, 제3분류기(355)는 Timbre 특징 중 거칠기/영교차율/롤오프 특징을 추가로 고려하여 음향 신호의 이벤트를 분류할 수 있다. The first classifier 315 and the second classifier 335 of the acoustic classifier 305 and the second classifier 335 of the acoustic classifier 305 may be provided separately from the feature estimator 303 and the acoustic classifier 305, The third classifier 355 may be implemented to perform the function of the feature evaluation unit 303 together. In this case, each of the first classifier 315, the second classifier 335, and the third classifier 355 includes a training Gaussian mixture model to calculate the likelihood of the MFCC feature, and the acoustic event model having the maximum likelihood Can be found. In addition, the second classifier 335 may further consider the brightness feature among the Timbre features, and the third classifier 355 may further classify the event of the sound signal by further considering the roughness / zero crossing rate / rolloff feature among the Timbre features .

상황 판단부(307)는 일정시간 동안 음향 이벤트 분류 결과(인식 결과)를 누적하여, 누적된 이벤트 분류 결과 중에서 비정상 상황(abnomal event)에서 발생할 수 있는 음향 이벤트(이하, '비정상 이벤트'라 함)의 비율이 식 (3)과 같이 임계값(threshold) 이상이면, 현재 감시 영역, 예를 들어 엘리베이터 내의 상황을 비정상이라 판단하고 경보 신호를 출력한다. 전술된 음향 이벤트를 예로 들면, 비명, 울음, 충돌이 비정상 이벤트로 분류되고, 나머지는 정상 이벤트로 분류된다. The situation determination unit 307 accumulates the sound event classification results (recognition results) for a predetermined time period and accumulates the sound events (hereinafter, referred to as 'abnormal events') that may occur in the abnormality event among the accumulated event classification results. (3), the state of the current monitoring area, for example, the elevator, is determined to be abnormal and an alarm signal is output. For example, screaming, crying, and collision are classified as abnormal events, and the rest are classified as normal events.

...(3)

... (3)

도 6은 본 발명의 일 실시예에 따른 음향 정보 기반의 비정상 상황 인식 방법을 개략적으로 설명하는 흐름도이다. 6 is a flowchart schematically illustrating a method for recognizing an abnormal situation based on an acoustic information according to an embodiment of the present invention.

음향 인식 장치는 음향 신호가 입력되면, 음향 신호에서 음향 특징을 추출한다(S61). 입력되는 음향 신호는 소정 시간 단위로 샘플링된 음향 프레임이다. 음향 특징은 MFCC 특징 및 Timbre 특징을 포함한다. Timbre 특징으로는 밝기/거칠기/영교차율/롤오프 특징을 포함할 수 있다. When a sound signal is input, the sound recognition apparatus extracts an acoustic feature from the sound signal (S61). The input acoustic signal is an acoustic frame sampled at predetermined time intervals. Acoustic features include MFCC features and Timbre features. Timbre features can include brightness / roughness / zero cross rate / rolloff features.

음향 인식 장치는 기준 음향 특징과 음향 신호의 음향 특징 간의 유사도를 계산하여 음향 특징을 평가한다(S63). 기준 음향 특징은 복수의 음향 이벤트에 대하여 훈련에 의해 미리 획득된 특징으로, 음향 인식 장치는 기준 음향 특징을 데이터베이스로 구축하여 구비한다. 음향 이벤트는 복수의 음성 이벤트와 복수의 비음성 이벤트를 포함한다. 보다 구체적으로, 음향 인식 장치는 가우시안 혼합 모델을 통해 복수의 음향 이벤트 각각에 대해 미리 획득된 훈련 모델(음향 이벤트 모델)과 음향 신호의 MFCC 특징 간의 우도를 계산한다. 그리고, 음향 인식 장치는 훈련에 의해 미리 획득된 복수의 음향 이벤트 각각의 기준 Timbre 특징을 음향 신호의 Timbre 특징과 비교한다. 음향 인식 장치는 복수의 음향 이벤트를 정의하고, 기준 음향 특징을 미리 데이터베이스로 구축하고 있다. The acoustic recognition apparatus evaluates the acoustic characteristics by calculating the degree of similarity between the reference acoustic characteristic and the acoustic characteristic of the acoustic signal (S63). The reference acoustic feature is a feature acquired in advance by training for a plurality of acoustic events, and the acoustic recognition device includes a reference acoustic feature as a database. A sound event includes a plurality of sound events and a plurality of non-sound events. More specifically, the acoustic recognition apparatus calculates the likelihood between the training model (acoustic event model) obtained beforehand for each of a plurality of acoustic events through the Gaussian mixture model and the MFCC features of the acoustic signal. Then, the acoustic recognition device compares the reference Timbre feature of each of a plurality of acoustic events obtained in advance by training with the Timbre feature of the acoustic signal. The sound recognition apparatus defines a plurality of sound events and builds a reference sound feature into a database in advance.

음향 인식 장치는 음향 신호의 음향 특징과 기준 음향 특징 간의 유사도를 기초로, 음향 신호를 계층적 접근 방식에 의해 복수의 음향 이벤트 중 하나로 분류한다. 보다 구체적으로, 음향 인식 장치는 최대 우도에 해당하는 음향 이벤트를 선택하여, 상기 음향 신호를 음성 또는 비음성으로 1차 분류한다(S65). The acoustic recognition apparatus classifies the acoustic signal into one of a plurality of acoustic events by a hierarchical approach, based on the similarity degree between the acoustic characteristic of the acoustic signal and the reference acoustic characteristic. More specifically, the acoustic recognition apparatus selects an acoustic event corresponding to the maximum likelihood, and classifies the acoustic signal as speech or non-speech (S65).

음향 인식 장치는 1차 분류된 음향 신호에 대해, 최대 우도에 해당하는 음향 이벤트 및 가장 유사한 Timbre 특징을 갖는 음향 이벤트를 선택하여, 복수의 음성 이벤트 및 비음성 이벤트들 중 하나로 2차 분류한다(S67). 음향 인식 장치는 음성으로 1차 분류된 음향 신호를, 최대 우도에 해당하는 음성 이벤트 및 가장 유사한 Timbre 특징을 갖는 음성 이벤트를 선택하여 복수의 음성 이벤트들 중 하나의 음성 이벤트로 분류한다. 이때 Timbre 특징으로 밝기 특징이 이용될 수 있다. 음향 인식 장치는 비음성으로 1차 분류된 음향 신호를, 최대 우도에 해당하는 비음성 이벤트 및 가장 유사한 Timbre 특징을 갖는 비음성 이벤트를 선택하여 복수의 비음성 이벤트들 중 하나의 비음성 이벤트로 분류한다. 이때 Timbre 특징으로 거칠기/영교차율/롤오프 특징이 이용될 수 있다. The sound recognition apparatus selects an acoustic event corresponding to the maximum likelihood and an acoustic event having the most similar Timbre characteristic for the primary classified acoustic signal and secondary-classifies the selected acoustic event into one of a plurality of voice events and non-voice events (S67 ). The sound recognition apparatus classifies a sound signal classified by speech into a sound event of a plurality of sound events by selecting a sound event corresponding to the maximum likelihood and a sound event having the most similar Timbre feature. At this time, the brightness feature can be used as the Timbre feature. The acoustic recognition apparatus selects a non-speech sound signal classified as non-speech as a non-speech event among a plurality of non-speech events by selecting a non-speech event corresponding to the maximum likelihood and a non-speech event having the most similar Timbre characteristic do. At this time, the roughness / zero cross rate / rolloff feature can be used as the Timbre feature.

음향 인식 장치는 일정 시간 동안 입력되는 음향 신호의 분류 결과를 누적하고, 누적된 결과에서 비정상 이벤트의 비율이 임계값 이상이면 비정상 상황으로 판단하여 경보 신호를 출력한다(S69). The sound recognition apparatus accumulates classification results of sound signals inputted for a predetermined time, and if the ratio of the abnormal events in the cumulative result is more than a threshold value, the sound recognition apparatus judges the abnormal condition and outputs an alarm signal (S69).

도 7은 본 발명의 일 실시예에 따른 음향 정보 기반 상황 인식 장치의 구성을 개략적으로 도시한 블록도이다. 7 is a block diagram schematically illustrating the configuration of an acoustic information-based situation recognition apparatus according to an embodiment of the present invention.

도 7을 참조하면, 음향 정보 기반 상황 인식 장치(1)는 음향 센서(2), 오디오 모듈(3), 카메라(4), 비디오 모듈(5), 중앙처리부(6), 및 출력 장치(7)를 포함한다. 7, the acoustic information-based situation recognition apparatus 1 includes an acoustic sensor 2, an audio module 3, a camera 4, a video module 5, a central processing unit 6, and an output unit 7 ).

음향 센서(2)는 감시 영역에서 발생되는 음향 신호를 수집한다. 음향 센서(2)는 공동주택이나 빌딩에서 엘리베이터 내부, 계단, 지하 주차장, 노인정, 놀이터, 산책로 등과 같은 밀폐 되고 인적이 드문 장소 등에 설치된다. 음향 센서(2)는 마이크 구조에 따라 다이나믹 마이크, 콘덴서 마이크, 리본 마이크 등이 사용될 수 있고, 지향성에 따라 지향성 마이크, 무지향성 마이크, 초지향성 마이크 등이 사용될 수 있다. The acoustic sensor 2 collects acoustic signals generated in the surveillance region. The acoustic sensor 2 is installed in an enclosed house or a building in an elevator, a staircase, an underground parking lot, a hallway, a playground, a walkway, and the like. The acoustic sensor 2 may be a dynamic microphone, a condenser microphone, a ribbon microphone or the like depending on the structure of the microphone, and a directional microphone, a non-directional microphone, a supergain microphone or the like may be used depending on the directivity.

오디오 모듈(3)은 음향 신호로부터 음향 특징을 추출하여 비정상 상황을 판단한다. 오디오 모듈(3)은 소프트웨어 및/또는 하드웨어에 의해 독립된 모듈로 구현되어, 도 1의 음향 인식 장치(30)의 기능을 수행할 수 있다. 오디오 모듈(3)은 음향 신호 분석을 통해 감시 영역에 비정상 상황이 발생되었는지를 판단하고, 비정상 상황으로 판단되면 경보 신호를 발생한다. 오디오 모듈(3)의 음향 인식 및 비정상 상황 인식 방법은 도 1 내지 도 6을 참조로 설명하였으므로, 상세한 설명은 생략하겠다. The audio module 3 extracts an acoustic feature from the acoustic signal to determine an abnormal situation. The audio module 3 may be implemented as an independent module by software and / or hardware and may perform the functions of the sound recognition device 30 of FIG. The audio module 3 determines whether an abnormal situation has occurred in the surveillance area through the analysis of the acoustic signal, and generates an alarm signal when it is determined that the abnormal condition occurs. The sound recognition and the abnormal situation recognition method of the audio module 3 have been described with reference to Figs. 1 to 6, and thus detailed description thereof will be omitted.

카메라(4)는 디지털 및 아날로그 방식의 카메라로, 감시 영역의 영상을 촬영한다. 음향 센서(2)와 카메라(4)는 각각 개별적으로, 또는 음향 센서(2)가 카메라(4)에 내장되는 등의 일체로 설치될 수 있으며, 감시 영역에 하나 이상의 개수로 분산 배치될 수 있다. The camera 4 is a digital and analog type camera and captures an image of a surveillance area. The acoustic sensor 2 and the camera 4 may be integrally installed individually or the acoustic sensor 2 may be embedded in the camera 4 or the like and may be dispersed in one or more numbers in the surveillance area .

비디오 모듈(5)은 오디오 모듈(3)과 동일한 공간의 모니터링을 통해 오디오 모듈(3)과 독립적으로 비정상 상황을 감지한다. 비디오 모듈(5)은 영상 신호를 처리 및 분석하여 미리 설정된 비정상 상황에 해당하는지를 판단하고, 비정상 상황으로 판단되면 경보 신호를 발생한다. 비정상 상황은 카메라 템퍼링(camera tempering), 및 영상 내 지정 영역에 물체 검출이 있는 경우를 포함할 수 있다. 도 8은 비디오 모듈(5)에서 판단하는 카메라 템퍼링에 의한 비정상 상황을 분류한 예를 도시하고 있다. 도 8을 참조하면, 카메라 화면 가려짐(a), 카메라 각도 틀어짐(b), 카메라 초점 틀어짐(c), 카메라 화면에 스프레이 분사(d) 등을 비정상 상황으로 분류할 수 있다. The video module 5 detects an abnormal situation independently from the audio module 3 through monitoring the same space as the audio module 3. [ The video module 5 processes and analyzes the video signal to determine whether it is in a preset abnormal state, and generates an alarm signal when it is determined that it is abnormal. The abnormal situation may include camera tempering, and the case where there is object detection in the designated area in the image. FIG. 8 shows an example of classifying an abnormal situation by camera tempering judged by the video module 5. Referring to FIG. 8, it is possible to classify a camera screen screen (a), a camera angle error (b), a camera focus error (c), and a spray image (d) on a camera screen as abnormal situations.

중앙처리부(6)는 오디오 모듈(3)에서 판단된 비정상 상황 판단 결과와 비디오 모듈(5)에서 판단된 비정상 상황 판단 결과를 수신하여, 최종적으로 비정상 상황 여부를 판단하고, 출력 장치(7)를 통해 경고 메시지 송출 및/또는 경보음을 발생시킨다. 출력 장치(7)는 디스플레이 및 스피커를 포함할 수 있다. 일 예로, 중앙처리부(6)는 오디오 모듈(3) 및 비디오 모듈(5) 중 하나로부터 경보 신호를 수신하면, 운영자에게 상황 확인이 필요함을 알려준다. 그리고, 중앙처리부(6)는 오디오 모듈(3) 및 비디오 모듈(5) 모두에서 경보 신호를 수신하면 경보를 발생한다. 중앙처리부(6)는 오디오 모듈(3) 및 비디오 모듈(5) 중 적어도 하나에서 경보 신호를 수신하면, 해당 음향 신호 및 영상 신호를 저장한다. The central processing unit 6 receives the abnormal state determination result determined by the audio module 3 and the abnormal state determination result determined by the video module 5 to finally determine whether or not the state is abnormal, Thereby generating a warning message and / or an audible alarm. The output device 7 may include a display and a speaker. For example, when the central processing unit 6 receives an alarm signal from one of the audio module 3 and the video module 5, it notifies the operator that status confirmation is required. Then, the central processing unit 6 generates an alarm when it receives the alarm signal from both the audio module 3 and the video module 5. When the central processing unit 6 receives the alarm signal from at least one of the audio module 3 and the video module 5, the central processing unit 6 stores the corresponding sound signal and the video signal.

중앙처리부(6)는 비정상 상황으로 판단되면 경보를 발생하고, 유선 또는 무선 통신망을 포함하는 네트워크를 통해 통합관리서버 혹은 보안 관제시스템을 연동시켜 즉각적인 대응을 제공할 수 있다.The central processing unit 6 generates an alarm when it is determined to be in an abnormal state and can provide an immediate response by interlocking the integrated management server or the security control system through a network including a wired or wireless communication network.

도 9는 본 발명의 실시예의 음향 인식률과 비교예의 음향 인식률을 보여주는 그래프이다. 도 9를 참조하면, MFCC 특징만을 이용한 음향 인식률에 비해, MFCC 특징과 Timbre 특징을 함께 이용한 음향 인식률이 더 우수함을 알 수 있다. 또한 MFCC 특징의 차수가 클수록 음향 인식률이 더 우수함을 알 수 있다. 9 is a graph showing the acoustic recognition rate of the embodiment of the present invention and the acoustic recognition rate of the comparative example. Referring to FIG. 9, it can be seen that the acoustic recognition rate using the MFCC feature and the Timbre feature together is superior to the acoustic recognition rate using only the MFCC feature. Also, it can be seen that the larger the degree of the MFCC feature is, the better the recognition rate of sound is.

본 발명의 실시예는 한정된 공간 내의 환경에서 발생할 수 있는 음향 이벤트를 효과적인 특징 추출을 통해 모델링하고, 훈련 모델을 이용하여 음향 신호를 계층적으로 분류하여 인식함으로써 강인한 인식 성능을 가질 수 있다. 또한 본 발명의 실시예는 프레임 단위의 인식 결과를 일정 시간 동안 누적하여 누적된 결과 중 비정상 상황에서 발생할 수 있는 음향의 비율이 큰 경우에는 현재 상황을 비정상 상황이라 판단함으로써 오판단을 줄일 수 있다. The embodiments of the present invention can model acoustic events that may occur in an environment within a limited space through effective feature extraction and classify and recognize acoustic signals in a hierarchical manner using a training model to have robust recognition performance. Further, in the embodiment of the present invention, when the recognition result of the frame unit is accumulated for a predetermined time, and the ratio of the sound that can be generated in the abnormal situation among the accumulated results is large, it is judged that the current situation is abnormal.

본 발명의 음향 인식 장치 및 음향 인식 방법은 감시시스템이나 음향 정보를 이용한 지능형 서비스에 활용될 수 있다. 예를 들어, 입력 음향 신호를 분석하고 미리 훈련된 데이터와의 비교를 통해 인식하여 특정 소리의 발생 원인을 찾아내거나, 환경음 인식을 통해 현재 위치를 파악하거나, 총소리나 비명소리 등을 인식하는 감시시스템에 활용될 수 있다. The sound recognition apparatus and sound recognition method of the present invention can be utilized for a surveillance system or an intelligent service using sound information. For example, the input acoustic signal is analyzed and compared with pre-trained data to recognize the cause of the specific sound, the current position can be recognized through environmental sound recognition, and a surveillance System can be utilized.

야외 환경에서는 영상 정보를 이용한 감시의 경우에 조명 변화 등에 따라 강인한 시스템을 만드는데 어려움이 있으나, 본 발명의 음향 인식 장치 및 음향 인식 방법이 적용된 음향 기반 상황 인식 시스템은, 야외 환경의 음향을 계층적으로 분류하여 비정상 상황을 인식함으로써 강인한 감시 시스템을 구성할 수 있다. 또한 인식하고자 하는 음향 이벤트의 종류나 환경이 바뀌게 되더라도, 일부 음향 특징 추출과 가우시안 혼합 모델의 파라메터 값들만 변화를 주면, 전체 시스템 구성은 동일하게 적용이 가능하기 때문에 다양하게 활용이 가능하다.In an outdoor environment, there is a difficulty in making a robust system according to illumination changes in the case of monitoring using image information. However, the acoustic-based situation recognition system to which the acoustic recognition apparatus and acoustic recognition method of the present invention is applied, It is possible to construct a robust surveillance system by recognizing the abnormal situation. In addition, even if the kind of the acoustic event to be recognized or the environment changes, only the parameter values of some acoustic feature extraction and Gaussian mixture model are changed, and the whole system configuration can be applied in the same manner.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 다른 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다.While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. Accordingly, the true scope of the present invention should be determined by the technical idea of the appended claims.

Claims

Extracting a Timbre feature including an MFCC feature and a tone color information from an input acoustic signal as an acoustic feature;
Evaluating the acoustic feature by calculating a similarity between a reference acoustic feature previously obtained by training and a MFCC feature and a Timbre feature of the acoustic signal for a plurality of acoustic events including a plurality of voice events and a non-voice event; And
Selecting an acoustic event corresponding to a reference acoustic feature most similar to the MFCC feature of the acoustic signal and corresponding to a reference acoustic feature most similar to a Timber feature of the acoustic signal, And classifying the acoustic information into one of a plurality of acoustic events.

The method according to claim 1,
Wherein the acoustic feature evaluation step comprises:
Calculating a likelihood between a training model previously obtained through a Gaussian mixture model and an MFCC feature of the acoustic signal for each of a plurality of acoustic events; And
And comparing the Timbre feature of the acoustic signal with a reference Timbre feature of each of a plurality of acoustic events previously obtained by training,
Wherein the acoustic signal classifying step comprises:
Selecting an acoustic event corresponding to the maximum likelihood and an acoustic event having the most similar Timbre feature to classify the acoustic signal as speech or non-speech; and classifying the primary classified acoustic signal into a plurality of speech events and a non- And a second step of classifying the sound information into one of two types.

delete

The method according to claim 1,
Accumulating classification results of acoustic signals inputted for a predetermined time and outputting an alarm signal when the ratio of abnormal events is more than a threshold value in an accumulated result,
Comparing the video information of the input video signal with video information of an abnormal situation based on the video information of the input video signal, and outputting an alarm signal when it is determined that the video signal is abnormal; And
And generating an alarm when both the alarm signal for the acoustic signal and the alarm signal for the video signal are output.

delete