KR100937101B1

KR100937101B1 - Emotion Recognizing Method and Apparatus Using Spectral Entropy of Speech Signal

Info

Publication number: KR100937101B1
Application number: KR1020080046544A
Authority: KR
Inventors: 홍광석; 이우석
Original assignee: 성균관대학교산학협력단
Priority date: 2008-05-20
Filing date: 2008-05-20
Publication date: 2010-01-15
Also published as: KR20090120640A

Abstract

본 발명은 음성 신호의 스펙트럴 엔트로피를 이용한 감정 인식 방법 및 장치에 관한 것으로, 감정 평가 모델 음성 신호의 프레임별 스펙트럼 엔트로피 값을 산출하고 이를 이용하여 감정 평가 모델을 생성하는 단계와 평가할 음성 신호의 프레임별 스펙트럼 엔트로피 값을 산출하고 이를 상기 감정 평가 모델에 적용하여 감정을 인식하는 단계를 포함하는 감정 인식 방법 및 이를 위한 장치를 제공함으로써 감정 인식 성능의 향상을 기대할 수 있고, 이를 컴퓨터, 이동통신 단말기, PDA 등의 임베디드 환경에도 적용할 수 있다는 효과를 얻게 된다.The present invention relates to a method and apparatus for emotion recognition using spectral entropy of a speech signal, the method comprising calculating a spectral entropy value for each frame of an emotion evaluation model speech signal and using the same to generate an emotion evaluation model and a frame of the speech signal to be evaluated. It is possible to expect an improvement in emotion recognition performance by providing an emotion recognition method and apparatus for calculating a spectrum spectral entropy value and applying the same to the emotion evaluation model to recognize the emotion. It can be applied to embedded environment such as PDA.

감정 인식, 스펙트럴 엔트로피 Emotion Recognition, Spectral Entropy

Description

Emotion Recognizing Method and Apparatus Using Spectral Entropy of Speech Signal

본 발명은 음성 신호의 스펙트럴 엔트로피를 이용한 감정 인식 방법 및 그 장치에 관한 것으로, 특히 델타 스펙트럴 엔트로피 또는 멜 주파수 스펙트럴 엔트로피 등을 이용하여 보다 정확한 화자의 감성 상태, 즉 기쁨, 슬픔, 두려움, 분노, 수용, 역겨움, 기대, 놀라움 등의 상태를 인식하는 방법 및 이를 수행하기 위한 장치에 관한 것이다.The present invention relates to a method and apparatus for emotion recognition using spectral entropy of a speech signal, and more particularly, using a spectral entropy or mel frequency spectral entropy to more accurately describe the emotion state of a speaker, that is, joy, sadness, fear, A method for recognizing a state of anger, acceptance, disgust, expectation, surprise, and the like, and a device for performing the same.

음성은 인간의 통신수단 중 가장 자연스러운 의사소통 수단이며, 언어를 구현하여 정보를 전달하기 위한 수단이다. 인간과 기계 사이에서 음성을 이용한 통신 인터페이스에 대한 구현은 과거부터 꾸준히 연구되어 왔다. 최근 음성 정보를 효과적으로 처리하기 위한 음성 정보 처리 기술 분야가 괄목할 만한 발전을 이룩함에 따라 실생활에도 속속 적용되고 있다.Voice is the most natural means of communication among human communication means, and it is a means for delivering information by implementing a language. The implementation of the communication interface using voice between human and machine has been studied steadily since the past. Recently, as the field of voice information processing technology for effective processing of voice information has made a remarkable development, it has been applied to real life one after another.

이러한 음성 정보 처리 기술은, 음성 인식(speech recognition), 음성 합성 speech synthesis), 화자 인증(speaker identification and verification)등으로 분류된다. Such voice information processing techniques are classified into speech recognition, speech synthesis speech synthesis, speaker identification and verification, and the like.

이 중에서, 음성 인식이란 사람이 말하는 음성 언어를 컴퓨터가 해석해 그 내용을 문자 데이터로 전환하는 처리를 말하며, 음성 합성은 말소리의 음파를 기계가 자동으로 만들어 내는 기술로, 간단히 말하면, 모델로 선정된 한 사람의 말소리를 녹음하여 일정한 음성 단위로 분할한 다음, 부호를 붙여 합성기에 입력하였다가 지시에 따라 필요한 음성 단위만을 다시 합쳐 말소리를 인위로 만들어내는 기술이다. 화자 인증은 개인의 음성 정보로 개인 신분을 확인하는 기술이다. Among them, speech recognition refers to a process in which a computer interprets a speech language spoken by a person and converts its contents into text data. Speech synthesis is a technology that automatically generates sound waves of speech sounds. It is a technique of recording a speech of a person, dividing it into a certain speech unit, inputting it to a synthesizer with a sign, and then combining only necessary speech units according to the instructions to artificially create the speech sound. Speaker authentication is a technology that verifies an individual's identity with the individual's voice information.

특히, 감정 인식 기술은 사람이 일상생활에서 사용하는 시각적, 청각적 정보 등을 통하여 사람의 감정 상태를 알 수 있듯이 기계도 사람의 감정 상태를 추정하는 인터페이스를 구현하는 것을 목표로 하고 있다. 감정 인식 인터페이스는 크게 화자의 음성을 통하여 감정을 인식하는 청각적인 면에서의 감정 인식과 화자의 표정을 통하여 감정을 인식하는 시각적인 면에서의 감정 인식으로 분류되어지는데, 본 발명은 청각적인 면에서의 감정인식과 관련된 것이다. In particular, the emotion recognition technology aims to implement an interface for estimating a person's emotional state, as a human's emotional state can be known through visual and audio information used in everyday life. The emotion recognition interface is classified into the emotion recognition in the auditory aspect of recognizing the emotion through the speaker's voice and the emotion recognition in the visual aspect of recognizing the emotion through the speaker's expression. It is related to the emotional perception of.

기존 특허 제10-2002-0026056호(웨이블렛 변환을 이용한 음성에서의 감정인식)는 뛰어난 주파수 분해 능력을 갖고 있는 웨이블렛 필터뱅크를 이용하여 음성을 여러 개의 서브밴드로 나누고 각 밴드에서 단시간 평균에너지(Short - time average energy)와 단시간 영교차율(Short - time zero crossing rate)을 추출하여 감정을 인식한다. Patent No. 10-2002-0026056 (Emotion Recognition in Speech Using Wavelet Transform) uses a wavelet filter bank with excellent frequency resolution to divide the speech into several subbands and short-term average energy (Short in each band). -Emotion is recognized by extracting time average energy) and short-time zero crossing rate.

또한, 기존 특허 제10-2004-0038419호(음성을 이용한 감정 인식 시스템 및 감정 인식 방법)는 언어적 파라미터와 비언어적 파라미터를 추출하여 화자의 감정 상태를 최종 산출한다. 언어적 파라미터에는 발화속도의 평균을 포함하였고, 비언어적 파라미터에는 피치(Pitch)의 평균값과 분산값을 포함하였다. In addition, Patent No. 10-2004-0038419 (Emotion Recognition System and Emotion Recognition Method Using Speech) extracts verbal and nonverbal parameters to finally calculate the speaker's emotional state. Linguistic parameters included the mean of speech rates, and nonverbal parameters included the mean and variance of pitch.

인간의 음성에 내포된 감정 정보를 추출하기 위해 일본의 후쿠다(Fukuda)는 음성 신호의 템포(Tempo)와 에너지를 가지고 감정 인식에 대해 연구하였고, 모리야마(Moriyama)는 음성신호의 피치와 전력의 포락선 검출을 통하여 감정 인식 실험을 진행하였다. 또한, Silva는 음성 신호의 피치와 HMM(Hidden Markov Model)을 이용하여 영어와 스페인어에 대하여 감정 인식을 실험한 바 있다.In order to extract emotion information embedded in human voice, Fukuda of Japan studied emotion recognition with tempo and energy of voice signal, and Moriyama enveloped the pitch and power envelope of voice signal. The emotion recognition experiment was conducted through the detection. Silva also experimented with emotion recognition for English and Spanish using the pitch of speech signals and the Hidden Markov Model (HMM).

이상의 경우에서 살펴볼 수 있듯이 대부분의 감정 인식 방법은 음성 신호에 포함된 감정 특징 파라미터로서 에너지, 피치, 음성의 톤(Tone), 포만트 주파수(Formant Frequency), 발화율(Duration), 음질(Speech Quality) 등을 고려하며, 이를 이용하여 입력된 음성 신호의 감정을 평가하게 된다.As can be seen from the above cases, most of the emotion recognition methods are emotion characteristic parameters included in the speech signal, energy, pitch, tone of the voice, formant frequency, duration, speech quality. And the like, and the emotion of the input voice signal is evaluated.

그러나 음성의 톤이나 음질, 그리고 에너지 등의 경우에는 마이크의 볼륨이나 전화 회선의 상태 및 주변 상황 등 외부 환경적인 요인이 매우 민감하게 작용한다는 문제점이 존재한다.However, in the case of voice tone, sound quality, and energy, there is a problem that external environmental factors such as microphone volume, telephone line status, and surrounding conditions are very sensitive.

따라서 감정 인식 시스템의 성능 향상을 위해서는 기존 기술에서 사용되는 감정 인식 파라미터 이외에 화자의 감정 상태를 반영할 수 있는 새로운 파라미터를 이용할 필요성이 존재한다.Therefore, in order to improve the performance of the emotion recognition system, there is a need to use a new parameter that can reflect the emotion state of the speaker in addition to the emotion recognition parameter used in the existing technology.

따라서 본 발명은 상기한 종래 기술에 따른 문제점을 해결하기 위한 것으로, 화자의 음성으로부터 감정 정보를 포함하고 있는 스펙트럴 엔트로피와 델타 스펙트럴 엔트로피 및 멜 주파수 스펙트럴 엔트로피를 이용하여 음성 신호로부터 보다 정확하게 감정을 인식하는 시스템 및 그 방법의 제공을 그 목적으로 한다.Therefore, the present invention is to solve the problems according to the prior art, using the spectral entropy and delta spectral entropy and mel frequency spectral entropy containing emotion information from the speaker's voice more accurately emotion from the speech signal The present invention provides a system and method for recognizing the same.

본 발명의 일 측면에 따른 스펙트럼 엔트로피 값을 이용한 감정 인식 방법은 감정 평가 모델 생성용 음성 신호의 프레임별 스펙트럼 엔트로피 값을 산출하고, 이를 이용하여 감정 평가 모델을 생성하는 단계와 평가용 음성 신호를 입력받고, 상기 평가용 음성 신호의 프레임별 스펙트럼 엔트로피 값을 산출한 후 이를 상기 감정 평가 모델에 적용하여 상기 평가 음성 신호에 따른 감정을 인식하는 단계를 포함한다.According to an aspect of the present invention, an emotion recognition method using a spectral entropy value calculates a spectral entropy value for each frame of a voice signal for generating an emotion evaluation model, and generates an emotion evaluation model and inputs an evaluation voice signal using the same. And receiving the frame-specific spectral entropy value of the evaluation speech signal and applying the same to the emotion evaluation model to recognize the emotion according to the evaluation speech signal.

상기 음성 신호의 프레임별 스펙트럼 엔트로피 값을 산출하는 단계는 음성 신호를 프레임으로 세분화하는 단계, 음성 신호의 프레임 별 고대역을 강조하는 단계, 음성 신호의 스펙트럼 정규화를 수행하는 단계 및 스펙트럼 정규화 분포로부터 프레임별 엔트로피 값을 산출하는 단계를 포함할 수 있다.Computing the spectral entropy value for each frame of the speech signal may include subdividing the speech signal into frames, emphasizing a high band for each frame of the speech signal, performing spectral normalization of the speech signal, and spectral normalization distribution. The method may include calculating a star entropy value.

이 경우 상기 음성 신호의 스펙트럼 정규화를 수행하는 단계는 음성 신호를 패스트 푸리에 변환(fast fourier transform)하는 단계, 패스트 푸리에 변환된 결 과로부터 파워 스펙트럼을 획득하는 단계 및 상기 파워 스펙트럼으로부터 정규화 연산을 수행하는 단계를 포함할 수 있다.In this case, the spectral normalization of the speech signal may include performing a fast fourier transform of the speech signal, obtaining a power spectrum from the fast Fourier transform, and performing a normalization operation from the power spectrum. It may include a step.

대체적으로, 상기 음성 신호의 스펙트럼 정규화를 수행하는 단계는 상기 음성 신호를 패스트 푸리에 변환(fast fourier transform)하는 단계, 패스트 푸리에 변환된 결과로부터 파워 스펙트럼을 획득하는 단계, 파워 스펙트럼으로부터 델타 패스트 푸리에 변환 스펙트럼을 연산하고, 그 절대값을 연산하는 단계 및 델타 패스트 푸리에 변환 스펙트럼의 절대값으로부터 정규화 연산을 수행하는 단계를 포함할 수도 있다. In general, performing spectral normalization of the speech signal may include performing a fast fourier transform of the speech signal, obtaining a power spectrum from a fast Fourier transform result, delta fast Fourier transform spectrum from the power spectrum. And calculating the absolute value and performing a normalization operation from the absolute value of the delta fast Fourier transform spectrum.

보다 바람직하게 상기 음성 신호의 스펙트럼 정규화를 수행하는 단계는 패스트 푸리에 변환 결과로부터 파워 스펙트럼을 획득한 후, 상기 파워 스펙트럼의 Mel 필터 연산을 수행하는 단계를 더 포함할 수도 있다.More preferably, the step of performing spectral normalization of the speech signal may further include performing a Mel filter operation of the power spectrum after acquiring a power spectrum from a fast Fourier transform result.

한편, 상기 음성 신호를 프레임으로 세분화하고 고역을 강조하는 단계에서는 해밍 윈도우(hamming window) 등을 이용할 수 있다.Meanwhile, in the step of subdividing the voice signal into frames and emphasizing high frequencies, a hamming window or the like may be used.

GMM(gaussian mixture model)을 이용한 상기 감정 평가 모델을 생성하는 단계는 MLE(maximum likelihood estimation) 또는 EM(expectation maximization) 알고리즘을 이용하여 최대 가우시안 혼합 분포 값을 갖는 GMM 파라미터를 추정하는 것을 특징으로 할 수 있다. 또한, 상기 감정 평가 모델을 생성하는 단계는 GMM 알고리즘 외에도 HMM(Hidden Markov Model) 알고리즘 또는 SVM(support vector machine) 알고리즘 등을 이용할 수 있다.Generating the emotion evaluation model using a Gaussian mixture model (GMM) may be characterized by estimating a GMM parameter having a maximum Gaussian mixture distribution value using a maximum likelihood estimation (MLE) or an expansion maximization (EM) algorithm. have. In addition, the generating of the emotion evaluation model may use a Hidden Markov Model (HMM) algorithm or a support vector machine (SVM) algorithm in addition to the GMM algorithm.

이 때 상기 평가 음성 신호의 프레임별 스펙트럼 엔트로피 값을 상기 GMM 감 정 평가 모델에 적용하여 감정 인식을 수행하는 단계는 상기 평가 음성 신호의 프레임별 스펙트럼 엔트로피 값과 상기 GMM 파라미터로부터 가우시안 혼합 분포를 구하는 단계와 상기 가우시안 혼합 분포 중 가장 큰 확률 값을 가지는 GMM 파라미터에 따른 감정을 선택하는 단계를 포함할 수 있다.In this case, the emotion recognition by applying the spectral entropy value of each frame of the evaluation speech signal to the GMM emotion evaluation model may include obtaining a Gaussian mixture distribution from the spectral entropy value of the evaluation speech signal and the GMM parameter. And selecting the emotion according to the GMM parameter having the largest probability value among the Gaussian mixture distribution.

본 발명의 다른 측면에 따른 스펙트럼 엔트로피 값을 이용한 감정 인식 장치는 입력된 음성 신호를 프레임으로 세분화하는 프레임 생성부, 세분화된 음성 신호의 프레임 별 파워 스펙트럼 정규화를 수행하는 스펙트럼 정규화 연산부, 각 프레임별 엔트로피 값을 구하는 엔트로피 연산부 및 감정 평가 모델을 생성하는 감정 평가 모델 생성부를 포함할 수 있다. 또한, 입력되는 평가용 음성 신호의 프레임별 스펙트럼 엔트로피 값을 상기 감정 평가 모델에 적용하여 감정 인식을 수행하는 음성 평가부를 더 포함할 수도 있다.According to another aspect of the present invention, an emotion recognition apparatus using a spectral entropy value includes a frame generation unit for subdividing an input speech signal into frames, a spectral normalization operation unit for performing power spectrum normalization for each frame of the divided speech signal, and entropy for each frame. It may include an entropy calculation unit for obtaining a value and an emotion evaluation model generator for generating an emotion evaluation model. The apparatus may further include a voice evaluator configured to apply the spectral entropy value of each frame of the input voice signal to the emotion evaluation model to perform emotion recognition.

상기 스펙트럼 정규화 연산부는 음성 신호를 패스트 푸리에 변환(fast fourier transform)하고, 상기 패스트 푸리에 변환된 결과로부터 파워 스펙트럼을 획득한 후, 상기 파워 스펙트럼의 정규화를 수행할 수 있다. 이 경우 상기 스펙트럼 정규화 연산부는 상기 패스트 푸리에 변환 결과로부터 파워 스펙트럼을 획득한 후, 상기 파워 스펙트럼의 Mel 필터 연산을 수행하여 파워 스펙트럼의 정규화를 수행할 수도 있다. The spectral normalization calculator may perform a fast Fourier transform on a speech signal, obtain a power spectrum from the fast Fourier transform, and then normalize the power spectrum. In this case, the spectral normalization operation unit may obtain a power spectrum from the fast Fourier transform result, and then normalize the power spectrum by performing a Mel filter operation of the power spectrum.

이 경우 상기 스펙트럼 정규화 연산부는, 상기 음성 신호를 패스트 푸리에 변환하여 파워 스펙트럼을 획득하고, 이로부터 델타 패스트 푸리에 변환 스펙트럼 의 절대값을 획득하는 과정을 수행할 수 있다. 이 때 스펙트럼 정규화 연산부는 상기 패스트 푸리에 변환 결과로부터 파워 스펙트럼을 획득한 후, 상기 파워 스펙트럼의 Mel 필터 연산을 수행할 수도 있다.In this case, the spectral normalization operation unit may perform a process of obtaining a power spectrum by fast Fourier transforming the speech signal and obtaining an absolute value of the delta fast Fourier transform spectrum. In this case, the spectral normalization operation unit may acquire a power spectrum from the fast Fourier transform result and then perform a Mel filter operation of the power spectrum.

한편, 본 발명에 따른 감정 인식 장치는 해밍 윈도우(hamming window) 등을 이용하여 고역을 강조하기 위한 고대역 강조부를 더 포함할 수도 있다.Meanwhile, the emotion recognition apparatus according to the present invention may further include a high band emphasis unit for emphasizing the high range using a hamming window or the like.

또한, GMM(gaussian mixture model)을 이용한 상기 감정 평가 모델을 생성하는 단계는 MLE(maximum likelihood estimation) 또는 EM(expectation maximization) 알고리즘을 이용하여 최대 가우시안 혼합 분포 값을 갖는 GMM 파라미터를 추정하는 것을 특징으로 할 수 있다. 또한, 상기 감정 평가 모델을 생성하는 단계는 GMM 알고리즘 외에도 HMM (Hidden Markov Model) 알고리즘 또는 SVM (support vector machine) 알고리즘 등을 이용할 수 있다. The generating of the emotion evaluation model using a Gaussian mixture model (GMM) may include estimating a GMM parameter having a maximum Gaussian mixture distribution value using a maximum likelihood estimation (MLE) or an exploration maximization (EM) algorithm. can do. In addition, the generating of the emotion evaluation model may use a Hidden Markov Model (HMM) algorithm or a support vector machine (SVM) algorithm in addition to the GMM algorithm.

상기 음성 평가부는 평가용 음성 신호의 프레임별 스펙트럼 엔트로피 값과 상기 GMM 파라미터로부터 가우시안 혼합 분포를 구하고, 이 중 가장 큰 확률 값을 가지는 GMM 파라미터에 따른 감정을 선택하는 것을 특징으로 한다.The speech evaluator obtains a Gaussian mixture distribution from the spectral entropy value of each frame of the evaluation speech signal and the GMM parameter, and selects an emotion based on the GMM parameter having the largest probability value.

마지막으로 본 발명에 따른 감정 인식 장치는 외부로부터 감정 평가 모델을 수신하기 위한 통신 인터페이스를 더 포함하는 것을 특징으로 한다.Finally, the emotion recognition apparatus according to the present invention further comprises a communication interface for receiving an emotion evaluation model from the outside.

상기한 바와 같이 본 발명에 따른 음성 신호의 스펙트럴 엔트로피를 이용한 감정 인식 방법 및 시스템에 의하면 음성 신호의 스펙트럴 엔트로피, 델타 스펙트 럴 엔트로피 및 멜 주파수 필터 뱅크 스펙트럴 엔트로피의 적용이 가능하다. 이에 더하여, 델타 멜 주파수 필터 뱅크 스펙트럴 엔트로피의 적용도 가능하며, 이러한 특징으로 인하여 감정 인식 성능이 향상된다. 이와 같은 방법을 적용하여 음성을 이용한 화자의 성별 및 연령 인식도 가능케 된다.As described above, according to the emotion recognition method and system using spectral entropy of the speech signal, the spectral entropy, delta spectral entropy, and mel frequency filter bank spectral entropy of the speech signal can be applied. In addition, the application of delta mel frequency filter bank spectral entropy is possible, which improves emotion recognition performance. By applying this method, it is possible to recognize the gender and age of the speaker using the voice.

또한, 본 발명에 따른 스펙트럴 엔트로피를 이용한 감정 인식 방법은 음성 PC 환경뿐만 아니라, 이동통신 단말기, PDA 등의 임베디드 환경에도 적용이 가능하므로, 보다 간편하고 편리하게 감정 인식을 수행할 수 있다.In addition, the emotion recognition method using spectral entropy according to the present invention can be applied not only to a voice PC environment but also to an embedded environment such as a mobile communication terminal and a PDA, so that the emotion recognition can be performed more simply and conveniently.

이하, 본 발명에 따른 음성 신호의 스펙트럴 엔트로피를 이용한 감정 인식 방법 및 그 장치에 대하여 첨부된 도면을 참조하여 상세히 설명한다.Hereinafter, a method and apparatus for emotion recognition using spectral entropy of a speech signal according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음성 신호의 감정 인식 모델 생성 방법을 나타낸 도면이다.1 is a diagram illustrating a method of generating an emotion recognition model of a voice signal according to an embodiment of the present invention.

감정 인식 장치(100)는 입력된 음성 신호를 평가할 기준으로서의 감정 평가 모델을 생성하기 위해서 음성 신호를 입력받는다(S101).The emotion recognition apparatus 100 receives a voice signal in order to generate an emotion evaluation model as a criterion for evaluating the input voice signal (S101).

감정 인식 장치(100)는 프레임 세분화 작업을 수행 한다 (S102). 그 후 감정 인식 장치(100)는 해밍 윈도우(Hamming window) 등을 사용하여 프레임별 고역 강조를 수행한다(S103).The emotion recognition apparatus 100 performs a frame segmentation operation (S102). Thereafter, the emotion recognizing apparatus 100 performs frame-by-frame high frequency emphasis using a Hamming window or the like (S103).

프레임 별 고역 강조 후 감정 인식 장치(100)는 상기 프레임별 음성 신호를 이용하여 스펙트럼 정규화를 수행한다(S104). 감정 인식 장치(100)가 프레임별 음성 신호를 스펙트럼 정규화하는 방법으로는 여러 방법이 존재한다. 특히, 본 발명에 따른 감정 인식 장치(100)는 도 3 내지 도 6에 제시된 스펙트럼 정규화 방법 중 하나를 이용할 수도 있다. After the high frequency emphasis for each frame, the emotion recognition apparatus 100 performs spectral normalization using the speech signal for each frame (S104). Various methods exist for the emotion recognizing apparatus 100 to spectrally normalize the speech signal for each frame. In particular, the emotion recognition apparatus 100 according to the present invention may use one of the spectral normalization methods shown in FIGS. 3 to 6.

도 3 내지 도 6의 스펙트럼 정규화 방법 중 도 3에 따른 스펙트럼 정규화 방법에 대하여 먼저 살펴보기로 한다.The spectral normalization method according to FIG. 3 among the spectral normalization methods of FIGS. 3 to 6 will be described first.

도 3은 패스트 푸리에 변환(Fast Fourier Transform)을 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면이다.3 is a diagram illustrating a method for performing spectral normalization using a Fast Fourier Transform.

먼저, 감정 인식 장치(100)는 프레임 세분화 작업과 프레임 별 고대역 강조를 수행한 신호에 대하여, 세분화된 프레임 단위로 패스트 푸리에 변환(fast fourier transform)을 수행한다(S301). First, the emotion recognition apparatus 100 performs a fast fourier transform on a frame-by-frame basis for a signal that performs frame segmentation and high-band emphasis for each frame (S301).

본 발명에서, 패스트 푸리에 변환의 수행 결과는 X(i,n)으로 표시하기로 한다. 여기서 X(i,n)는 n번째 프레임 신호들의 i번째 주파수 성분을 나타낸 것으로 아래의 수학식으로 나타낼 수 있다.In the present invention, the result of performing the Fast Fourier Transform is denoted by X (i, n). X (i, n) represents the i-th frequency component of the n-th frame signal and may be represented by the following equation.

수학식 1에서 x(m,n)은 시간 영역의 음성신호 n번째 프레임의 m번째 샘플을 나타내며, M은 패스트 푸리에 변환 포인트의 개수, 그리고 N은 주기를 나타낸다.In Equation 1, x (m, n) represents the m-th sample of the n-th frame of the speech signal in the time domain, M represents the number of fast Fourier transform points, and N represents a period.

그 후, 감정 인식 장치(100)는 패스트 푸리에 변환을 수행한 결과를 이용하여, 파워 스펙트럼 연산을 수행한다(S302). 그 결과를 패스트 푸리에 변환 파워 스펙트럼으로 칭하기로 한다. 패스트 푸리에 변환 파워 스펙트럼은 S(i,n)로 나타내기로 하며, 이는 아래의 수학식 2로부터 구할 수 있다.Thereafter, the emotion recognizing apparatus 100 performs a power spectrum operation using the result of performing the Fast Fourier Transform (S302). The result will be referred to as fast Fourier transform power spectrum. The fast Fourier transform power spectrum is represented by S (i, n), which can be obtained from Equation 2 below.

S302 과정에 따른 파워 스펙트럼의 연산 결과를 이용하여 감정 인식 장치(100)는 파워 스펙트럼의 정규화 분포를 연산한다(S303). 파워 스펙트럼의 정규화 분포 연산은 아래의 수학식 3에 의하여 구할 수 있다.The emotion recognition apparatus 100 calculates a normalized distribution of the power spectrum by using the calculation result of the power spectrum according to the step S302 (S303). The normalized distribution calculation of the power spectrum can be obtained by Equation 3 below.

여기서, P[S(i,n)]는 패스트 푸리에 변환의 파워 스펙트럼 정규화 분포를 나타낸다. 또한, S(i,n)는 패스트 푸리에 변환의 파워 스펙트럼을 나타낸다. 이상의 도 3을 이용하여 감정 인식 장치(100)는 스펙트럼 정규화를 수행할 수 있다.Where P [S (i, n)] represents the power spectral normalization distribution of the Fast Fourier transform. In addition, S (i, n) represents the power spectrum of the Fast Fourier Transform. 3, the emotion recognition apparatus 100 may perform spectral normalization.

또한, 감정 인식 장치(100)는 도 4 내지 도 6을 이용하여서도 도 3과 같은 스펙트럼 정규화 수행을 할 수도 있다. 감정 인식 장치(100)는 감정 인식 효율 등을 고려하여 상기 스펙트럼 정규화 수행 방법 중 하나를 선택할 수 있다. 도 4 내 지 도 6의 스펙트럼 정규화 방법은 뒤에서 살펴보기로 하고, 다시 도 1의 감정 평가 모델 생성에 대하여 설명하기로 한다.In addition, the emotion recognition apparatus 100 may perform spectral normalization as shown in FIG. 3 using FIGS. 4 to 6. The emotion recognition apparatus 100 may select one of the spectral normalization methods in consideration of emotion recognition efficiency. The spectral normalization method of FIGS. 4 to 6 will be described later, and the generation of the emotion evaluation model of FIG. 1 will be described again.

감정 인식 장치(100)는 정규화된 스펙트럼의 분포를 이용하여 각 프레임별 엔트로피 값을 산출한다(S105). The emotion recognition apparatus 100 calculates an entropy value for each frame using the normalized spectrum distribution (S105).

도 3의 스펙트럼 정규화 수행이 이루어진 경우, 프레임별 엔트로피를 H(n)로 나타내기로 하며, 이는 아래의 수학식에 의하여 구할 수 있다.When the spectral normalization of FIG. 3 is performed, entropy for each frame is represented by H (n), which can be obtained by the following equation.

감정 인식 장치(100)는 프레임별 엔트로피 피 값을 이용하여 감정 평가 모델을 생성한다(S106). The emotion recognition apparatus 100 generates an emotion evaluation model using an entropy value for each frame (S106).

각 프레임마다 계산된 엔트로피 값과 GMM(gaussian mixture model) 알고리즘을 이용하여 확률 모델을 생성할 수 있다. 이러한 패턴 인식의 방법으로는 HMM(Hidden Markov Model)이나 SVM(support vector machine) 등의 다른 알고리즘의 적용도 가능하다.Probabilistic models can be generated using entropy values calculated for each frame and Gaussian mixture model (GMM) algorithms. As a method of pattern recognition, other algorithms such as Hidden Markov Model (HMM) or support vector machine (SVM) can be applied.

GMM 알고리즘은 음성 신호를 M개의 각 성분 분포들의 선형 조합으로 근사화시킬 수 있으며, 긴 구간의 신호에 대하여도 표현이 가능하다. GMM 확률 분포는 아래의 수학식과 같다.The GMM algorithm can approximate a speech signal as a linear combination of M component distributions, and can express a long signal. The GMM probability distribution is as shown below.

여기서 b_i(x)는 데이터 x에 대한 가우시안 확률 밀도 함수를 의미하며, p_i는 혼합 가중치(mixture weight)를 나타낸다. 음성 신호를 GMM 모델로 표현하기 위해서는 i) 평균 벡터, ii) 공분산 행렬, iii) 가중치의 파라미터가 필요하다. 이 세 가지 파라미터의 집합으로 어떤 화자나 감정에 따른 음성 신호를 표현할 수 있다. 이러한 집합을 GMM이라고 하고 다음 식과 같다.Where b _i (x) denotes a Gaussian probability density function for data x, and p _i denotes a mixture weight. In order to represent a speech signal with a GMM model, parameters of i) an average vector, ii) a covariance matrix, and iii) a weight are required. The set of these three parameters can express the speech signal according to a speaker or emotion. This set is called GMM.

여기서 GMM 집합의 구성 요소 중 P_i는 혼합 가중치이며, u_i는 평균 벡터이다. ∑i는 공분산 행렬이다. 이들 세 가지 파라미터의 집합을 통하여 가우시안 혼합 분포를 표현할 수 있다.Here, among the components of the GMM set, P _i is a mixed weight and u _i is an average vector. Σi is a covariance matrix. The set of three parameters can represent a Gaussian mixture.

GMM을 이용한 인식 시스템은 학습 과정에서 감정별 학습 데이터마다 MLE(maximum likelihood estimation) 알고리즘과 EM(expectation maximization) 알고리즘 등을 이용하여 최대 가우시안 혼합 분포 값을 갖는 GMM 파라미터를 추정하게 된다.The recognition system using the GMM estimates GMM parameters having a maximum Gaussian mixture distribution value by using a maximum likelihood estimation (MLE) algorithm and an expansion maximization (EM) algorithm for each learning data.

도 2는 본 발명의 다른 실시예에 따른 음성 신호의 감정 인식 방법을 나타낸 도면이다.2 is a diagram illustrating a emotion recognition method of a voice signal according to another embodiment of the present invention.

도 2의 음성 신호 감정 인식 방법은 도 1의 감정 평가 모델을 생성하는 방법과 상당히 유사하다. 또한, 도 2의 음성 신호 감정 인식 방법은 도 1의 감정 평가 모델을 전제로 수행된다.The voice signal emotion recognition method of FIG. 2 is very similar to the method of generating the emotion evaluation model of FIG. 1. In addition, the voice signal emotion recognition method of FIG. 2 is performed under the assumption of the emotion evaluation model of FIG. 1.

감정 인식 장치(100)는 감정 인식을 하고자 하는 음성 신호를 입력받는다(S201). 그 후 도 1과 마찬가지로 감정 인식 장치(100)는 입력된 음성 신호를 프레임으로 세분화 하고(S202), 해밍 윈도우 등을 이용하여 각 프레임 별 고역을 강조하는 작업을 수행한다(S203).The emotion recognition apparatus 100 receives a voice signal for emotion recognition in operation S201. After that, as in FIG. 1, the emotion recognition apparatus 100 subdivides the input voice signal into frames (S202), and emphasizes high frequencies for each frame using a hamming window (S203).

감정 인식 장치(100)는 인식하고자 하는 음성 신호에 대하여도 감정 평가 모델을 생성할 때와 동일한 스펙트럼 정규화를 수행한다(S204). S204 결과를 이용하여 감정 인식 장치(100)는 프레임별 엔트로피 값을 산출한다(S205).The emotion recognizing apparatus 100 performs the same spectral normalization with respect to the voice signal to be recognized as when generating the emotion evaluation model (S204). Using the result of S204, the emotion recognition apparatus 100 calculates an entropy value for each frame (S205).

감정 인식 장치(100)는 S205 단계에서 산출된 엔트로피 값과 다수의 감정별 GMM 파라미터로부터 각각 가우시안 혼합 분포를 획득한다(S206). The emotion recognition apparatus 100 obtains a Gaussian mixture distribution from the entropy value calculated in step S205 and a plurality of emotion-specific GMM parameters (S206).

가우시안 혼합 분포 획득 후 감정 인식 장치(100)는 확률이 가장 큰 GMM 파라미터에 상응하는 감정을 음성 데이터의 감정으로 선택하게 된다(S207). 이하, 도 4 내지 도 6에 따른 스펙트럼 정규화 방법에 대하여 살펴보기로 한다.After obtaining the Gaussian mixture distribution, the emotion recognizing apparatus 100 selects an emotion corresponding to the GMM parameter having the greatest probability as the emotion of the voice data (S207). Hereinafter, the spectral normalization method according to FIGS. 4 to 6 will be described.

도 4는 델타 패스트 푸리에 변환을 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면이다.4 illustrates a method of performing spectral normalization using a delta fast Fourier transform.

도 3과 마찬가지로 감정 인식 장치(100)는 프레임으로 세분화된 후 고대역 강조된 음성 신호에 대하여, 프레임 단위로 패스트 푸리에 변환(fast fourier transform)을 수행한다(S401). S401의 패스트 푸리에 변환은 위의 수학식 1에 의하여 수행될 수 있다. As in FIG. 3, the emotion recognition apparatus 100 performs a fast fourier transform on a frame-by-frame basis for a high-band emphasized speech signal segmented into frames (S401). The fast Fourier transform of S401 may be performed by Equation 1 above.

그 후, 감정 인식 장치(100)는 패스트 푸리에 변환된 결과를 이용하여, 패스트 푸리에 변환 파워 스펙트럼 연산을 수행한다(S402). 이는 위의 수학식 2에 의하여 수행될 수 있다. Thereafter, the emotion recognition apparatus 100 performs a fast Fourier transform power spectrum operation using the result of the Fast Fourier transform (S402). This may be performed by Equation 2 above.

감정 인식 장치(100)는 S402로부터 획득한 파워 스펙트럼 값으로부터 델타 패스트 푸리에 변환 스펙트럼 연산을 수행하고, 델타 패스트 푸리에 변환 스펙트럼 값의 절대값을 연산한다(S403).The emotion recognition apparatus 100 performs a delta fast Fourier transform spectrum operation from the power spectrum value obtained from S402, and calculates an absolute value of the delta fast Fourier transform spectrum value (S403).

델타 패스트 푸리에 변환 스펙트럼 연산은 S'(i,n)로 정의하기로 하며, 이는 아래의 수학식에 의하여 수행될 수 있다. The delta fast Fourier transform spectrum operation is defined as S '(i, n), which can be performed by the following equation.

또한, 델타 패스트 푸리에 변환 스펙트럼 연산의 절대값은 아래의 수학식과 같이 표시될 수 있다.In addition, the absolute value of the delta fast Fourier transform spectrum operation may be expressed by the following equation.

감정 인식 장치(100)는 수학식 8의 연산 결과를 획득한 후 파워 스펙트럼의 정규화 분포를 연산한다. 파워 스펙트럼의 정규화 분포 연산은 아래의 수학식 9에 의하여 구할 수 있다.The emotion recognition apparatus 100 calculates a normalized distribution of the power spectrum after obtaining the calculation result of Equation 8. The normalized distribution calculation of the power spectrum can be obtained by Equation 9 below.

이와 같은 수식을 이용하여 감정 인식장치(100)는 스펙트럼 정규화를 수행할 수 있다. Using the above equation, the emotion recognition apparatus 100 may perform spectral normalization.

감정 인식 장치(100)는 스펙트럼 정규화를 수행한 후 각 프레임 별 엔트로피 값을 산출한다. 이는 S105 과정에 해당되며, 산출되는 프레임 별 엔트로피 값 결과는 아래의 수학식과 같다.The emotion recognition apparatus 100 calculates entropy values for each frame after performing spectral normalization. This corresponds to the process S105, and the calculated entropy value result for each frame is shown in the following equation.

수학식 10은 수학식 4와 비교할 때, 입력되는 S(i,n)가 S'(i,n)로 변경되었다는 점에서 차이가 있다. Equation 10 is different from Equation 4 in that the input S (i, n) is changed to S '(i, n).

이상의 프레임 별 엔트로피 값 결과를 GMM 특징 벡터로 사용하여 도 1의 감정 평가 모델 생성 과정과 입력된 음성 신호로부터 감정을 인식하는 과정을 수행할 수 있게 된다.By using the entropy value result for each frame as the GMM feature vector, the emotion evaluation model generation process of FIG. 1 and the emotion recognition process from the input voice signal can be performed.

도 5는 패스트 푸리에 변환과 Mel 필터를 이용하여 스펙트럼 정규화를 수행 하는 방법을 나타낸 도면이다.5 illustrates a method of performing spectral normalization using a Fast Fourier transform and a Mel filter.

도 3, 4와 마찬가지로 감정 인식 장치(100)는 음성 신호에 대하여, 세분화된 프레임 단위로 패스트 푸리에 변환(fast fourier transform)을 수행한 후(S501), 패스트 푸리에 변환 파워 스펙트럼 연산을 수행한다(S502). As in FIGS. 3 and 4, the emotion recognition apparatus 100 performs a fast Fourier transform on a granular frame basis for a speech signal (S501) and then performs a Fast Fourier transform power spectrum operation (S502). ).

감정 인식 장치(100)는 S502로부터 획득한 파워 스펙트럼을 멜(Mel) 주파수 스펙트럼 필터 뱅크에 대입하고, 또한, 그 결과의 절대값을 연산하게 된다(S503). 이러한 결과는 멜 주파수 스펙트럼, M(b,n)으로 정의하기로 한다.The emotion recognition apparatus 100 substitutes the power spectrum obtained from S502 into a Mel frequency spectrum filter bank, and calculates an absolute value of the result (S503). This result is defined as the mel frequency spectrum, M (b, n).

멜 주파수 스펙트럼을 구하는 과정, 즉 푸리에 변환 파워 스펙트럼을 멜 주파수 스펙트럼 필터 뱅크에 대입하는 과정은 아래의 수학식에 의하여 이루어질 수 있다.The process of obtaining the mel frequency spectrum, that is, substituting the Fourier transform power spectrum into the mel frequency spectrum filter bank may be performed by the following equation.

여기서 V_b(i)는 b번째 멜 필터의 i번째 주파수 성분의 멜-스케일 즉 가중치이며, L_b와 U_b는 b번째 멜 필터의 시작점 주파수와 종료점 주파수를 나타낸다.Here, V _b (i) is the mel-scale, that is, the weight of the i-th frequency component of the b-th mel filter, and L _b and U _b represent the starting and ending frequency of the b-th mel filter.

그 후 감정 인식 장치는 멜 주파수 스펙트럼의 정규화 연산을 수행하게 된다(S504). 상기 멜 주파수 스펙트럼 정규화 연산은 아래의 수학식에 의하여 구해질 수 있다.After that, the emotion recognition apparatus performs a normalization operation of the mel frequency spectrum (S504). The Mel frequency spectrum normalization operation can be obtained by the following equation.

여기서 B는 멜 필터의 총 개수를 나타낸다. Where B represents the total number of mel filters.

수학식 12를 통하여 획득한 멜 주파수 스펙트럼 정규화 결과를 이용하여 각 프레임의 엔트로피 값(HMFB(n))을 구한다. 멜 주파수 스펙트럼 정규화 결과로부터 각 프레임 별 엔트로피 값을 구하는 방법은 아래의 수학식에 따른다.The entropy value HMFB (n) of each frame is obtained using the Mel frequency spectrum normalization result obtained through Equation 12. The method for obtaining the entropy value for each frame from the mel frequency spectrum normalization result is as follows.

이러한 결과를 GMM 특징 벡터로 사용하여 감정 인식 모델을 생성하는 과정과 입력된 음성 신호로부터 감정 인식 과정은 도 1에서 설명한 바와 같다.The process of generating an emotion recognition model using the result as a GMM feature vector and the process of emotion recognition from the input voice signal are as described with reference to FIG. 1.

도 6은 델타 패스트 푸리에 변환과 Mel 필터를 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면이다.6 is a diagram illustrating a method of performing spectral normalization using a delta fast Fourier transform and a Mel filter.

이미 설명한 방법과 마찬가지로 감정 인식 장치(100)는 음성 신호에 대하여, 세분화된 프레임 단위로 패스트 푸리에 변환(fast fourier transform)을 수행한 후(S601), 패스트 푸리에 변환 파워 스펙트럼 연산을 수행한다(S602). Similarly to the method described above, the emotion recognition apparatus 100 performs a fast Fourier transform on a granular frame basis for the speech signal (S601), and then performs a Fast Fourier transform power spectrum operation (S602). .

감정 인식 장치(100)는 S602로부터 획득한 파워 스펙트럼을 멜(Mel) 주파수 스펙트럼 필터 뱅크에 대입하고, 그에 대한 절대값을 연산한다(S603). 이를 멜 주파수 스펙트럼, M(b,n)으로 칭한다.The emotion recognition apparatus 100 substitutes the power spectrum obtained from S602 into a Mel frequency spectrum filter bank and calculates an absolute value thereof (S603). This is called the mel frequency spectrum, M (b, n).

멜 주파수 스펙트럼을 구하는 과정, 즉 푸리에 변환 파워 스펙트럼을 멜 주파수 스펙트럼 필터 뱅크에 대입하는 과정은 위 수학식 11에 의하여 이루어질 수 있다.The process of obtaining the mel frequency spectrum, that is, substituting the Fourier transform power spectrum into the mel frequency spectrum filter bank may be performed by Equation 11 above.

감정 인식 장치(100)는 S602로부터 획득한 멜 파워 스펙트럼 값으로부터 델타 멜 스펙트럼 연산을 수행하고, 그 결과의 절대값을 연산한다(S604). 델타 멜 패스트 푸리에 변환 스펙트럼 연산은 M'(b,n)으로 정의될 수 있으며 아래의 수학식에 의하여 수행될 수 있다. The emotion recognition apparatus 100 performs a delta mel spectral operation from the mel power spectral value obtained from S602, and calculates an absolute value of the result (S604). The delta mel fast Fourier transform spectral operation may be defined as M ′ (b, n) and may be performed by the following equation.

또한, 델타 멜 스펙트럼의 절대값은 아래의 수학식과 같이 구할 수 있다.In addition, the absolute value of the delta mel spectrum can be obtained by the following equation.

델타 멜 스펙트럼의 절대값을 구한 후 감정 인식 장치(100)는 멜 주파수 스펙트럼의 정규화 연산을 수행하게 된다(S605). 상기 멜 주파수 스펙트럼 정규화 연산은 아래의 수학식에 의하여 구해질 수 있다.After obtaining the absolute value of the delta mel spectrum, the emotion recognition apparatus 100 performs a normalization operation of the mel frequency spectrum (S605). The Mel frequency spectrum normalization operation can be obtained by the following equation.

획득한 델타 멜 주파수 스펙트럼 정규화 결과를 이용하여 각 프레임의 엔트로피 값(HMFB`(n))을 구한다. 델타 멜 주파수 스펙트럼 정규화 결과로부터 각 프레임 별 엔트로피 값을 구하는 방법은 아래의 수학식에 따른다.The entropy value HMFB` (n) of each frame is obtained by using the obtained delta mel frequency spectrum normalization result. The method of obtaining the entropy value for each frame from the delta mel frequency spectrum normalization result is as follows.

도 7은 본 발명의 또 다른 실시예에 따른 감정 인식 장치의 구성을 나타낸 도면이다.7 is a diagram showing the configuration of an emotion recognition apparatus according to another embodiment of the present invention.

도 7에 제시된 바와 같이 본 발명에 따른 감정 인식 장치는 마이크(110), 프레임 생성부(120), 고대역 강조부(130), 스펙트럼 정규화 연산부(140), 엔트로피 연산부(150), 감정 평가 모델 생성부(170), 감정 평가 모델 DB(180), 음성 평가부(160) 등을 포함하여 구성될 수 있다.As shown in FIG. 7, the emotion recognition apparatus according to the present invention includes a microphone 110, a frame generator 120, a high band emphasis unit 130, a spectral normalization calculator 140, an entropy calculator 150, and an emotion evaluation model. The generator 170, the emotion evaluation model DB 180, the voice evaluator 160, and the like may be configured.

마이크(110)는 사용자 등으로부터 음성 신호를 입력받기 위한 구성 요소에 해당하다. 프레임 생성부(120)는 마이크(110)로부터 음성 신호가 입력되면 프레임을 세분화한다. 고대역 강조부(130)는 해밍 윈도우 등을 사용하여 세분화된 프레임의 고대역을 강조하는 구성 요소에 해당한다. The microphone 110 corresponds to a component for receiving a voice signal from a user or the like. The frame generator 120 subdivides the frame when the voice signal is input from the microphone 110. The high band emphasis unit 130 corresponds to a component that emphasizes the high band of the subdivided frame using a hamming window or the like.

감정 인식 장치(100)의 스펙트럼 정규화 연산부(140)는 프레임별 음성 신호에 대한 스펙트럼 정규화를 수행한다. 스펙트럼 정규화 연산부(140)가 스펙트럼 정규화를 수행하기 위하여 도 3 내지 도 6의 방법 중 하나를 이용할 수 있다. 이에 대한 자세한 설명은 생략하기로 한다.The spectral normalization operation unit 140 of the emotion recognition apparatus 100 performs spectral normalization of the speech signal for each frame. The spectral normalization operation unit 140 may use one of the methods of FIGS. 3 to 6 to perform spectral normalization. Detailed description thereof will be omitted.

감정 인식 장치(100)의 엔트로피 연산부(150)는 스펙트럼 정규화 연산부(140)가 출력한 정규화된 스펙트럼의 분포를 이용하여 각 프레임별 엔트로피 값을 산출하게 된다. 이러한 프레임별 엔트로피 값을 산출하기 위하여 엔트로피 연산부(150)는 수학식 4, 10, 13, 17을 이용할 수 있다.The entropy calculator 150 of the emotion recognition apparatus 100 calculates an entropy value for each frame by using the distribution of normalized spectrums output by the spectrum normalization calculator 140. In order to calculate the entropy value for each frame, the entropy calculator 150 may use Equations 4, 10, 13, and 17.

감정 인식 장치(100)의 감정 평가 모델 생성부(170)는 위에서 설명한 방법에 따라 프레임별 엔트로피 값을 이용하여 감정 인식을 위한 모델을 생성한다. 감정 평가 모델 생성부(170)가 생성한 감정 평가 모델은 감정 평가 모델 DB(180)에 저장된다.The emotion evaluation model generation unit 170 of the emotion recognition apparatus 100 generates a model for emotion recognition using entropy values for each frame according to the method described above. The emotion evaluation model generated by the emotion evaluation model generator 170 is stored in the emotion evaluation model DB 180.

한편, 음성 평가부(160)는 감정을 인식하려는 화자의 음성 신호를 감정 평가 모델 DB(180)에 저당되어 있는 감정 인식 모델에 적용함으로써 화자의 음성 신호에 따른 감정을 평가하게 된다. 이와 같이 평가된 감정은 디스플레이부(181)에 의하여 출력될 수 있다.Meanwhile, the voice evaluator 160 evaluates the emotion according to the speaker's voice signal by applying the speaker's voice signal to recognize the emotion to the emotion recognition model stored in the emotion evaluation model DB 180. The evaluated emotion may be output by the display unit 181.

한편, 통신 인터페이스(182) 외부 네트워크로부터 감정 평가 모델을 입력받기 위한 구성 요소에 해당한다. 본 발명에 따른 감정 인식 장치(100)는 자체적으로 음성 평가 모델을 생성할 수도 있지만, 외부 네트워크 등으로부터 음성 평가 모델을 제공받을 수도 있다. 외부 네트워크로부터 제공받은 음성 평가 모델은 감정 평가 모델 DB(180)에 저장된다.The communication interface 182 corresponds to a component for receiving an emotion evaluation model from an external network. The emotion recognition apparatus 100 according to the present invention may generate a voice evaluation model by itself, but may also be provided with a voice evaluation model from an external network. The voice evaluation model provided from the external network is stored in the emotion evaluation model DB 180.

감정 인식 장치(100)는 감정을 인식하려는 화자의 음성 신호를 외부 네트워크로부터 제공받은 음성 평가 모델에 적용함으로써 화자의 감정을 평가할 수도 있는 것이다.The emotion recognizing apparatus 100 may evaluate the speaker's emotion by applying the speaker's voice signal to recognize the emotion to a voice evaluation model provided from an external network.

이상에서 대표적인 실시예를 통하여 본 발명에 대하여 상세하게 설명하였으나, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 상술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리 범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의하여 정해져야 한다.Although the present invention has been described in detail through the representative embodiments, those skilled in the art to which the present invention pertains can make various modifications without departing from the scope of the present invention. I will understand. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined by the claims below and equivalents thereof.

도 1은 본 발명의 일 실시예에 따른 음성 신호의 감정 인식 모델 생성 방법을 나타낸 도면.1 is a view showing a method for generating an emotion recognition model of a voice signal according to an embodiment of the present invention.

도 2는 본 발명의 다른 실시예에 따른 음성 신호의 감정 인식 방법을 나타낸 도면.2 is a view showing a emotion recognition method of a voice signal according to another embodiment of the present invention.

도 3은 패스트 푸리에 변환(Fast Fourier Transform)을 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면.3 illustrates a method of performing spectral normalization using a Fast Fourier Transform.

도 4는 델타 패스트 푸리에 변환을 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면.4 illustrates a method of performing spectral normalization using a delta fast Fourier transform.

도 5는 패스트 푸리에 변환과 Mel 필터를 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면.5 illustrates a method of performing spectral normalization using a Fast Fourier transform and a Mel filter.

도 6은 델타 패스트 푸리에 변환과 Mel 필터를 이용하여 스펙트럼 정규화를 수행하는 방법을 나타낸 도면.6 is a diagram illustrating a method for performing spectral normalization using a delta fast Fourier transform and a Mel filter.

도 7은 본 발명의 또 다른 실시예에 따른 감정 인식 장치의 구성을 나타낸 도면.7 is a diagram showing the configuration of an emotion recognition device according to another embodiment of the present invention.

<도면의 주요 부분에 대한 부호 설명><Description of the symbols for the main parts of the drawings>

100 : 감정 인식 장치 100: emotion recognition device

110 : 마이크110: microphone

120 : 프레임 생성부120: frame generation unit

130 ; 고대역 강조부130; High band emphasis

140 : 스펙트럼 정규화 연산부140: spectrum normalization operation unit

150 : 엔트로피 연산부150: entropy calculator

160 : 음성 평가부160: voice evaluation unit

170 : 감정 평가 모델 생성부170: emotion evaluation model generation unit

180 : 감정 평가 모델 DB180: emotional evaluation model DB

181 : 디스플레이부181: display unit

182 : 통신 인터페이스182: communication interface

Claims

In the emotion recognition method using the spectral entropy value,

Calculating an spectral entropy value for each frame of the speech signal for generating an emotion evaluation model, and generating an emotion evaluation model using the same; Wow

Receiving an evaluation speech signal, calculating a spectral entropy value for each frame of the evaluation speech signal, and applying the same to the emotion evaluation model to recognize the emotion according to the evaluation speech signal.

The method of claim 1,

Computing the spectral entropy value for each frame of the speech signal,

Subdividing the voice signal into frames;

Emphasizing a high band per frame of the speech signal;

Performing spectral normalization of the speech signal; And

Calculating an entropy value for each frame from the spectral normalization distribution.

The method of claim 2,

Performing spectral normalization of the speech signal,

Fast Fourier transforming the speech signal;

Obtaining a power spectrum from the fast Fourier transformed result; And

And performing a normalization operation from the power spectrum.

The method of claim 2,

Performing spectral normalization of the speech signal,

Fast Fourier transforming the speech signal;

Obtaining a power spectrum from the fast Fourier transformed result;

Calculating a delta fast Fourier transform spectrum from the power spectrum and calculating an absolute value thereof; And

And performing a normalization operation from the absolute value of the delta fast Fourier transform spectrum.

The method according to claim 2 or 3,

Performing spectral normalization of the speech signal,

And acquiring a power spectrum from the fast Fourier transform result, and then performing a Mel filter operation of the power spectrum.

The method of claim 2,

Emphasizing the high band for each frame of the speech signal,

An emotion recognition method characterized by emphasizing the high range of a frame using a hamming window or the like.

The method of claim 1,

Generating the emotion evaluation model,

An emotion recognition method using one of a Gaussian mixture model (GMM) algorithm, a Hidden Markov Model (HMM) algorithm, or a support vector machine (SVM) algorithm.

The method of claim 7, wherein

Generating the emotion evaluation model,

Emotion recognition method comprising estimating a GMM parameter having a maximum Gaussian mixture distribution value using a maximum likelihood estimation (MLE) or an expansion maximization (EM) algorithm.

The method of claim 8,

Emotion recognition is performed by applying the spectral entropy value of each frame of the evaluation speech signal to the GMM emotion evaluation model.

Obtaining a Gaussian mixture distribution from the frame-specific spectral entropy value of the evaluation speech signal and the GMM parameter; Wow

Selecting an emotion according to a GMM parameter having the largest probability value among the Gaussian mixture distribution.

In the emotion recognition device using the spectral entropy value,

A frame generator which subdivides the input voice signal into frames;

A spectral normalization calculator configured to perform power spectral normalization of the granular speech signal per frame;

An entropy calculation unit that calculates an entropy value for each frame by using the normalization result of the spectrum; And

And an emotion evaluation model generator configured to generate an emotion evaluation model from the entropy values of the frames.

The method of claim 10,

And a speech evaluator configured to apply the spectrum spectral entropy value of the input speech signal for evaluation to the emotion evaluation model to perform emotion recognition.

The method according to claim 10 or 11, wherein

The spectral normalization operation unit,

And a fast fourier transform of the speech signal, a power spectrum obtained from the result of the fast Fourier transform, and normalization of the power spectrum.

The method of claim 12,

The spectral normalization operation unit,

The method according to claim 10 or 11, wherein

The spectral normalization operation unit,

Performing spectral normalization of the speech signal,

Emotion recognition by calculating a Fourier Fourier transform of the speech signal, obtaining an absolute value of the Delta Fast Fourier transform spectrum, and then normalizing the absolute value of the Delta Fast Fourier transform spectrum. Device.

The method of claim 14,

The spectral normalization operation unit,

The method according to claim 10 or 11, wherein

And a high band emphasis unit for emphasizing the high range of the frame using a hamming window or the like.

The method of claim 11,

The emotion evaluation model generation unit,

An emotion recognition apparatus using one of a Gaussian mixture model (GMM) algorithm, a Hidden Markov Model (HMM) algorithm, or a support vector machine (SVM) algorithm.

The method of claim 17,

The emotion evaluation model generation unit,

Emotion recognition apparatus characterized by estimating a GMM parameter having a maximum Gaussian mixture distribution value using a maximum likelihood estimation (MLE) or an expansion maximization (EM) algorithm.

The method of claim 18,

The voice evaluation unit,

A gaussian mixture distribution is obtained from the spectral entropy value of each frame of the evaluation speech signal and the GMM parameter, and emotion is selected according to a GMM parameter having the largest probability value.

The method according to claim 11 or 12, wherein

Emotion recognition device further comprises a communication interface for receiving an emotion evaluation model from the outside.