KR20100094182A

KR20100094182A - Apparatus and method for recognizing emotion using a voice signal

Info

Publication number: KR20100094182A
Application number: KR1020090013492A
Authority: KR
Inventors: 강상기; 김용석; 최정윤; 이정원; 최가원; 유성엽
Original assignee: 삼성전자주식회사; 연세대학교 산학협력단
Priority date: 2009-02-18
Filing date: 2009-02-18
Publication date: 2010-08-26
Also published as: KR101560834B1

Abstract

PURPOSE: A device and a method for recognizing feelings using voice signal are provided to extract non-linguistic parameter reflecting a emotional state from speed sound, thereby more accurately determining the feeling state of a speaker. CONSTITUTION: A non-linguistic information detector(210) detects one between pitch track and energy track from inputted speed sound as non-linguistic information. A non-linguistic parameter extractor(220) detects a specific interval from at least one between the pitch orbit and energy orbit. The non-linguistic parameter extractor extracts non-linguistic parameter from the detected specific section. A determination unit(230) determines the feeling state recognized about the speed sound through the extracted non-linguistic parameter.

Description

Apparatus and method for emotion recognition using speech signal {APPARATUS AND METHOD FOR RECOGNIZING EMOTION USING A VOICE SIGNAL}

본 발명은 감정을 인식하는 장치 및 방법에 관한 것으로, 특히 음성 신호를 이용하여 감정을 인식하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for recognizing emotions, and more particularly, to an apparatus and method for recognizing emotions using voice signals.

현재 많은 관심을 끌고 있고 미래 기술로 주목받고 있는 분야는 감정 인식 및 감정 이해 분야이다. 이는 IT 연구의 전체적인 방향이 PC 중심에서 네트워크 중심을 거쳐 사용자 중심으로 이동하는 전체적인 흐름과도 관계가 있다. The fields that are currently attracting much attention and are attracting attention as future technologies are the fields of emotion recognition and emotion understanding. This is also related to the overall flow of IT research moving from PC-centric to network-centric.

이와 같이 사용자와 기계 간의 인터페이스 기술에 대한 관심이 증대되면서, 음성 및 얼굴 표정을 비롯한 생체 데이터로부터 인간의 감정을 인식하는 기술들이 활발하게 연구되고 있다. 현재 사용되고 있는 인간과 기계 간의 인터페이스 기술은 일반적으로 아래와 같은 두 가지 방법이 사용되고 있다. As the interest in the interface technology between the user and the machine is increased, technologies for recognizing human emotion from biometric data including voice and facial expressions are being actively researched. Currently, two methods of interface technology between human and machine are generally used.

첫 번째 방법은 키보드, 마우스 등의 기계적 신호 입력 방식으로서 대다수의 인간과 기계 간의 인터페이스 방식이 여기에 속한다. 이것은 신호 입력 시 사용자의 의사가 정확하고 안정적으로 전달되는 장점이 있지만, 기계를 효과적으로 동작 하기 위해서 상당한 학습이 필요하고, 사용자의 편의가 상대적으로 뒤떨어진다는 단점이 있다. 또한, 사용자의 의사 전달 중 많은 부분이 손실되는 문제점이 발생한다.The first method is a mechanical signal input method such as a keyboard and a mouse, and most of them include an interface method between humans and machines. This has the advantage that the user's intention is accurately and reliably delivered when the signal is input, but it requires a considerable learning to operate the machine effectively and the user's convenience is relatively inferior. In addition, there is a problem that a large portion of the communication of the user is lost.

두 번째 방법은 음성, 영상 등의 자연적 신호 입력 방식으로서 기술의 발전과 수요에 의해 점차적으로 제품에 응용되고 있으나, 아직까지는 상술한 기계적 신호 입력 방식에 비해 사용 빈도가 매우 낮다. 이것은 기계적 신호 입력 방식에 비해서는 의사 전달이 불안정하고, 기계적으로 추가 비용이 발생한다는 단점이 있지만, 사용자의 입장에서는 훨씬 편리하고, 다양한 요구를 전달할 수 있다는 장점이 있다. The second method is a natural signal input method such as voice and video, which is gradually applied to products due to the development and demand of technology, but the frequency of use is still very low compared to the mechanical signal input method described above. This has the disadvantage that the communication is unstable compared to the mechanical signal input method and mechanically incurs additional costs, but it is more convenient for the user and has the advantage of being able to convey various needs.

이러한 인간과 기계 간의 인터페이스 기술의 한 부분으로 인간의 가장 기본적인 의사 소통 수단이자 정보 전달 수단인 음성을 이용한 감정 인식의 중요성이 부각되고 있다.As a part of the interface technology between humans and machines, the importance of emotion recognition using voice, which is the most basic means of communication and information transmission, has been highlighted.

음성 신호를 이용한 감정 인식 기술은 여러 분야에 활용될 수 있다. 예를 들면, 화자가 휴식을 원할 것으로 예상되는 감정 상태에 있다고 판단한 경우, 조용한 음악을 제안하거나 휴식을 취할 수 있는 서비스를 제안하도록 하여 해당 서비스와 연결하는 음성 신호를 이용하는 지능형 인터페이스 기술이 가능하다. 또한 이를 이용하여 타 인터페이스 기술의 성능을 향상시키는 데에도 도움을 줄 수 있다.Emotion recognition technology using voice signals can be used in various fields. For example, if it is determined that the speaker is in an emotional state that is expected to have a rest, an intelligent interface technology that uses a voice signal connected to the service by suggesting a quiet music or a service to relax may be possible. It can also be used to improve the performance of other interface technologies.

이러한 음성 신호를 이용한 감정 인식은 일반적으로 화자의 발화음에서 표면적으로 얻을 수 있는 언어적 정보나 비언어적 정보를 통해 이루어진다.Recognition of emotions using these voice signals is generally achieved through verbal or non-verbal information that can be obtained from the speaker's speech.

이때, 상기 비언어적 정보는 발화음의 피치와 에너지를 분석하여 획득할 수 있다. 이러한 연구의 예로서 (곽현석, 김수현, 곽윤근, "운율　특성　벡터와　가우시안　혼합　모델을　이용한　감정　인식", 한국　소음 진동 공학회　추계　학술 대회, 2002권, 단일호, 시작쪽수 375쪽, 2002)은 발화음의 운율적인 특징인 피치와 에너지 및 템포를 사용하여 이들의 1차 미분, 2차 미분, 변화폭 및 정규화를 시켜 2종류의 특징 벡터를 구성하여 감정 인식을 수행한 바 있다. 또한 Schuller(Schuller, B. Rigoll, G. and Lang, M., Hidden Markov Model-Based Speech Emotion Recognition, Proc. ICASSP 2003, IEEE, Hong Kong, China, vol. II, 1-4, 2003)은 발화음의 동적인 특징뿐만 아니라, 발화음의 피치와 에너지를 이용한 정적인 통계치를 이용하여 감정 인식을 수행하는 방안을 제안하였다.In this case, the non-verbal information may be obtained by analyzing pitch and energy of the spoken sound. Examples of such research (Kwak Hyun-suk, Kim Soo-hyun, Kwak Yun-geun, "Emotion Recognition Using Rhythm and Characteristic Vector and Gaussian Mixture Model", Korean Society for Noise and Vibration Engineering Society of Autumn, 2002, Han Ho, p. 375, 2002) Emotional recognition was performed by constructing two types of feature vectors by using their first, second, and variation and normalization using pitch, energy, and tempo, which are the rhythm features of. Schuller (Schuller, B. Rigoll, G. and Lang, M., Hidden Markov Model-Based Speech Emotion Recognition, Proc. ICASSP 2003, IEEE, Hong Kong, China, vol. II, 1-4, 2003) In addition to the dynamic characteristics of chords, we propose a method to perform emotion recognition using static statistics using pitch and energy of the chords.

그러나 발화음의 비언어적 정보는 환경적 요인과 화자별 요인에 많은 영향을 받는다. 피치는 언어에 따른 억양과 강세를 비롯하여 각 화자 별 성대 구조에 따라 다양하게 달라질 수 있다. 또한 에너지도 언어에 따른 음소 구성이나 발화 위치, 화자 개개인의 편차, 마이크의 볼륨이나 주변 상황에 따라 변할 수 있다. 또한 상기 발화음의 피치와 에너지는 음성 신호 입력 과정에서 영향을 받을 수도 있다. 다시 말해, 환경적 요인과 화자별 요인은 감정 인식의 성능 저하의 요인이 된다. However, the nonverbal information of speech is greatly influenced by environmental and speaker-specific factors. The pitch can vary depending on the vocal cord structure of each speaker, including intonation and accent depending on the language. Energy can also vary according to phoneme composition, location of speech, individual speaker variations, microphone volume, or surroundings. In addition, the pitch and energy of the spoken sound may be affected in the voice signal input process. In other words, environmental factors and speaker-specific factors contribute to the performance degradation of emotion recognition.

　상기와 같이 발화음으로부터 비언어적 정보를 추출하여 감정 인식을 수행하는 경우 화자의 감정 상태를 보다 정확하게 반영할 수 있는 파라미터 추출 방식과 상기 추출된 파라미터들을 이용하여 화자의 감정 상태를 보다 객관적으로 판단할 수 있는 감정 인식 방식이 요구된다.As described above, when extracting nonverbal information from a spoken sound to perform emotion recognition, the emotional state of the speaker can be more objectively determined by using a parameter extraction method that can accurately reflect the emotional state of the speaker and the extracted parameters. Emotional awareness is required.

따라서 본 발명은 음성 신호에서 감정 상태를 나타내는 비언어적인 정보에 대한 파라미터를 추출하여 감정을 인식하는 장치 및 방법을 제공한다. Accordingly, the present invention provides an apparatus and method for recognizing emotions by extracting a parameter for non-verbal information indicating an emotional state from a voice signal.

또한 본 발명은 음성 신호에서 감정 상태를 반영하는 일반적인 파라미터 및 감정 상태를 반영하는 추가적인 파라미터를 추출하여 감정을 인식하는 장치 및 방법을 제공한다.The present invention also provides an apparatus and method for recognizing emotions by extracting general parameters reflecting emotional states and additional parameters reflecting emotional states from voice signals.

또한 본 발명은 음성 신호의 유성음 및 에너지를 이용하여 감정 상태를 반영하는 추가적인 파라미터를 추출하여 감정을 인식하는 장치 및 방법을 제공한다.The present invention also provides an apparatus and method for recognizing emotions by extracting additional parameters reflecting an emotional state using voiced sounds and energy of a voice signal.

본 발명에 따른 음성 신호를 이용한 감정 인식 장치에 있어서, 입력된 발화음으로부터 피치 궤도와 에너지 궤도 중 적어도 하나를 비언어적 정보로 검출하는 비언어적 정보 검출부; 상기 검출된 피치 궤도와 에너지 궤도 중 적어도 하나로부터 특정 구간을 검출하고, 상기 검출된 특정 구간으로부터 비언어적 파라미터를 추출하는 비언어적 파라미터 추출부; 및 상기 추출된 비언어적 파라미터를 이용하여 상기 발화음에 대해 인식되는 감정 상태를 판단하는 판단부를 포함한다.An emotion recognition apparatus using a speech signal, the apparatus comprising: a non-verbal information detector for detecting at least one of a pitch track and an energy track from non-verbal information; A non-verbal parameter extraction unit detecting a specific section from at least one of the detected pitch trajectory and the energy trajectory, and extracting a nonverbal parameter from the detected specific section; And a determination unit to determine an emotional state recognized for the speech using the extracted nonverbal parameters.

또한 본 발명에 따른 음성 신호를 이용한 감정 인식 방법에 있어서, 입력된 발화음으로부터 피치 궤도와 에너지 궤도 중 적어도 하나를 비언어적 정보로 검출하는 과정; 상기 검출된 피치 궤도와 에너지 궤도 중 적어도 하나로부터 특정 구간 을 검출하고, 상기 검출된 특정 구간으로부터 비언어적 파라미터를 추출하는 과정; 및 상기 추출된 비언어적 파라미터를 이용하여 상기 발화음에 대해 인식되는 감정 상태를 판단하는 과정을 포함한다.In addition, the emotion recognition method using the voice signal according to the present invention, the step of detecting at least one of the pitch trajectory and the energy trajectory from the input speech sound as non-verbal information; Detecting a specific section from at least one of the detected pitch trajectory and the energy trajectory, and extracting a non-verbal parameter from the detected specific section; And determining an emotional state recognized for the spoken sound by using the extracted non-verbal parameters.

따라서 본 발명에 의하면 발화음으로부터 감정 상태를 반영하는 비언어적 파라미터를 추출하여 화자의 감정 상태를 보다 정확하게 판단할 수 있다.Therefore, according to the present invention, it is possible to determine the emotional state of the speaker more accurately by extracting a nonverbal parameter reflecting the emotional state from the spoken sound.

또한 본 발명에 의하면 발화음의 감정 상태를 나타내는 비언어적 파라미터로서 발화음의 피치 궤도와 에너지 궤도를 검출하고, 상기 피치 궤도와 에너지 궤도로부터 추출된 다양한 통계학적 파라미터를 조합하여, 화자의 감정 상태를 보다 정확하게 판단할 수 있다.In addition, according to the present invention, the pitch track and the energy track of the spoken sound are detected as non-verbal parameters indicating the emotional state of the spoken sound, and various statistical parameters extracted from the pitch track and the energy track are combined to obtain the emotional state of the talker. You can judge accurately.

이하, 본 발명에 따른 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 또한, 하기 설명에서는 구체적인 특정 사항들이 나타나고 있는데, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐 이러한 특정 사항들 없이도 본 발명이 실시될 수 있음은 이 기술 분야에서 통상의 지식을 가진 자에게는 자명하다 할 것이다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In addition, specific details appear in the following description, which is provided to help a more general understanding of the present invention, and it is obvious to those skilled in the art that the present invention may be practiced without these specific details. Will do.

먼저 본 발명은 발화음으로부터 비언어적 정보를 추출하여 감정 인식을 수행하는 방식에서 환경적 요인, 화자별 요인 등과 같은 외적 요인의 영향을 최소화하 도록 제안된 것이다. 이를 위해 본 발명에서는 비언어적 정보를 추출할 때 단순히 발화음의 피치와 에너지의 절대값과 변화량을 측정하여 감정 인식을 수행하는 방식에서 탈피하여 발화음의 피치 궤도와 에너지 궤도로부터 감정 인식을 위한 다양한 비언어적 파라미터들을 추출하고, 추출된 파라미터들을 감정 상태별 모델링 값들과 비교하여 화자의 감정 상태를 객관적으로 판단하는 방식을 제시한다.First, the present invention is proposed to minimize the influence of external factors such as environmental factors, speaker-specific factors, etc. in a method of extracting non-verbal information from speech sounds to recognize emotions. To this end, in the present invention, when extracting non-verbal information, it is simply to measure the absolute value and the amount of change of pitch and energy of speech, and to remove emotion from speech pitch and energy trajectory. By extracting the parameters and comparing the extracted parameters with modeling values for each emotional state, we present a method of objectively determining the emotional state of the speaker.

도 1은 본 발명에 따라 비언어적 파라미터를 추출하여 감정을 인식하는 과정을 나타낸 흐름도이다.1 is a flowchart illustrating a process of recognizing emotions by extracting non-verbal parameters according to the present invention.

110단계에서 화자의 발화음이 마이크를 통해 입력되면, 120단계에서 본 발명에 따라 발화음의 피치 궤도와 에너지 궤도로부터 감정 인식을 위한다양한 비언어적 파라미터들을 추출한다. 여기서 상기 발화음의 피치 궤도는 화자의 성대의 떨림이 수반된 지속적 발성음인 유성음으로부터 추출된다. 그리고 상기 추출된 피치 궤도와 에너지 궤도로부터 각 궤도의 상승 구간, 하강 구간, 위로 볼록인 구간, 아래로 볼록인 구간을 검출하고, 검출된 각 구간으로부터 통계학적 파라미터들을 계산하여 상기 비언어적 파라미터들로 추출한다.When the talker's speech is input through the microphone in step 110, various non-verbal parameters for emotion recognition are extracted from the pitch trajectory and the energy trajectory of the speech in accordance with the present invention in step 120. Here, the pitch trajectory of the spoken sound is extracted from the voiced sound, which is a continuous voice accompanied by tremor of the vocal cords. And detecting the rising section, the falling section, the up-convex section, and the up-convex section of each track from the extracted pitch trajectory and the energy trajectory, and calculate statistical parameters from each detected section to extract the non-verbal parameters. do.

이후, 130단계에서 상기 추출된 비언어적 파라미터들을 미리 결정된 감정 상태별 모델링 값들과 비교하여 화자의 감정 상태를 판단함으로써 감정을 인식한다. Then, in step 130, the emotion is recognized by comparing the extracted non-verbal parameters with predetermined modeling values for each emotion state to determine the emotion state of the speaker.

상기와 같이 본 발명은 발화음으로부터 측정된 유성음 구간 및 에너지를 이용하여 피치 궤도와 에너지 궤도를 각각 추출하고, 상기 피치 궤도와 에너지 궤도의 특정 구간으로부터 통계학적 파라미터들을 계산하여 감정 인식을 위한 비언어적 파라미터들로 추출한다.As described above, the present invention extracts the pitch track and the energy track using the voiced sound interval and the energy measured from the spoken sound, and calculates statistical parameters from the specific section of the pitch track and the energy track to calculate the non-verbal parameters for emotion recognition. Extract it with

이하 도 2 내지 도 4를 참조하여 본 발명에서 상기 비언어적 파라미터들을 추출하고, 상기 추출된 비언어적 파라미터들을 이용하여 감정 상태를 판단하는 방식을 구체적으로 설명하기로 한다.Hereinafter, a method of extracting the non-verbal parameters and determining the emotional state using the extracted non-verbal parameters will be described in detail with reference to FIGS. 2 to 4.

도 2는 본 발명의 실시 예에 따른 감정 인식 장치의 구성도이다.2 is a block diagram of an emotion recognition apparatus according to an embodiment of the present invention.

도 2의 감정 인식 장치는 비언어적 정보 검출부(210), 비언어적 파라미터 추출부(220), 판단부(230) 및 모델링부(240)를 포함한다.The emotion recognition apparatus of FIG. 2 includes a non-verbal information detector 210, a non-verbal parameter extractor 220, a determiner 230, and a modeler 240.

도 2를 참조하면, 상기 비언어적 정보 검출부(210)는 입력되는 발화음으로부터 비언어적 정보를 검출하기 위해 유성음 및 에너지를 각각 검출하는 유성음 검출부(211) 및 에너지 검출부(213)를 포함한다. 여기서 상기 유성음 및 에너지는 구간 단위로 검출될 수 있으며, 비언어적 정보로서 상기 유성음 및 에너지 이외에 화자의 감정 상태를 반영하는 다른 비언어적 정보를 검출할 수 있는 수단을 더 구비할 수 있다. 또한 각 검출부(211, 213)는 도 2에 도시된 바와 같이 병렬로 설치되거나 또는 순서에 상관없이 순차적으로 수행될 수 있다. 예를 들어 발화음으로부터 유성음을 검출한 후, 에너지를 검출하거나 또는 에너지를 검출한 후, 유성음을 검출하도록 장치를 구성하는 것도 가능할 것이다. 또한 도 2의 실시 예에서는 유성음 검출부(211) 및 에너지 검출부(213)를 모두 이용하도록 장치를 구성하였으나, 유성음 또는 에너지 중 하나만을 검출하도록 장치를 구성하는 것도 가능할 것이다. 이하 설명될 실시 예에서는 발명의 이해를 돕기 위해 유성음과 에너지 모두를 검출하는 것으로 가정한다.Referring to FIG. 2, the non-verbal information detector 210 includes a voiced sound detector 211 and an energy detector 213 for detecting voiced sound and energy to detect non-verbal information from input speech. Here, the voiced sound and energy may be detected in units of intervals, and may further include means for detecting other non-verbal information reflecting the speaker's emotional state in addition to the voiced sound and energy as non-verbal information. In addition, the detection units 211 and 213 may be installed in parallel as shown in FIG. 2 or sequentially performed regardless of the order. For example, it may be possible to configure the apparatus to detect voiced sound after detecting voiced sound from the spoken sound, or after detecting energy. In addition, although the apparatus is configured to use both the voiced sound detector 211 and the energy detector 213 in the embodiment of FIG. 2, the apparatus may be configured to detect only the voiced sound or energy. In the following embodiments, it is assumed that both voiced sound and energy are detected to help understand the invention.

먼저 본 발명의 실시 예에서는 유성음을 검출하기 위해 자기 상관 함 수(Auto-correlation)를 이용하여 발화음에서 무성음을 제외한 유성음에서의 시간에 따른 피치 궤도를 검출한다. 또한 에너지를 검출하기 위해 제곱 평균 제곱근(Root Mean Square)을 이용하여 시간에 따른 에너지 궤도를 검출한다. 상기 피치 궤도를 검출하는 방법 및 에너지 궤도를 검출하는 방법은 상기 자기 상관 함수 및 제곱 평균 제곱근을 이용하는 방법 외에도 현재 제안되어 사용되고 있는 다양한 방법을 이용할 수 있다.First, an embodiment of the present invention detects a pitch trajectory over time in voiced sound except voiced sound using auto-correlation to detect voiced sound. In addition, the energy trajectory over time is detected using a root mean square to detect energy. The method for detecting the pitch trajectory and the method for detecting the energy trajectory may use various methods currently proposed and used in addition to the method using the autocorrelation function and the root mean square.

상기 비언어적 파라미터 추출부(220)는 상기 유성음 검출부(211) 및 에너지 검출부(213)로부터 각각 검출된 피치 궤도 및 에너지 궤도의 특정 구간으로부터 다양한 통계학적 파라미터들을 계산하고, 상기 계산된 통계학적 파라미터들로부터 다수의 비언어적 파라미터를 추출한다. 한편 도 2의 구성 예를 변형하여 상기 유성음 검출부(211) 및 에너지 검출부(213)에 대해 각각 상기 비언어적 파라미터 추출부(220)를 구비하는 것도 가능하다.The non-verbal parameter extractor 220 calculates various statistical parameters from specific intervals of the pitch trajectory and the energy trajectory respectively detected by the voiced sound detector 211 and the energy detector 213, and calculates various statistical parameters from the calculated statistical parameters. Extract a number of nonverbal parameters. 2 may be modified to include the non-verbal parameter extraction unit 220 for the voiced sound detector 211 and the energy detector 213, respectively.

상기 비언어적 파라미터 추출부(220)는 유성음 검출부(211)로부터 검출된 피치 궤도에 대한 기울기와, 그 기울기의 변화율을 소정의 임계치와 비교하여 피치 궤도가 아래로 볼록인 구간, 위로 볼록인 구간, 상승인 구간 및 하강인 구간 중 적어도 하나를 획득한 후, 해당 구간에 대해 통계학적 파라미터들을 계산하여 감정 상태가 반영된 비언어적 파라미터를 추출한다. The non-verbal parameter extraction unit 220 compares the slope with respect to the pitch track detected by the voiced sound detector 211 and the rate of change of the slope with a predetermined threshold, and the pitch track is convex downward, convex upward and upward. After acquiring at least one of a phosphorus interval and a descending interval, statistical parameters are calculated for the interval to extract non-verbal parameters that reflect the emotional state.

구체적으로 설명하면, 상기 피치 궤도를 이용하여 추출한 비언어적 파라미터는 피치 궤도가 아래로 볼록인 구간 및 위로 볼록인 구간에 대한 각각의 시간의 평균, 분산, 최고값, 중앙값 및 IQR(Interquatile range) 중 적어도 하나를 포함한 다. 여기서 상기 IQR은 피치 궤도의 분포에서 예컨대, 상위 75%의 구간을 의미한다.Specifically, the non-verbal parameters extracted using the pitch trajectory may include at least one of an average, a variance, a maximum value, a median value, and an interquatile range (IQR) for each of the time intervals in which the pitch trajectory is convex downward and convex upward. Contains one. In this case, the IQR refers to, for example, a section of the upper 75% in the distribution of the pitch trajectories.

또한 상기 피치 궤도를 이용하여 추출한 비언어적 파라미터는 피치 궤도가 아래로 볼록인 구간 및 위로 볼록인 구간에 대한 각각의 시간과 유성음 총 시간과의 비율, 아래로 볼록인 구간 및 위로 볼록인 구간에서의 각각의 피치의 평균, 중앙값, IQR 중 적어도 하나를 포함한다. In addition, the non-verbal parameters extracted using the pitch trajectory are the ratio of the total time of the voiced sound and the time for the convex downward convex section and the convex upper convex section, the convex down section and the convex upper section, respectively. At least one of the mean, median, IQR of the pitch of.

또한 상기 피치 궤도를 이용하여 추출한 비언어적 파라미터는 피치 궤도가 상승인 구간 및/또는 하강인 구간에 대한 각각의 시간의 평균, 분산, 최고값, 중앙값, IQR 중 적어도 하나를 포함하며, 상승인 구간 및/또는 하강인 구간에서의 통계학적 파라미터들을 계산하여 각각의 피치의 평균, 중앙값, IQR 중 적어도 하나를 포함한다. The non-verbal parameters extracted using the pitch trajectory may include at least one of an average, a variance, a maximum value, a median value, and an IQR of each time for the interval in which the pitch trajectory is rising and / or the descending interval. And / or calculate statistical parameters in the descending interval to include at least one of the mean, median, and IQR of each pitch.

한편 에너지 검출부(213)로부터 검출된 에너지 궤도에 대한 기울기를 소정의 임계치와 비교하여 상승인 구간 및/또는 하강인 구간을 획득한 후, 해당 구간에 대해 통계학적 파라미터들을 계산하여 감정 상태가 반영된 비언어적 파라미터를 추출한다. On the other hand, after obtaining a rising section and / or a falling section by comparing the slope of the energy trajectory detected by the energy detector 213 with a predetermined threshold, non-verbal that reflects the emotional state by calculating statistical parameters for the section. Extract the parameters.

이때, 상기 에너지 궤도를 이용하여 추출한 비언어적 파라미터는 0~250㎐, 0~500㎐, 500~1000㎐ 및 2500~3500㎐에 해당하는 각 주파수 대역별 에너지 궤도, 궤도에 대한 기울기와 그 기울기의 변화율에서의 평균, 중앙값, 분산, 최고값, 최저값 및 IQR 중 적어도 하나를 포함한다. In this case, the non-verbal parameters extracted using the energy trajectory are the energy trajectories for each frequency band corresponding to 0 to 250 Hz, 0 to 500 Hz, 500 to 1000 Hz, and 2500 to 3500 Hz, and the slope of the slope and the change rate of the slope. And at least one of mean, median, variance, highest, lowest, and IQR in.

또한 상기 에너지 궤도를 이용하여 추출한 비언어적 파라미터는 0~250㎐, 0~500㎐, 500~1000㎐ 및 2500~3500㎐에 해당하는 각 주파수 대역별 에너지를 각 프레임의 전체 에너지로 정규화한 궤도, 궤도에 대한 기울기 및 그 기울기의 변화율에서의 평균, 중앙값, 분산, 최고값, 최저값 및 IQR 중 적어도 하나를 포함한다.In addition, the non-verbal parameters extracted using the energy trajectory are orbits, orbits, in which the energy of each frequency band corresponding to 0 to 250 kHz, 0 to 500 kHz, 500 to 1000 kHz, and 2500 to 3500 kHz is normalized to the total energy of each frame. And at least one of an average, median, variance, maximum, minimum, and IQR at the slope for and the rate of change of the slope.

상기 0~250㎐, 0~500㎐, 500~1000㎐ 및 2500~3500㎐의 주파수 대역은 예시된 것으로 상기 주파수 대역에 반드시 한정되는 것은 아니다.The frequency bands of 0 to 250 kHz, 0 to 500 kHz, 500 to 1000 kHz and 2500 to 3500 kHz are exemplified and are not necessarily limited to the frequency band.

또한 상기 에너지 궤도를 이용하여 추출한 비언어적 파라미터는 에너지 궤도가 상승인 구간 및/또는 하강인 구간에 대한 각각의 시간의 평균, 최고값, 중앙값 및 IQR 중 적어도 하나를 포함하고, 상승 구간 및/또는 하강 구간에서의 각각의 에너지의 평균, 중앙값 및 IQR 중 적어도 하나를 포함한다.In addition, the non-verbal parameters extracted using the energy trajectory include at least one of an average, a maximum value, a median value, and an IQR of each time for the section in which the energy trajectory is rising and / or falling, and the rising and / or falling And at least one of the mean, median, and IQR of each energy in the interval.

상기한 설명과 같이 발화음이 입력되면, 피치 궤도 및 에너지 궤도로부터 다양한 비언어적 파라미터들이 추출된다. 그리고 상기 다양한 비언어적 파라미터들은 감정 인식의 성능이 향상되도록 조합하여 사용할 수 있다. 여기서 상기 추출된 비언어적 파라미터들은 미리 모델링된(or 학습된) 감정 상태별 모델링 값들과 비교하여 가장 근접한 값을 갖는 모델링 값에 해당하는 감정 상태가 화자의 감정 상태로 판단된다. 여기서 상기 감정 상태별 모델링 값들은 예컨대, 트레이닝 음성 신호로부터 학습되어 감정 모델로 추출된다.When a spoken sound is input as described above, various non-verbal parameters are extracted from the pitch trajectory and the energy trajectory. The various non-verbal parameters may be used in combination to improve the performance of emotion recognition. Here, the extracted non-verbal parameters are determined as the emotional state of the speaker as compared to the modeling values for each emotional state that have been pre-modeled (or learned). Here, the emotional state modeling values are learned from, for example, a training voice signal and extracted into an emotional model.

도 2에서 상기 모델링부(240)는 입력되는 트레이닝 음성 신호로부터 상기 감정 모델을 추출한다.In FIG. 2, the modeling unit 240 extracts the emotion model from the input training voice signal.

도 3을 참조하여 상기 감정 모델을 추출하는 과정을 설명하면, 도 3의 310단계에서 트레이닝 음성 신호가 상기 모델링부(240)로 입력되면, 320단계에서 상기 모델링부(240)는 상기 입력된 트레이닝 음성 신호로부터 비언어적 파라미터들을 추출한다. 이후, 330단계에서 상기 모델링부(240)는 상기 추출된 비언어적 파라미터들을 이용하여 하기와 같은 감정 모델을 추출한다.Referring to FIG. 3, the process of extracting the emotion model is described. When a training voice signal is input to the modeling unit 240 in step 310 of FIG. 3, the modeling unit 240 performs the input training in step 320. Extract non-verbal parameters from speech signal. In operation 330, the modeling unit 240 extracts an emotion model as follows using the extracted non-verbal parameters.

하기의 <표 1>은 음성 신호의 트레이닝 시 이용되는 데이터 베이스에서 트레이닝 음성 신호의 각 특징 벡터의 평균 및/또는 분산 값들을 감정 상태별로 예시한 것이다. 상기한 감정 상태별 모델링 값들, 즉 감정 모델은 상기 각 특징 벡터의 평균 및/또는 분산 값들을 이용할 수 있다. 하기 <표 1>의 감정 모델은 상기 각 특징 벡터의 파라미터들 중 선택된 10개의 파라미터들을 예시한 것으로써, 그 파라미터들은 하기 <표 1>의 세로 열의 위부터 순차로 각각 피치 궤도가 상승인 구간 및 하강인 구간에서의 각각의 피치값의 IQR(P1, P2), 아래로 볼록인 구간에서의 피치값의 IQR(P3), 500~1000㎐ 주파수 대역 에너지 궤도의 기울기에서의 분산(P4)과 그 기울기의 변화율에서의 분산(P5), 500~1000㎐ 주파수 대역 에너지 궤도에서의 최대값(P6)과 평균(P7) 및 분산(P8), 0~250㎐에 해당하는 주파수 대역 에너지를 각 프레임의 전체 에너지로 정규화한 궤도의 중앙값(P9), 0~500㎐에 해당하는 주파수 대역 에너지를 각 프레임의 전체 에너지로 정규화한 구도의 평균(P10)을 나타낸 것이다.Table 1 below illustrates average and / or variance values of respective feature vectors of the training voice signal for each emotional state in a database used for training the voice signal. The emotion state modeling values, that is, the emotion model may use average and / or variance values of the feature vectors. The emotion model of Table 1 below illustrates 10 selected parameters among the parameters of each feature vector, and the parameters are the intervals in which the pitch trajectory rises sequentially from the vertical column of Table 1 below. IQR (P1, P2) of the respective pitch values in the descending section, IQR (P3) of the pitch values in the convex down section, the variance (P4) in the slope of the 500-1000 Hz frequency band energy trajectory and its The frequency band energy corresponding to the variance (P5) at the rate of change of the slope, the maximum value P6 and the average (P7) and variance (P8) at the frequency band energy range of 500 to 1000 Hz, and 0 to 250 Hz, The median value (P9) of the orbit normalized by total energy, and the average (P10) of the composition where the frequency band energy corresponding to 0 to 500 Hz is normalized to the total energy of each frame.

상기 각 특징 벡터가 GMM(Gaussian Mixture Model)에 적용될 때에는 정규화 과정을 거쳐서 상기한 감정 모델로서 GMM 모델을 만들게 된다. 그리고 상기 <표 1>의 값들은 예컨대, 한국어 음성 데이터 베이스를 이용한 실시 예이며, 영어 등의 다른 언어에서는 다른 값으로 저장될 수 있다. 그러나 언어가 달라지더라도 발화음(or 트레이닝 음성 신호)으로부터 피치 궤도와 에너지 궤도를 검출하고, 검출된 피치 궤도와 에너지 궤도의 특정 구간에서 계산된 통계학적 파라미터들을 비언어적 파라미터들로서 추출하는 본 발명의 동작은 공통으로 적용된다.When each feature vector is applied to a Gaussian Mixture Model (GMM), a GMM model is generated as the emotion model through a normalization process. The values in Table 1 are examples using a Korean voice database, and may be stored as different values in other languages such as English. However, even if the language is different, the operation of the present invention detects the pitch trajectory and the energy trajectory from the spoken sound (or training voice signal), and extracts the statistical parameters calculated in a specific section of the detected pitch trajectory and the energy trajectory as non-verbal parameters. Applies in common.

상기 판단부(230)는 상기 비언어적 파라미터 추출부(220)로부터 추출된 비언어적 파라미터들을 벡터화한 후, 상기 모델링부(240)에 미리 구성된 감정 상태를 판단하기 위한 감정 모델을 이용하여 상기 벡터화 된 비언어적 파라미터들과 상기 감정 모델에서 각 감정 상태에 대한 확률(likelihood)을 비교함으로써 화자의 감정 상태를 판단한다. 즉 상기 판단부(230)는 상기 벡터화 된 비언어적 파라미터들을 감정 상태별 모델링 값들과 비교하여 가장 근접한 값을 갖는 모델링 값에 해당하는 감정 상태를 화자의 감정 상태로 판단(인식)한다.The determination unit 230 vectorizes the non-verbal parameters extracted from the non-verbal parameter extraction unit 220, and then uses the vectorized non-verbal parameters by using an emotion model for determining an emotional state pre-configured in the modeling unit 240. The emotional state of the speaker is determined by comparing the likelihood for each emotional state in the emotional model. That is, the determination unit 230 determines (recognizes) the emotional state corresponding to the modeling value having the closest value by comparing the vectorized nonverbal parameters with modeling values for each emotional state as the speaker's emotional state.

이때, 상기 감정 상태를 판단하기 위한 알고리즘으로 GMM 또는 HMM(Hidden Markov Model)을 이용할 수 있다. 그 외에도 감정 인식에서 사용 가능한 패턴 분류 알고리즘이라면 그 또한 이용 가능하다. In this case, GMM or HMM (Hidden Markov Model) may be used as an algorithm for determining the emotional state. In addition, any pattern classification algorithm that can be used in emotion recognition can be used.

도 4는 본 발명의 실시 예에 따라 입력된 발화음으로부터 감정 상태를 인식하는 방법을 나타낸 흐름도이다.4 is a flowchart illustrating a method of recognizing an emotional state from an input utterance according to an embodiment of the present invention.

먼저 발화음이 입력되면, 410단계에서 상기 입력된 발화음에 대한 비언어적 정보를 검출한다. 구체적으로 상기 비언어적 정보는 발화음으로부터 추출된 유성음과 에너지 중 적어도 하나를 포함하며, 411 단계에서 상기 유성음의 구간으로부터 피치 궤도가 검출되고, 413 단계에서 상기 발화음의 에너지로부터 에너지 궤도가 검출된다. 그리고 상기 발화음의 에너지를 상기 유성음의 구간에서 검출하는 것도 가능할 것이다.First, when a speech sound is input, in step 410, non-verbal information on the input speech sound is detected. Specifically, the non-verbal information includes at least one of voiced sound and energy extracted from the spoken sound. In step 411, the pitch track is detected from the voiced sound, and in step 413, the energy track is detected from the energy of the spoken sound. In addition, it may be possible to detect the energy of the spoken sound in the period of the voiced sound.

상기 411단계 및 413단계는 병렬로 수행될 수 있으며, 진행 순서에 상관없이 순차적으로 수행하는 것도 가능할 것이다. 또한 각 단계 중 적어도 하나를 선택적으로 실시하는 것도 가능할 것이다. Steps 411 and 413 may be performed in parallel, and may be performed sequentially regardless of the order of progress. It will also be possible to selectively perform at least one of each step.

이후, 420단계로 진행하여 상기 411단계 및 413단계에서 각각 검출된 비언어적 정보인 상기 피치 궤도와 에너지 궤도의 특정 구간으로부터 통계학적 파라미터들을 계산하여 감정 인식을 위한 비언어적 파라미터들로 추출한다. 상기 비언어적 파라미터들을 계산하는 구체적인 방법은 도 2의 설명에서 기술한 것처럼 피치 궤도와 에너지 궤도가 아래로 볼록인 구간, 위로 볼록인 구간, 상승인 구간 및 하강인 구간 중 적어도 하나의 구간을 획득한 후, 해당 구간에 대해 통계학적 파라미터들을 계산하여 감정 상태가 반영된 비언어적 파라미터를 추출한다.In operation 420, statistical parameters are calculated from specific sections of the pitch track and the energy track, which are non-verbal information detected in steps 411 and 413, respectively, and extracted as non-verbal parameters for emotion recognition. A specific method of calculating the non-verbal parameters may include obtaining at least one of a section in which the pitch track and the energy track are convex downward, convex upward, rising section and falling section as described in the description of FIG. 2. In addition, statistical parameters are calculated for the interval to extract non-verbal parameters that reflect the emotional state.

그리고 430단계에서 상기 추출된 비언어적 파라미터들을 벡터화된 값으로 계산하고, 440단계에서 상기 벡터화 된 비언어적 파라미터들을 감정 상태별 모델링 값들과 비교하고, 450단계에서 상기 비교 결과 상기 벡터화 된 비언어적 파라미터들과 가장 근접한 값을 갖는 모델링 값을 확인하고, 460단계에서 상기 확인된 모델링 값에 해당하는 감정 상태를 화자의 감정 상태로 판단(인식)한다.In operation 430, the extracted non-verbal parameters are calculated as vectorized values, and in step 440, the vectorized non-verbal parameters are compared with modeling values for each emotional state, and in operation 450, the closest to the vectorized non-verbal parameters are obtained. The modeling value having the value is checked, and in step 460, the emotional state corresponding to the identified modeling value is determined (recognized) as the speaker's emotional state.

도 5a 내지 도 5c는 각각 본 발명의 실시 예에 따른 감정 인식 방법을 적용한 시뮬레이션 결과를 나타낸 것이다.5A to 5C show simulation results of applying an emotion recognition method according to an embodiment of the present invention, respectively.

각 감정 상태당 2개의 가우시안을 사용했을 때의 pdf 분포(x축 정규화됨)를 나타낸 그래프로써, 각각의 감정 상태는 평상(510), 기쁨(520), 슬픔(530) 및 화남(540)으로 구분하여 도시한 것임을 명시한다. 먼저 도 5a는 본 발명의 실시 예에 따라 피치 궤도가 하강인 구간에서의 IQR에 대한 pdf를 나타낸 그래프이고, 도 5b는 본 발명의 실시 예에 따라 표준화된 0~250㎐에 해당하는 에너지의 중앙값에 대한 pdf를 나타낸 그래프이다. 또한 도 5c는 표준화된 0~500㎐에 해당하는 에너지의 평균값에 대한 pdf를 나타낸 그래프이다. 도 5a 내지 도 5c를 참조하면, 대체로 흥분상태를 나타내는 기쁨(520)과 화남(540), 그리고 차분한 상태를 나타내는 평상(510)과 슬픔(530)이 비슷한 분포를 나타내는 것을 알 수 있다. 상기 시뮬레이션에서는 감정 인식 성능을 향상시키기 위해 본 발명에 따라 추출된 다수의 비언어적 파라미터들을 조합하여 벡터화 한 후 사용하였다.A graph showing the pdf distribution (x-axis normalized) using two Gaussians for each emotional state, with each emotional state being normal (510), joy (520), sadness (530), and angry (540). It is clearly shown. First, Figure 5a is a graph showing the pdf of the IQR in the interval of the pitch trajectory in accordance with an embodiment of the present invention, Figure 5b is a median of energy corresponding to 0 ~ 250 kHz standardized according to an embodiment of the present invention This is a graph showing the pdf for. In addition, Figure 5c is a graph showing a pdf of the average value of the energy corresponding to the standardized 0 ~ 500㎐. 5A to 5C, it can be seen that the joy 520 and the anger 540 generally representing the excited state, and the normal 510 and the sadness 530 representing the calm state have similar distributions. In the simulation, a combination of a plurality of nonverbal parameters extracted according to the present invention was used after vectorization in order to improve emotion recognition performance.

도 1은 본 발명에 따라 비언어적 파라미터를 추출하여 감정을 인식하는 과정을 나타낸 흐름도,1 is a flowchart illustrating a process of recognizing emotions by extracting non-verbal parameters according to the present invention;

도 2는 본 발명의 실시 예에 따른 감정 인식 장치의 구성도,2 is a block diagram of an emotion recognition apparatus according to an embodiment of the present invention;

도 3은 본 발명의 실시 예에 따라 감정 모델을 추출하는 과정을 나타낸 흐름도,3 is a flowchart illustrating a process of extracting an emotion model according to an embodiment of the present invention;

도 4는 본 발명의 실시 예에 따라 입력된 발화음으로부터 감정 상태를 인식하는 방법을 나타낸 흐름도,4 is a flowchart illustrating a method of recognizing an emotional state from an input utterance according to an embodiment of the present invention;

도 5a는 본 발명의 실시 예에 따라 피치 궤도가 하강인 구간에서의 IQR에 대한 pdf를 나타낸 그래프, 5a is a graph showing a pdf for an IQR in a section in which a pitch trajectory falls according to an embodiment of the present invention;

도 5b는 본 발명의 실시 예에 따라 표준화된 0~250㎐에 해당하는 에너지의 중앙값에 대한 pdf를 나타낸 그래프,Figure 5b is a graph showing a pdf of the median of energy corresponding to 0 ~ 250 kHz standardized according to an embodiment of the present invention,

도 5c는 표준화된 0~500㎐에 해당하는 에너지의 평균값에 대한 pdf를 나타낸 그래프.Figure 5c is a graph showing the pdf of the average value of the energy corresponding to 0 ~ 500 Hz standardized.

Claims

A non-verbal information detection unit for detecting at least one of a pitch track and an energy track from the input speech sound as non-verbal information;

A non-verbal parameter extraction unit detecting a specific section from at least one of the detected pitch trajectory and the energy trajectory, and extracting a nonverbal parameter from the detected specific section; And

And a determination unit that determines an emotional state recognized for the speech using the extracted non-verbal parameters.

The method of claim 1, wherein the detected specific section is,

At least one of a downward convex section, an upward convex section, a rising section, and a falling section, which are detected by comparing a slope and a rate of change of the slope with respect to at least one of the detected pitch trajectory and the energy trajectory with a predetermined threshold value. Emotion recognition apparatus using a voice signal, characterized in that the interval.

The method of claim 1, wherein the emotion recognition device,

And a modeling unit which generates a modeling value for each emotional state as a database using a plurality of non-verbal parameters extracted from the input training signal.

The method of claim 1,

The non-verbal information detection unit,

A voiced sound detector for detecting the pitch trajectory according to time in voiced sound by detecting a voiced sound section from the spoken sound; And

And an energy detector for detecting the energy trajectory over time by detecting energy from the spoken sound.

The method of claim 3, wherein the determination unit,

The emotional state corresponding to the modeling value having the closest value is compared to the emotional state recognized by the speech sound by comparing the extracted non-verbal parameter from the input speech sound with the modeling values for each emotional state provided through the modeling unit. Emotion recognition apparatus using a voice signal, characterized in that for judging.

The method of claim 1, wherein the determination unit,

The non-verbal parameter extracted from the input speech sound is compared with predetermined modeling values for each emotional state, and the voice signal is determined as an emotional state recognized for the speech sound corresponding to the modeling value having the closest value. Emotion Recognition Device.

The method of claim 1, wherein the determination unit,

Emotion recognition apparatus using a speech signal, characterized in that using one of the Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) as an algorithm for determining the emotional state.

The method of claim 1, wherein the non-verbal parameters are:

IQR (Interquatile range) of each pitch value in the section where the pitch track is rising and falling, IQR of the pitch value in the section where the pitch track is convex downward, The normalization of the variance in the slope and the variance in the rate of change of the slope, the maximum and mean and variance in the 500–1000 Hz frequency band energy trajectory, and the frequency band energy corresponding to 0 to 250 Hz And a vectorized value of at least one of an average of a composition in which a frequency band energy corresponding to a median of the track and a frequency band energy corresponding to 0 to 500 Hz is normalized to the total energy of each frame.

The method of claim 1, wherein the emotional state,

Emotion recognition apparatus using a voice signal comprising at least one of a sad state, a normal state, a joy state, an angry state.

The method of claim 8, wherein, among the non-verbal parameters,

IQR of each pitch value in the section in which the pitch track is rising and falling, IQR of the pitch value in a section in which the pitch track is convex downward, in the slope with respect to the energy band of 500 to 1000 kHz frequency band. Variance, the mean and variance in the 500-1000 Hz frequency band energy trajectory, the median of the trajectory where the frequency band energy corresponding to 0 to 250 kHz is normalized to the total energy of each frame, and the frequency corresponding to the 0 to 500 Hz Characterized in that at least one of the magnitudes of the average values of the average of the composition of band energy normalized to the total energy of each frame corresponds to the sadness state, the normal state, the joy state, and the angry state, respectively, in the ascending order. Emotion Recognition Apparatus Using Voice Signal.

The method of claim 8, wherein, among the non-verbal parameters,

At least one of the variance in the rate of change of the slope with respect to the 500-1000 Hz frequency band energy trajectory and the magnitude of each mean value with respect to the maximum value in the 500-1000 Hz frequency band energy trajectory is each sad in the emotional state in descending order. Emotion recognition apparatus using a voice signal, characterized in that corresponding to the state, normal state, joy state, anger state.

Detecting at least one of a pitch trajectory and an energy trajectory as non-verbal information from the input speech;

Detecting a specific section from at least one of the detected pitch trajectory and the energy trajectory, and extracting a non-verbal parameter from the detected specific section; And

And determining the emotional state recognized for the spoken sound using the extracted non-verbal parameters.

The method of claim 12, wherein the detected specific section,

At least one of a downward convex section, an upward convex section, a rising section, and a falling section, which are detected by comparing a slope and a rate of change of the slope with respect to at least one of the detected pitch trajectory and the energy trajectory with a predetermined threshold value. Emotion recognition method using a speech signal, characterized in that the interval.

The method of claim 12, wherein the emotion recognition method,

And generating a database for each emotional state modeling value using a plurality of non-verbal parameters extracted from the input training signal.

13. The method of claim 12,

The detecting process,

Detecting the pitch trajectory according to time in the voiced sound by detecting a voiced sound section from the spoken sound; And

And detecting at least one energy trajectory over time by detecting energy from the spoken sound.

The method of claim 12, wherein the determining comprises:

The non-verbal parameter extracted from the input speech sound is compared with modeling values for each emotional state, and the emotional state corresponding to the modeling value having the closest value is determined as the emotional state recognized for the speech sound. Emotion recognition method using the speech signal.

The method of claim 12, wherein the determining comprises:

Emotion recognition method using a voice signal, characterized in that using one of the Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) as an algorithm for determining the emotional state.

The method of claim 12, wherein the non-verbal parameter is

IQR (Interquatile range) of each pitch value in the section where the pitch track is rising and falling, IQR of the pitch value in the section where the pitch track is convex downward, The normalization of the variance in the slope and the variance in the rate of change of the slope, the maximum and mean and variance in the 500–1000 Hz frequency band energy trajectory, and the frequency band energy corresponding to 0 to 250 Hz And a vectorized value of at least one of a mean of a composition in which a frequency band energy corresponding to a median value of a track, 0 to 500 Hz, is normalized to the total energy of each frame.

The method of claim 12, wherein the emotional state,

Emotion recognition method using a voice signal comprising at least one of a sad state, a normal state, a joy state, an angry state.

19. The method of claim 18 wherein, among the non-verbal parameters,

IQR of each pitch value in the section in which the pitch track is rising and falling, IQR of the pitch value in a section in which the pitch track is convex downward, in the slope with respect to the energy band of 500 to 1000 kHz frequency band. Variance, the mean and variance in the 500-1000 Hz frequency band energy trajectory, the median of the trajectory where the frequency band energy corresponding to 0 to 250 kHz is normalized to the total energy of each frame, and the frequency corresponding to the 0 to 500 Hz Wherein at least one of the magnitudes of the average values of the average of the band energy normalized to the total energy of each frame corresponds to the sadness state, the normal state, the joy state, and the angry state among the sadness states in ascending order. Emotion recognition method using the speech signal.

19. The method of claim 18, wherein, among the non-verbal parameters,

At least one of the variance in the rate of change of the slope with respect to the 500-1000 Hz frequency band energy trajectory and the magnitude of each mean value with respect to the maximum value in the 500-1000 Hz frequency band energy trajectory is each sad in the emotional state in descending order. Emotion recognition method using a voice signal, characterized in that corresponding to the state, normal state, joy state, anger state.