KR20220098991A

KR20220098991A - Method and apparatus for recognizing emtions based on speech signal

Info

Publication number: KR20220098991A
Application number: KR1020210000952A
Authority: KR
Inventors: 권순일; 무스타킴
Original assignee: 세종대학교산학협력단
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2022-07-12
Also published as: KR102541660B1

Abstract

An emotion recognition device for recognizing the emotions of a speaker based on voice signals according to one aspect of the present invention includes a memory storing a voice-based emotion recognition program, and a processor executing the program stored in the memory. The voice-based emotion recognition program receives the speaker's voice data and classifies the speaker's emotions by inputting the received voice data into an emotion classification model. The emotion classification model includes a local feature extraction unit that extracts local features of voice data through ConvLSTM, and a global feature extraction unit that extracts global features of voice data through GRU (gated recurrent unit), and classifies the speaker's emotions based on the extracted local and global features.

Description

Apparatus and method for emotion recognition based on voice signals

본 발명은 기계 학습 모델을 통해 음성 신호로부터 발화자의 감정을 인식할 수 있는 음성 기반 감정 인식 장치 및 방법에 관한 것이다.The present invention relates to a speech-based emotion recognition apparatus and method capable of recognizing a speaker's emotion from a speech signal through a machine learning model.

사람의 음성을 통해 발화자의 감정을 인식하는 기술에 대한 연구가 진행되고 있다. 특히, 인공 지능 기술이나 기계 학습 기술을 이용하는, 스마트 음성 감정 인식(SER, Speech Emotion Recognition) 기술은 디지털 오디오 신호 처리의 새로운 분야로 알려지고 있으며, 인간-컴퓨터 상호 작용(HCI, Human Computer Interface) 기술과 관련된 많은 응용 프로그램에서 중요한 역할을 할것으로 기대하고 있다.Research on a technology for recognizing a speaker's emotions through a human voice is in progress. In particular, Speech Emotion Recognition (SER) technology, which uses artificial intelligence technology or machine learning technology, is known as a new field of digital audio signal processing, and is a human-computer interaction (HCI) technology. It is expected to play an important role in many applications related to

기존의 연구는 음성 데이터에서 감정인식을 모델링 하기위해 다양한 수의 심층 신경망(DNN)을 도입하고 있다. 예를 들어, 원본 오디오 샘플에서 중요한 신호를 감지하는 DNN 모델이 제안되거나, 오디오 녹음의 특정 표현을 사용하여 모델에 대한 입력을 제공하는 기술이 제안되었다.Existing research has introduced a number of deep neural networks (DNNs) to model emotion recognition in voice data. For example, a DNN model for detecting important signals in original audio samples has been proposed, or a technique has been proposed to provide an input to the model using a specific representation of an audio recording.

특히, 연구자들은 다양한 유형의 컨볼루션 연산을 통해 숨겨진 신호를 추출하고 선, 곡선, 점, 모양 및 색상을 인식하고 있다. 예를 들면, CNN(convolution neural networks), RNN(recurrent neural networks), LSTM(long short-term memory), DBN(deep belief networks) 등을 포함하는 중간 수준의 종단 간 모델을 활용하고 있다. 다만, 이러한 다양한 인공 신경망 모델의 구성이 여전히 부실하기 때문에 정확도 수준과 인식률이 낮다는 문제가 존재한다. CNN을 이용한 모델의 경우 감정 인식의 정확도를 높이는 역할이 부족하다.In particular, researchers are extracting hidden signals through various types of convolution operations and recognizing lines, curves, points, shapes and colors. For example, mid-level end-to-end models including convolution neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM), and deep belief networks (DBN) are utilized. However, there is a problem that the accuracy level and recognition rate are low because the configuration of these various artificial neural network models is still poor. In the case of a model using CNN, the role of increasing the accuracy of emotion recognition is insufficient.

또한, 시간에 있어서 장기적인 변화요소를 학습하고, 감정을 인식하기 위해 RNN과 LSTM을 활용하고 있는데, 정확도를 크게 향상시키지 못하면서도 전체 모델의 계산 및 학습 시간을 증가시키는 문제가 있다. 이와 같이, 공간적 감정 신호와 순차적 신호를 인식하는 효율적이고 중요한 프레임 워크를 제공할 필요가 있다.In addition, RNN and LSTM are used to learn long-term change factors in time and to recognize emotions, but there is a problem in that it increases the computation and training time of the entire model without significantly improving the accuracy. As such, there is a need to provide an efficient and important framework for recognizing spatial emotional signals and sequential signals.

일본 등록특허공보 제6732703호 (발명의 명칭: 감정 인터렉션 모델 학습 장치, 감정 인식장치, 감정 인터렉션 모델 학습 방법, 감정 인식 방법, 및 프로그램)Japanese Patent Publication No. 6732703 (Title of the invention: emotion interaction model learning device, emotion recognition device, emotion interaction model learning method, emotion recognition method, and program)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 음성신호의 공간적 특징과 시간적 특징을 모두 활용하여 음성으로부터 발화자의 감정을 분류할 수 있는 음성 기반 감정 인식 장치 및 방법을 제공하는데 목적이 있다. The present invention is to solve the problems of the prior art, and it is an object of the present invention to provide a voice-based emotion recognition apparatus and method capable of classifying a speaker's emotions from a voice by utilizing both spatial and temporal characteristics of a voice signal. .

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical task to be achieved by the present embodiment is not limited to the above-described technical task, and other technical tasks may exist.

상술한 기술적 과제를 해결하기 위한 기술적 수단으로서, 본 발명의 일 측면에 따른 음성 신호에 기반하여 발화자의 감정을 인식하는 감정 인식 장치는, 음성 기반 감정 인식 프로그램이 저장된 메모리; 및 상기 메모리에 저장된 프로그램을 실행하는 프로세서를 포함하며, 상기 음성 기반 감정 인식 프로그램은, 발화자의 음성 데이터를 수신하고, 수신한 음성 데이터를 감정 분류 모델에 입력하여 발화자의 감정을 분류한다. 이때, 감정 분류 모델은 ConvLSTM을 통해 음성 데이터의 로컬 특징을 추출하는 로컬 특징 추출부, GRU(gated Recurren Unit)를 통해 음성 데이터의 글로벌 특징을 추출하는 글로벌 특징 추출부를 포함하고, 상기 로컬 특징과 글로벌 특징에 기반하여 발화자의 감정을 분류한다.As a technical means for solving the above technical problem, an emotion recognition apparatus for recognizing a speaker's emotion based on a speech signal according to an aspect of the present invention includes: a memory in which a speech-based emotion recognition program is stored; and a processor executing the program stored in the memory, wherein the speech-based emotion recognition program receives the speaker's voice data and inputs the received voice data into an emotion classification model to classify the speaker's emotions. At this time, the emotion classification model includes a local feature extractor that extracts local features of voice data through ConvLSTM, and a global feature extractor that extracts global features of voice data through a gated recurren unit (GRU), and the local features and global features Classify the speaker's emotions based on their characteristics.

또한, 본 발명의 다른 측면에 따른 음성 기반 감정 인식 장치를 이용한 구조물 감정 인식 방법은, 발화자의 음성 데이터를 수신하는 단계, 및 수신한 음성 데이터를 감정 분류 모델에 입력하여 발화자의 감정을 분류하는 단계를 포함하되, 감정 분류 모델은 ConvLSTM을 통해 음성 데이터의 로컬 특징을 추출하는 로컬 특징 추출부, GRU(gated Recurren Unit)를 통해 음성 데이터의 글로벌 특징을 추출하는 글로벌 특징 추출부를 포함하고, 상기 로컬 특징과 글로벌 특징에 기반하여 발화자의 감정을 분류를 포함한다.In addition, the structure emotion recognition method using a voice-based emotion recognition apparatus according to another aspect of the present invention includes the steps of receiving voice data of a speaker, and inputting the received voice data into an emotion classification model to classify the emotions of the speaker Including, wherein the emotion classification model includes a local feature extractor for extracting local features of voice data through ConvLSTM, and a global feature extractor for extracting global features of voice data through a gated recurring unit (GRU), the local features and classifying the speaker's emotions based on global characteristics.

전술한 본원의 과제 해결 수단에 의하면, 음성 데이터에 포함된 시간적 특징과 공간적 특징을 효과적으로 추출하여, 발화자의 감정을 자동으로 분류할 수 있다.According to the above-described problem solving means of the present application, by effectively extracting temporal and spatial characteristics included in voice data, it is possible to automatically classify the speaker's emotions.

특히, 음성 데이터의 로컬 특징을 추출하는 과정에서 ConvLSTM 모델을 사용함에 따라, 음성 신호의 연속적인 시퀀스를 쉽게 인식하고, 인식된 시퀀스로부터 연결된 감정정보를 추출할 수 있다. 또한, GRU를 통해 서로 시간적으로 떨어져 있는 감정 정보를 함께 고려할 수 있어서, SER 시스템의 예측 성능을 향상시킬 수 있다.In particular, as the ConvLSTM model is used in the process of extracting local features of voice data, it is possible to easily recognize a continuous sequence of a voice signal and to extract connected emotion information from the recognized sequence. In addition, since emotional information that is temporally separated from each other can be considered together through the GRU, the prediction performance of the SER system can be improved.

도 1은 본 발명의 일 실시예에 따른 음성 기반 감정 인식 장치의 구성을 도시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 음성 기반 감정 인식 장치를 설명하기 위한 개념도이다.
도 3은 본 발명의 일 실시예에 따른 음성 기반 감정 인식 방법을 설명하기 위한 순서도이다.
도 4는 본 발명의 일 실시예에 따른 음성 기반 감정 인식 방법에 사용되는 감정 분류 모델의 구축 과정을 설명하기 위한 순서도이다.
도 5는 본 발명의 일 실시예에 따른 ConvLSTM 계층의 구체적인 구성을 도시한 도면이다.
도 6은 본 발명의 일 실시예에 따른 글로벌 특징 추출부에 사용되는 GRU의 구성을 도시한 것이다.1 is a block diagram illustrating a configuration of a voice-based emotion recognition apparatus according to an embodiment of the present invention.
2 is a conceptual diagram illustrating a voice-based emotion recognition apparatus according to an embodiment of the present invention.
3 is a flowchart illustrating a voice-based emotion recognition method according to an embodiment of the present invention.
4 is a flowchart illustrating a process of constructing an emotion classification model used in a voice-based emotion recognition method according to an embodiment of the present invention.
5 is a diagram illustrating a detailed configuration of a ConvLSTM layer according to an embodiment of the present invention.
6 is a diagram illustrating a configuration of a GRU used in a global feature extraction unit according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art to which the present application pertains can easily implement them. However, the present application may be implemented in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is said to be "connected" with another part, it includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. do.

본원 명세서 전체에서, 어떤 부재가 다른 부재 “상에” 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when a member is said to be located “on” another member, this includes not only a case in which a member is in contact with another member but also a case in which another member is present between the two members.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다.Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 음성 기반 감정 인식 장치의 구성을 도시한 블록도이다. 1 is a block diagram illustrating a configuration of a voice-based emotion recognition apparatus according to an embodiment of the present invention.

도시된 바와 같이 음성 기반 감정 인식 장치(100)는 통신 모듈(110), 메모리(120), 프로세서(130) 및 데이터베이스(140)를 포함할 수 있다. 또한, 음성 기반 감정 인식 장치(100)는 마이크 등을 내장할 수 있으며, 이를 통해 직접 음성 데이터를 생성하는 것도 가능하다.As shown, the voice-based emotion recognition apparatus 100 may include a communication module 110 , a memory 120 , a processor 130 , and a database 140 . In addition, the voice-based emotion recognition apparatus 100 may have a built-in microphone or the like, and it is also possible to directly generate voice data through this.

통신 모듈(110)은 발화자의 음성 데이터를 외부 기기로부터 수신하는 것으로서, 각종 스마트 단말에 연결된 마이크 등을 통해 입력된 음성 데이터를 통신망(300)을 통해 수신할 수 있다. 또한 통신 모듈(110)은 각종 외부 장치(서버 또는 단말) 로부터 음성 기반 감정 인식 프로그램 등의 업데이트 정보 등을 수신하여 프로세서(130)로 전송할 수 있다.The communication module 110 receives voice data of a speaker from an external device, and may receive voice data input through a microphone connected to various smart terminals through the communication network 300 . In addition, the communication module 110 may receive update information, such as a voice-based emotion recognition program, from various external devices (servers or terminals) and transmit the received information to the processor 130 .

통신 모듈(110)은 다른 네트워크 장치와 유무선 연결을 통해 제어 신호 또는 데이터 신호와 같은 신호를 송수신하기 위해 필요한 하드웨어 및 소프트웨어를 포함하는 장치일 수 있다.The communication module 110 may be a device including hardware and software necessary for transmitting and receiving signals such as control signals or data signals through wired/wireless connection with other network devices.

메모리(120)에는 발화자의 음성을 기반으로 발화자의 감정을 분류하는 음성 기반 감정 인식 프로그램이 저장된다. 이러한 메모리(120)에는 음성 기반 감정 인식 장치(100)의 구동을 위한 운영 체제나 음성 기반 감정 인식 프로그램의 실행 과정에서 발생되는 여러 종류가 데이터가 저장된다. The memory 120 stores a voice-based emotion recognition program for classifying the speaker's emotions based on the speaker's voice. Various types of data generated during the execution of an operating system for driving the voice-based emotion recognition apparatus 100 or a voice-based emotion recognition program are stored in the memory 120 .

이때, 음성 기반 감정 인식 프로그램은, 발화자의 음성 데이터를 통신 모듈(110)을 통해 수신하고, 수신한 음성 데이터를 감정 분류 모델에 입력하여 발화자의 감정을 분류한다. 이때, 감정 분류 모델은 ConvLSTM을 통해 음성 데이터의 로컬 특징을 추출하는 로컬 특징 추출부, GRU(gated Recurren Unit)를 통해 음성 데이터의 글로벌 특징을 추출하는 글로벌 특징 추출부를 포함하며, 추가적으로 손실 함수를 통해 감정 분류 모델을 갱신하는 손실 함수부를 포함할 수 있다. 음성 기반 감정 인식 프로그램의 구체적인 내용에 대해서는 추후 설명하기로 한다.In this case, the voice-based emotion recognition program receives the speaker's voice data through the communication module 110, and inputs the received voice data to the emotion classification model to classify the speaker's emotions. In this case, the emotion classification model includes a local feature extractor that extracts local features of voice data through ConvLSTM, and a global feature extractor that extracts global features of voice data through a gated recurren unit (GRU), and additionally through a loss function. It may include a loss function unit for updating the emotion classification model. The specific contents of the voice-based emotion recognition program will be described later.

이때, 메모리(120)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치를 통칭하는 것이다. In this case, the memory 120 collectively refers to a non-volatile storage device that continuously maintains stored information even when power is not supplied, and a volatile storage device that requires power to maintain the stored information.

또한, 메모리(120)는 프로세서(130)가 처리하는 데이터를 일시적 또는 영구적으로 저장하는 기능을 수행할 수 있다. 여기서, 메모리(120)는 저장된 정보를 유지하기 위하여 전력이 필요한 휘발성 저장장치 외에 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.In addition, the memory 120 may perform a function of temporarily or permanently storing data processed by the processor 130 . Here, the memory 120 may include magnetic storage media or flash storage media in addition to the volatile storage device requiring power to maintain stored information, but the scope of the present invention is limited thereto. it's not going to be

프로세서(130)는 메모리(120)에 저장된 프로그램을 실행하되, 음성 기반 감정 인식 프로그램의 실행에 따라, 감정 분류 모델의 구축 과정과 구축된 감정 분류 모델을 통해 음성을 기반으로 발화자의 감정을 분류하는 작업을 수행한다.The processor 130 executes the program stored in the memory 120, but according to the execution of the speech-based emotion recognition program, classifies the speaker's emotion based on the voice through the construction process of the emotion classification model and the constructed emotion classification model. do the work

이러한 프로세서(130)는 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다. 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로써, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 본 발명의 범위가 이에 한정되는 것은 아니다.The processor 130 may include any type of device capable of processing data. For example, it may refer to a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as a code or an instruction included in a program. As an example of the data processing device embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but the scope of the present invention is not limited thereto.

데이터베이스(140)는 프로세서(130)의 제어에 따라, 음성 기반 감정 인식 장치(100)에 필요한 데이터를 저장 또는 제공한다. 이러한 데이터베이스(140)는 메모리(120)와는 별도의 구성 요소로서 포함되거나, 또는 메모리(120)의 일부 영역에 구축될 수도 있다.The database 140 stores or provides data necessary for the voice-based emotion recognition apparatus 100 under the control of the processor 130 . The database 140 may be included as a component separate from the memory 120 , or may be built in a part of the memory 120 .

한편, 음성 기반 감정 인식 장치(100)는 장치(100)에 내장되거나 이에 접속된 마이크 등을 통해 발화자의 음성 신호를 녹음하여, 음성 데이터를 직접 생성할 수 있으며, 이에 대해 감정 인식을 수행할 수 있다.On the other hand, the voice-based emotion recognition apparatus 100 may directly generate voice data by recording the speaker's voice signal through a microphone built into or connected to the device 100, and may perform emotion recognition. have.

도 2는 본 발명의 일 실시예에 따른 음성 기반 감정 인식 장치를 설명하기 위한 개념도이고, 도 3은 본 발명의 일 실시예에 따른 음성 기반 감정 인식 방법을 설명하기 위한 순서도이고, 도 4는 본 발명의 일 실시예에 따른 음성 기반 감정 인식 방법에 사용되는 감정 분류 모델의 구축 과정을 설명하기 위한 순서도이다.2 is a conceptual diagram for explaining a voice-based emotion recognition apparatus according to an embodiment of the present invention, FIG. 3 is a flowchart for explaining a voice-based emotion recognition method according to an embodiment of the present invention, and FIG. 4 is this It is a flowchart for explaining a process of constructing an emotion classification model used in a voice-based emotion recognition method according to an embodiment of the present invention.

메모리(140)에 저장된 음성 기반 감정 인식 프로그램에 의해 수행되는 음성 기반 감정 인식 방법을 살펴보기로 한다.A speech-based emotion recognition method performed by a speech-based emotion recognition program stored in the memory 140 will be described.

먼저, 음성 기반 감정 인식 장치(100)에 설치된 음성 기반 감정 인식 프로그램은 마이크 등을 통해 기록된 음성 데이터를 마이크로부터 수신하거나, 통신 모듈(110)을 통해 음성 데이터를 수신한다(S310). 음성 데이터는 디지털 데이터로서, 소정의 시간 단위로 구분된 음성 세그먼트로 분리되어, 감정 분류 모델에 입력될 수 있다. 이와 같이, 서로 연속된 관계에 있는 음성 세그먼트들은 시간적으로 강한 상관 관계를 갖게되며, 이러한 특징을 이용하여 감정 인식을 수행한다.First, the voice-based emotion recognition program installed in the voice-based emotion recognition apparatus 100 receives recorded voice data from a microphone through a microphone or the like, or receives voice data through the communication module 110 ( S310 ). The voice data is digital data, and may be divided into voice segments divided by a predetermined time unit and input to the emotion classification model. In this way, speech segments in a continuous relationship with each other have a strong temporal correlation, and emotion recognition is performed using this characteristic.

다음으로, 음성 기반 감정 인식 프로그램은 수신한 음성 데이터를 감정 분류 모델에 입력하여 발화자의 감정을 분류한다(S320). 예를 들면, 음성 데이터를 감정 분류 모델에 입력함에 따라, 그 출력으로서 발화자의 감정 상태를 '화남', '슬픔', '행복', '보통' 등으로 분류할 수 있다. Next, the speech-based emotion recognition program classifies the speaker's emotion by inputting the received speech data into the emotion classification model (S320). For example, as voice data is input to the emotion classification model, the emotional state of the speaker may be classified as 'angry', 'sad', 'happiness', 'normal', etc. as the output.

이때, 감정 분류 모델은 ConvLSTM을 통해 음성 데이터의 로컬 특징을 추출하는 로컬 특징 추출부, GRU(gated Recurren Unit)를 통해 음성 데이터의 글로벌 특징을 추출하는 글로벌 특징 추출부를 포함하는 것으로, 이의 구체적인 구성과 구축 과정에 대해서는 도 2, 도 4 내지 도 6을 통해 더욱 상세히 살펴보기로 한다.At this time, the emotion classification model includes a local feature extractor that extracts local features of voice data through ConvLSTM, and a global feature extractor that extracts global features of voice data through a gated recurren unit (GRU). The construction process will be described in more detail with reference to FIGS. 2 and 4 to 6 .

본 발명에서 처리하는 음성 데이터는 시간적으로 연속되는 특징을 가진 데이터로서, 통상적으로는 LSTM과 같은 모델을 사용하여 특징을 추출하고 있으나, 해당 모델의 경우 계산 및 학습 시간을 증가시키는 문제점이 있다. Speech data processed in the present invention is data having temporally continuous features, and features are typically extracted using a model such as LSTM, but the model has a problem of increasing calculation and learning time.

이에, 본 발명의 감정분류 모델은 공간적인 특징을 추출하고 학습하는데 유용한 CNN 모델과 시간적인 특징을 추출하고 학습하는데 유용한 LSTM 모델을 병합한, ConvLSTM 을 사용하는 복수의 로컬 기능 학습 블록 (LFLB, local features learning blocks)들로 이루어진 로컬 특징 추출부를 포함한다. 본 발명에서는 공간적 특징의 활용함으로써, 음성 세그먼트들 간의 시간적 간격 뿐만 아니라 주파수 대역에서 간격을 두고 분리되어 있는 특징을 추출하여 감정을 분류하는데 활용한다. 이를 통해 짧은 시간뿐만 아니라 긴 시간에 걸쳐 표현되는 감정의 특징을 적절히 활용할 수 있게 해주어 감정인식의 성능향상에 이바지한다.Accordingly, the emotion classification model of the present invention merges a CNN model useful for extracting and learning spatial features and an LSTM model useful for extracting and learning temporal features, a plurality of local function learning blocks (LFLB, local It includes a local feature extraction unit consisting of features learning blocks). In the present invention, by utilizing spatial features, not only temporal intervals between voice segments, but also features separated at intervals in frequency bands are extracted and used to classify emotions. Through this, it is possible to properly utilize the characteristics of emotions expressed over a long time as well as a short time, thereby contributing to the improvement of the performance of emotion recognition.

도면에서는, 4개의 로컬 기능 학습 블록 (LFLB)이 순차적으로 연결된 구조를 제시하고 있는데, 이는 예시적인 구성으로서 본 발명이 이에 제한되는 것인 아니다.The figure shows a structure in which four local function learning blocks (LFLBs) are sequentially connected, but this is an exemplary configuration and the present invention is not limited thereto.

이때, 각각의 로컬 기능 학습 블록은 도 2에서와 같이, ConvLSTM 계층, BN 계층 및 풀링 계층이 순차적으로 연결된 구조를 가진다. 그리고, 각각의 복수의 로컬 기능 학습 블록이 순차적으로 연결된 구조를 통해, 음성 세그먼트 간의 입력-상태(input-state) 및 상태-상태(state-state) 상관 관계를 찾을 수 있다. 즉, 순차적으로 입력된 음성 세그먼트를 처리하는 과정에서 각 음성 세그먼트의 상관 관계를 포착하고, 이를 통해 감정을 인식한다.In this case, each local function learning block has a structure in which a ConvLSTM layer, a BN layer, and a pooling layer are sequentially connected as shown in FIG. 2 . And, through a structure in which each of a plurality of local function learning blocks are sequentially connected, input-state and state-state correlations between voice segments may be found. That is, in the process of processing sequentially input speech segments, the correlation of each speech segment is captured, and emotions are recognized through this.

ConvLSTM 계층은 시퀀스를 최적화하고 음성 세그먼트 간의 시공간적 상관 관계를 찾기 위해, 순차적 정보를 내부 상태로 유지하기 위해 숨겨진 단계별 예측에 사용되었다.The ConvLSTM layer was used for hidden step-by-step prediction to keep sequential information in an internal state, to optimize the sequence and find spatiotemporal correlations between speech segments.

도 4를 참조하여, 감정 분류 모델의 구축 과정을 살펴보기로 한다.Referring to FIG. 4 , a process of constructing an emotion classification model will be described.

먼저, 음성 기반 감정 인식 장치(100)에 설치된 음성 기반 감정 인식 프로그램은 마이크 등을 통해 기록된 음성 데이터를 마이크로부터 수신하거나, 통신 모듈(110)을 통해 음성 데이터를 수신한다(S410).First, the voice-based emotion recognition program installed in the voice-based emotion recognition apparatus 100 receives recorded voice data from a microphone through a microphone or the like, or receives voice data through the communication module 110 ( S410 ).

다음으로, ConvLSTM 에 기반하여 로컬 특징 추출부에 음성 데이터를 입력한다(S420).Next, voice data is input to the local feature extraction unit based on ConvLSTM (S420).

도 5는 본 발명의 일 실시예에 따른 ConvLSTM 계층의 구체적인 구성을 도시한 도면이다.5 is a diagram illustrating a detailed configuration of a ConvLSTM layer according to an embodiment of the present invention.

도시된, ConvLSTM 계층은 다음의 수학식을 이용하여 가중치를 계산한다.The illustrated ConvLSTM layer calculates a weight using the following equation.

[수학식 1][Equation 1]

σ 는 시그모이드 함수를 나타내고, * 는 컨볼루션 연산을 나타내고, ⓒ는 엘리먼트별 연산(element wise operation), tanh는 쌍곡탄젠트 함수(hyperbolic tangent function), w는 각각의 변수에 대한 가중치, b는 편향 값, t는 연산 반복 횟수, x_t 는 입력 데이터, c_t는 셀 상태(cell state), h_t는 은닉 상태(hidden state)를 나타낸다. 이와 같이, ConvLSTM 계층에서는 행렬간의 곱이 행해지던 연산의 일부가 컨볼루션 연산으로 대체된다.σ denotes a sigmoid function, * denotes a convolution operation, ⓒ denotes an element wise operation, tanh denotes a hyperbolic tangent function, w denotes a weight for each variable, and b denotes an element wise operation. A bias value, t is the number of operation repetitions, x _t is input data, c _t is a cell state, and h _t is a hidden state. As described above, in the ConvLSTM layer, a part of the operation in which the multiplication between matrices is performed is replaced by the convolution operation.

그리고, 도 5에서 i_t는 입력 게이트(input gate), f_t 는 망각 게이트(forget gate), o_t는 출력 게이트(output gate), g_t는 입력 변조 게이트(input modulation gate)를 각각 나타내며, 이는 일반적인 LSTM의 구성과 동일하다.And, in FIG. 5, it is an input gate (input gate), f _t is a forget gate, o _t is an output gate (output gate), g _t is an input modulation gate ₍ input modulation gate), respectively, This is the same as the configuration of a general LSTM.

한편, ConvLSTM에서는 각 입력 게이트에서 처리되는 데이터와 입력 데이터(x_t), 셀 상태(c_t), 은닉 상태(h_t) 는 모두 3차원 텐서로 표현된다. 이때, 입력 텐서에서 첫 번째 차원은 시간 정보, 두 번째 차원은 크기 정보, 세번째 차원은 공간 정보를 나타낸다. 이와 같이, ConvLSTM은 상태에서 상태로 전환하는 동안 시공간 특징을 추출하는 것에 기술적 특징이 있다.Meanwhile, in ConvLSTM, data processed by each input gate, input data (x _t ), cell state (c _t ), and hidden state (h _t ) are all expressed as three-dimensional tensors. In this case, in the input tensor, the first dimension represents temporal information, the second dimension represents size information, and the third dimension represents spatial information. As such, ConvLSTM has technical characteristics in extracting spatiotemporal features during state-to-state transition.

다시 도 4를 참조하면, GRU 기반의 글로벌 특징 추출부에 음성 데이터를 입력한다(S430).Referring again to FIG. 4, voice data is input to the GRU-based global feature extraction unit (S430).

먼저, 도 2에 도시된 바와 같이, 글로벌 특징 추출부는 GFLB(Global Feature Learning Block)를 포함한다. GFLB는 음성 데이터에서 글로벌 특징 정보를 학습하고, 장기적인 컨텍스트 종속성을 인식하기 위해 GRU(gated recurrent unit)를 포함한다. First, as shown in FIG. 2 , the global feature extraction unit includes a Global Feature Learning Block (GFLB). GFLB includes a gated recurrent unit (GRU) to learn global feature information from speech data and recognize long-term context dependencies.

도 6은 본 발명의 일 실시예에 따른 글로벌 특징 추출부에 사용되는 GRU의 구성을 도시한 것이다.6 is a diagram illustrating a configuration of a GRU used in a global feature extraction unit according to an embodiment of the present invention.

GRU는 게이트 메커니즘이 적용된 LSTM 프레임워크의 일종으로서, (a)에 도시된 바와 같이, 업데이트 게이트 및 리셋 게이트를 포함한다. 업데이트 게이트는 LSTM에서의 망각 게이트 및 입력 게이트와 같은 동작을 수행하고, 리셋 게이트는 LSTM에서의 리셋 게이트와 같은 동작을 수행한다.The GRU is a kind of LSTM framework to which a gate mechanism is applied, and includes an update gate and a reset gate as shown in (a). The update gate performs the same operations as the forget gate and input gate in the LSTM, and the reset gate performs the same operation as the reset gate in the LSTM.

리셋 게이트는 과거의 정보를 적당히 리셋시키는 것으로서, 시그모이드 함수를 이용하며, 아래 수학식과 같이

를 출력한다.The reset gate properly resets past information, and uses a sigmoid function, as shown in the following equation.

to output

[수학식 2][Equation 2]

업데이트 게이트는 과거와 현재의 정보의 최신화 비율을 결정하는 것으로, 시그모이드 함수를 이용하며, 아래 수학식과 같이

를 출력한다.The update gate determines the update rate of past and present information, and uses a sigmoid function, as shown in the following equation

to output

[수학식 3][Equation 3]

또한, 업데이트 게이트는 수학식 4를 통해 현시점의 정보 후보군(

)을 산출하는데, 이때 리셋 게이트의 결과를 이용한다.In addition, the update gate is the current information candidate group (

), which uses the result of the reset gate.

[수학식 4][Equation 4]

마지막으로, 최종 은닉 상태(

)의 결과는 수학식 3과 수학식 4에 의해서 결정되는 업데이트 게이트의 출력을 결합하여 수학식 5에 의해 결정된다.Finally, the final hidden state (

) is determined by Equation 5 by combining the outputs of the update gate determined by Equations 3 and 4.

[수학식 5][Equation 5]

이때, σ 는 시그모이드 함수를 나타내고, * 는 엘리먼트별 곱셈(element wise multiplication), tanh는 쌍곡탄젠트 함수(hyperbolic tangent function), W와 U는 각 변수에 대한 가중치, t는 연산 반복 횟수, x_t 는 입력 데이터, h_t는 은닉 상태를 나타낸다.In this case, σ denotes a sigmoid function, * denotes element wise multiplication, tanh denotes a hyperbolic tangent function, W and U denote a weight for each variable, t denotes the number of repetitions of operations, x _t is the input data, and h _t is the hidden state.

이와 같은 구성에 의해, 음성 세그먼트에 포함된 단기 종속성은 리셋 게이트에 의해 활성화되고, 음성 세그먼트의 이전 상태는 업데이트 게이트에 의해 제어되는데, 업데이트 게이트는 장기적인 상황 정보를 제어하는 역할도 수행한다.With this configuration, the short-term dependency included in the voice segment is activated by the reset gate, and the previous state of the voice segment is controlled by the update gate, which also serves to control long-term context information.

한편, 본 발명에서는 (b)에 도시된 바와 같이, 2개의 GRU를 적층(stack)한 단위 레이어를 복수개 배치하여 글로벌 특징에 대한 가중치를 조절할 수 있다.Meanwhile, in the present invention, as shown in (b), it is possible to adjust the weight for the global feature by arranging a plurality of unit layers in which two GRUs are stacked.

그리고, 글로벌 특징 추출부의 출력단에는 완전 연결된(fully connected) 레이어가 결합되며, 이를 통해 발화자의 감정을 분류하며, 이후에 결합되는 융합 손실 함수의 결과를 기초로 갱신될 수 있다.In addition, a fully connected layer is coupled to the output terminal of the global feature extraction unit, and through this, the speaker's emotions are classified, and then updated based on the result of the combined fusion loss function.

다시 도 4를 참조하면, 손실 함수를 이용하여 감정 분류 모델을 갱신하는 작업을 수행한다(S440).Referring back to FIG. 4 , an operation of updating the emotion classification model using the loss function is performed ( S440 ).

본 발명에서는 중심 손실 함수(center loss function)와 소프트 맥스 손실 함수를 사용하여 감정 분류 모델의 손실을 산출한다. 소프트 맥스 손실 함수를 이용한 모델의 예측 성능은 클래스 내에서 거리가 멀기 때문에 다소 성능이 낮아진다.In the present invention, the loss of the emotion classification model is calculated using a center loss function and a soft max loss function. The predictive performance of the model using the soft max loss function is somewhat lower due to the large distance within the class.

본 발명에서는 중심 손실 함수를 사용하여 클래스 내 최소 거리를 계산하고 소프트 맥스 손실 함수를 통해 클래스 간 최대 거리를 계산하였으며, 구체적인 수학식은 아래와 같다.In the present invention, the minimum distance within a class is calculated using the central loss function, and the maximum distance between classes is calculated through the soft max loss function, and the specific equation is as follows.

[수학식 6]:소프트 맥스 손실 함수[Equation 6]: Soft max loss function

[수학식 7]: 중심 손실 함수[Equation 7]: Center loss function

n은 클래스의 개수, m은 최소 배치 사이즈, c_yi는 클래스y_i의 중심을 나타낸다.n is the number of classes, m is the minimum batch size, and c _yi is the center of class y _i .

이때, 실시간 시나리오에서 오 분류를 방지하는 데 필요한 최소 거리를 계산하기 위해 중심 손실에 대한 λ 기호를 사용하여, 소프트 맥스 손실 함수와 중심 손실 함수를 모두 반영한, 융합 손실 함수를 수학식 8과 같이 사용하였다.At this time, in order to calculate the minimum distance required to prevent misclassification in a real-time scenario, a fusion loss function that reflects both the soft max loss function and the central loss function using the λ symbol for the center loss is used as in Equation 8 did.

[수학식 8] [Equation 8]

감정 분류 모델은 중심 손실 함수와 소프트 맥스 손실함수를 기초로하는 융합 손실 함수를 통해 로컬 특징 추출부와 글로벌 특징 추출부의 출력에 대한 손실을 산출하고, 손실을 최소화하는 방향으로 가중치 업데이트를 수행한다.The emotion classification model calculates the loss for the output of the local feature extractor and the global feature extractor through the fusion loss function based on the central loss function and the soft max loss function, and performs weight updates in the direction of minimizing the loss.

이와 같이 구성된 본 발명의 감정 분류 모델의 효과를 평가하기 위해 동 분야에서 학문적 실험에 널리 사용되는 오픈 데이터베이스인 IEMOCAP 및 RAVDESS를 사용하였는데, 이들은 각각 감정적 언어 말뭉치를 포함하는 두 가지 표준 말뭉치 데이터를 포함한다. 본 발명에 따른 IEMOCAP와 RAVDESS 말뭉치에 대해 각각 75 %의 인식률과 80 %의 인식률을 확보하였으며, 이는 2020년 말 기준으로 최상위의 수치에 해당한다.In order to evaluate the effect of the emotion classification model of the present invention constructed as described above, IEMOCAP and RAVDESS, which are open databases widely used in academic experiments in the same field, were used, and these include two standard corpus data including emotional language corpus, respectively. . For the IEMOCAP and RAVDESS corpus according to the present invention, a recognition rate of 75% and a recognition rate of 80% were secured, respectively, which correspond to the highest figures as of the end of 2020.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described with reference to specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustration, and those of ordinary skill in the art to which the present application pertains will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may also be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present application.

100: 음성 기반 감정 인식 장치
110: 통신 모듈
120: 메모리
130: 프로세서
140: 데이터베이스100: voice-based emotion recognition device
110: communication module
120: memory
130: processor
140: database

Claims

An emotion recognition apparatus for recognizing a speaker's emotion based on a voice signal, comprising:
a memory in which a speech-based emotion recognition program is stored; and
A processor for executing the program stored in the memory;
The speech-based emotion recognition program receives the speaker's voice data, and classifies the speaker's emotions by inputting the received voice data into an emotion classification model,
The emotion classification model includes a local feature extractor that extracts local features of voice data through ConvLSTM, and a global feature extractor that extracts global features of voice data through a gated recurren unit (GRU), wherein the local features and global features A voice-based emotion recognition device that classifies the speaker's emotions based on the

According to claim 1,
The local feature extraction unit has a structure in which a plurality of local function learning blocks are sequentially connected, and each local function learning block has a structure in which a ConvLSTM layer, a BN layer and a pooling layer are sequentially connected.

According to claim 1,
The global feature extraction unit will include a plurality of unit layers each including two stacked gated recurrent units (GRUs), the speech-based emotion recognition apparatus.

The method of claim 1,
The emotion classification model calculates the loss for the outputs of the local feature extraction unit and the global feature extraction unit through a fusion loss function based on the central loss function and the soft max loss function, and updates the weights in a direction to minimize the loss. A voice-based emotion recognition device that performs.

In the emotion recognition method using a voice-based emotion recognition device,
receiving voice data of the speaker; and
Including the step of inputting the received voice data into an emotion classification model to classify the speaker's emotions,
The emotion classification model includes a local feature extractor that extracts local features of voice data through ConvLSTM, and a global feature extractor that extracts global features of voice data through a gated recurren unit (GRU), wherein the local features and global features A voice-based emotion recognition method that classifies the speaker's emotions based on

6. The method of claim 5,
The local feature extraction unit has a structure in which a plurality of local function learning blocks are sequentially connected, and each local function learning block has a structure in which a ConvLSTM layer, a BN layer, and a pooling layer are sequentially connected.

6. The method of claim 5,
The global feature extraction unit will include a plurality of unit layers each including two stacked gated recurrent units (GRUs), the speech-based emotion recognition method.

6. The method of claim 5,
Further comprising the step of updating the emotion classification model,
Calculating the loss for the outputs of the local feature extracting unit and the global feature extracting unit through a fusion loss function based on the central loss function and the soft max loss function, and performing weight updates in a direction to minimize the loss, Speech-based emotion recognition method.