KR102031954B1

KR102031954B1 - Apparatus and method for speech emotion recongnition using a reasoning process

Info

Publication number: KR102031954B1
Application number: KR1020180024714A
Authority: KR
Inventors: 백성욱; 이미영; 권순일; 전석봉; 자밀 아마드; 칸 무하마드; 울 하크 이자즈; 박준렬
Original assignee: 세종대학교산학협력단
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2019-10-14
Also published as: KR20190103810A

Abstract

본 발명의 일 측면에 따른 음성신호에 기반한 사용자의 감정 추정 방법에 있어서, 사용자의 음성 신호를 세그먼트 단위의 음성 데이터로 변환하는 단계; 및 세그먼트 단위의 음성 데이터를 3D 컨볼루션 뉴럴 네트워크(3D Convoultion neural network)에 입력하여, 세그먼트 별로 각 음성 데이터의 특징을 추출하고, 추출된 특징값에 기초하여 각 음성 데이터가 분류될 감정 상태 추정값을 산출하는 단계를 포함하되, 감정 상태 추정값을 산출하는 단계는 컨볼류선 뉴럴 네트워크를 통해 추출된 특징값에 기초하여 미리 설정된 복수의 감정 상태로 분류될 확률을 나타내는 제 1 신뢰도, 이전에 감정 상태가 분류된 전체 음성 데이터의 개수 대비 각 음성 데이터별 감정 분류 상태의 값에 기초하여 산출한 과거 분류 이력을 나타내는 제 2 신뢰도, 및 복수의 감정이 공존할 수 있는 확률을 나타내는 지식 베이스에 기반하여 분류 대상 세그먼트의 음성 데이터가 가질 수 있는 감정 상태 확률을 나타내는 제 3 신뢰도를 합산한 값에 기초하여 감정 상태 추정값을 산출할 수 있다.An emotion estimation method of a user based on a voice signal according to an aspect of the present invention, the method comprising: converting a user's voice signal into voice data in units of segments; And inputting speech data in a segment unit into a 3D convolutional neural network, extracting features of each speech data for each segment, and estimating an emotional state to which each speech data is classified based on the extracted feature values. And calculating the emotional state estimation value, wherein the calculating of the emotional state estimate comprises: a first reliability indicating a probability of being classified into a plurality of preset emotional states based on feature values extracted through the convolutional neural network; The object to be classified based on the second reliability indicating the past classification history calculated based on the value of the emotion classification state for each voice data compared to the total number of the classified voice data, and the knowledge base indicating the probability that a plurality of emotions can coexist. Summing up a third degree of confidence indicating the emotional state probability that the segment's speech data may have The emotional state estimate may be calculated based on the value.

Description

Apparatus and method for speech emotion recognition using inference process {APPARATUS AND METHOD FOR SPEECH EMOTION RECONGNITION USING A REASONING PROCESS}

본 발명은 음성대화를 수행하는 화자의 감정을 인식하는 기술에 관한 것으로, 특히 심층 학습 및 추론 기술을 사용하여 화자의 감정 상태를 조사하는 것에 관한 방법 및 장치에 관한 것이다.The present invention relates to a technique for recognizing the emotion of a speaker performing a speech conversation, and more particularly, to a method and apparatus for examining the emotional state of a speaker using in-depth learning and reasoning techniques.

사람의 음성을 인식하는 기술은 더 편리한 서비스를 원하는 사용자들의 요구에 발맞추어 빠르게 발전하고 있다. 최근에는 단순 음성의 인식에 그치지 않고, 딥러닝 기술을 이용하여 사용자의 음성에서 감정을 인식하는 기술이 발전하고 있는 상황이다.The technology of recognizing human voice is rapidly developing to meet the needs of users who want more convenient services. Recently, technology for recognizing emotions in a user's voice using deep learning technology has been developed.

하지만, 음성 감정 인식은 음성 신호로부터 인간의 감정을 검출하고 인식하지만, 언어를 기반으로 감정을 인식하는 방식은 감정적 표현의 다양성으로 인해 쉽사리 파악하기 어렵고, 감정의 단서가 언어의 기반이 아닌 강한 표현력이나 음성 신호 그 자체에 있을 수 있기에 기계가 이를 판단하기에 어려운 한계가 존재한다.However, voice emotion recognition detects and recognizes human emotions from voice signals, but the way of perceiving emotions based on language is difficult to grasp due to the variety of emotional expressions, and the strong cues of emotions are not based on language. However, there is a limit that is difficult for the machine to judge because it may be in the voice signal itself.

본 발명의 일 실시예는 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 사용자의 음성으로부터 사용자의 감정 상태를 인식하기 위한 기술로, 딥러닝 분석과 시간이 흐름에 따라 변하게 되는 감정 정보를 바탕으로 현재의 감정 상태를 인식하여 종래의 감정 인식 기술보다 정확성을 높이는 것을 목적으로 한다.One embodiment of the present invention is to solve the above-described problems of the prior art, a technique for recognizing the emotional state of the user from the user's voice, based on deep learning analysis and emotion information that changes over time An object of the present invention is to recognize current emotional state and to improve accuracy than conventional emotion recognition technology.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 일 측면에 따른 음성신호에 기반한 사용자의 감정 추정 방법에 있어서, 사용자의 음성 신호를 세그먼트 단위의 음성 데이터로 변환하는 단계; 및 세그먼트 단위의 음성 데이터를 3D 컨볼루션 뉴럴 네트워크(3D Convoultion neural network)에 입력하여, 세그먼트 별로 각 음성 데이터의 특징을 추출하고, 추출된 특징값에 기초하여 각 음성 데이터가 분류될 감정 상태 추정값을 산출하는 단계를 포함하되, 감정 상태 추정값을 산출하는 단계는 3D 컨볼루션 뉴럴 네트워크를 통해 추출된 특징값에 기초하여 미리 설정된 복수의 감정 상태로 분류될 확률을 나타내는 제 1 신뢰도, 이전에 감정 상태가 분류된 전체 음성 데이터의 개수 대비 각 음성 데이터별 감정 분류 상태의 값에 기초하여 산출한 과거 분류 이력을 나타내는 제 2 신뢰도, 및 복수의 감정이 공존할 수 있는 확률을 나타내는 지식 베이스에 기반하여 분류 대상 세그먼트의 음성 데이터가 가질 수 있는 감정 상태 확률을 나타내는 제 3 신뢰도를 합산한 값에 기초하여 감정 상태 추정값을 산출할 수 있다.A technical means for achieving the above-described technical problem, comprising: a method for estimating emotion of a user based on a voice signal according to an aspect of the present invention, the method comprising the steps of: converting a voice signal of the user into voice data in units of segments; And inputting voice data in a segment unit to a 3D convolutional neural network, extracting features of each voice data for each segment, and calculating an emotional state estimate for each voice data based on the extracted feature values. And calculating the emotional state estimate value, wherein the calculating of the emotional state estimate comprises: a first confidence level indicating a probability of being classified into a plurality of preset emotional states based on feature values extracted through the 3D convolutional neural network; The object to be classified based on the second reliability indicating the past classification history calculated based on the value of the emotion classification state for each voice data compared to the total number of the classified voice data, and the knowledge base indicating the probability that a plurality of emotions can coexist. Sum up a third degree of confidence indicating the emotional state probability that the segment's speech data may have An emotional state estimate may be calculated based on one value.

본 발명의 다른 측면에 따른 음성신호에 기반한 사용자의 감정 추정하는 장치에 있어서, 사용자의 음성 신호로부터 음성 데이터를 추출하고, 음성 데이터를 기초로 감정 상태 추정값을 산출하는 프로그램이 저장된 메모리 및 메모리에 저장된 프로그램을 실행하여 음성신호에 기반한 사용자의 감정 추정하는 프로세서를 포함하되, 프로세서는 사용자의 음성 신호를 세그먼트 단위의 음성 데이터로 변환하고, 세그먼트 단위의 음성 데이터를 3D 컨볼루션 뉴럴 네트워크(3D Convoultion neural network)에 입력하여, 세그먼트 별로 각 음성 데이터의 특징을 추출하고, 추출된 특징값에 기초하여 각 음성 데이터가 분류될 감정 상태 추정값을 산출하되, 감정 상태 추정값을 산출하는 방법은 3D 컨볼루션 뉴럴 네트워크를 통해 추출된 특징값에 기초하여 미리 설정된 복수의 감정 상태로 분류될 확률을 나타내는 제 1 신뢰도, 이전에 감정 상태가 분류된 전체 음성 데이터의 개수 대비 각 음성 데이터별 감정 분류 상태의 값에 기초하여 산출한 과거 분류 이력을 나타내는 제 2 신뢰도, 및 복수의 감정이 공존할 수 있는 확률을 나타내는 지식 베이스에 기반하여 분류 대상 세그먼트의 음성 데이터가 가질 수 있는 감정 상태 확률을 나타내는 제 3 신뢰도를 합산한 값에 기초하여 감정 상태 추정값을 산출하는 장치일 수 있다.In an apparatus for estimating a user's emotion based on a voice signal according to another aspect of the present invention, a memory for storing a voice data from the user's voice signal and calculating an emotion state estimation value based on the voice data is stored in the memory and the memory. And a processor configured to execute a program to estimate a user's emotion based on the voice signal, wherein the processor converts the user's voice signal into segment data and converts the segment data into 3D convolutional neural networks. Extracts a feature of each voice data for each segment and calculates an emotional state estimation value to which each voice data is to be classified based on the extracted feature value, and calculates the emotional state estimate by using a 3D convolutional neural network. Preset based on the feature value extracted through A first confidence level indicating a probability of being classified into a number of emotional states, a second confidence level indicating a past classification history calculated based on a value of an emotion classification state for each voice data relative to the total number of voice data in which the emotional state is previously classified, And an apparatus for calculating an emotion state estimate based on a sum of a third reliability representing an emotion state probability that voice data of the segment to be classified may have based on a knowledge base indicating a probability that a plurality of emotions may coexist. Can be.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 사용자의 음성으로부터 사용자의 감정 상태를 인식하기 위한 기술로, 딥러닝 분석과 시간이 흐름에 따라 변하게 되는 감정 정보를 바탕으로 현재의 감정 상태를 인식하여 종래의 감정 인식 기술보다 높은 정확도를 가질 수 있게 된다.According to any one of the above-described problem solving means of the present invention, a technology for recognizing the emotional state of the user from the user's voice, based on deep learning analysis and emotion information that changes over time based on the current emotional state Recognition may have a higher accuracy than conventional emotion recognition techniques.

도 1은 본 발명의 일 실시예에 따른, 추론 프로세스를 이용한 음성 감정 인식의 입출력 구조를 나타낸 도면이다.
도 2는 본 발명의 일 실시예에 따른, 연속적인 음성 세그먼트에서의 복수의 감정이 공존할 확률을 나타낸 도면이다.
도 3은 종래의 충돌하고 공존하는 감정의 상태 사이에 관계를 나타낸 전환 다이어그램이다.
도 4는 본 발명의 일 실시예에 따른, 3가지 신뢰도을 결합하는 방식과 그 예시를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른, 음성 세그먼트의 감정 사이의 충돌을 해결하는 방법과 그 예시 도면이다.
도 6은 본 발명의 일 실시예에 따른, 추론 프로세스를 이용한 음성 감정 인식의 방법을 나타낸 동작 흐름도이다.
도 7은 본 발명의 일 실시예에 따른, 3D 컨볼루션 뉴럴 네트워크를 통해 반환된 신뢰도 확인을 통해 신뢰 레벨의 높낮음을 나타낸 도면이다.1 is a diagram illustrating an input / output structure of speech emotion recognition using an inference process according to an embodiment of the present invention.
2 is a diagram illustrating a probability that a plurality of emotions coexist in consecutive speech segments according to an embodiment of the present invention.
3 is a transition diagram showing the relationship between conventional states of conflicting and coexisting emotions.
4 is a diagram illustrating a method of combining three reliability levels and an example thereof according to an embodiment of the present invention.
5 is an exemplary diagram and a method for resolving conflicts between emotions of speech segments, in accordance with an embodiment of the present invention.
6 is an operation flowchart illustrating a method of speech emotion recognition using an inference process according to an embodiment of the present invention.
FIG. 7 is a diagram illustrating a high level of confidence level through a reliability check returned through a 3D convolutional neural network according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 본 발명을 명확하게 설명하기 위해 도면에서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. 또한, 도면을 참고하여 설명하면서, 같은 명칭으로 나타낸 구성일지라도 도면에 따라 도면 번호가 달라질 수 있고, 도면 번호는 설명의 편의를 위해 기재된 것에 불과하고 해당 도면 번호에 의해 각 구성의 개념, 특징, 기능 또는 효과가 제한 해석되는 것은 아니다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in the drawings, and like reference numerals designate like parts throughout the specification. In addition, while describing with reference to the drawings, even if the configuration shown by the same name may be different according to the drawing number, the drawing number is just described for convenience of description and the concept, features, functions of each configuration by the corresponding reference number Or the effects are not to be construed as limiting.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Throughout the specification, when a part is "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element in between. . In addition, when a part is said to "include" a certain component, which means that it may further include other components, except to exclude other components unless specifically stated otherwise, one or more other It is to be understood that the present invention does not exclude the possibility of the presence or the addition of features, numbers, steps, operations, components, parts, or combinations thereof.

본 명세서에 있어서 '부(部)' 또는 '모듈'이란, 하드웨어 또는 소프트웨어에 의해 실현되는 유닛(unit),양방을 이용하여 실현되는 유닛을 포함하며, 하나의 유닛이 둘 이상의 하드웨어를 이용하여 실현되어도 되고, 둘 이상의 유닛이 하나의 하드웨어에 의해 실현되어도 된다.In the present specification, the term 'part' or 'module' includes a unit realized by hardware or software, and a unit realized by both, and one unit is realized by using two or more pieces of hardware. Two or more units may be implemented by one hardware.

도 1은 본 발명의 일 실시예에 따른, 추론 프로세스를 이용한 음성 감정 인식을 위한 입출력 구조를 나타낸 도면이다.1 is a diagram illustrating an input / output structure for speech emotion recognition using an inference process according to an embodiment of the present invention.

도 1을 참조하면, 추론 프로세스를 이용한 음성 감정 인식을 위한 입출력 구조는 사용자의 음성 신호(100)와 딥러닝인 3D 컨볼루션 뉴럴 네트워크(3D Convoultion neural network, 200), 감정 상태 추정모듈(300)이 포함될 수 있다.Referring to FIG. 1, an input / output structure for speech emotion recognition using an inference process includes a user's voice signal 100 and a deep learning 3D convolutional neural network 200 and an emotion state estimation module 300. This may be included.

사용자의 음성이 어떠한 음성 녹음 장치를 통해 본 발명의 시스템에 수신되면, 음성 신호(100)는 세그먼트 단위로 전환되어 인풋값은 고속퓨리에변환(fast fourier transform)을 사용하여 스펙트로그램으로 변환된다.When a user's voice is received by the system of the present invention through any voice recording device, the voice signal 100 is converted into segments and the input value is converted into a spectrogram using a fast fourier transform.

이때 스펙트로그램은 특정 파형에 존재하는 다양한 주파수에서 시간에 다른 신호가 가지는 파형의 강도 또는 크기에 대한 시각적인 인식 값으로, 스펙트럼 그래프으로 표현할 수 있는데, 2차원 표현에서 X축에는 시간이, Y축에는 주파수 강도를 나타낸다.In this case, the spectrogram is a visual recognition value of the intensity or magnitude of the waveform of a different signal in time at various frequencies present in a specific waveform. The spectrogram can be expressed as a spectral graph. Denotes the frequency intensity.

또한, 주파수의 진폭은 해당 시간 간격의 특정 포인트에서 강도 또는 색상 값을 가지고, 이를 기초로 음성 신호를 작은 조각으로 나누어 계산한다. 그 후 고속퓨리에변환 알고리즘을 각 조각에 적용하여 주파수 스펙트럼의 크기를 계산한다. 고속퓨리에변환은 특정한 시간에 진폭에 해당하는 각 조각의 열(row) 값을 반환하고, 스펙트로그램 이미지를 형성하기 위해 나란히 열게 된다.In addition, the amplitude of the frequency is calculated by dividing the speech signal into small pieces based on the intensity or color value at a particular point in the corresponding time interval. The fast Fourier transform algorithm is then applied to each piece to calculate the magnitude of the frequency spectrum. The Fast Fourier Transform returns the row value of each piece corresponding to the amplitude at a specific time and opens side by side to form a spectrogram image.

생성된 스펙트로그램은 특징 추출 및 분류를 위해 3D 컨볼루션 뉴럴 네트워크(200)로 전달되는데, 이때 3D 컨볼루션 뉴럴 네트워크(200)의 마지막 층은 행복(happy), 슬픔(sad), 분노(angry), 공포(fear), 혐오(disgust), 지루함(boredom) 및 중립(neutral)의 7가지 감정 분류로 구성된다. The generated spectrogram is passed to the 3D convolutional neural network 200 for feature extraction and classification, where the last layer of the 3D convolutional neural network 200 is happy, sad and angry. It consists of seven emotional categories: fear, disgust, boredom, and neutral.

따라서 음성 세그먼트에 대한 감정은 7개의 확률을 가지는 신뢰 레벨이 결정되면, 이를 통해 낮은 신뢰도 값을 처리하기 위해 각 확률 값의 상, 중 하의 신뢰도로 메커니즘을 분류한다. 신뢰도가 높으면 다음 단계에서는 충돌하는 감정을 추가적으로 검사하고, 작은 차이와 상충되는 감정을 가지는 신뢰도 값은 현재 화자의 공존 감정과 감정 기록에 대한 지식을 기반으로 구성된 본 발명에서 제안된 증거, 검증 및 추론을 통해 해결하게 된다.Therefore, when a confidence level having seven probabilities is determined for the speech segment, the mechanism is classified into upper and lower reliability of each probability value in order to process low reliability values. If the confidence level is high, the next step is to additionally check for conflicting emotions, and the confidence value with conflicting feelings with small differences is based on the presenter's knowledge of the coexistence feelings and the emotional record. This is solved through

이때, 각 확률 값의 상, 중, 하의 신뢰도 메커니즘의 분류하는 상세한 방법은 도 7을 통해 설명하도록 한다.In this case, a detailed method of classifying the reliability mechanisms of the upper, middle and lower of each probability value will be described with reference to FIG. 7.

감정 상태 추정모듈(300)에서 감정 상태 추정값은 음성의 분석을 통해 앞서 설명한 7가지의 감정 분류에 현재의 음성이 얼마나 충족하는가에 대한 것으로 행복(happy), 슬픔(sad), 분노(angry), 공포(fear), 혐오(disgust), 지루함(boredom) 및 중립(neutral)이 각각의 확률 점수로 표현될 수 있고, 필요에 따라 7개 이상의 감정이나 그 이하의 감정이 사용될 수 있다.In the emotional state estimation module 300, the emotional state estimation value is about how much the current voice satisfies the seven kinds of emotions described above through analysis of the voice, and includes happiness, sad, anger, Fear, disgust, boredom, and neutral can be represented by their probability scores, and seven or more emotions can be used as needed.

감정 상태 추정모듈(300)에서 감정 상태 추정값이 산출되는 과정을 자세하게 설명하면, 3D 컨볼루션 뉴럴 네트워크(200)를 통해 각 세그먼트의 단위마다 음성 데이터의 특징을 추출하고, 추출된 특징값에 기초하여 미리 설정된 복수의 감정 상태를 분류하는 제 1 신뢰도와 제 1 신뢰도를 산출하기 직전인 과거를 기준으로 전체 음성 데이터 대비 과거 분류 이력을 나타내는 제 2 신뢰도, 앞서 서술한 7가지 감정이 공존할 수 있는 확률을 지식 베이스에 기반하여 분류 대상 세그먼트의 음성 데이터를 가질 수 있는 감정의 상태 확률인 제 3 신뢰도를 결합하여 감정 상태 추정값을 산출하게 된다.When the emotional state estimation module 300 describes the process of calculating the emotional state estimate in detail, the feature of the voice data is extracted for each segment unit through the 3D convolutional neural network 200, and based on the extracted feature value. First confidence that classifies a plurality of preset emotional states and a second reliability indicating past classification history compared to all voice data based on the past immediately before calculating the first reliability, and the probability that the seven emotions described above may coexist. Based on the knowledge base, the emotion state estimation value is calculated by combining the third reliability, which is the state probability of emotion that may have speech data of the segment to be classified.

도 2는 본 발명의 일 실시예에 따른, 연속적인 음성 세그먼트에서의 복수의 감정이 공존할 시, 감정마다의 관계를 확률로 나타낸 도면이다.2 is a diagram illustrating a relationship between emotions when a plurality of emotions in a continuous voice segment coexist according to an embodiment of the present invention.

도 2를 살펴보면, 연속적인 음성 세그먼트에서의 복수의 감정이 공존할 확률은 각각 음성 세그먼트에 대한 시퀀스가 2개 혹은 3개의 감정이 동시에 표현된 세트를 확인할 수 있다.Referring to FIG. 2, the probability that a plurality of emotions coexist in successive speech segments may identify a set in which two or three emotions are simultaneously expressed in a sequence for each speech segment.

최종 감정 클래스는 해결된 갈등과 중간 감정의 집계 예측을 통해 높은 수준의 감정 및 충돌하지 않은 감정을 결합하여 예측하게 된다. 본 발명의 일 실시예로 각 시퀀스의 감정 개수를 측정하고, 가장 반복적인 감정을 선택한 후 시퀀스의 특정 감정 개수를 추가하여 이를 총 감정 개수로 나눈다. 이때, 높은 값을 가진 감정이 관찰 중인 음성 신호로 선택되고, 신뢰도가 높은 스펙트로그램과 이들의 주석은 정확도를 향상시키고 화자의 감정적 기록을 업데이트하기 위해 데이터베이스에 누적 저장된다.The final emotion class predicts a combination of high-level and non-conflicting emotions through aggregate prediction of resolved conflicts and intermediate emotions. In one embodiment of the present invention, the number of emotions in each sequence is measured, the most repetitive emotion is selected, and the number of specific emotions in the sequence is added and divided by the total number of emotions. At this time, high-value emotions are selected as the voice signal being observed, and highly reliable spectrograms and their annotations are accumulated and stored in a database to improve accuracy and update the speaker's emotional record.

도 3은 종래의 충돌하고 공존하는 감정의 상태 사이에 관계를 나타낸 상태 전환 다이어그램이다.3 is a state transition diagram showing the relationship between the states of conventional conflicting and coexisting emotions.

도 3를 살펴보면, 상태 전환 다이어그램은 사람이 현재의 감정에서 다른 감정으로 변할 수 있는 관계를 확률로 제 3 신뢰도로 정의될 수 있다. Referring to FIG. 3, the state transition diagram may be defined as a third reliability as a probability of a relationship in which a person may change from a current emotion to another emotion.

이러한 확률은 심리학적 연구와 전문가의 의견에 기초하여 할당될 수 있고, 서로 갈등하거나 반대의 감정은 공존하는 감정에 비해 낮은 확률 점수를 가지게 된다. These probabilities can be assigned based on psychological research and expert opinions, and conflicting or opposing emotions have lower probability scores than coexisting emotions.

도 3에서는 7가지의 감정을 점선과 실선으로 연결하였는데, 여기서 서로 충돌하는 감정은 점선을, 공존할 수 있는 감정을 실선으로 표기하였다. 예를들어 사람의 감정 상태가 "행복"에서 "분노"상태로 변할 수 있는 확률은 0.12(충돌 감정)값을 가지고, "행복"에서 "중립"상태는 0.6(공존 감정)의 확률값을 가질 수 있다.In FIG. 3, seven emotions are connected by a dotted line and a solid line. Here, the emotions that collide with each other are indicated by a dotted line, and emotions that can coexist as a solid line. For example, the probability that a person's emotional state can change from "happy" to "anger" has a value of 0.12 (collision feelings) and a "neutral" to "neutral" state has a probability value of 0.6 (coexistence feelings). have.

도 4는 본 발명의 일 실시예에 따른, 3가지 신뢰도을 합산하여 감정 상태 추정모듈(300)에서 감정 상태 추정값을 산출하는 방식과 그 예시를 나타낸 도면이다.4 is a diagram illustrating a method and an example of calculating an emotional state estimation value in the emotional state estimation module 300 by summing three reliability levels according to an embodiment of the present invention.

도 4를 살펴보면, 3가지 신뢰도를 이용하여 감정상태를 결정하기 위한 공식(a)와 공식에 사용되는 가중치 값(b)를 활용하여 연산작업이 수행된다.Referring to FIG. 4, a calculation operation is performed by using a formula (a) for determining an emotional state using three reliability levels and a weight value (b) used in the formula.

공식(a)에서 S₁, S₂, S₃은 각각 제 1 신뢰도, 제 2 신뢰도, 제 3 신뢰도의 점수를 뜻하고, 각각의 신뢰도 값에 주어지는 가중치 (b)인 r₁, r₂, r₃,을 통해 감정 상태 추정모듈(300)에서 감정 상태 추정값을 도출할 수 있다.In formula (a), S ₁ , S ₂ , and S ₃ denote scores of the first reliability, the second reliability, and the third reliability, respectively, and r ₁ , r ₂ , and r, which are weights (b) given to the respective reliability values. _{Through 3} , the emotional state estimation module 300 may derive the emotional state estimation value.

이때, 가중치 (b)는 심리학적 연구 및 누적된 데이터를 기초로 산출되고 가변형 수치가 될 수 있다.In this case, the weight (b) may be calculated based on psychological studies and accumulated data, and may be a variable value.

그림 (c)를 통해 예를 들어 설명하면, 신뢰도 값은 아래의 표1와 같고 가중치 (b)를 활용하게 되면 아래의 공식을 따르게 된다.For example, as shown in Figure (c), the reliability values are as shown in Table 1 below, and using the weight (b) follows the formula below.

제 1 신뢰도1st reliability 제 2 신뢰도 2nd reliability 제 3 신뢰도3rd reliability 행복Happiness 0.50.5 0.60.6 0.990.99 중립neutrality 0.580.58 00 0.790.79

각각의 감정이 가지는 감정 상태 추정값은 다음과 같이 산출될 수 있다.The emotion state estimation value of each emotion may be calculated as follows.

케이스 1 : F(happy) = (0.15*0.5+0.5*0.6+0.35*0.99) = 0.7215Case 1: F (happy) = (0.15 * 0.5 + 0.5 * 0.6 + 0.35 * 0.99) = 0.7215

케이스 2 : F(neutral) = (0.15*0.58+0.5*0+0.35*0.79) = 0.3635Case 2: F (neutral) = (0.15 * 0.58 + 0.5 * 0 + 0.35 * 0.79) = 0.3635

따라서 위의 공식에 따라 사용자는 현재 행복이라는 감정을 가질 확률이 0.7215, 중립이라는 감정을 가질 확률이 0.3635일 수 있다. 이때 도 4에서는 행복 및 중립이라는 2가지 감정을 통해 예시를 제시하였으나, 실제 발명에서는 7가지의 감정에 대한 확률을 모두 계산하게 된다.Therefore, according to the above formula, the user may have a current feeling of happiness having a feeling of happiness and a probability of being neutral. In this case, although an example is given through two emotions of happiness and neutrality, in actual invention, all probability of seven emotions is calculated.

도 5는 본 발명의 일 실시예에 따른, 음성 세그먼트에서 감정을 추출하는 과정을 나타낸 예시 도면이다.5 is an exemplary diagram illustrating a process of extracting emotions from a voice segment according to an embodiment of the present invention.

먼저, 도 5의 그림 (a)를 살펴보면 이전의 N개의 샘플에서 고유한 감정의 숫자가 계산될 수 있다. 이전 감정의 예시는 행복/행복/중립/행복/행복의 값을 가지고, 각각 E1: 행복(happy) = 4, E2: 중립(neutral) = 1로 N = 5개의 값을 가진다.First, referring to FIG. 5A, a number of unique emotions may be calculated from previous N samples. An example of previous feelings is the value of happiness / happiness / neutrality / happy / happy, with E1: happy = 4 and E2: neutral = 1 and N = 5 values, respectively.

또한, 예시로 주어진 현재 추정하는 감정(즉, 본 발명에서는 이를 제 1신로도로 정의 한다.)은 슬픔=0.34, 행복=0.31을 제시하고 있다. 실제로 본 발명에서는 7가지의 감정(행복(happy), 슬픔(sad), 분노(angry), 공포(fear), 혐오(disgust), 지루함(boredom) 및 중립(neutral))에 대한 모든 확률을 계산하게 된다.In addition, the presently estimated emotion given as an example (that is, the present invention is defined as the first faithfulness) presents sadness = 0.34 and happiness = 0.31. Indeed, the present invention calculates all probabilities for seven emotions (happy, sad, angry, fear, disgust, boredom and neutral). Done.

그림 (b)는 그림 (a)에서 제시된 파라미터값을 이용하여, 사용자의 감정 충돌을 해결하고, 감정 상태 추정모듈(300)에서 감정 상태 추정값을 산출하는 방법을 타나낸 예시이다.Figure (b) is an example showing how to resolve the emotional conflict of the user, using the parameter values shown in Figure (a), and calculate the emotional state estimation value in the emotional state estimation module 300.

먼저 복수의 이전의 감정에 대한 확률값은 First, the probability values for the plurality of previous emotions

emotion_x score = no. of occurrences of emotion_x in N / Nemotion _x score = no. of occurrences of emotion _x in N / N

의 공식을 통해 계산할 수 있다.Can be calculated through the formula

그림 (a)에서 주어진 값은 각각 E1: 행복=4, E2: 중립= 1, N = 5이기에 In Figure (a), the values given are E1: happiness = 4, E2: neutral = 1, and N = 5, respectively.

케이스 1 : 행복(happy) = 4/5 = 0.8Case 1: happy = 4/5 = 0.8

케이스 2 : 중립(neutral) = 1/5 = 0.2Case 2: neutral = 1/5 = 0.2

라는 계산을 통해 높은 값인 행복(happy) = 0.8을 사용자가 이전에 가지는 감정은 행복이라고 결정할 수 있다.Through the calculation, it can be determined that the user's previous feeling of happiness (happy) = 0.8 is happiness.

다음으로 데이터베이스에 기 저장된 연속적인 음성 세그먼트에서의 복수의 감정이 공존할 확률을 통해 제 3 신뢰도 값을 도출하게 된다.Next, a third reliability value is derived through a probability that a plurality of emotions in consecutive speech segments previously stored in a database coexist.

그림 (b)의 표를 참조하면 현재의 감정이 슬픔일 경우를 케이스1로, 행복일 경우를 케이스 2로 가정하고, 이전의 감정을 각각 행복(happy) - 중립(neutral) 순으로 가정하면,Referring to the table in (b), assuming that current emotion is sadness, case 1 is happiness, case 2 is assumed, and previous emotions are assumed to be happy-neutral, respectively.

케이스 1은 행복(happy) - 중립(neutral) - 슬픔(sad)의 순서를 가지게 되고 매핑되는 제 3 신뢰도 값은 0.2가 된다. 또한, 케이스2는 행복(happy) - 중립(neutral) - 행복(happy) 의 순서를 가지게되고 매핑되는 제 3 신뢰도는 0.7이 된다.Case 1 has a sequence of happy-neutral-sad and the third confidence value that is mapped is 0.2. In addition, Case 2 has a sequence of happy-neutral-happy and the third reliability mapped is 0.7.

이를 도4에서 제시한 방법을 감정 상태 추정값은 각각 슬픔(sad)이 0.121, 행복(happy)이 0.691이 산출되고, 시스템은 사용자가 느끼는 감정이 행복이라고 결론짓게 된다.In the method shown in FIG. 4, the emotional state estimates yield sadness of 0.121 and happiness of 0.691, respectively, and the system concludes that the user feels happiness.

도 6은 본 발명의 일 실시예에 따른, 추론 프로세스를 이용한 음성 감정 인식의 방법을 나타낸 동작 흐름도이다.6 is an operation flowchart illustrating a method of speech emotion recognition using an inference process according to an embodiment of the present invention.

도 6을 참조하면, 추론 프로세스를 이용한 음성 감정 인식의 방법은 사용자로부터 음성 신호(100)를 수신하는 단계를 가진다(S610).Referring to FIG. 6, the method of voice emotion recognition using an inference process includes receiving a voice signal 100 from a user (S610).

이때, 사용자의 음성 신호(100)는 다양한 음성 녹음장치를 통해 실시간으로 입력될 수 있다.In this case, the user's voice signal 100 may be input in real time through various voice recording apparatuses.

단계(S610)에서 수신한 음성 신호(100)는 세그먼트 단위의 음성 데이터로 변환한다(S620).The voice signal 100 received in step S610 is converted into voice data in units of segments (S620).

음성 신호(100)는 고속퓨리에변환을 통해 세그먼트 단위를 가지는 2차원 그래프의 형태인 스펙트로그램이 생성되고, 이를 시간단위로 분할하여 3차원 구조로 재배치하게 된다.The speech signal 100 generates a spectrogram in the form of a two-dimensional graph having a segment unit through a fast Fourier transform, and divides it into time units and rearranges the three-dimensional structure.

다음으로 변환된 데이터를 3D 컨볼루션 뉴럴 네트워크(200)를 통해 특징값을 추출하게 된다(S630).Next, the feature data is extracted through the 3D convolutional neural network 200 (S630).

이때 특징값은 현재 음성에서 추출할 수 있는 제 1 신뢰도와 기 추출되어 판단된 감정값인 제 2 신뢰도를 산출하게 된다.In this case, the feature value calculates a first reliability that can be extracted from the current voice and a second reliability that is a predetermined and determined emotion value.

마지막으로 특징값에서 각 음성 데이터의 구간에 매핑하는 감정 상태 추정값을 산출하여 단계를 마무리한다(S640).Finally, an emotional state estimation value mapped from the feature value to the interval of each voice data is calculated to complete the step (S640).

도 5에서 제시된 방법을 통해, 제 1 신뢰도, 제 2 신뢰도, 제 3 신뢰도를 이용하여 각 감정에 대한 감정 상태 추정값을 산출하고, 산출한 값 중 가장 높은 수치를 지니는 감정을 선택하여 사용자에게 제공하게 된다. Through the method shown in FIG. 5, an emotion state estimation value for each emotion is calculated using the first reliability, the second reliability, and the third reliability, and the emotion having the highest value among the calculated values is selected and provided to the user. do.

도 7은 본 발명의 일 실시예에 따른, 3D 컨볼루션 뉴럴 네트워크(200)를 통해 반환된 신뢰도 확인을 통해 신뢰 레벨의 높낮음을 나타낸 도면이다.FIG. 7 is a diagram illustrating a high level of confidence level through a reliability check returned through the 3D convolutional neural network 200 according to an embodiment of the present invention.

예를 들어 도 7을 설명하면, 높은 수준과 중간 수준의 기 설정된 임계값을 가정하고, 처음 조건에서 도출된 신뢰도가 높은 수준으로 설정된 임계값 보다 높으면 신뢰 수준을 "높은 신뢰도"라고 정한 후 충돌하는 감정으로 정의 후 처리한다.For example, referring to FIG. 7, assuming high and medium preset thresholds, and if the confidence derived from the initial condition is higher than the threshold set as the high level, the confidence level is set to "high confidence" and then the collision is made. Define and deal with emotions.

만약 높은 수준의 임계값 보다 낮으면, 중간 수준으로 설정된 임계값과 비교하고, 해당 임계값보다 높으면 "중간 신뢰도"라고 정의하고, 그 이하이면 "낮은 신뢰도"로 분류하여 해당 값을 폐기한다.If it is lower than the high threshold, it is compared with the threshold set to the middle level. If it is higher than the threshold, it is defined as "medium confidence", and if it is lower than it is classified as "low confidence" and discards the value.

이때, 신뢰 점수가 "중간 신뢰도"라면, 도 2에 제시된 과정을 거치게 된다.At this time, if the confidence score is "intermediate reliability", it goes through the process shown in FIG.

이상에서 설명한 본 발명의 실시예에 따른 최적의 학습 모델 선택 방법은, 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 이러한 기록 매체는 컴퓨터 판독 가능 매체를 포함하며, 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함하며, 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. The optimal learning model selection method according to the embodiment of the present invention described above may be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Such recording media includes computer readable media, and computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, computer readable media includes computer storage media, which are volatile and nonvolatile implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Both removable and non-removable media.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 조사 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the present invention is intended for illustration, and a person of ordinary skill in the art may understand that the present invention can be easily modified in other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

또한, 본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수도 있다.In addition, while the methods and systems of the present invention have been described in connection with specific embodiments, some or all of their components or operations may be implemented using a computer system having a general purpose hardware architecture.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is shown by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention. .

100: 음성 신호 200: 3D 컨볼루션 뉴럴 네트워크
300: 감정 상태 추정모듈100: voice signal 200: 3D convolutional neural network
300: emotional state estimation module

Claims

In the method for estimating the emotion of the user based on the voice signal,
Converting a user's voice signal into voice data in units of segments; And
The voice data of each segment unit is input to a 3D convolutional neural network to extract a feature of each voice data for each segment, and an emotion state estimation value for each voice data to be classified based on the extracted feature value. Calculating steps,
The calculating of the emotional state estimate
A first reliability representing a probability of being classified into a plurality of preset emotional states based on feature values extracted through the 3D convolutional neural network,
A second reliability indicating a past classification history calculated on the basis of the value of the emotion classification state for each voice data relative to the number of all voice data in which the emotion state has been previously classified, and
Based on a knowledge base representing a probability that a plurality of emotions can coexist, an emotion state estimation value is calculated based on a sum of a third reliability representing an emotion state probability that voice data of the segment to be classified can have,
The third reliability is a value previously set in the knowledge base, and indicates a probability that a first emotional state previously existing and a second emotional state that voice data of the segment to be classified can coexist with. A method of estimating a user's feelings based on.

The method of claim 1,
The converting of the speech data may include generating a spectrogram in the form of a two-dimensional graph by applying a fast fourier transform (FFT) to the speech signal, and
And dividing the spectrogram into unit time units and arranging the spectrogram into a three-dimensional structure.

The method of claim 1,
And discarding the corresponding voice data when the first reliability is less than or equal to the threshold value.

The method of claim 1,
The emotional state estimate is a sum of a value obtained by multiplying a first reliability by a first weight, a second reliability by a second weight, and a third reliability by a third weight. And estimating a user's emotion based on the voice signal.

delete

The method of claim 1,
The third reliability is a value previously set in the knowledge base, in which a first emotional state previously existing, a second emotional state previously existing, and a third emotional state that the voice data of the segment to be classified may coexist with. Estimating a user's emotion based on a voice signal.

The method of claim 1,
Each voice data is classified into seven emotional states of happiness, sad, anger, fear, disgust, boredom, and neutral. A method of estimating a user's emotion based on a voice signal.

An apparatus for estimating a user's emotion based on a voice signal,
A memory storing a program for extracting voice data from a voice signal of a user and calculating an emotional state estimate based on the voice data;
And a processor configured to estimate a user's emotion based on the voice signal by executing a program stored in the memory.
The processor converts the user's voice signal into voice data in units of segments, inputs the voice data in units of segments to a 3D convolutional neural network, and extracts features of each voice data for each segment. Based on the extracted feature values, an emotional state estimation value to which each voice data is to be classified is calculated.
The method of calculating the emotional state estimation value may include a first reliability representing a probability of being classified into a plurality of preset emotional states based on feature values extracted through the 3D convolutional neural network, and the entire voice data in which the emotional states are previously classified. The voice data of the segment to be classified is based on a second reliability indicating a past classification history calculated based on the value of the emotion classification state for each voice data relative to the number of voices, and a knowledge base indicating a probability that a plurality of emotions can coexist. An emotional state estimate is calculated based on the sum of the third reliability levels representing the emotional state probabilities that may have,
The third reliability is a value previously set in the knowledge base, and indicates a probability that a first emotional state previously existing and a second emotional state that voice data of the segment to be classified can coexist with. Device for estimating the emotion of the user based on.

The method of claim 8,
The converting of the voice data may include generating a two-dimensional graph spectrogram by applying a fast fourier transform (FFT) to the voice signal.
And dividing the spectrogram into unit time units to arrange the spectrogram in a three-dimensional structure.

The method of claim 8,
And discarding the corresponding voice data when the first reliability is less than or equal to the threshold value.

The method of claim 8,
The emotional state estimate is a sum of a value obtained by multiplying a first reliability by a first weight, a second reliability by a second weight, and a third reliability by a third weight. Apparatus for estimating the user's emotion based on the voice signal.

The method of claim 8,
The third reliability is a value previously set in the knowledge base, and indicates a probability that a first emotional state previously existing and a second emotional state that voice data of the segment to be classified can coexist with. Device for estimating the emotion of the user based on.

delete

The method of claim 8,
Each voice data is classified into seven emotional states of happiness, sad, anger, fear, disgust, boredom, and neutral. An apparatus for estimating a user's emotion based on a voice signal.

A computer-readable recording medium having recorded thereon a program for performing a method of estimating a user's emotion based on a voice signal according to any one of claims 1 to 4, 6 and 7.