KR101006049B1

KR101006049B1 - Apparatus and method for recognizing emotion

Info

Publication number: KR101006049B1
Application number: KR1020080101643A
Authority: KR
Inventors: 강정환
Original assignee: 강정환
Priority date: 2008-10-16
Filing date: 2008-10-16
Publication date: 2011-01-06
Also published as: KR20100042482A

Abstract

입력되는 음성 신호를 분석하고 그 결과에 따라 해당 음성 신호에 내포된 감정을 인식하는 장치 및 방법이 개시된다. 일 실시예에 따른 감정 인식 장치는, 대상 음성 신호를 입력받는 입력부; 입력받은 상기 대상 음성 신호를 스펙트로그램(spectrogram)을 이용하여 입력 행렬로 변환하는 변환부; 상기 입력 행렬에 대하여 비음수 행렬 인수분해(NMF: non-negative matrix factorization)을 수행하여 특징 벡터를 산출하는 NMF 수행부; 감정 모델 별 모델 벡터가 저장되어 있는 감정 모델 데이터베이스; 상기 특징 벡터와 상기 감정 모델 데이터베이스에 저장된 모델 벡터를 비교하는 비교부; 및 상기 비교부에서의 비교 결과 상기 특징 벡터와 가장 유사한 모델 벡터에 상응하는 감정 모델로 상기 대상 음성 신호의 감정 상태를 판단하는 판단부를 포함할 수 있다. 비음수 행렬 인수분해를 이용하여 감정 모델 별 스펙트로그램으로부터 감정 모델 벡터를 결정하고, 이를 이용하여 화자의 음성에 내포된 감정을 인식하는 감정 인식 장치 및 방법을 제공한다.Disclosed are an apparatus and method for analyzing an input voice signal and recognizing an emotion contained in the voice signal according to the result. An emotion recognition apparatus according to an embodiment may include an input unit configured to receive a target voice signal; A converter for converting the input voice signal into an input matrix using a spectrogram; An NMF performer for calculating a feature vector by performing non-negative matrix factorization (NMF) on the input matrix; An emotion model database that stores model vectors for each emotion model; A comparison unit comparing the feature vector with a model vector stored in the emotion model database; And a determination unit that determines an emotional state of the target voice signal as an emotion model corresponding to the model vector most similar to the feature vector as a result of the comparison in the comparison unit. Provided are an emotion recognition apparatus and method for determining an emotion model vector from a spectrogram for each emotion model using non-negative matrix factorization, and using the same to recognize emotions contained in a speaker's voice.

감정, 인식, 음성, 분석, 비음수 행렬 인수분해 Emotion, Recognition, Speech, Analysis, Nonnegative Matrix Factorization

Description

Apparatus and method for recognizing emotion

본 발명은 감정 인식 장치에 관한 것으로, 보다 상세하게는 입력되는 음성 신호를 분석하고 그 결과에 따라 해당 음성 신호에 내포된 감정을 인식하는 장치 및 방법에 관한 것이다. The present invention relates to an emotion recognizing apparatus, and more particularly, to an apparatus and a method for analyzing an input voice signal and recognizing an emotion contained in the voice signal according to the result.

음성은 사람의 가장 자연스러운 의사 소통 수단이면서 정보 전달 수단이다. 사람의 음성을 효과적으로 처리하고 수치화함으로써 이를 효과적으로 이용하기 위한 음성 정보 처리 기술(SIT: speech information technology) 분야가 괄목할 만한 발전을 이룩함에 따라 실생활에도 속속 적용이 되고 있다.Voice is the most natural means of communication and information delivery. As the field of speech information technology (SIT) for the effective processing and digitization of human voice has been made a remarkable development, it has been applied to real life one after another.

이러한 음성 정보 처리 기술은, 음성 인식(speech recognition), 음성 합성(speech synthesis), 화자 인증(speaker identification and verification), 음성 코딩(speech coding) 등으로 분류된다. Such voice information processing techniques are classified into speech recognition, speech synthesis, speaker identification and verification, speech coding, and the like.

이러한 음성 정보 처리 기술과 관련하여, 음성 신호를 분석함으로써 화자의 감정 상태를 추정, 인식하는 감정 인식 기술을 고려해볼 수 있다. 감정 인식 기술 은 사람이 일상 생활에서 사용하는 언어, 음성 등에 내포된 감정을 기계를 통해서 수치적으로 인식하고자 하고 있다. In relation to the voice information processing technology, it is possible to consider an emotion recognition technology that estimates and recognizes the speaker's emotional state by analyzing the voice signal. Emotion recognition technology attempts to numerically recognize emotions embedded in language, voice, etc. that people use in their daily lives.

이러한 음성 신호 분석에 기반한 감정 인식 기술의 대표적인 예로, 거짓말 탐지기를 들 수 있다. 거짓말 탐지기는 폴리그래프의 일종으로서, 폴리그래프란 사람의 흥분, 긴장 또는 감정적인 갈등의 상태를 미리 정의해 놓은 기준에 의하여 감지하는 시스템을 의미한다. 보통 사람이 거짓말을 할 때는 정신적인 긴장으로 인하여 성대의 혈액량이 저하하게 되고, 부득이한 신경작용으로 인하여 성대에서는 일그러진 음파가 나오게 되며, 거짓말 탐지기는 이를 감지하여 화자의 거짓말 여부를 판별하게 된다. 하지만, 이러한 거짓말 탐지기로는 일상생활에서 사람의 음성에 내포된 다양한 감정을 음성 신호로부터 분석, 판별할 수는 없는 문제점이 있었다. A representative example of the emotion recognition technology based on the voice signal analysis is a lie detector. A polygraph is a type of polygraph, which means a system that detects a person's state of excitement, tension or emotional conflict by a predefined standard. When a person lie, the blood pressure of the vocal cords decreases due to mental tension, and the distorted sound waves come out of the vocal cords due to unavoidable neural action. However, such a lie detector has a problem in that it is not possible to analyze and determine from the voice signal various emotions contained in the human voice in everyday life.

또한, 이러한 음성 신호 분석에 기반한 감정 인식 기술을 통해 미리 설정된 소정의 신호 처리를 수행함으로써 사용자에게 다양한 서비스를 제공할 수 있는 시스템이 필요하다. In addition, there is a need for a system capable of providing various services to a user by performing predetermined signal processing through an emotion recognition technique based on the voice signal analysis.

따라서, 본 발명은 비음수 행렬 인수분해(non-negative matrix factorization)를 이용하여 감정 모델 별 스펙트로그램(spectrogram)으로부터 감정 모델 벡터를 결정하고, 이를 이용하여 화자의 음성에 내포된 감정을 인식하는 감정 인식 장치 및 방법을 제공한다. Accordingly, the present invention determines an emotion model vector from a spectrogram for each emotion model using non-negative matrix factorization, and uses the same to recognize emotions contained in the speaker's voice. A recognition apparatus and method are provided.

또한, 본 발명은 분석된 음성을 기초로 하여 감정을 인식한 결과를 기초로 하여 전화통화 시 응대 방법의 전환이나 우수 상담원 연결 등 다양한 고객 응대 서비스를 제공할 수 있는 감정 인식 장치 및 방법을 제공한다.In addition, the present invention provides an emotion recognition apparatus and method that can provide a variety of customer service, such as switching the response method or connection of the best counselor on the basis of the results of the emotion recognition based on the analyzed voice. .

본 발명의 일 측면에 따르면, 음성 신호를 분석하여 감정 상태를 판단하는 감정 인식 장치가 제공된다. According to an aspect of the present invention, there is provided an emotion recognition apparatus for determining an emotional state by analyzing a voice signal.

일 실시예에 따른 감정 인식 장치는, 대상 음성 신호를 입력받는 입력부; 입력받은 상기 대상 음성 신호를 스펙트로그램(spectrogram)을 이용하여 입력 행렬로 변환하는 변환부; 상기 입력 행렬에 대하여 비음수 행렬 인수분해(NMF: non-negative matrix factorization)을 수행하여 특징 벡터를 산출하는 NMF 수행부; 감정 모델 별 모델 벡터가 저장되어 있는 감정 모델 데이터베이스; 상기 특징 벡터와 상기 감정 모델 데이터베이스에 저장된 모델 벡터를 비교하는 비교부; 및 상기 비교부에서의 비교 결과 상기 특징 벡터와 가장 유사한 모델 벡터에 상응하는 감정 모델로 상기 대상 음성 신호의 감정 상태를 판단하는 판단부를 포함할 수 있다. An emotion recognition apparatus according to an embodiment may include an input unit configured to receive a target voice signal; A converter for converting the input voice signal into an input matrix using a spectrogram; An NMF performer for calculating a feature vector by performing non-negative matrix factorization (NMF) on the input matrix; An emotion model database that stores model vectors for each emotion model; A comparison unit comparing the feature vector with a model vector stored in the emotion model database; And a determination unit that determines an emotional state of the target voice signal as an emotion model corresponding to the model vector most similar to the feature vector as a result of the comparison in the comparison unit.

상기 비교부는 상기 특징 벡터와 상기 모델 벡터 간의 유클리디언 거리(Euclidean distance)의 제곱을 이용하여 비교를 수행할 수 있다.The comparison unit may perform the comparison by using a square of an Euclidean distance between the feature vector and the model vector.

상기 변환부는 상기 대상 음성 신호의 스펙트로그램에 대하여 주파수축 및 시간축을 상기 입력 행렬의 행 및 열로 설정하고, 주파수 및 시간의 변화에 따른 상기 대상 음성 신호의 진폭 값을 원소(element)로 할 수 있다. 여기서, 상기 입력 행렬의 각 원소는 비음수일 수 있다. The converter may set a frequency axis and a time axis as rows and columns of the input matrix with respect to a spectrogram of the target voice signal, and set an amplitude value of the target voice signal according to a change in frequency and time as an element. . Here, each element of the input matrix may be non-negative.

상기 NMF 수행부는 상기 입력 행렬의 비음수 행렬 인수분해 수행 결과 중 상기 대상 음성 신호의 주파수 특징을 나타내는 기초 행렬(basis matrix)을 상기 특징 벡터로 선택할 수 있다.The NMF performing unit may select a basis matrix representing a frequency characteristic of the target speech signal among the results of performing the non-negative matrix factorization of the input matrix as the feature vector.

상기 입력부는 하나 이상의 모델 음성 신호를 입력받고, 상기 변환부는 상기 모델 음성 신호의 스펙트로그램을 이용하여 기본 입력 행렬을 생성하며 상기 감정 모델 별로 상기 기본 입력 행렬을 가로로 붙인 합성 입력 행렬을 생성하고, 상기 NMF 수행부는 상기 합성 입력 행렬에 대하여 비음수 행렬 인수분해를 수행하여 상기 모델 벡터를 산출하며 상기 감정 모델 데이터베이스에 저장할 수 있다. 여기서, 상기 모델 음성 신호는 상기 판단부에 의해 감정 상태가 판단된 상기 대상 음성 신호를 포함할 수 있다. The input unit receives one or more model voice signals, and the converter generates a basic input matrix by using the spectrogram of the model voice signal, and generates a composite input matrix horizontally pasting the basic input matrix for each emotion model. The NMF execution unit may perform non-negative matrix factorization on the synthesized input matrix to calculate the model vector and store it in the emotion model database. The model voice signal may include the target voice signal in which an emotional state is determined by the determination unit.

본 발명의 다른 측면에 의하면, 음성 신호를 분석하여 감정 상태를 판단하는 감정 인식 방법 및 이를 수행하기 위한 프로그램이 기록된 기록매체가 제공된다. According to another aspect of the present invention, there is provided an emotion recognition method for analyzing the voice signal to determine the emotional state and a recording medium having recorded thereon a program for performing the same.

일 실시예에 따른 감정 인식 방법은, 대상 음성 신호를 입력받는 단계(a); 입력받은 상기 대상 음성 신호를 스펙트로그램(spectrogram)을 이용하여 입력 행렬로 변환하는 단계(b); 상기 입력 행렬에 대하여 비음수 행렬 인수분해(NMF: non-negative matrix factorization)을 수행하여 특징 벡터를 산출하는 단계(c); 상기 특징 벡터와 감정 모델 데이터베이스에 저장된 모델 벡터를 비교하는 단계(d); 및 상기 비교 결과 상기 특징 벡터와 가장 유사한 모델 벡터에 상응하는 감정 모델로 상기 대상 음성 신호의 감정 상태를 판단하는 단계(e)를 포함할 수 있다.According to an embodiment, a method of recognizing emotions may include: receiving a target voice signal (a); (B) converting the received target voice signal into an input matrix using a spectrogram; (C) calculating a feature vector by performing non-negative matrix factorization (NMF) on the input matrix; (D) comparing the feature vector with a model vector stored in an emotion model database; And (e) determining an emotional state of the target voice signal using an emotion model corresponding to a model vector most similar to the feature vector as a result of the comparison.

상기 단계(d)는 상기 특징 벡터와 상기 모델 벡터 간의 유클리디언 거리(Euclidean distance)의 제곱을 이용하여 비교를 수행할 수 있다.In step (d), a comparison may be performed using a square of an Euclidean distance between the feature vector and the model vector.

상기 단계(b)는 상기 대상 음성 신호의 스펙트로그램에 대하여 주파수축 및 시간축을 상기 입력 행렬의 행 및 열로 설정하고, 주파수 및 시간의 변화에 따른 상기 대상 음성 신호의 진폭 값을 원소(element)로 할 수 있다. 여기서, 상기 입력 행렬의 각 원소는 비음수일 수 있다.In the step (b), the frequency axis and the time axis are set as rows and columns of the input matrix with respect to the spectrogram of the target voice signal, and the amplitude value of the target voice signal according to the change of frequency and time is an element. can do. Here, each element of the input matrix may be non-negative.

상기 단계(c)는 상기 입력 행렬의 비음수 행렬 인수분해 수행 결과 중 상기 대상 음성 신호의 주파수 특징을 나타내는 기초 행렬(basis matrix)을 상기 특징 벡터로 선택할 수 있다.In step (c), a basis matrix representing a frequency characteristic of the target speech signal may be selected as the feature vector among the results of performing the non-negative matrix factorization of the input matrix.

하나 이상의 모델 음성 신호를 입력받는 단계; 상기 모델 음성 신호의 스펙트로그램을 이용하여 기본 입력 행렬을 생성하는 단계; 상기 감정 모델 별로 상기 기본 입력 행렬을 가로로 붙인 합성 입력 행렬을 생성하는 단계; 상기 합성 입력 행렬에 대하여 비음수 행렬 인수분해를 수행하여 상기 모델 벡터를 산출하는 단계; 및 상기 감정 모델 데이터베이스에 저장하는 단계가 상기 단계(a)에 선행하여 수행될 수 있다.Receiving at least one model voice signal; Generating a basic input matrix using a spectrogram of the model speech signal; Generating a composite input matrix by horizontally pasting the basic input matrix for each emotion model; Calculating a model vector by performing non-negative matrix factorization on the composite input matrix; And storing in the emotion model database may be performed prior to step (a).

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

본 발명에 따른 감정 인식 장치 및 방법은 비음수 행렬 인수분해를 이용하여 감정 모델 별 스펙트로그램으로부터 감정 모델 벡터를 결정하고, 이를 이용하여 화자의 음성에 내포된 감정을 인식하는 감정 인식 장치 및 방법을 제공한다.Emotion recognition apparatus and method according to the present invention by using a non-negative matrix factorization to determine the emotion model vector from the spectrogram for each emotion model, using the emotion recognition apparatus and method for recognizing the emotions contained in the speaker's voice to provide.

또한, 분석된 음성을 기초로 하여 감정을 인식한 결과를 기초로 하여 전화통화 시 응대 방법의 전환이나 우수 상담원 연결 등 다양한 고객 응대 서비스를 제공할 수 있다. In addition, on the basis of the result of recognizing the emotion based on the analyzed voice, it is possible to provide a variety of customer service such as switching the method of answering the phone call or connecting excellent counselors.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.As the invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description. However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all transformations, equivalents, and substitutes included in the spirit and scope of the present invention. In the following description of the present invention, if it is determined that the detailed description of the related known technology may obscure the gist of the present invention, the detailed description thereof will be omitted.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르 게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

이하, 본 발명의 실시예를 첨부한 도면들을 참조하여 상세히 설명하기로 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서 대상 음성 신호는 감정 판단의 대상이 되는 음성 신호로, 예를 들어 콜센터(call center)에 전화한 고객의 통화 시 음성일 수 있다. 모델 음성 신호는 감정 판단의 기준이 되는 음성 신호로, 특정 감정 모델에 관한 특징이 충분하면서도 타 감정 모델과 구별될 수 있도록 내포되어 있다. 모델 음성 신호는 훈련된 성우 등에 의해 표현된 음성 신호일 수 있다.In the present invention, the target voice signal is a voice signal that is an object of emotion determination, and may be, for example, a voice when a customer calls a call center. The model voice signal is a voice signal that is a criterion for emotion determination, and is contained in such a manner that the characteristic of a specific emotion model is sufficient and can be distinguished from other emotion models. The model voice signal may be a voice signal represented by a trained voice actor or the like.

도 1은 본 발명의 일 실시예에 따른 감정 인식 장치의 블록 구성도이다. 1 is a block diagram of an emotion recognition apparatus according to an embodiment of the present invention.

감정 인식 장치(100)는 입력되는 대상 음성 신호를 분석하여 해당 대상 음성 신호에 내포된 감정이 무엇인지를 인식하여 표시하거나 알려준다. 본 발명에서는 감정 인식을 위해 비음수 행렬 인수분해를 이용하는 바 이에 대해서는 추후 상세히 설명하기로 한다. The emotion recognition apparatus 100 analyzes the input target voice signal to recognize and display or inform what emotions are included in the target voice signal. In the present invention, non-negative matrix factorization is used for emotion recognition, which will be described in detail later.

감정 인식 장치(100)는 대상 음성 신호에 내포된 감정을 인식함에 있어서 다음의 감정 모델 중 하나로 구분한다. 감정 모델은 음성 신호로부터 판단가능한 감정 상태를 의미하며, 평온(neutral), 행복(happy), 화남(angry), 슬픔(sad) 등으 로 분류될 수 있다. 실시예에 따라 지루함(bored) 등의 감정 모델이 더 추가될 수 있다. The emotion recognizing apparatus 100 classifies one of the following emotion models in recognizing the emotions included in the target voice signal. The emotion model refers to an emotional state that can be judged from a voice signal, and may be classified into neutral, happy, angry, sad, and the like. In some embodiments, an emotional model such as bored may be further added.

감정 인식 장치(100)는 유선 또는 무선 통신 네트워크(이하 '통신 네트워크'라 칭함)에 접속되어, 통신 네트워크에 접속된 사용자 단말기와 음성 신호를 수신한다. 또는 감정 인식 장치(100)는 별도로 구비된 입력 인터페이스를 통해 음성 신호를 입력받을 수 있다. The emotion recognition apparatus 100 is connected to a wired or wireless communication network (hereinafter referred to as a "communication network") and receives a voice signal with a user terminal connected to the communication network. Alternatively, the emotion recognition apparatus 100 may receive a voice signal through an input interface provided separately.

즉, 음성 신호는 통신 네트워크를 통해 사용자 단말기로부터 감정 인식 장치(100)로 실시간으로 또는 미리 저장되어 있던 파일 형태로 전송되거나 혹은 감정 인식 장치(100)에 구비된 마이크 등의 음성 입력 장치를 통해 직접 입력되거나 혹은 감정 인식 장치(100)에 구비된 데이터 입력부를 통해 미리 파일 형태로 저장된 음성 신호가 입력될 수 있다. 이 외에도 감정 인식 장치(100)에 음성 신호가 입력되는 방법은 본 발명의 사상 범위 내에서 다양할 수 있다. That is, the voice signal is transmitted from the user terminal to the emotion recognizing apparatus 100 in real time or in the form of a pre-stored file through a communication network, or directly through a voice input device such as a microphone provided in the emotion recognizing apparatus 100. A voice signal input in advance or stored in a file form may be input through a data input unit included in the emotion recognition apparatus 100. In addition, a method of inputting a voice signal to the emotion recognition apparatus 100 may vary within the spirit of the present invention.

감정 인식 장치(100)는 입력부(110), 변환부(120), NMF 수행부(130), 비교부(140), 판단부(150), 감정 모델 데이터베이스(160)를 포함한다. The emotion recognition apparatus 100 may include an input unit 110, a converter 120, an NMF performer 130, a comparator 140, a determiner 150, and an emotion model database 160.

입력부(110)는 통신 네트워크를 통해 음성 신호를 전송받거나 또는 감정 인식 장치(100)에 별도로 구비된 입력 인터페이스를 통해 음성 신호를 입력받는다. 음성 신호는 대상 음성 신호 및/또는 모델 음성 신호일 수 있다. 음성 신호는 실시간으로 입력되거나 파일 형태로 미리 저장되어 입력될 수 있다. The input unit 110 receives a voice signal through a communication network or receives a voice signal through an input interface provided separately in the emotion recognition apparatus 100. The voice signal may be a target voice signal and / or a model voice signal. The voice signal may be input in real time or may be input in advance in a file form.

입력부(110)에 입력된 음성 신호에는 배경 잡음 등이 포함되어 있을 수 있는 바, 감정 인식 장치(100)는 배경 잡음 필터링부(미도시)를 더 구비하여 배경 잡 음을 필터링함으로써 보다 정확하고 신뢰성 높은 감정 인식이 가능하도록 할 수 있다. The voice signal input to the input unit 110 may include a background noise. The emotion recognition apparatus 100 further includes a background noise filtering unit (not shown) to filter the background noise to be more accurate and reliable. High emotional awareness can be achieved.

변환부(120)는 입력부(110)로 입력된 음성 신호를 스펙트로그램(spectrogram)에 따른 행렬로 변환한다. 스펙트로그램은 소리나 파동을 시각화하여 파악할 수 있도록 하며, 파형(waveform)과 스펙트럼(spectrum)의 특징이 조합되어 있다. 파형에서 시간축의 변화에 따른 진폭의 변화만을 파악할 수 있고, 스펙트럼에서 주파수축의 변화에 따른 진폭의 변화만을 파악할 수 있는 반면, 스펙트로그램에서는 시간축과 주파수축의 변화에 따른 진폭의 변화를 파악할 수 있다. The converter 120 converts the voice signal input to the input unit 110 into a matrix according to a spectrogram. Spectrograms allow you to visualize and identify sounds or waves, combining a combination of waveform and spectrum characteristics. In the waveform, only the change in amplitude according to the change of the time axis and the change in the amplitude according to the change of the frequency axis in the spectrum can be identified, whereas in the spectrogram, the change in the amplitude according to the change in the time axis and the frequency axis can be identified.

여기서, 주파수축의 주파수 단위에 따라 행(row)을 구분하고, 시간축의 시간 단위에 따라 열(column)을 구분한다. 이후 임의의 주파수 및 시간에서의 진폭을 해당 주파수에 상응하는 행 및 해당 시간에 상응하는 열의 원소(element)로 함으로써, 음성 신호를 행렬로 변환할 수 있다. 예를 들어, 음성 신호를 512개의 주파수로 구분하고 200개의 시간 단위로 분리하는 경우 변환부(120)는 해당 음성 신호를 512 x 200 행렬로 변환할 수 있다. Here, rows are classified according to frequency units of the frequency axis, and columns are classified according to time units of the time axis. Thereafter, the amplitude of an arbitrary frequency and time is an element of a row corresponding to the frequency and a column corresponding to the corresponding time, thereby converting the speech signal into a matrix. For example, when the voice signal is divided into 512 frequencies and separated into 200 time units, the converter 120 may convert the voice signal into a 512 x 200 matrix.

행렬의 각 원소는 음성 신호의 주파수 및 시간에 따른 진폭으로, 비음수(non-negative)이다. Each element of the matrix is a non-negative amplitude and time amplitude of a speech signal.

NMF 수행부(130)는 변환부(120)에서 변환된 행렬에 대하여 비음수 행렬 인수분해(NMF: non-negative matrix factorization)를 수행하여 기본 벡터(basis vector)를 산출한다. 변환부(120)에서 변환된 행렬의 각 원소가 앞서 설명한 바와 같이 비음수이므로, 비음수 행렬 인수분해를 수행하는 것이 가능하다. The NMF execution unit 130 calculates a basis vector by performing non-negative matrix factorization (NMF) on the matrix converted by the conversion unit 120. Since each element of the matrix converted by the transform unit 120 is non-negative as described above, it is possible to perform non-negative matrix factorization.

음성 신호가 모델 음성 신호인 경우 산출되는 기본 벡터는 모델 벡터이며, 음성 신호가 대상 음성 신호인 경우 산출되는 기본 벡터는 특징 벡터이다. 모델 벡터는 모델 음성 신호가 속하는 감정 모델에 따라 서로 구별된다. The basic vector calculated when the speech signal is a model speech signal is a model vector, and the basic vector calculated when the speech signal is a target speech signal is a feature vector. The model vectors are distinguished from each other according to the emotion model to which the model speech signal belongs.

모델 음성 신호에 대하여 NMF 수행부(130)에서 비음수 행렬 인수분해를 수행한 경우 산출된 기본 벡터인 모델 벡터는 감정 모델 데이터베이스(160)에 저장된다. When the NMF performer 130 performs the non-negative matrix factorization on the model speech signal, the model vector, which is the calculated basic vector, is stored in the emotion model database 160.

비교부(140)는 대상 음성 신호에 대하여 NMF 수행부(130)에서 비음수 행렬 인수분해를 수행한 경우 산출된 기본 벡터인 특징 벡터와, 감정 모델 데이터베이스(160)에 저장되어 있는 감정 모델 별 모델 벡터를 비교한다. The comparison unit 140 is a feature vector, which is a basic vector calculated when the NMF execution unit 130 performs a non-negative matrix factor on the target speech signal, and a model for each emotion model stored in the emotion model database 160. Compare vectors.

감정 모델 데이터베이스(160)에는 감정 모델에 따라 서로 구별되는 특징을 가지는 모델 벡터가 저장되어 있는 바, 특징 벡터와 하나 이상의 모델 벡터 간의 유사성을 비교한다. 일 실시예에 따르면 유사성을 비교하는 방법으로는 유클리디안 거리(Euclidean distance)를 이용할 수 있다. The emotion model database 160 stores model vectors having features that are distinguished from each other according to the emotion model, and compares similarities between the feature vector and one or more model vectors. According to an embodiment, Euclidean distance may be used as a method for comparing similarity.

판단부(150)는 감정 모델 데이터베이스(160)에 저장된 하나 이상의 모델 벡터와 NMF 수행부(130)에서 산출된 대상 음성 신호의 특징 벡터를 비교한 결과를 이용하여 해당 대상 음성 신호에 내포된 감정이 무엇인지를 판단한다. 즉, 대상 음성 신호의 특징 벡터와 가장 유사한 모델 벡터를 선택하고, 해당 모델 벡터가 속하는 감정 모델이 해당 대상 음성 신호가 내포하고 있는 감정인 것으로 판단한다. The determination unit 150 uses the result of comparing the at least one model vector stored in the emotion model database 160 with the feature vector of the target voice signal calculated by the NMF execution unit 130 to determine whether the emotion included in the target voice signal is included. Determine what it is. That is, a model vector most similar to the feature vector of the target voice signal is selected, and it is determined that the emotion model to which the model vector belongs is the emotion contained in the target voice signal.

판단부(150)에서의 판단 결과에 따라 선택된 감정은 다양한 방법으로 표시할 수 있다. 각 감정 모델에 대한 해당 대상 음성 신호의 유사성을 막대바 혹은 수 치로 표시하거나 선택된 감정 모델을 나타내는 색상(예를 들어, 중립은 파랑, 화남은 빨강, 행복은 녹색 등)으로 표시하거나 혹은 각 감정 모델에 따른 표정을 나타내는 캐릭터 등의 그림으로 표시할 수 있다. The emotion selected by the determination unit 150 may be displayed in various ways. The similarity of the corresponding target speech signal to each emotion model is indicated by a bar or number, in a color that represents the selected emotion model (for example, blue for neutral, red for anger, green for happiness, etc.) or for each emotion model. It can be displayed as a figure, such as a character representing the facial expression.

또는 판단부(150)에서의 판단 결과에 따라 선택된 감정에 기초하여 다양한 서비스를 제공할 수 있다. 예를 들어, 콜센터 시스템에 감정 인식 장치(100)가 적용된 경우를 가정한다. Alternatively, various services may be provided based on the selected emotion according to the determination result of the determination unit 150. For example, it is assumed that the emotion recognition device 100 is applied to the call center system.

콜센터 시스템은 외부의 다수 고객들 중 하나와, 내부의 상담원들 중 하나를 연결시켜 준다. 이 때 상담 중 또는 상담 전 고객의 음성을 분석하여 현재 고객의 감정 상태를 인식하고, 그에 적합한 서비스를 제공하는 것이 가능하다. 예를 들어, 고객의 감정 상태가 화남, 분노, 흥분 등의 상태인 경우 자동 응답 과정을 생략하고 직접 상담원에게 연결되도록 하거나 고객 대응 경험이 풍부한 고참 상담원에게 연결되도록 할 수 있다. The call center system connects one of the many external customers with one of the internal agents. At this time, it is possible to analyze the voice of the customer during the consultation or before the consultation to recognize the current emotional state of the customer and to provide a suitable service. For example, if the customer's emotional state is anger, anger, excitement, etc., the automatic response process may be omitted and directly connected to an agent, or may be connected to an experienced counselor with experience in customer response.

다른 실시예에 따른 감정 인식 장치(100)는 업데이트부를 더 포함할 수 있다. 업데이트부는 감정 인식의 대상이 된 대상 음성 신호에 내포된 감정 상태를 판단한 이후, 해당 대상 음성 신호를 판단된 감정 상태에 속하는 추가적인 모델 음성 신호로 하여 감정 모델 데이터베이스(160)에 저장된 모델 벡터를 업데이트한다. The emotion recognition apparatus 100 according to another embodiment may further include an update unit. After determining the emotional state included in the target voice signal that is the object of emotion recognition, the update unit updates the model vector stored in the emotion model database 160 by using the target voice signal as an additional model voice signal belonging to the determined emotional state. .

이러한 감정 인식 장치(100)에서의 감정 인식 방법에 대하여 도 2 이하의 도면을 참조하여 상세히 설명하기로 한다. An emotion recognition method in the emotion recognition apparatus 100 will be described in detail with reference to the drawings below with reference to FIG. 2.

도 2는 본 발명의 일 실시예에 따른 감정 인식 방법의 흐름도이고, 도 3은 본 발명의 일 실시예에 따른 음성 신호가 변환된 행렬의 예시도이다. 2 is a flowchart of a emotion recognition method according to an embodiment of the present invention, and FIG. 3 is an exemplary diagram of a matrix in which a voice signal is converted according to an embodiment of the present invention.

입력부(110)는 대상 음성 신호를 입력받는다(단계 S210). 통신 네트워크를 통해 전송받거나 감정 인식 장치(100)에 별도로 구비된 입력 인터페이스를 통해 입력받을 수 있다. 대상 음성 신호를 실시간으로 전송/입력되거나 미리 저장된 파일 형태로 전송/입력될 수 있다. The input unit 110 receives a target voice signal (step S210). It may be received through a communication network or through an input interface provided separately in the emotion recognition apparatus 100. The target voice signal may be transmitted / input in real time or may be transmitted / input in the form of a pre-stored file.

변환부(120)는 입력받은 대상 음성 신호의 스펙트로그램을 이용하여 입력 행렬로 변환한다(단계 S220). The conversion unit 120 converts the input speech signal into an input matrix by using the spectrogram of the received target speech signal (step S220).

사람의 감정은 성대의 발성에 따라서 다른 주파수 스펙트럼을 가지고 있다. 예를 들면, 중립적인 감정의 목소리는 0~4000 Hz 대역에서 주파수 성분의 특성이 강하고, 화나거나 행복한 감정의 목소리는 0~8000 Hz 대역 전체에서 고르게 분포되는 특성을 가진다. 또한, 스펙트럼의 모양도 감정 모델에 따라 피크(peak)와 밸리(valley)를 정확하게 구분하여 가질 수 있다. 따라서, 음성 신호에 대하여 스펙트로그램을 이용하는 경우 각 감정 모델에 따라 구분되는 특성을 이용하는 것이 가능하다. Human emotions have a different frequency spectrum depending on the vocal cords. For example, a neutral emotional voice has a strong frequency component in the 0 to 4000 Hz band, and an angry or happy voice is evenly distributed throughout the 0 to 8000 Hz band. In addition, the shape of the spectrum may have a peak and a valley accurately classified according to the emotion model. Therefore, when using a spectrogram for the speech signal, it is possible to use characteristics distinguished according to each emotion model.

스펙트로그램에서는 시간축과 주파수축의 변화에 따른 진폭의 변화를 파악할 수 있다. 도 3을 참조하면, 대상 음성 신호를 입력 행렬로 변환한 예시가 도시되어 있다. 대상 음성 신호에 대하여 스펙트로그램을 이용하여 시간축과 주파수축의 변화에 따른 진폭 값을 입력 행렬의 각 원소로 정한다. 입력 행렬의 각 원소는 대상 음성 신호의 주파수 및 시간에 따른 진폭 값인 바 비음수이다.In the spectrogram, we can see the change in amplitude with the change of time and frequency axis. Referring to FIG. 3, an example of converting a target speech signal into an input matrix is illustrated. The spectrogram is used for the target speech signal to determine the amplitude value according to the change of the time axis and the frequency axis as each element of the input matrix. Each element of the input matrix is a non-negative bar that is an amplitude value over time and frequency of the target speech signal.

입력 행렬은 주파수축의 주파수 단위에 따라 행을 구분하고, 시간축의 시간 단위에 따라 열을 구분한다. 예를 들어, 주파수축을 512개의 주파수(f1 내지 f512) 로 구분하고, 시간축을 200개의 시간(t1 내지 t200)으로 구분한다면, 입력 행렬은 512 x 200 행렬로 변환할 수 있다. 주파수 f3 및 시간 t2에서의 진폭 값이 입력 행렬 V의 3행 2열 원소 V(3, 2)가 된다. The input matrix divides the rows according to the frequency units of the frequency axis and the columns according to the time units of the time axis. For example, if the frequency axis is divided into 512 frequencies f1 to f512 and the time axis is divided into 200 times t1 to t200, the input matrix may be converted into a 512 x 200 matrix. The amplitude value at the frequency f3 and the time t2 becomes the three-row, two-column element V (3, 2) of the input matrix V.

NMF 수행부(130)는 입력 행렬에 대하여 비음수 행렬 인수분해를 수행하여 특징 벡터를 산출한다(단계 S230). The NMF execution unit 130 performs non-negative matrix factorization on the input matrix to calculate a feature vector (step S230).

비음수 행렬 인수분해(NMF)는 다변수 데이터(multivariate data)를 분해하는데 유용한 특성을 가지고 있다. 업데이트 룰(update rule)을 적용하여 입력 행렬을 2개의 행렬로 분해한다.Non-negative matrix factorization (NMF) has properties that are useful for decomposing multivariate data. The update rule is applied to decompose the input matrix into two matrices.

수학식 1에서, V는 입력 행렬, W는 기초 행렬(basis matrix), H는 인코딩 행렬(encoding matrix)이다. 여기서, 입력 행렬 V는 각 원소가 비음수이어야 한다. In Equation 1, V is an input matrix, W is a basis matrix, and H is an encoding matrix. Here, the input matrix V must be non-negative of each element.

기초 행렬 W는 입력 행렬의 특징 정보를 가지고, 인코딩 행렬 H는 인코딩 정보를 가지게 된다. 기초 행렬 W와 인코딩 행렬 H는 하기 수학식 2의 곱셈 업데이트 룰(multiplicative update rule)을 적용하여 업데이트를 수행한다. The base matrix W has the characteristic information of the input matrix, and the encoding matrix H has the encoding information. The base matrix W and the encoding matrix H are updated by applying a multiplicative update rule of Equation 2 below.

수학식 2의 업데이트 룰 하에서는 V와 WH 사이의 유클리디언 거 리(Euclidean distance)

가 증가하지 않는 특징을 가지고 있다. 그리고 계속적인 업데이트를 수행하며 하기 수학식 3의 비용 함수(cost function)을 이용하여 오차를 계산하게 된다. Under the update rule of Equation 2, the Euclidean distance between V and WH

Has the characteristic of not increasing. The continuous update is performed and an error is calculated using a cost function of Equation 3 below.

여기서, A는 입력 행렬 V, B는 기초 행렬 W와 인코딩 행렬 H의 곱이 된다. Here, A is the input matrix V, B is the product of the base matrix W and the encoding matrix H.

수학식 3의 비용 함수는 A와 B 사이의 유클리디언 거리의 제곱(square)을 이용한다.

이 최소가 되는 경우에 입력 행렬 V에 대하여 기초 행렬 W와 인코딩 행렬 H를 결정하게 된다. The cost function in Equation 3 uses the square of the Euclidean distance between A and B.

In this case, the base matrix W and the encoding matrix H are determined for the input matrix V.

이외에도 비용 함수로는 하기의 수학식 4와 같은 비용 함수를 이용할 수도 있다. In addition, a cost function such as Equation 4 below may be used as the cost function.

기초 행렬 W는 입력 행렬 V의 기초 정보를 가진다. 따라서, 입력 행렬 V의 성질이 다르다면, 기초 행렬 W 역시 그 성질이 달라지게 된다. 본 발명에서 입력 행렬 V가 스펙트로그램을 이용하는 바, 기초 행렬 W는 음성 신호의 주파수에 관련된 정보를 가지게 되고, 인코딩 행렬 H는 음성 신호의 시간에 따른 인코딩 정보를 가지게 된다. 따라서, 입력 행렬 V의 주파수 특징을 나타내는 기초 행렬 W가 특징 벡터가 된다. The base matrix W has the basic information of the input matrix V. Therefore, if the properties of the input matrix V are different, the properties of the base matrix W are also different. In the present invention, when the input matrix V uses the spectrogram, the base matrix W has information related to the frequency of the speech signal, and the encoding matrix H has encoding information over time of the speech signal. Therefore, the base matrix W representing the frequency characteristic of the input matrix V becomes a feature vector.

입력 행렬 V가 n x m 행렬인 경우를 가정하여 설명한다. 이 경우 입력 행렬 V는 n 차원의 데이터 벡터가 m 개 있는 것으로 판단할 수 있다. 예를 들어, 도 3에 도시된 입력 행렬 V의 경우 주파수를 중심으로 512 차원의 데이터 벡터가 시간에 따라 200 개가 있는 것으로 판단할 수 있다. It is assumed that the input matrix V is an n x m matrix. In this case, the input matrix V may determine that there are m data vectors in n-dimensions. For example, in the case of the input matrix V illustrated in FIG. 3, it may be determined that there are 200 data vectors of 512 dimensions with respect to frequency.

행렬 연산에 의해 기초 행렬 W는 n x r의 차원을 가지게 되고, 인코딩 행렬 H는 r x m 의 차원을 가지게 된다. 여기서, 입력 행렬 V의 특징은 m이 아닌 n에 있으므로, 기초 행렬 W가 입력 행렬 V의 특징을 나타내게 된다. 여기서, r은 정해지지 않은 임의의 자연수로, 사용자에 의해 설정될 수 있다. 이하에서는 r이 1인 것을 가정하여 설명하지만, 본 발명이 이에 한정되지 않음은 자명하다. By matrix operation, the base matrix W has a dimension of n x r, and the encoding matrix H has a dimension of r x m. In this case, since the characteristics of the input matrix V are in n instead of m, the base matrix W represents the characteristics of the input matrix V. Here, r is an arbitrary natural number that is not determined and can be set by the user. In the following description, it is assumed that r is 1, but the present invention is not limited thereto.

이와 같은 방법으로 입력 행렬 V에 대한 특징을 나타내는 기초 벡터 W를 산출하고, 이를 특징 벡터로 결정한다. In this way, a base vector W representing a feature of the input matrix V is calculated and determined as a feature vector.

비교부(140)는 비음수 행렬 인수분해를 수행한 결과 산출된 특징 벡터와, 감정 모델 데이터베이스(160)에 저장되어 있는 모델 벡터 W_E를 비교한다(단계 S240). 여기서, 모델 벡터 W_E는 모델 음성 신호에 대하여 상술한 것과 같은 비음수 행렬 인수분해를 수행한 결과로, 미리 감정 모델 데이터베이스(160)에 감정 모델 별로 구분되어 저장되어 있다. 이에 대해서는 추후 도 4 이하 도면을 참조하여 상세히 설명하기로 한다. The comparison unit 140 compares the feature vector calculated as a result of performing the non-negative matrix factorization with the model vector W _E stored in the emotion model database 160 (step S240). Here, the model vector W _E is a result of performing the non-negative matrix factorization on the model voice signal as described above, and is previously stored in the emotion model database 160 for each emotion model. This will be described later in detail with reference to FIG. 4.

특징 벡터와 모델 벡터를 비교하는 방법은 유클리디언 거리의 제곱을 이용하는 방법이다. 유클리디언 거리는 저장된 모델 벡터와 새로운 특징 벡터 사이의 차이를 구해 가장 유사한 모델 벡터를 선별하기 위함이다. The method of comparing the feature vector and the model vector is to use the square of Euclidean distance. Euclidean distance is to select the most similar model vector by finding the difference between the stored model vector and the new feature vector.

유클리디언 거리 만을 이용하는 경우에는 차이값만을 구하게 되지만, 제곱을 추가적으로 이용하는 경우에는 차이가 큰 값은 더 차이가 크게 하고 차이가 작은 값은 더 차이가 작게 하여 특징이 있는 부분은 그 값이 커지게 되고 특징이 없는 부분은 차이가 커지지 않게 되어 더 좋은 인식률을 낼 수 있다. If only Euclidean distance is used, only the difference value is obtained.However, if additional squares are used, the larger value is larger and the smaller value is smaller. And the non-characterized part will not have a big difference, which can lead to better recognition rate.

본 실시예에서는 특징 벡터와 모델 벡터 사이의 유사성을 찾는 방법으로 유클리디언 거리를 이용하였다. 하지만, 이는 하나의 실시예에 불과하며, 벡터 사이의 유사성을 찾는 다양한 방법이 본 발명에 적용될 수 있음을 이해해야 할 것이다. In this embodiment, Euclidean distance is used as a method for finding similarity between the feature vector and the model vector. However, it is to be understood that this is only one embodiment and that various methods of finding similarity between vectors may be applied to the present invention.

판단부(150)는 모델 벡터와 특징 벡터 간의 비교 결과 특징 벡터와 가장 유사한 모델 벡터를 선택한다(단계 S250). 그리고 선택된 모델 벡터가 나타내는 감정 모델이 현재 판단 대상인 특징 벡터를 가지는 대상 음성 신호에 내포된 감정 상태인 것으로 판단한다(단계 S260). The determination unit 150 selects a model vector most similar to the feature vector as a result of the comparison between the model vector and the feature vector (step S250). In addition, it is determined that the emotion model indicated by the selected model vector is an emotion state included in the target speech signal having the feature vector that is the current determination target (step S260).

따라서, 감정 인식 장치(100)에 입력된 대상 음성 신호에 대하여 내포된 감정이 무엇인지를 인식하게 된다. 그리고 그 결과에 따라 미리 지정된 작업을 수행할 수 있다. 일 실시예에서, 판단된 감정을 사용자가 확인할 수 있도록 막대바, 수치, 색상, 캐릭터 등으로 표시할 수 있다. 다른 실시예에서, 감정 인식 장치(100)가 콜센터 시스템에 적용된 경우 고객의 음성으로부터 판단되는 고객의 감정 상태에 따라 상담원 연결에 유연성을 부가할 수도 있다. Thus, the emotion recognition apparatus 100 recognizes what emotions are included in the target voice signal. And, depending on the result, a predetermined task can be performed. In an embodiment, the determined emotion may be displayed as a bar bar, a numerical value, a color, a character, or the like so that the user can check the determined emotion. In another embodiment, when the emotion recognition apparatus 100 is applied to the call center system, flexibility may be added to the counselor connection according to the emotion state of the customer determined from the voice of the customer.

이상에서는 대상 음성 신호를 이용한 감정 인식 방법에 대하여 설명하였으며, 이하에서는 모델 음성 신호를 이용한 모델 벡터 생성 방법에 대하여 설명하기로 한다. The emotion recognition method using the target voice signal has been described above, and a model vector generation method using the model voice signal will be described below.

도 4는 본 발명의 일 실시예에 따른 모델 벡터 생성 방법의 흐름도이고, 도 5는 다양한 감정 모델 별 모델 음성 신호의 예시를 도시한 도면이며, 도 6은 모델 음성 신호를 이용하여 생성한 입력 행렬의 예시를 도시한 도면이다. 4 is a flowchart of a method for generating a model vector according to an embodiment of the present invention, FIG. 5 is a diagram illustrating examples of model speech signals for various emotion models, and FIG. 6 is an input matrix generated using a model speech signal. Is an illustration of an example.

모델 벡터 생성 시에는 감정 인식 장치(100)의 구성요소 중 비교부(140)와 판단부(150)는 활성화되지 않을 수 있다. When generating the model vector, the comparator 140 and the determiner 150 among the components of the emotion recognition apparatus 100 may not be activated.

입력부(110)는 하나 이상의 모델 음성 신호를 입력받는다(단계 S410). 통신 네트워크를 통해 전송받거나 감정 인식 장치(100)에 별도로 구비된 입력 인터페이스를 통해 입력받을 수 있다. 모델 음성 신호를 실시간으로 전송/입력되거나 미리 저장된 파일 형태로 전송/입력될 수 있다.The input unit 110 receives one or more model voice signals (step S410). It may be received through a communication network or through an input interface provided separately in the emotion recognition apparatus 100. The model voice signal may be transmitted / inputted in real time or transmitted / input in the form of a pre-stored file.

모델 음성 신호는 추후 입력될 대상 음성 신호에 내포된 감정 상태를 판단하기 위한 기준이 되는 음성 신호로, 수가 많을수록 보다 정확한 판단이 가능할 수 있다. 따라서, 모델 벡터 생성 시에는 다수의 모델 음성 신호를 입력받는 것이 바람직하다. The model voice signal is a voice signal serving as a reference for determining an emotional state included in a target voice signal to be input later, and the larger the number, the more accurate the voice signal may be. Therefore, when generating a model vector, it is preferable to receive a plurality of model voice signals.

입력된 모델 음성 신호를 감정 모델 별로 분류한다(단계 S420). 모델 음성 신호는 내포된 감정 상태가 무엇인지에 대한 판단이 완료되어 있는 바, 어느 감정 모델에 속하는지에 따라 각 모델 음성 신호를 그룹화한다. The input model voice signal is classified for each emotion model (step S420). The model voice signal is judged as to what implied emotional state is completed, and each model voice signal is grouped according to which emotion model belongs.

여기서, 도 2에 도시된 방법에 의해 감정 인식이 완료된 대상 음성 신호의 경우 특정 감정 상태를 나타내는 모델 음성 신호로서 기능할 수도 있다.Here, in the case of the target voice signal in which emotion recognition is completed by the method illustrated in FIG. 2, the target voice signal may function as a model voice signal indicating a specific emotional state.

본 실시예에서는 여러 감성 모델에 속하는 다수의 모델 음성 신호가 입력되는 것을 가정하여 설명하였지만, 이와는 달리 특정 감정 모델에 속하는 모델 음성 신호만이 입력될 수도 있다. 이 경우 단계 S420의 분류 과정은 생략될 수 있다. In the present embodiment, the description has been made on the assumption that a plurality of model voice signals belonging to various emotion models are input. Alternatively, only model voice signals belonging to a specific emotion model may be input. In this case, the classification process of step S420 may be omitted.

감정 모델 별로 모델 음성 신호가 분류된 이후에는 각 감정 모델에 따라 단계 S430 내지 S450을 별도로 수행한다. After the model voice signal is classified for each emotion model, steps S430 to S450 are separately performed according to each emotion model.

모델 음성 신호를 스펙트로그램을 이용하여 기본 입력 행렬로 변환한다(단계 S430). 기본 입력 행렬로의 변환은 앞서 설명한 감정 인식 방법의 단계 S220과 동일한 바 상세한 설명은 생략하기로 한다. The model speech signal is converted into a basic input matrix using a spectrogram (step S430). The conversion to the basic input matrix is the same as in the above-described step S220 of the emotion recognition method, and thus a detailed description thereof will be omitted.

하나의 모델 음성 신호마다 하나의 기본 입력 행렬이 생성되는 바, 입력된 모델 음성 신호의 수와 같은 수의 기본 입력 행렬이 생성된다. One basic input matrix is generated for each model speech signal, and the same number of basic input matrices as the number of input model speech signals are generated.

만약 입력된 모델 음성 신호가 하나인 경우 기본 입력 행렬도 하나이므로 단계 S440은 생략될 수 있다. 그리고 기본 입력 행렬이 바로 합성 입력 행렬이 된다. If there is only one input model voice signal, step S440 may be omitted since there is also one basic input matrix. The default input matrix is the composite input matrix.

만약 입력된 모델 음성 신호가 둘 이상인 경우 기본 입력 행렬이 둘 이상이므로, 단계 S440을 수행한다. 둘 이상의 기본 입력 행렬을 가로 방향으로 붙인 합성 입력 행렬을 생성한다 (단계 S440). 가로 방향으로 붙이는 것은 둘 이상의 기본 입력 행렬에 대하여 행을 유지한 채로 열을 증가시킨 것을 의미한다. If there is more than one input model voice signal, since the basic input matrix is more than one, step S440 is performed. A composite input matrix is formed by pasting two or more basic input matrices in a horizontal direction (step S440). Pasting in the horizontal direction means incrementing columns while keeping rows for two or more basic input matrices.

도 5를 참조하면, 감정 모델 별로 구분된 합성 입력 행렬이 도시되어 있다. 중립을 나타내는 제1 합성 입력 행렬(500N), 화남을 나타내는 제2 합성 입력 행 렬(500A), 행복을 나타내는 제3 합성 입력 행렬(500H), 슬픔을 나타내는 제4 합성 입력 행렬(500S)가 도시되어 있다. Referring to FIG. 5, a composite input matrix divided by emotion models is illustrated. The first composite input matrix 500N for neutral, the second composite input matrix 500A for anger, the third composite input matrix 500H for happiness, and the fourth composite input matrix 500S for sadness are shown. It is.

제1 합성 입력 행렬(500N)을 중심으로 설명하면, 제1 합성 입력 행렬(500N)은 n개의 기본 입력 행렬(510-1, 510-2, 510-3, …, 510-n)로 구성된다. n개의 기본 입력 행렬(510-1, 510-2, 510-3, …, 510-n)에 대하여 행을 유지한 채로 열을 증가시켜 제1 합성 입력 행렬(500N)을 생성한다. 이는 본 발명에서 주파수가 음성 신호의 중요한 특징을 파악하기 위한 기준이 되는 바 주파수를 나타내는 행이 입력 행렬의 중요한 특징을 가지고 있기 때문이다. Referring to the first synthesis input matrix 500N, the first synthesis input matrix 500N includes n basic input matrices 510-1, 510-2, 510-3, ..., 510-n. . The first composite input matrix 500N is generated by increasing columns while maintaining rows for the n basic input matrices 510-1, 510-2, 510-3, ..., 510-n. This is because, in the present invention, the frequency is a reference for identifying an important characteristic of the speech signal, and the row representing the frequency has the important characteristic of the input matrix.

기본 입력 행렬이 512 x 200 행렬인 것으로 가정하면, 제1 기본 입력 행렬(510-1)의 (1, 1) 원소가 제1 합성 입력 행렬(500N)의 (1, 1) 원소가 되며( N1(1, 1) = N(1, 1) ), 제2 기본 입력 행렬(510-2)의 (1, 1) 원소가 제1 합성 입력 행렬(500N)의 (1, 201) 원소가 된다( N2(1, 1) = N(1, 201) ). Assuming that the primary input matrix is a 512 x 200 matrix, the (1, 1) element of the first basic input matrix 510-1 becomes the (1, 1) element of the first composite input matrix 500N (N1 (1, 1) = N (1, 1)), and the (1, 1) element of the second basic input matrix 510-2 becomes the (1, 201) element of the first composite input matrix 500N ( N2 (1, 1) = N (1, 201)).

NMF 수행부(130)는 합성 입력 행렬에 대하여 비음수 행렬 인수분해를 수행하여 모델 벡터를 산출한다(단계 S450). 비음수 행렬 인수분해를 수행하는 방법은 앞서 설명한 감정 인식 방법의 비음수 행렬 인수분해 수행 과정(단계 S230)과 유사한 바 상세한 설명은 생략하기로 한다. The NMF execution unit 130 calculates a model vector by performing non-negative matrix factorization on the synthesized input matrix (step S450). Since the method of performing the non-negative matrix factorization is similar to the process of performing the non-negative matrix factorization of the emotion recognition method described above (step S230), a detailed description thereof will be omitted.

여기서, 비음수 행렬 인수분해 시 입력 행렬은 합성 입력 행렬이고, 기초 벡터가 모델 벡터가 된다. 비음수 행렬 인수분해 시 합성 입력 행렬의 열의 수가 많을수록 기초 벡터는 해당 감정 모델에 대한 다수의 주파수 특징을 보다 더 잘 나타내게 된다. Here, the input matrix is a composite input matrix when the nonnegative matrix factorization is performed, and the base vector is a model vector. In non-negative matrix factorization, the larger the number of columns in the composite input matrix, the better the basis vector will represent the multiple frequency features for that emotion model.

각 감정 모델 별로 단계 S430 내지 S450을 수행함으로써 각 감정 모델에 대한 모델 벡터를 획득할 수 있다. 획득된 모델 벡터는 감정 모델 데이터베이스(160)에 저장된다(단계 S460). 이후 대상 음성 신호가 입력되면, 해당 대상 음성 신호의 감정 인식 시 활용된다. By performing steps S430 to S450 for each emotion model, a model vector for each emotion model may be obtained. The acquired model vector is stored in the emotion model database 160 (step S460). Thereafter, when the target voice signal is input, the target voice signal is used for emotion recognition.

도 7a 내지 도 7c는 본 발명의 일 실시예에 따라 획득된 모델 벡터의 예시이다. 여기서, 가로축은 모델 벡터의 행에 해당하며, 512개의 주파수를 나타낸다. 그리고 세로축은 각 주파수에서의 진폭 특징을 나타낸다. 특히, 가로축은 0~8000 Hz 대역을 512개로 구분하고 있다. 7A-7C are examples of model vectors obtained in accordance with one embodiment of the present invention. Here, the horizontal axis corresponds to the row of the model vector and represents 512 frequencies. And the vertical axis represents the amplitude characteristic at each frequency. In particular, the horizontal axis divides the 0 to 8000 Hz band into 512 units.

감정 모델이 중립(도 7a 참조), 화남(도 7b 참조), 슬픔(도 7c 참조)의 경우를 나타내고 있으며, 각 감정 모델에 따라 구별되는 특징이 나타나 있다. 이러한 모델 벡터는 모델 음성 신호의 종류, 수 등에 따라 미세하게 다른 값을 가질 수는 있지만, 주된 특징은 공유하게 된다. The emotion models represent neutral (see FIG. 7A), anger (see FIG. 7B), and sadness (see FIG. 7C), and features distinguished according to each emotion model. Such a model vector may have a slightly different value depending on the type, number, etc. of the model voice signal, but the main features are shared.

예를 들어, 감정 모델이 중립인 경우 음성 신호가 0~4000 Hz 대역에서 강한 값을 가짐이 도 7a에 도시되어 있다. 또한, 감정 모델이 화남인 경우 음성 신호가 0~8000 Hz 대역 전체에서 고르게 분포됨이 도 7b에 도시되어 있다. 또한, 감정 모델이 슬픔인 경우 음성 신호가 특정 대역에서 피크와 밸리가 반복적으로 나타나고 있음이 도 7c에 도시되어 있다. For example, it is shown in FIG. 7A that the speech signal has a strong value in the 0 to 4000 Hz band when the emotion model is neutral. In addition, when the emotional model is angry, it is shown in FIG. 7B that voice signals are evenly distributed throughout the 0 to 8000 Hz band. In addition, it is shown in FIG. 7C that a peak and a valley appear repeatedly in a specific band when the emotional model is sad.

한편, 상술한 감정 인식 방법 및/또는 모델 벡터 생성 방법은 컴퓨터 프로그램으로 작성 가능하다. 상기 프로그램을 구성하는 코드들 및 코드 세그먼트들은 당해 분야의 컴퓨터 프로그래머에 의하여 용이하게 추론될 수 있다. 또한, 상기 프 로그램은 컴퓨터가 읽을 수 있는 정보저장매체(computer readable media)에 저장되고, 컴퓨터에 의하여 읽혀지고 실행됨으로써 감정 인식 방법 및/또는 모델 벡터 생성 방법을 구현한다. 상기 정보저장매체는 자기 기록매체, 광 기록매체 및 캐리어 웨이브 매체를 포함한다.Meanwhile, the above-described emotion recognition method and / or model vector generation method can be created by a computer program. The codes and code segments that make up the program can be easily deduced by a computer programmer in the field. In addition, the program is stored in a computer readable media, and read and executed by a computer to implement the emotion recognition method and / or model vector generation method. The information storage medium includes a magnetic recording medium, an optical recording medium, and a carrier wave medium.

본 발명의 실시예들에서 하나 이상의 구성 요소가 통합되어 구현되거나 또는 일부 구성 요소가 기능적으로 세분화되어 구현될 수 있으며, 이는 본 발명의 권리범위에 속함을 이해할 것이다. It is to be understood that one or more of the components in the embodiments of the present invention may be implemented in an integrated manner, or some of the components may be functionally subdivided and implemented, which is within the scope of the present invention.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the appended claims. It will be understood that the invention may be varied and varied without departing from the scope of the invention.

도 1은 본 발명의 일 실시예에 따른 감정 인식 장치의 블록 구성도. 1 is a block diagram of an emotion recognition apparatus according to an embodiment of the present invention.

도 2는 본 발명의 일 실시예에 따른 감정 인식 방법의 흐름도.2 is a flowchart of a emotion recognition method according to an embodiment of the present invention.

도 3은 본 발명의 일 실시예에 따른 음성 신호가 변환된 행렬의 예시도. 3 is an exemplary diagram of a matrix in which a voice signal is converted according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따른 모델 벡터 생성 방법의 흐름도.4 is a flowchart of a model vector generation method according to an embodiment of the present invention.

도 5는 다양한 감정 모델 별 모델 음성 신호의 예시를 도시한 도면.5 is a diagram illustrating examples of model speech signals for various emotion models.

도 6은 모델 음성 신호를 이용하여 생성한 입력 행렬의 예시를 도시한 도면. 6 shows an example of an input matrix generated using a model speech signal.

도 7a 내지 도 7c는 본 발명의 일 실시예에 따라 획득된 모델 벡터의 예시.7A-7C are illustrations of model vectors obtained in accordance with one embodiment of the present invention.

<도면부호의 설명><Description of Drawing>

100: 감정 인식 장치100: emotion recognition device

110: 입력부 120: 변환부110: input unit 120: conversion unit

130: NMF 수행부 140: 비교부130: NMF execution unit 140: comparison unit

150: 판단부 160: 감정 모델 데이터베이스150: judgment unit 160: emotion model database

500N, 500A, 500H, 500S: 합성 입력 행렬500N, 500A, 500H, 500S: composite input matrix

510-1, 510-2, 510-3, 510-n: 기본 입력 행렬510-1, 510-2, 510-3, 510-n: default input matrix

Claims

An input unit configured to receive a target voice signal;

A converter for converting the input voice signal into an input matrix using a spectrogram;

An NMF performer for calculating a feature vector by performing non-negative matrix factorization (NMF) on the input matrix;

An emotion model database that stores model vectors for each emotion model;

A comparison unit comparing the feature vector with a model vector stored in the emotion model database; And

And a determination unit for determining an emotion state of the target voice signal using an emotion model corresponding to a model vector most similar to the feature vector as a result of the comparison in the comparison unit.

The input unit receives one or more model voice signals,

The transform unit generates a basic input matrix by using the spectrogram of the model speech signal, and generates a composite input matrix by horizontally pasting the basic input matrix for each emotion model.

And the NMF performing unit calculates the model vector by performing non-negative matrix factorization on the synthesized input matrix and stores the model vector in the emotion model database.

The method of claim 1,

And the comparing unit performs a comparison by using a square of an Euclidean distance between the feature vector and the model vector.

The method of claim 1,

The conversion unit sets a frequency axis and a time axis as rows and columns of the input matrix with respect to a spectrogram of the target voice signal, and sets an amplitude value of the target voice signal according to a change in frequency and time as an element. Emotion recognition device.

The method of claim 3,

Wherein each element of the input matrix is non-negative.

The method of claim 1,

And the NMF performing unit selects a basis matrix representing a frequency characteristic of the target speech signal as the feature vector among the results of performing the non-negative matrix factorization of the input matrix.

delete

The method of claim 1,

And the model voice signal includes the target voice signal in which an emotional state is determined by the determination unit.

Receiving a target voice signal (a);

(B) converting the received target voice signal into an input matrix using a spectrogram;

(C) calculating a feature vector by performing non-negative matrix factorization (NMF) on the input matrix;

(D) comparing the feature vector with a model vector stored in an emotion model database; And

(E) determining an emotional state of the target speech signal using an emotion model corresponding to a model vector most similar to the feature vector as a result of the comparison,

Receiving at least one model voice signal;

Generating a basic input matrix using a spectrogram of the model speech signal;

Generating a composite input matrix by horizontally pasting the basic input matrix for each emotion model;

Calculating a model vector by performing non-negative matrix factorization on the composite input matrix; And

And storing in the emotion model database is performed prior to step (a).

The method of claim 8,

The step (d) is a emotion recognition method characterized in that the comparison using the square of the Euclidean distance (Euclidean distance) between the feature vector and the model vector.

The method of claim 8,

In the step (b), the frequency axis and the time axis are set as rows and columns of the input matrix with respect to the spectrogram of the target voice signal, and the amplitude value of the target voice signal according to the change of frequency and time is an element. Emotion recognition method, characterized in that.

The method of claim 10,

Wherein each element of the input matrix is non-negative.

The method of claim 8,

Wherein the step (c) selects a basis matrix representing a frequency characteristic of the target speech signal as the feature vector among the results of performing the non-negative matrix factorization of the input matrix.

delete

A recording medium having recorded thereon a program of instructions which can be executed in a computer device for carrying out the emotion recognition method according to any one of claims 8 to 12, wherein the program can be read by the computer device.