KR102106700B1

KR102106700B1 - A method of simultaneous recognition for emotion, age, and gender based on users' voice signal

Info

Publication number: KR102106700B1
Application number: KR1020180071462A
Authority: KR
Inventors: 이수영; 채명수; 신영훈; 김태호; 김준우
Original assignee: 한국과학기술원
Priority date: 2018-05-23
Filing date: 2018-06-21
Publication date: 2020-05-28
Also published as: KR20190133580A

Abstract

사용자의 음성 신호를 기반으로 감정, 나이 및 성별을 동시에 인식하는 방법 및 시스템이 개시된다. 컴퓨터에 의해 실행되는 사용자 특성 정보 인식 방법에 있어서, 음성 신호에 해당하는 입력 데이터 셋(input data set)을 대상으로, 미리 지정된 복수의 특성 별로 프레임(frame)을 구분하는 단계, 구분된 상기 특성 별 프레임을 대상으로, 컨볼루션 뉴럴 네트워크(convolution neural network)를 기반으로 상기 복수의 특성 별 손실 함수에 기초하여 학습을 수행하는 단계, 및 상기 학습을 통해 생성된 학습 모델에 기초하여 입력된 음성 신호에 해당하는 서로 다른 복수의 사용자 특성 정보를 인식하는 단계를 포함할 수 있다.Disclosed is a method and system for simultaneously recognizing emotion, age, and gender based on a user's voice signal. In the method for recognizing user characteristic information executed by a computer, a step of classifying frames according to a plurality of predetermined characteristics based on an input data set corresponding to a voice signal, according to the distinguished characteristics Based on a convolutional neural network, performing learning based on a loss function for each of the characteristics based on a convolutional neural network, and input voice signals based on a learning model generated through the learning. And recognizing a plurality of different user characteristic information.

Description

A method of simultaneous recognition for emotion, age, and gender based on users' voice signal}

본 발명의 실시예들은 사용자의 감정, 나이 및 성별을 인식하는 기술에 관한 것으로서, 더욱 상세하게는 딥러닝(deep learning)을 기반으로 사용자의 감정, 나이 및 성별을 인식하는 기술에 관한 것이다.Embodiments of the present invention relates to a technique for recognizing a user's emotion, age, and gender, and more particularly, a technique for recognizing a user's emotion, age, and gender based on deep learning.

인공지능 비서, 생체 인식 기반 보안 등 사용자와 기계 간의 인터페이스 기술에 대한 관심이 증대되면서, 음성 및 얼굴 표정을 비롯한 생체 데이터로부터 인간의 감정 등의 특성을 인식하는 기술들이 활발하게 연구되고 있다.As interest in interface technologies between a user and a machine, such as artificial intelligence assistant and biometrics-based security, has increased, techniques for recognizing characteristics of human emotions, such as voice and facial expressions, have been actively studied.

음성 신호를 이용한 감정 인식 기술은 여러 분야에서 활용될 수 있다. 예를 들면, 사용자가 화가 난 것으로 예상되는 감정 상태인 경우, 격양된 감정을 진정시키는 어조, 차분한 말투 등으로 서비스를 제안하도록 하여 해당 서비스와 연결하는 지능형 대응이 가능하다. 또한, 사용자의 감정 상태가 슬픔으로 예측되는 경우, 슬픈 발라드 등의 음악을 제안하는 등의 서비스와 지능형 대응이 가능하다.Emotion recognition technology using voice signals can be used in various fields. For example, if the user is in an emotional state that is expected to be angry, an intelligent response to connect with the corresponding service is possible by allowing the service to be proposed in a tone of calming calm emotions, calm speech, and the like. In addition, when the user's emotional state is predicted as sadness, intelligent response is possible with services such as suggesting music such as a sad ballad.

고객센터, 전화상담, 전화 교육 등의 스마트폰(smartphone) 기반 서비스를 이용하는 과정에 있어서 사용자의 감정, 나이, 성별에 대한 정보를 알면 사용자의 상태에 적합한 서비스의 제공이 가능해진다.In the process of using a smartphone-based service such as a customer center, telephone consultation, and telephone education, knowing information about a user's emotion, age, and gender makes it possible to provide a service suitable for the user's condition.

한국공개특허 제10-2011-0011969호는 WTM을 기반으로 손실함수와 최대마진기법을 통한 음성 감정 인식 모델 구축 방법에 관한 것으로, WTM(Watson-Tellegen Emotional Model)의 감정군들 사이의 기하학적 거리를 사용하여 각 감정 사이의 차이를 수치화하고, 설정한 값들을 기초로 하여 손실함수(loss function)의 값을 구하고, 구해진 손실함수를 기초로 하여 각 음성 감정 모델의 파라미터를 구하는 감정 인식 모델을 구축하는 기술을 개시하고 있다.Korean Patent Publication No. 10-2011-0011969 relates to a method for constructing a voice emotion recognition model using a loss function and a maximum margin technique based on WTM, and the geometric distance between emotion groups of WTM (Watson-Tellegen Emotional Model). Using to quantify the difference between each emotion, to obtain the value of the loss function (loss function) based on the set value, and to build an emotion recognition model to obtain the parameters of each voice emotion model based on the obtained loss function Technology is disclosed.

본 발명은 딥 러닝(deep learning)을 기반으로 사용자의 음성 신호로부터 사용자의 감정, 나이 및 성별을 인식하는 기술에 관한 것이다.The present invention relates to a technique for recognizing a user's emotion, age, and gender from a user's voice signal based on deep learning.

또한, 인식된 사용자의 감정, 나이 및 성별에 적합한 서비스를 제공하는 기술에 관한 것이다.In addition, it relates to a technology for providing a service suitable for the recognized user's emotion, age and gender.

컴퓨터에 의해 실행되는 사용자 특성 정보 인식 방법에 있어서, 음성 신호에 해당하는 입력 데이터 셋(input data set)을 대상으로, 미리 지정된 복수의 특성 별로 프레임(frame)을 구분하는 단계, 구분된 상기 특성 별 프레임을 대상으로, 컨볼루션 뉴럴 네트워크(convolution neural network)를 기반으로 상기 복수의 특성 별 손실 함수에 기초하여 학습을 수행하는 단계, 및 상기 학습을 통해 생성된 학습 모델에 기초하여 입력된 음성 신호에 해당하는 서로 다른 복수의 사용자 특성 정보를 인식하는 단계를 포함할 수 있다.In the method for recognizing user characteristic information executed by a computer, a step of classifying frames according to a plurality of predetermined characteristics based on an input data set corresponding to a voice signal, according to the distinguished characteristics Based on a convolutional neural network, performing learning based on a loss function for each of the characteristics based on a convolutional neural network, and input voice signals based on a learning model generated through the learning. And recognizing a plurality of different user characteristic information.

일측면에 따르면, 상기 학습을 수행하는 단계는, 소프트맥스(softmax) 함수에 기초하여 상기 복수의 특성 별로 손실(loss)을 계산하는 단계, 및 계산된 상기 복수의 특성 별 손실의 합에 기초하여 최종 학습의 기준을 설정하는 단계를 포함할 수 있다.According to one aspect, the step of performing the learning may include calculating a loss for each of the plurality of characteristics based on a softmax function, and calculating the loss for each of the plurality of characteristics. And setting a standard for final learning.

다른 측면에 따르면, 상기 복수의 특성 별로 프레임(frame)을 구분하는 단계는, 상기 입력 데이터 셋(input data set)을 대상으로, 사용자의 감정(emotion), 나이(age) 및 성별(gender)을 위한 프레임으로 구분할 수 있다.According to another aspect, the step of classifying a frame for each of the plurality of characteristics includes, for the input data set, the user's emotion, age, and gender. It can be divided into frames for.

또 다른 측면에 따르면, 상기 학습을 수행하는 단계는, 상기 컨볼루션 뉴럴 네트워크에 기반하는 각 컨볼루션 레이어를 대상으로 샘플링(sampling)을 수행하는 맥스 풀링(max pooling)을 수행하는 단계를 포함할 수 있다.According to another aspect, the step of performing the learning may include performing max pooling to perform sampling on each convolutional layer based on the convolutional neural network. have.

또 다른 측면에 따르면, 상기 사용자 특성 정보를 인식하는 단계는, 상기 입력된 음성 신호로부터 사용자의 감정, 나이 및 성별을 동시에 인식할 수 있다.According to another aspect, the step of recognizing the user characteristic information may simultaneously recognize a user's emotion, age, and gender from the input voice signal.

또 다른 측면에 따르면, 상기 사용자 특성 정보를 인식하는 단계는, 상기 사용자 특성 정보가 감정에 해당하는 경우, 상기 입력된 음성 신호로부터 사용자의 감정 상태가 중립, 기쁨, 슬픔, 분노, 혐오, 놀람, 공포 중 적어도 하나에 해당하는지 여부를 인식할 수 있다.According to another aspect, in the step of recognizing the user characteristic information, when the user characteristic information corresponds to emotion, the emotional state of the user from the input voice signal is neutral, joy, sadness, anger, disgust, surprise, You can recognize whether you are at least one of the fears.

또 다른 측면에 따르면, 인식된 상기 사용자 특성 정보를 기반으로 사용자의 현재 상태에 해당하는 서비스를 제공하는 단계를 더 포함할 수 있다.According to another aspect, the method may further include providing a service corresponding to the current state of the user based on the recognized user characteristic information.

사용자 특성 정보 인식 시스템에 있어서, 음성 신호에 해당하는 입력 데이터 셋(input data set)을 대상으로, 미리 지정된 복수의 특성 별로 프레임(frame)을 구분하는 프레임 구분부, 구분된 상기 특성 별 프레임을 대상으로, 컨볼루션 뉴럴 네트워크(convolution neural network)를 기반으로 상기 복수의 특성 별 손실 함수에 기초하여 학습을 수행하는 학습 제어부, 및 상기 학습을 통해 생성된 학습 모델에 기초하여 입력된 음성 신호에 해당하는 서로 다른 복수의 사용자 특성 정보를 인식하는 사용자 특성 인식부를 포함할 수 있다.In the user characteristic information recognition system, targeting an input data set corresponding to a voice signal, a frame dividing unit for dividing a frame for a plurality of predetermined characteristics, and targeting the frame for each of the characteristics The learning control unit performs learning based on the loss function for each of the plurality of characteristics based on a convolutional neural network, and corresponds to an input voice signal based on a learning model generated through the learning. And a user characteristic recognition unit that recognizes a plurality of different user characteristic information.

일측면에 따르면, 상기 학습 제어부는, 소프트맥스(softmax) 함수에 기초하여 상기 복수의 특성 별로 손실(loss)을 계산하고, 계산된 상기 복수의 특성 별 손실의 합에 기초하여 최종 학습의 기준을 설정할 수 있다.According to one aspect, the learning control unit calculates a loss for each of the plurality of characteristics based on a softmax function, and sets a criterion for final learning based on the calculated sum of the losses for each of the plurality of characteristics. Can be set.

다른 측면에 따르면, 상기 프레임 구분부는, 상기 입력 데이터 셋(input data set)을 대상으로, 사용자의 감정(emotion), 나이(age) 및 성별(gender)을 위한 프레임으로 구분할 수 있다.According to another aspect, the frame division unit may be divided into frames for the user's emotion, age, and gender for the input data set.

또 다른 측면에 따르면, 상기 학습 제어부는, 상기 컨볼루션 뉴럴 네트워크에 기반하는 각 컨볼루션 레이어를 대상으로 샘플링(sampling)을 수행하는 맥스 풀링(max pooling)을 수행할 수 있다.According to another aspect, the learning control unit may perform max pooling, which performs sampling on each convolutional layer based on the convolutional neural network.

또 다른 측면에 따르면, 상기 사용자 특성 인식부는, 상기 입력된 음성 신호로부터 사용자의 감정, 나이 및 성별을 동시에 인식할 수 있다.According to another aspect, the user characteristic recognition unit may simultaneously recognize a user's emotion, age, and gender from the input voice signal.

또 다른 측면에 따르면, 상기 사용자 특성 인식부는, 상기 사용자 특성 정보가 감정에 해당하는 경우, 상기 입력된 음성 신호로부터 사용자의 감정 상태가 중립, 기쁨, 슬픔, 분노, 혐오, 놀람, 공포 중 적어도 하나에 해당하는지 여부를 인식할 수 있다.According to another aspect, when the user characteristic information corresponds to emotion, the user characteristic recognition unit may have at least one of a neutral, joy, sadness, anger, disgust, surprise, and fear from the input voice signal. You can recognize whether or not.

또 다른 측면에 따르면, 인식된 상기 사용자 특성 정보를 기반으로 사용자의 현재 상태에 해당하는 서비스를 제공하는 서비스 제공부를 더 포함할 수 있다.According to another aspect, it may further include a service providing unit for providing a service corresponding to the current state of the user based on the recognized user characteristic information.

본 발명은 딥 러닝(deep learning) 기반 학습 알고리즘 중 컨볼루션 뉴럴 네트워크(convolution neural network)를 기반 학습을 이용하여 사용자의 음성 신호로부터 사용자의 감정, 나이 및 성별을 동시에 인식할 수 있다. The present invention can simultaneously recognize a user's emotion, age, and gender from a user's voice signal using a learning based on a convolutional neural network among deep learning-based learning algorithms.

또한, 사용자의 나이 및 성별과 함께 사용자의 감정을 인식함으로써, 성별에 따라, 그리고 나이에 따라 서로 다른 음역대, 톤, 속도 등의 변화를 반영하여 보다 정확하게 사용자의 감정 상태를 인식할 수 있다.In addition, by recognizing the user's emotion along with the user's age and gender, it is possible to more accurately recognize the user's emotional state by reflecting changes in different ranges, tones, and speeds according to gender and age.

도 1은 본 발명의 일실시예에 있어서, 사용자 특성 정보 인식 시스템의 내부 구성을 도시한 블록도이다.
도 2는 본 발명의 일실시예에 있어서, 사용자 특성 정보 인식 방법을 도시한 흐름도이다.
도 3은 본 발명의 일실시예에 있어서, 컨볼루션 뉴럴 네트워크 구조를 도시한 도면이다.1 is a block diagram illustrating an internal configuration of a user characteristic information recognition system in an embodiment of the present invention.
2 is a flowchart illustrating a method for recognizing user characteristic information in an embodiment of the present invention.
3 is a diagram illustrating a convolutional neural network structure in an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

본 실시예들은 사용자의 음성 신호를 기반으로 사용자의 감정, 나이, 성별을 동시에 인식하는 기술에 관한 것으로서, 특히, 딥러닝 기법 중 컨볼루션 뉴럴 네트워크를 기반으로 학습하여 생성된 학습 모델을 이용하여 입력된 사용자의 음성 신호에 기초하여 사용자의 현재 감정 상태, 나이, 및 성별을 동시에 인식하는 기술에 관한 것이다.The present embodiments relate to a technique of simultaneously recognizing a user's emotion, age, and gender based on a user's voice signal, and in particular, input using a learning model generated by learning based on a convolutional neural network among deep learning techniques The present invention relates to a technique of simultaneously recognizing a user's current emotional state, age, and gender based on a user's voice signal.

본 실시예들에서, "사용자 특성 정보"는, 사용자의 감정, 나이 및 성별 등의 사용자의 상태를 나타낼 수 있다.In the present embodiments, "user characteristic information" may indicate a user's state, such as a user's emotion, age, and gender.

본 실시예들에서, "컨볼루션 뉴럴 네트워크 기반 학습"은 손실 함수(joint loss)가 최소화 값으로 수렴되도록 학습이 진행되는 것을 나타낼 수 있다.In the present embodiments, "convolutional neural network-based learning" may indicate that learning progresses such that the joint loss converges to a minimum value.

도 1은 본 발명의 일실시예에 있어서, 사용자 특성 정보 인식 시스템의 내부 구성을 도시한 블록도이고, 도 2는 본 발명의 일실시예에 있어서, 사용자 특성 정보 인식 방법을 도시한 흐름도이다.1 is a block diagram illustrating an internal configuration of a user characteristic information recognition system in an embodiment of the present invention, and FIG. 2 is a flowchart illustrating a method for recognizing user characteristic information in an embodiment of the present invention.

본 실시예에 따른 사용자 특성 정보 인식 시스템(100)은 프로세서(110), 버스(120), 네트워크 인터페이스(130), 및 메모리(140)를 포함할 수 있다. 메모리(140)는 운영체제(141) 및 서비스 제공 루틴(142)를 포함할 수 있다. 프로세서(110)는 프레임 구분부(111), 학습 제어부(112), 사용자 특성 인식부(113) 및 서비스 제공부(114)를 포함할 수 있다. 다른 실시예들에서 사용자 특성 정보 인식 시스템(100)은 도 1의 구성요소들보다 더 많은 구성요소들을 포함할 수도 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다. 예를 들어, 사용자 특성 정보 인식 시스템(100)은 디스플레이나 트랜시버(transceiver)와 같은 다른 구성요소들을 포함할 수도 있다.The user characteristic information recognition system 100 according to the present embodiment may include a processor 110, a bus 120, a network interface 130, and a memory 140. The memory 140 may include an operating system 141 and a service provision routine 142. The processor 110 may include a frame classification unit 111, a learning control unit 112, a user characteristic recognition unit 113, and a service providing unit 114. In other embodiments, the user characteristic information recognition system 100 may include more components than those in FIG. 1. However, there is no need to clearly show most prior art components. For example, the user characteristic information recognition system 100 may include other components such as a display or a transceiver.

메모리(140)는 컴퓨터에서 판독 가능한 기록 매체로서, RAM(random access memory), ROM(read only memory) 및 디스크 드라이브와 같은 비소멸성 대용량 기록장치(permanent mass storage device)를 포함할 수 있다. 또한, 메모리(140)에는 운영체제(141)와 서비스 제공 루틴(142)을 위한 프로그램 코드가 저장될 수 있다. 이러한 소프트웨어 구성요소들은 드라이브 메커니즘(drive mechanism, 미도시)을 이용하여 메모리(140)와는 별도의 컴퓨터에서 판독 가능한 기록 매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록 매체는 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록 매체(미도시)를 포함할 수 있다. 다른 실시예에서 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록 매체가 아닌 네트워크 인터페이스(130)를 통해 메모리(140)에 로딩될 수도 있다.The memory 140 is a computer-readable recording medium and may include a non-permanent mass storage device such as random access memory (RAM), read only memory (ROM), and a disk drive. Also, program codes for the operating system 141 and the service providing routine 142 may be stored in the memory 140. These software components may be loaded from a computer-readable recording medium separate from the memory 140 using a drive mechanism (not shown). Such a separate computer-readable recording medium may include a computer-readable recording medium (not shown) such as a floppy drive, disk, tape, DVD / CD-ROM drive, and memory card. In other embodiments, software components may be loaded into memory 140 via network interface 130 rather than a computer-readable recording medium.

버스(120)는 사용자 특성 정보 인식 시스템(100)의 구성요소들 간의 통신 및 데이터 전송을 가능하게 할 수 있다. 버스(120)는 고속 시리얼 버스(high-speed serial bus), 병렬 버스(parallel bus), SAN(Storage Area Network) 및/또는 다른 적절한 통신 기술을 이용하여 구성될 수 있다.The bus 120 may enable communication and data transmission between components of the user characteristic information recognition system 100. The bus 120 may be constructed using a high-speed serial bus, parallel bus, storage area network (SAN), and / or other suitable communication technology.

네트워크 인터페이스(130)는 사용자 특성 정보 인식 시스템(100)을 컴퓨터 네트워크에 연결하기 위한 컴퓨터 하드웨어 구성요소일 수 있다. 네트워크 인터페이스(130)는 사용자 특성 정보 인식 시스템(100)을 무선 또는 유선 커넥션을 통해 컴퓨터 네트워크에 연결시킬 수 있다.The network interface 130 may be a computer hardware component for connecting the user characteristic information recognition system 100 to a computer network. The network interface 130 may connect the user characteristic information recognition system 100 to a computer network through a wireless or wired connection.

사용자 특성 정보 인식 시스템(100)은 서버에 접속한 사용자 단말로 사용자의 음성 신호를 기반으로 사용자의 감정, 나이 및 성별 등의 사용자 특성 정보를 인식하고, 인식된 특성에 해당하는 서비스를 제공하도록 플랫폼(platform) 형태로 구현될 수도 있고, 사용자 단말에 마련된 스피커 등을 통해 입력된 음성 신호를 기반으로 사용자의 감정, 나이 및 성별 등의 사용자 특성 정보를 인식하는 어플리케이션(application, 즉, 서비스 앱) 형태로 구현될 수도 있다. 이때, 사용자 단말에 어플리케이션 형태로 구현된 경우, 사용자 단말에서 인식된 사용자 특성 정보는 어플리케이션을 통해 서버인 서비스 제공자 단말로 전달될 수 있으며, 서비스 제공자 단말은 수신된 사용자 특성 정보에 해당하는 서비스를 어플리케이션을 통해 사용자 단말로 제공할 수 있다.The user characteristic information recognition system 100 is a user terminal connected to the server, and recognizes user characteristic information such as the user's emotion, age, and gender based on a user's voice signal, and provides a platform corresponding to the recognized characteristic. It may be implemented in the form of (platform), or an application (application, that is, a service app) that recognizes user characteristic information such as user's emotion, age, and gender based on a voice signal input through a speaker provided in the user terminal. It may be implemented as. In this case, when implemented in the form of an application on the user terminal, the user characteristic information recognized by the user terminal may be transmitted to the service provider terminal as a server through the application, and the service provider terminal may apply a service corresponding to the received user characteristic information. Through it can be provided to the user terminal.

프로세서(110)는 기본적인 산술, 로직 및 사용자 특성 정보 인식 시스템(100)의 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(140) 또는 네트워크 인터페이스(130)에 의해, 그리고 버스(120)를 통해 프로세서(110)로 제공될 수 있다. 프로세서(110)는 프레임 구분부(111), 학습 제어부(112), 사용자 특성 인식부(113) 및 서비스 제공부(114)를 위한 프로그램 코드를 실행하도록 구성될 수 있다. 이러한 프로그램 코드는 메모리(140)와 같은 기록 장치에 저장될 수 있다.The processor 110 may be configured to process instructions of a computer program by performing input / output operations of the basic arithmetic, logic, and user characteristic information recognition system 100. Instructions may be provided to processor 110 by memory 140 or network interface 130, and via bus 120. The processor 110 may be configured to execute program codes for the frame division unit 111, the learning control unit 112, the user characteristic recognition unit 113, and the service provision unit 114. Such program code may be stored in a recording device such as memory 140.

프레임 구분부(111), 학습 제어부(112), 사용자 특성 인식부(113) 및 서비스 제공부(114)는 도 2의 단계들(210 내지 240 단계)을 수행하기 위해 구성될 수 있다.The frame division unit 111, the learning control unit 112, the user characteristic recognition unit 113, and the service provision unit 114 may be configured to perform the steps 210 to 240 of FIG. 2.

210 단계에서, 프레임 구분부(111)는 음성 신호에 해당하는 입력 데이터 셋(input data set)을 대상으로, 미리 지정된 복수의 특성 별로 프레임(frame)을 구분할 수 있다. 예를 들어, 사용자의 감정, 나이 및 성별을 동시에 인식하기 위한 학습 모델을 생성하고자 하는 경우, 프레임 구분부(111)는 미리 수집된 서로 다른 다양한 사용자들의 음성 신호에 해당하는 입력 데이터 셋을 사용자의 감정에 해당하는 프레임, 나이에 해당하는 프레임, 성별에 해당하는 프레임으로 구분할 수 있다.In step 210, the frame division unit 111 may classify frames according to a plurality of predetermined characteristics for an input data set corresponding to a voice signal. For example, in order to generate a learning model for simultaneously recognizing a user's emotion, age, and gender, the frame division unit 111 may input a set of input data corresponding to voice signals of different users collected in advance. It can be divided into a frame corresponding to emotion, a frame corresponding to age, and a frame corresponding to gender.

일례로, 프레임 구분부(111)는 MFC(Mel Frequency Cepstral Coefficient)에 기초하여 음성 신호에 해당하는 입력 데이터 셋에서 미리 지정된 유효한 소리에 해당하는 특징(feature)을 추출할 수 있다. 이때, 프레임 구분부(111)는 입력 데이터 셋 전체를 대상으로 특징을 추출하는 것이 아니라, 일정 구간, 즉, 일정 프레임씩 구분하고, 프레임 별로 스펙트럼(spectrum) 분석을 통해 특징(feature)을 추출할 수 있다. 예컨대, 시간 영역에서 음성 신호는 지속적으로 변화하므로, 변화하는 소리를 대상으로 특징을 추출하기 위해 미리 지정된 짧은 시간 내에서는 음성 신호가 많이 변화지 않는다고 가정할 수 있다. 즉, 오차 범위 내에서 실제로 음성 신호의 변화가 거의 없다고 해석할 수 있다. 그러면, 프레임 구분부(111)는 각 프레임 별로 파워 스펙트럼(즉, 주파수)를 계산할 수 있다. 여기서, 프레임 별로 계산된 파워 스펙트럼이 특징(feature), 즉, 특징 벡터로서 추출될 수 있다. 이처럼, 뉴럴 주파수 주파수가 계산되면, 각 구간에서 얼마만큼의 에너지가 존재하는지 여부를 알 수 있다. For example, the frame division unit 111 may extract a feature corresponding to a valid sound that is previously specified from an input data set corresponding to a voice signal based on a MFC (Mel Frequency Cepstral Coefficient). At this time, the frame division unit 111 does not extract features for the entire set of input data, but divides them by a certain section, that is, by a certain frame, and extracts features through spectrum analysis for each frame. Can be. For example, since the voice signal is continuously changed in the time domain, it can be assumed that the voice signal does not change much within a predetermined short time in order to extract features for the changing sound. That is, it can be interpreted that there is virtually no change in the audio signal within the error range. Then, the frame division unit 111 may calculate the power spectrum (ie, frequency) for each frame. Here, the power spectrum calculated for each frame may be extracted as a feature, that is, a feature vector. As described above, when the neural frequency frequency is calculated, it is possible to know how much energy exists in each section.

220 단계에서, 학습 제어부(112)는 상기 구분된 특성 별 프레임들을 대상으로, 컨볼루션 뉴럴 네트워크(convolution neural network)를 기반으로 복수의 특성 별 손실 함수에 기초하여 학습을 수행할 수 있다. 즉, 프레임 구분부(111)에서 계산된 파워 스펙트럼이 속하는 프레임들이 컨볼루션 뉴럴 네트워크의 입력층(input layer)의 입력값으로 설정될 수 있다. 예컨대, 특성(성별, 나이, 감정 등) 별로 25개의 프레임들이 학습을 위해 입력층에 설정될 수 있다. 이처럼, 입력된 프레임 별 파워 스펙트럼, 즉, 각 특성(성별, 나이, 감정)에 해당하는 특징 벡터들을 입력으로 받아 학습을 수행할 수 있다. 이때, 학습 제어부(112)는 입력층(input layer), 은닉층(hidden layer) 및 출력층(output layer)으로 구성된 컨볼루션 뉴럴 네트워크 구조에서, 출력층에서 제시한 값에 대해 실제 원하는 값으로 학습이 수행되도록 제어할 수 있다. 여기서, 각 층은 서로 교차되는 가중치(weight) 값으로 연결되어 있을 수 있으며, 학습 제어부(112)는 동일한 입력층에 대한 원하는 값이 출력되도록 각 층 별로 개개의 가중치를 조정하여 학습을 수행되도록 제어할 수 있다.In step 220, the learning control unit 112 may perform learning based on a loss function for a plurality of characteristics based on a convolutional neural network, targeting the frames according to the distinguished characteristics. That is, the frames to which the power spectrum calculated by the frame classification unit 111 belongs may be set as an input value of an input layer of the convolutional neural network. For example, 25 frames per characteristic (gender, age, emotion, etc.) may be set in the input layer for learning. As described above, learning can be performed by receiving the input power spectrum for each frame, that is, feature vectors corresponding to each characteristic (gender, age, emotion). At this time, the learning control unit 112 is a convolutional neural network structure composed of an input layer, a hidden layer, and an output layer, so that learning is performed at a desired value with respect to a value suggested by the output layer. Can be controlled. Here, each layer may be connected to a weight value intersecting each other, and the learning control unit 112 controls to perform learning by adjusting individual weights for each layer so that a desired value for the same input layer is output. can do.

구체적으로, 학습 제어부(112)는 적어도 하나의 은닉층에 속하는 특성 별 각 콘볼루션 레이어(CNN_ReLU)를 대상으로 샘플링(sampling)을 수행하는 맥스 풀링(max pooling)을 수행할 수 있다. 예컨대, 학습 제어부(112)는 적어도 특성 별로 25장의 프레임이 설정된 경우, 25장의 프레임들에 해당하는 특징 벡터들에 가중치를 곱한 값 중 가장 큰 값을 특성 별(예컨대, 성별, 나이, 감정 별)로 모으는 맥스 풀링을 수행할 수 있다.Specifically, the learning control unit 112 may perform max pooling, which performs sampling on each convolutional layer (CNN_ReLU) according to characteristics belonging to at least one hidden layer. For example, when at least 25 frames are set for each characteristic, the learning control unit 112 determines the largest value among the multiplied weights of feature vectors corresponding to the 25 frames by characteristic (eg, by gender, age, and emotion). Max pooling can be performed.

맥스 풀링이 수행되면, 학습 제어부(112)는 소프트맥스(softmax) 함수에 기초하여 복수의 특성 별로 손실(loss)을 계산할 수 있으며, 계산된 복수의 특성 별 손실의 합에 기초하여 최종 학습의 기준을 설정할 수 있다. 즉, 학습 제어부(112)는 감정, 나이, 성별 각각에 해당하는 손실을 계산하고, 계산된 손실(loss)을 아래의 수학식 1에 기초하여 합쳐 조인트 손실(joint loss)을 계산할 수 있다. 그리고, 조인트 손실을 최종 학습의 기준으로 설정할 수 있다. 다시 말해, 조인트 손실을 최소화하도록 학습이 수행되도록 감정, 나이, 성별 각각에 해당하는 손실(loss)이 계산되도록 학습이 수행될 수 있다.When the max pooling is performed, the learning control unit 112 may calculate the loss for each of the plurality of characteristics based on the softmax function, and the criteria for the final learning based on the calculated sum of the losses for each of the plurality of characteristics. You can set That is, the learning control unit 112 may calculate a loss corresponding to each emotion, age, and gender, and calculate the joint loss by combining the calculated loss based on Equation 1 below. And, the joint loss can be set as a criterion for final learning. In other words, learning may be performed such that loss corresponding to each of emotion, age, and gender is calculated so that learning is performed to minimize joint loss.

[수학식 1][Equation 1]

수학식 1에서, L_gender는 소프트맥스(softmax) 함수에 기초하여 계산된 성별에 해당하는 손실, L_age는 나이에 해당하는 손실, L_emotion은 감정에 해당하는 손실을 나타낼 수 있다.In Equation 1, L _gender may represent a loss corresponding to gender calculated based on a softmax function, L _age may be a loss corresponding to age, and L _emotion may represent a loss corresponding to _emotion .

이처럼, 소프트맥스 함수를 이용하여 감정, 나이, 성별 각각에 해당하는 손실(loss)을 계산하고, 계산된 손실의 합으로써 조인트 손실(joint loss)를 계산하여 최종 학습의 기준으로 설정함으로써, 하나의 음성 신호를 대상으로 여러 가지 태스크(task)에 해당하는 사용자의 감정, 나이, 성별을 동시에 인식하기 위한 멀티 태스크 학습(multi-task learning)이 수행될 수 있다. As described above, the loss corresponding to each of emotion, age, and gender is calculated using the softmax function, and joint loss is calculated as the sum of the calculated losses, and then set as a criterion for final learning. Multi-task learning may be performed to simultaneously recognize a user's emotion, age, and gender corresponding to various tasks on a voice signal.

수학식 1에 따르면, 학습 제어부(112)는 각 태스크(성별, 감정, 나이) 별로 가중치를 단순하게 더하는 것이 아니라, 변수화하여 학습시켜 weighted sum을 최종 학습의 기준으로 설정할 수 있다.According to Equation 1, the learning control unit 112 may set the weighted sum as a criterion for the final learning by simply training the variables instead of simply adding weights for each task (gender, emotion, age).

230 단계에서, 사용자 특성 인식부(113)는 학습을 통해 생성된 학습 모델에 기초하여 입력된 음성 신호에 해당하는 서로 다른 복수의 사용자 특성 정보를 동시에 인식할 수 있다. 예컨대, 수학식 1에 기초하여 계산된 조인트 손실(joint loss)을 최종 학습의 기준으로 설정하여 생성된 학습 모델에 사용자 단말의 스피커 등을 통해 입력된 음성 신호를 입력으로 설정할 수 있다. 그리고, 학습 모델의 출력으로서, 상기 입력된 음성 신호를 대상으로 인식된 사용자의 나이 및 성별이 출력될 수 있다. 이때, 나이 및 성별과 함께 사용자의 현재 감정 상태가 동시에 인식되어 출력될 수 있다.In operation 230, the user characteristic recognition unit 113 may recognize a plurality of different user characteristic information corresponding to the input voice signal simultaneously based on the learning model generated through learning. For example, the joint loss calculated based on Equation 1 may be set as a criterion for final learning, and a voice signal input through a speaker of a user terminal may be set as an input to the generated learning model. And, as the output of the learning model, the age and gender of the user recognized as the target of the input voice signal may be output. At this time, the current emotional state of the user along with the age and gender may be simultaneously recognized and output.

일례로, 사용자의 나이와 성별에 따라 음성의 음역대와 강도가 상이할 수 있다. 예컨대, 여성의 음역대가 남성보다 고주파 대역으로 차이를 보이며, 어린 아이일수록 주파수 대역이 어른보다 상대적으로 높은 차이점을 가질 수 있다. 이처럼, 학습 모델에서 사용자의 성별 및 나이가 인식됨과 동시에 사용자의 현재 감정 상태가 세부적으로 인식되어 출력될 수 있다. 감정의 상태에 따라 음성의 톤, 속도, 떨림 등의 변화가 존재하며, 톤, 속도, 떨림 등을 나타내는 특징 벡터들, 음성 신호의 음역대 및 강도에 해당하는 특징 벡터들을 기반으로 해당 성별의 해당 나이의 사용자의 감정 상태가 미리 지정된 복수의 감정 상태 중 어디에 해당하는지 여부가 인식될 수 있다. 예컨대, 사용자의 감정 상태가 중립, 기쁨, 슬픔, 분노, 혐오, 놀람, 공포 등 7가지 감정 중 적어도 하나에 해당하는지 여부가 인식될 수 있다.For example, the range and intensity of voice may be different depending on the age and gender of the user. For example, a female's transliteration band shows a difference in a higher frequency band than a male, and a younger child may have a relatively higher difference in frequency band than an adult. In this way, while the gender and age of the user are recognized in the learning model, the current emotional state of the user may be recognized and output in detail. There are changes in tone, speed, tremor, etc. according to the state of emotion, and feature vectors representing tone, speed, tremor, etc., and corresponding age of the gender based on feature vectors corresponding to the range and intensity of the voice signal It may be recognized whether the user's emotional state corresponds to a plurality of predetermined emotional states. For example, it may be recognized whether the user's emotional state corresponds to at least one of seven emotions such as neutral, joy, sadness, anger, disgust, surprise, and fear.

예컨대, 20대 여성이나 7세 이하의 아이의 음성 신호의 경우, 추출된 특징 벡터에 해당하는 주파가 기쁨에 해당하나, 40대 남성의 경우, 분노, 놀람에 해당하는 것과 같이 서로 상이할 수 있다. 이에 따라, 학습 모델로 입력되는 하나의 음성 신호를 대상으로 사용자의 나이, 성별 및 감정이라는 세 가지 특성을 동시에 인식함으로써, 사용자의 나이 및 성별에 따라 달라지는 생리적 음성 특성을 반영하여 사용자의 실제 감정 상태를 보다 잘 인식할 수 있다. 여기서는 기쁨에 해당하는 경우를 예로 들어 설명하였으나, 둘 이상의 감정 상태에 해당하는 것으로 인식될 수도 있다. 예컨대, 놀람 및 공포에 해당하는 감정 상태, 놀람, 공포 및 분노의 감정 상태 등으로 인식되는 것 역시 가능할 수 있다.For example, in the case of a voice signal of a woman in her twenties or a child under the age of 7, the frequency corresponding to the extracted feature vector corresponds to joy, but in the case of a man in his 40s, they may be different from each other, such as anger and surprise. . Accordingly, by simultaneously recognizing three characteristics of the user's age, gender, and emotion on a single voice signal input to the learning model, the user's actual emotional state is reflected by physiological voice characteristics that vary depending on the user's age and gender. Can be recognized better. Here, the case corresponding to joy has been described as an example, but it may be recognized as corresponding to two or more emotional states. For example, it may be possible to be recognized as an emotional state corresponding to surprise and fear, an emotional state of surprise, fear and anger, and the like.

240 단계에서, 사용자의 나이, 성별 및 감정이 인식되면, 서비스 제공부(114)는 인식된 나이, 성별 및 감정을 기반으로 사용자의 상태에 적합한 서비스를 제공할 수 있다.In step 240, when the user's age, gender, and emotion are recognized, the service provider 114 may provide a service suitable for the user's condition based on the recognized age, gender, and emotion.

예를 들어, 음악 서비스를 제공하는 경우, 사용자의 감정이 기쁨에 해당하고 7세 이하의 여아에 해당하는 경우, 생일 축하 동요를 제공할 수 있다. 사용자의 감정이 기쁨에 해당하고 7세 이하의 남아에 해당하는 경우, 미리 지정된 해당 나이대에서 인기 있는 만화 엔딩곡이나 오프닝곡 등을 제공할 수 있다. 사용자의 감정이 기쁨에 해당하고 20대의 남성에 해당하는 경우, 미리 지정된 해당 나이대에서 인기 있는 KPOP 등을 제공할 수 있다. 영화 서비스를 제공하는 경우에도 마찬가지로, 서비스 제공부(114)는 사용자의 감정, 나이 및 성별을 고려하여, 서비스를 제공할 수 있다.For example, in the case of providing a music service, if the user's emotion corresponds to joy and a girl under 7 years of age, a happy birthday song may be provided. If the user's emotion corresponds to joy and the boy under 7 years of age, a popular cartoon ending song or opening song may be provided at a predetermined age. When the user's emotion corresponds to joy and to a man in his twenties, it is possible to provide a popular KPOP, etc. at a predetermined age. Similarly, in the case of providing a movie service, the service provider 114 may provide a service in consideration of the user's emotion, age, and gender.

이외에, 고객 센터, 전화 상담 시, 연결된 상대방 단말을 통해 입력된 음성 신호를 기반으로 사용자의 감정, 나이 및 성별이 인식되면, 서비스 제공부(114)는 미리 지정된 고객 대응 매뉴얼에서 상기 인식된 사용자의 감정, 나이 및 성별에 해당하는 대응 방법을 고객 센터나 전화 상담사의 단말로 제공할 수 있다. 이때, 인식된 사용자의 감정, 나이 및 성별 정보가 함께 제공될 수 있다. 그러면, 고객 센터 담당자나 전화 상담사가 사용자의 감정, 나이 및 성별 정보, 상기 제공된 대응 방법을 참고하여 사용자의 현재 상태에 알맞게 사용자의 요청 사항에 효과적으로 대응하도록 도움을 줄 수 있다. 예컨대, 사용자가 제품 고장, 오배송으로 인해 고객 센터로 전화한 경우, 감정 상태가 분노인 사용자가 있을 수도 있고, 중립인 사용자가 존재할 수도 있다. 이때, 서비스 제공부(114)는 인식된 사용자의 나이, 성별 및 감정 상태와 함께 해당하는 대응방법을 제공함으로써, 고객 센터 담당자가 해당 사용자에게 적절한 대응을 하도록 할 수 있다. 예컨대, 고령인 남성이고, 감정이 분노 상태인 경우, 많은 시간을 소비하거나 분노 게이지를 높이지 않고 신속하게 요구 사항을 들어주도록 할 수 있으며, 이때, 고령의 남성이 인지하기 쉬운 말투 및 속도로 전화에 응대하도록 하는 대응 방법을 제공할 수 있다.In addition, when the customer center, telephone consultation, when the user's emotion, age and gender are recognized based on the voice signal input through the connected counterpart terminal, the service providing unit 114 displays the recognized user in the predefined customer response manual. A response method corresponding to emotion, age, and gender may be provided to a customer center or a telephone counselor's terminal. At this time, emotion, age, and gender information of the recognized user may be provided together. Then, a customer center representative or a telephone counselor can help to effectively respond to the user's request according to the user's current state by referring to the user's emotion, age and gender information, and the provided response method. For example, when a user calls a customer center due to product failure or misdelivery, there may be a user with an emotional state or a neutral user. At this time, the service providing unit 114 may provide a corresponding response method along with the recognized user's age, gender, and emotional state, so that the customer center representative can respond appropriately to the user. For example, if you are an elderly male and your emotions are in anger, you can make them quickly respond to your needs without spending a lot of time or raising your anger gauge, and at this time, the older male calls with a speech and speed that is easy to recognize. It is possible to provide a response method to respond to the.

도 3은 본 발명의 일실시예에 있어서, 컨볼루션 뉴럴 네트워크 구조를 도시한 도면이다.3 is a diagram illustrating a convolutional neural network structure in an embodiment of the present invention.

도 3을 참고하면, 컨볼루션 뉴럴 네트워크(300)는 입력층(310), 은닉층(320) 및 출력층(330)을 포함할 수 있다. 은닉층(320)는 복수개의 레이어(layer)로 구성될 수 있다.Referring to FIG. 3, the convolutional neural network 300 may include an input layer 310, a hidden layer 320 and an output layer 330. The hidden layer 320 may be composed of a plurality of layers.

도 3에서, 40차원의 멜 스펙트로그램(mel spectrogram) 25개가 입력층의 입력으로 설정될 수 있다. 이때, 특성(감정, 나이, 성별) 별로, 25개의 프레임들(즉, mel spectrogram*25)이 입력으로 설정될 수 있다. 멜 스펙트로그램은 인간의 청각적 지각 방식에 최적화된 2D 표현 방식으로, 주파수 축에서 STFT(Short Time Fourier Transform)을 압축하여 가장 중요한 정보를 보존하는 음성 신호 표현 기법을 나타낼 수 있다.In FIG. 3, 25 40-dimensional mel spectrograms may be set as inputs of the input layer. At this time, for each characteristic (emotion, age, gender), 25 frames (ie, mel spectrogram * 25) may be set as input. Mel spectrogram is a 2D expression method optimized for human auditory perception, and can represent a speech signal expression technique that preserves the most important information by compressing a Short Time Fourier Transform (STFT) on the frequency axis.

이처럼, 입력층에 특성 별로 복수개의 프레임들이 설정되면, 각 프레임에 해당하는 특징 벡터 별로 가중치가 곱해지고, 가중치가 곱해진 값들 중 최대값을 특성 별로 모으는 맥스 풀링(max pooling)이 수행될 수 있다.As described above, when a plurality of frames are set for each characteristic in the input layer, the weighting is multiplied for each feature vector corresponding to each frame, and max pooling may be performed to collect the maximum value among the multiplied values by characteristics. .

맥스 풀링된 값들을 대상으로 소프트맥스 함수에 기초하여 특성 별 손실(loss)이 계산되고, 계산된 손실을 상기 수학식 1에 기초하여 합함으로써, 음성 신호에서 감정, 성별, 나이를 동시에 인식하기 위한 학습 모델을 생성하는 조인트 손실(joint loss) 계산할 수 있다. 이처럼, 조인트 손실이 계산되면, 최종 학습 기준으로 설정하고, 입력 데이터 셋을 대상으로 상기 조인트 손실이 최소값에 수렴하도록 학습이 수행될 수 있다. 그러면, 특정 음성 신호가 입력되면, 상기 학습이 수행됨에 따라 생성된 학습 모델을 기반으로, 상기 입력된 특정 음성 신호에서 추출된 특징 벡터를 상기 학습 모델의 입력으로 설정하여, 상기 음성 신호에 해당하는 사용자의 나이, 성별 및 감정이 인식되어 출력값으로 출력될 수 있다. For each of the max pooled values, a loss for each characteristic is calculated based on a soft max function, and the calculated loss is summed based on Equation 1 to simultaneously recognize emotion, gender, and age in a voice signal. You can calculate the joint loss that creates the training model. As described above, when the joint loss is calculated, the final learning criterion is set, and learning may be performed so that the joint loss converges to a minimum value for an input data set. Then, when a specific speech signal is input, based on a learning model generated as the learning is performed, a feature vector extracted from the input specific speech signal is set as an input of the learning model, and corresponds to the speech signal. The user's age, gender and emotion can be recognized and output as output values.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and / or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if substituted or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A method for recognizing user characteristic information executed by a computer, the method comprising:
Classifying a frame according to a plurality of predetermined characteristics based on an input data set corresponding to a voice signal;
Performing learning based on the loss function for each of the plurality of characteristics based on a convolutional neural network for the divided frames for each characteristic; And
Recognizing a plurality of different user characteristic information corresponding to the input voice signal based on the learning model generated through the learning.
Including,
The step of dividing the frame for each of the plurality of characteristics,
Extracting a power spectrum calculated through spectrum analysis for each frame divided for a certain section from an input data set corresponding to a speech signal based on a Mel Frequency Cepstral Coefficient (MFC) as a feature vector
Including,
The step of performing the learning,
The frames to which the calculated power spectrum for each frame belongs are set as an input value of an input layer of a convolutional neural network, and learning is performed by receiving feature vectors corresponding to the power spectrum for each frame as input.
Including,
Recognizing the user characteristic information,
When max pooling is performed for sampling of each convolution layer based on the convolutional neural network, and when the max pooling is performed, loss is performed according to a plurality of characteristics based on a soft max function (loss) is calculated, and the equation 1 (

), Calculates the joint loss, sets the calculated joint loss as a reference for the final learning, inputs a speech signal into the generated learning model, and outputs the learning model as the output of the learning model. Steps to recognize age and gender at the same time
Including,
In Equation 1, L _gender is the loss corresponding to the gender calculated based on the softmax function, L _age is the loss corresponding to the age, L _emotion is the loss corresponding to the _emotion
How to recognize user property information.

delete

According to claim 1,
The step of classifying a frame for each of the plurality of characteristics may include:
Targeting the input data set, and dividing it into frames for the user's emotion, age, and gender
A method for recognizing user characteristic information, characterized in that.

delete

According to claim 1,
Recognizing the user characteristic information,
When the user characteristic information corresponds to emotion, recognizing whether the user's emotional state corresponds to any one of neutral, joy, sadness, anger, disgust, surprise, and fear from the input voice signal
A method for recognizing user characteristic information, characterized in that.

According to claim 1,
Providing a service corresponding to the current state of the user based on the recognized user characteristic information
A method for recognizing user characteristic information further comprising a.

In the user characteristic information recognition system,
A frame dividing unit for classifying frames according to a plurality of predetermined characteristics for an input data set corresponding to a voice signal;
A learning control unit that performs learning based on the loss function for each of the plurality of characteristics based on a convolutional neural network, targeting the divided frames for each characteristic; And
A user characteristic recognition unit that recognizes a plurality of different user characteristic information corresponding to the input voice signal based on the learning model generated through the learning.
Including,
The division,
And extracting a power spectrum calculated through spectrum analysis for each frame divided for a predetermined section from an input data set corresponding to a speech signal based on a Mel Frequency Cepstral Coefficient (MFC), as a feature vector,
The learning control unit,
Frames to which the calculated power spectrum for each frame belongs are set as input values of an input layer of a convolutional neural network, and receiving feature vectors corresponding to the power spectrum for each frame as input, and performing learning. ,
The user characteristic recognition unit,
When max pooling is performed for sampling of each convolution layer based on the convolutional neural network, and when the max pooling is performed, loss is performed according to a plurality of characteristics based on a soft max function (loss) is calculated, and the equation 1 (

), Calculates the joint loss, sets the calculated joint loss as a reference for the final learning, inputs a speech signal into the generated learning model, and outputs the learning model as the output of the learning model. Including simultaneously recognizing age and gender,
In Equation 1, L _gender is the loss corresponding to the gender calculated based on the softmax function, L _age is the loss corresponding to the age, L _emotion is the loss corresponding to the _emotion
User characteristic information recognition system.

delete

The method of claim 8,
The frame division unit,
Targeting the input data set, and dividing it into frames for the user's emotion, age, and gender
User characteristic information recognition system, characterized by.

delete

The method of claim 8,
The user characteristic recognition unit,
When the user characteristic information corresponds to emotion, recognizing whether the user's emotional state corresponds to any one of neutral, joy, sadness, anger, disgust, surprise, and fear from the input voice signal
User characteristic information recognition system, characterized by.

The method of claim 8,
A service providing unit that provides a service corresponding to a user's current state based on the recognized user characteristic information
User characteristic information recognition system further comprising a.