KR102381925B1

KR102381925B1 - Robust speaker adaptation modeling method for personalized emotion recognition and apparatus thereof

Info

Publication number: KR102381925B1
Application number: KR1020200096085A
Authority: KR
Inventors: 이승룡; 방재훈
Original assignee: 경희대학교 산학협력단
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2022-03-31
Also published as: KR20220015714A

Abstract

개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법 및 그 장치가 개시된다.
이 방법에서, 타겟 사용자로부터 초기에 획득된 감정 음성 데이터(이하 “타겟 사용자의 데이터”로 지칭함)가 미리 설정된 임계 개수보다 적으면서, 상기 타겟 사용자의 데이터의 감정 데이터가 부재 데이터가 아닌 경우, 미리 형성된 다수의 사용자에 대한 초기 모델(Initial Model)(이하 “초기 모델”로 지칭함)의 레이블된 데이터를 언레이블한 데이터 중에서 상기 타겟 사용자의 데이터의 감정별 데이터와 유사한 데이터를 선택하여 강화한다. 또는 이 방법에서, 상기 타겟 사용자의 데이터가 상기 미리 설정된 임계 개수보다 적으면서, 상기 타겟 사용자의 데이터의 감정 데이터가 부재 데이터인 경우, 상기 초기 모델의 데이터 중에서 상기 타겟 사용자의 감정 데이터와 가장 유사한 사용자의 데이터 중 상기 타겟 사용자의 부재 데이터에 해당하는 감정에 대응하는 데이터를 선택하여 강화한다. 또는 이 방법에서, 상기 타겟 사용자의 데이터가 상기 미리 설정된 임계 개수 이상이면서, 상기 타겟 사용자의 감정별 데이터의 개수가 불균형 상태인 경우, 오버 샘플링 알고리즘을 사용한 가상 데이터 증강 방식으로 상기 타겟 사용자의 감정별 데이터를 강화한다. Disclosed are a robust speaker adaptive modeling method for personalized emotion recognition and an apparatus therefor.
In this method, when the emotional voice data (hereinafter referred to as “target user data”) initially obtained from the target user is less than a preset threshold number, and the emotional data of the target user data is not absent data, in advance Data similar to the emotion-specific data of the target user's data is selected and strengthened from the unlabeled data of the labeled data of the formed initial model (hereinafter referred to as "initial model") for a plurality of users. Or in this method, when the target user's data is less than the preset threshold number and the emotional data of the target user's data is absent data, the user most similar to the target user's emotional data among the data of the initial model Data corresponding to the emotion corresponding to the absence data of the target user is selected and strengthened from among the data. Or, in this method, when the target user's data is equal to or greater than the preset threshold number and the number of data for each emotion of the target user is in an imbalanced state, a virtual data augmentation method using an oversampling algorithm for each emotion of the target user Enrich your data.

Description

ROBUST SPEAKER ADAPTATION MODELING METHOD FOR PERSONALIZED EMOTION RECOGNITION AND APPARATUS THEREOF

본 발명은 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법 및 그 장치에 관한 것이다.The present invention relates to a robust speaker adaptive modeling method and apparatus for personalized emotion recognition.

전통적인 감정인식 기술은 다수의 사용자로부터 감정 음성 데이터를 수집하여 이를 기계 학습 알고리즘을 통해 학습하고, 학습에 의해 생성된 모델을 모든 사용자에게 동일하게 적용하는 것이다. Traditional emotion recognition technology collects emotional voice data from multiple users, learns it through a machine learning algorithm, and applies the model generated by learning to all users equally.

그런데, 이러한 기술은 모든 사용자에게 고른 정확도를 나타내기 어렵고 정확도 편차가 크다는 단점이 존재한다. 따라서, 최근에는 사용자마다 개인화된 모델을 생성하고 이를 사용자에게 적용하여 제공하는 개인화된 감정 인식 기술이 대두되었다. However, this technique has a disadvantage in that it is difficult to represent an even accuracy to all users and the accuracy deviation is large. Accordingly, in recent years, a personalized emotion recognition technology for generating a personalized model for each user and applying it to the user has emerged.

기존의 개인화된 감정 인식 기술은 다수의 사용자로부터 구축되어진 감정 인식 모델을 중심으로 타겟 사용자로부터 수집되는 레이블(Label)된 감정 음성 데이터를 반영하여 선형 파라미터 변경 및 레이블 정제(Label Refinment) 작업을 통해 모델을 변경하는 방식으로 동작한다. Existing personalized emotion recognition technology reflects the labeled emotion speech data collected from the target user, centering on the emotion recognition model built from multiple users, and models it through linear parameter change and label refinement work. works by changing the

그러나, 기존의 개인화된 감정 인식 기술에서는 정확도 높은 개인화된 모델로 변경하기 위해서 충분한 데이터를 요구하며, 사용자의 감정 음성 데이터가 균일하게 입력되어야지만 최상의 성능을 나타낼 수 있다. 그런데, 개인화된 감정 인식 초기 단계에서는 타겟 사용자의 레이블된 데이터를 충분히 확보하기 어렵고 또한 불균형 환경에 빠지기 쉬워 개인화된 모델을 생성하기 어려운 콜드 스타트 문제(Cold-Start Problem) 빠지기 쉽다.However, the existing personalized emotion recognition technology requires sufficient data to change to a highly accurate personalized model, and the best performance can be achieved only when the user's emotional voice data is uniformly input. However, in the initial stage of personalized emotion recognition, it is difficult to sufficiently secure the labeled data of the target user, and it is easy to fall into an imbalanced environment, so that it is easy to fall into a cold-start problem in which it is difficult to create a personalized model.

따라서, 기존의 개인화된 감정 인식 기술에서 콜드 스타트 문제를 해결하기 위한 강인한 개인화된 감정 인식 모델 생성 기술이 요구된다.Therefore, a strong personalized emotion recognition model generation technology is required to solve the cold start problem in the existing personalized emotion recognition technology.

본 발명이 해결하고자 하는 과제는 개인화된 음성 기반 감정 인식 과정에서 타겟 사용자의 데이터가 충분히 모이지 않은 작은 데이터, 부재 데이터, 불균형 데이터 환경에서도 개인화 훈련 모델 생성에 충분하고 균형잡힌 개인화된 훈련 데이터셋을 도출할 수 있는 강인한 화자 적응 모델링 방법 및 그 장치를 제공하는 것이다.The problem to be solved by the present invention is to derive a personalized training dataset that is sufficient and balanced for creating a personalized training model even in small data, absent data, and imbalanced data environments where target user data is not sufficiently collected in the personalized voice-based emotion recognition process It is to provide a robust speaker adaptive modeling method capable of performing and an apparatus for the same.

상기한 바와 같은 본 발명의 과제를 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한, 본 발명의 특징적인 구성은 하기와 같다.The characteristic configuration of the present invention for achieving the object of the present invention as described above and for realizing the characteristic effects of the present invention to be described later is as follows.

본 발명의 일 측면에 따르면, 화자 적응 모델링 방법이 제공되며, 이 방법은,According to one aspect of the present invention, a speaker adaptive modeling method is provided, the method comprising:

개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법으로서, 타겟 사용자로부터 초기에 획득된 감정 음성 데이터(이하 “타겟 사용자의 데이터”로 지칭함)가 미리 설정된 임계 개수보다 적으면서, 상기 타겟 사용자의 데이터의 감정 데이터가 부재 데이터가 아닌 경우, 미리 형성된 다수의 사용자에 대한 초기 모델(Initial Model)(이하 “초기 모델”로 지칭함)의 레이블된 데이터를 언레이블한 데이터 중에서 상기 타겟 사용자의 데이터의 감정별 데이터와 유사한 데이터를 선택하여 강화하는 단계, 또는 상기 타겟 사용자의 데이터가 상기 미리 설정된 임계 개수보다 적으면서, 상기 타겟 사용자의 데이터의 감정 데이터가 부재 데이터인 경우, 상기 초기 모델의 데이터 중에서 상기 타겟 사용자의 감정 데이터와 가장 유사한 사용자의 데이터 중 상기 타겟 사용자의 부재 데이터에 해당하는 감정에 대응하는 데이터를 선택하여 강화하는 단계, 또는 상기 타겟 사용자의 데이터가 상기 미리 설정된 임계 개수 이상이면서, 상기 타겟 사용자의 감정별 데이터의 개수가 불균형 상태인 경우, 오버 샘플링 알고리즘을 사용한 가상 데이터 증강 방식으로 상기 타겟 사용자의 감정별 데이터를 강화하는 단계를 포함한다.As a robust speaker adaptation modeling method for personalized emotion recognition, emotional voice data (hereinafter referred to as “target user data”) initially obtained from a target user is less than a preset threshold number, and the emotion of the target user data When the data is not absent data, among the data obtained by unlabeling the labeled data of an initial model (hereinafter referred to as an “initial model”) for a plurality of preformed users, emotion-specific data of the data of the target user and Selecting and strengthening similar data, or when the target user's data is less than the preset threshold number and the emotion data of the target user's data is absent data, the target user's emotion among the data of the initial model Selecting and strengthening data corresponding to the emotion corresponding to the absent data of the target user from among the data of the user most similar to the data, or the target user's data is equal to or greater than the preset threshold number, each emotion of the target user and strengthening data for each emotion of the target user by a virtual data augmentation method using an oversampling algorithm when the number of data is in an imbalanced state.

본 발명의 다른 측면에 따르면, 화자 적응 모델링 장치가 제공되며, 이 장치는,According to another aspect of the present invention, there is provided a speaker adaptive modeling apparatus, the apparatus comprising:

개인화된 감정 인식을 위한 강인한 화자 적응 모델링 장치로서, 미리 형성된 다수의 사용자에 대한 초기 모델(Initial Model)(이하 “초기 모델”로 지칭함)의 레이블된 데이터를 언레이블한 데이터 중에서 타겟 사용자로부터 초기에 획득된 감정 음성 데이터(이하 “타겟 사용자의 데이터”로 지칭함)의 감정별 데이터와 유사한 데이터를 선택하여 강화하는 제1 모델링부, 상기 초기 모델의 데이터 중에서 상기 타겟 사용자의 감정 데이터와 가장 유사한 사용자의 데이터 중 상기 타겟 사용자의 부재 데이터에 해당하는 감정에 대응하는 데이터를 선택하여 강화하는 제2 모델링부, 오버 샘플링 알고리즘을 사용한 가상 데이터 증강 방식으로 상기 타겟 사용자의 감정별 데이터를 강화하는 제3 모델링부, 그리고 상기 타겟 사용자의 데이터가 미리 설정된 임계 개수보다 적으면서, 상기 타겟 사용자의 데이터의 감정 데이터가 부재 데이터가 아닌 경우 상기 제1 모델링부를 통한 데이터 강화를 수행시키거나, 또는 상기 타겟 사용자의 데이터가 상기 미리 설정된 임계 개수보다 적으면서, 상기 타겟 사용자의 데이터의 감정 데이터가 부재 데이터인 경우 상기 제2 모델링부를 통한 데이터 강화를 수행시키거나, 또는 상기 타겟 사용자의 데이터가 상기 미리 설정된 임계 개수 이상이면서, 상기 타겟 사용자의 감정별 데이터의 개수가 불균형 상태인 경우 상기 제3 모델링부를 통한 데이터 강화를 수행시키는 처리 제어부를 포함한다.As a robust speaker adaptive modeling device for personalized emotion recognition, it is initially obtained from a target user among unlabeled data of labeled data of an initial model (hereinafter referred to as an “initial model”) for a plurality of preformed users. A first modeling unit that selects and strengthens data similar to emotion-specific data of the acquired emotion voice data (hereinafter referred to as “target user data”), the data of the initial model, the user most similar to the emotion data of the target user A second modeling unit that selects and strengthens data corresponding to emotions corresponding to the absent data of the target user from among the data, and a third modeling unit that strengthens data for each emotion of the target user in a virtual data augmentation method using an oversampling algorithm And, when the target user's data is less than a preset threshold number, and the emotional data of the target user's data is not absent data, data reinforcement is performed through the first modeling unit, or the target user's data is While less than the preset threshold number, if the emotional data of the target user's data is absent data, data reinforcement is performed through the second modeling unit, or the target user's data is greater than or equal to the preset threshold number, and a processing controller configured to enhance data through the third modeling unit when the number of data for each emotion of the target user is in an imbalanced state.

본 발명에 따르면, 개인화된 음성 기반 감정 인식 과정에서 타겟 사용자의 데이터가 충분히 모이지 않은 작은 데이터, 부재 데이터, 불균형 데이터 환경에서도 개인화 훈련 모델 생성에 충분하고 균형잡힌 개인화된 훈련 데이터셋을 도출할 수 있다.According to the present invention, it is possible to derive a personalized training dataset sufficient and balanced for creating a personalized training model even in a small data, absent data, and imbalanced data environment in which the target user's data is not sufficiently collected in the personalized voice-based emotion recognition process. .

또한, 개인화된 모델링 기술을 통해 보다 정확한 감정 인지가 가능해짐으로써, 다양한 IoT 지능형 서비스 분야에 적용될 수 있다.In addition, as more accurate emotion recognition is possible through personalized modeling technology, it can be applied to various IoT intelligent service fields.

또한, 로봇 및 콜 센터에 적용하여 입력되는 사용자들의 감정 음성을 축적해 업데이트함으로써 개인화된 감정 인식 서비스 제공이 가능하다. In addition, it is possible to provide a personalized emotion recognition service by accumulating and updating the emotional voices of users applied to robots and call centers.

또한, 이를 활용한 조울증/우을증과 같은 심리 치료에 필요한 시장에 측정 기법으로 활용이 가능하다.In addition, it can be used as a measurement technique in the market necessary for psychological treatment such as bipolar disorder/depression using this.

도 1은 본 발명의 실시예에 따른 타겟 사용자의 소량의 감정 음성 데이터에 대한 화자 적응 모델링 방법의 개략적인 흐름도이다.
도 2는 도 1에 따른 흐름을 음성 데이터의 분포도를 사용하여 설명한 도면이다.
도 3은 도 1에 도시된 화자 적응 모델링 방법에서 IM의 레이블된 감정별 데이터에 대해 언레이블된 데이터의 예를 도시한 도면이다.
도 4는 도 1에 도시된 화자 적응 모델링 방법에 따라 타겟 사용자에 대해 획득된 화남(Anger), 슬픔(Sadness), 행복(Happiness)의 감정 각각에 대한 특징 벡터값의 평균값이 MLE 계산에 의해 산출된 예를 도시한 도면이다.
도 5는 도 1에 도시된 화자 적응 모델링 방법에 따라 임계값을 설정하여 데이터를 선택하는 예를 도시한 도면이다.
도 6은 도 1에 도시된 화자 적응 모델링 방법에 따라 강화된 타겟 사용자 데이터(우측)와 기존 방식에 따라 확보된 타겟 사용자 데이터의 예를 도시한 도면이다.
도 7은 본 발명의 실시예에 따른 타겟 사용자의 부재 감정 음성 데이터에 대한 화자 적응 모델링 방법의 개략적인 흐름도이다.
도 8은 도 7에 따른 흐름을 음성 데이터의 분포도를 사용하여 설명한 도면이다.
도 9는 도 7에 도시된 화자 적응 모델링 방법에 따라 타겟 사용자의 데이터와 IM 데이터 각각에 대해 데이터 분포 요소를 산출한 예를 도시한 도면이다.
도 10은 도 7에 도시된 화자 적응 모델링 방법에 따라 타겟 사용자의 데이터와 IM 데이터 각각에 대해 산출된 데이터 분포 요소에 기초한 유사성 산출 예를 도시한 도면이다.
도 11은 본 발명의 실시예에 따른 타겟 사용자의 불균형한 감정 음성 데이터에 대한 화자 적응 모델링 방법의 개략적인 흐름도이다.
도 12는 본 발명의 실시예에 따른 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법의 개략적인 흐름도이다.
도 13은 본 발명의 실시예에 따른 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 장치의 개략적인 구성 블록도이다.1 is a schematic flowchart of a speaker adaptive modeling method for a small amount of emotional voice data of a target user according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating the flow according to FIG. 1 using a distribution diagram of voice data.
FIG. 3 is a diagram illustrating an example of unlabeled data for labeled emotion-specific data of an IM in the speaker adaptive modeling method shown in FIG. 1 .
FIG. 4 shows the average value of the feature vector values for each emotion of Anger, Sadness, and Happiness obtained for a target user according to the speaker adaptive modeling method shown in FIG. 1 is calculated by MLE calculation. It is a drawing showing an example of the
FIG. 5 is a diagram illustrating an example of selecting data by setting a threshold value according to the speaker adaptive modeling method shown in FIG. 1 .
6 is a diagram illustrating an example of target user data (right) enhanced according to the speaker adaptive modeling method shown in FIG. 1 and target user data secured according to the existing method.
7 is a schematic flowchart of a speaker adaptive modeling method for emotional voice data of a target user's absence according to an embodiment of the present invention.
8 is a diagram illustrating the flow according to FIG. 7 using a distribution diagram of voice data.
FIG. 9 is a diagram illustrating an example of calculating a data distribution element for each of target user data and IM data according to the speaker adaptive modeling method shown in FIG. 7 .
10 is a diagram illustrating an example of calculating a similarity based on data distribution factors calculated for each of target user data and IM data according to the speaker adaptive modeling method shown in FIG. 7 .
11 is a schematic flowchart of a speaker adaptation modeling method for unbalanced emotional voice data of a target user according to an embodiment of the present invention.
12 is a schematic flowchart of a robust speaker adaptation modeling method for personalized emotion recognition according to an embodiment of the present invention.
13 is a schematic block diagram of a speaker adaptive modeling apparatus for personalized emotion recognition according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. there is.

본 발명에서 설명하는 장치들은 적어도 하나의 프로세서, 메모리 장치, 통신 장치 등을 포함하는 하드웨어로 구성되고, 지정된 장소에 하드웨어와 결합되어 실행되는 프로그램이 저장된다. 하드웨어는 본 발명의 방법을 실행할 수 있는 구성과 성능을 가진다. 프로그램은 도면들을 참고로 설명한 본 발명의 동작 방법을 구현한 명령어(instructions)를 포함하고, 프로세서와 메모리 장치 등의 하드웨어와 결합하여 본 발명을 실행한다. The devices described in the present invention are composed of hardware including at least one processor, a memory device, a communication device, and the like, and a program to be executed in combination with the hardware is stored in a designated place. The hardware has the configuration and capability to implement the method of the present invention. The program includes instructions for implementing the method of operation of the present invention described with reference to the drawings, and is combined with hardware such as a processor and a memory device to execute the present invention.

이하, 본 발명의 실시예에 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법에 대해 설명한다.Hereinafter, a robust speaker adaptation modeling method for personalized emotion recognition will be described in an embodiment of the present invention.

본 발명의 실시예에서는 개인화된 감정 인식을 위한 강인한 화자 적응 모델링으로 3가지 방식을 제공한다. 구체적으로, 첫 번째 방식은 타겟 사용자의 감정 음성 데이터가 소량인 경우에 대한 방식이고, 두 번째 방식은 타겟 사용자의 일부 감정 음성 데이터가 부재인 경우에 대한 방식이며, 세 번째 방식은 타겟 사용자의 감정별 음성 데이터가 불균형 상태인 경우에 대한 방식이다. In an embodiment of the present invention, three methods are provided as robust speaker adaptation modeling for personalized emotion recognition. Specifically, the first method is a method for a case where the emotional voice data of the target user is small, the second method is a method for a case where some emotional voice data of the target user is absent, and the third method is a method for a case where some emotional voice data of the target user is absent. This is a method for a case in which star voice data is in an unbalanced state.

먼저, 첫 번째 방식에 대해 설명한다. 즉, 타겟 사용자에 대해 수집된 감정 음성 데이터가 소량인 경우에 수행되는 강인한 화자 적응 모델링 방식에 대한 것이다. 이 방식은 소량의 데이터(Small Data) 환경에서 기존에 구축 되어져 있던 다수의 사용자 초기 모델에서 타겟 사용자와 유사한 음성을 선택한 후 레이블을 정제하여 훈련 데이터셋, 즉 타겟 사용자 중심의 실제(Real-case) 데이터로 선택하는 방식이다.First, the first method will be described. That is, it relates to a robust speaker adaptation modeling method performed when the emotional voice data collected for the target user is small. This method selects a voice similar to the target user from the initial model of a large number of users previously built in a small data environment, and then refines the label to obtain a training dataset, that is, the real-case centered on the target user. A method of selecting data.

도 1은 본 발명의 실시예에 따른 타겟 사용자의 소량의 감정 음성 데이터에 대한 화자 적응 모델링 방법의 개략적인 흐름도이고, 도 2는 도 1에 따른 흐름을 음성 데이터의 분포도를 사용하여 설명한 도면이다. 설명 전에, 본 발명의 실시예에서는 이미 다수의 사용자에 대한 초기 모델(Initial Model, IM)(11)이 구축되어 IM 데이터베이스(10)에 저장되어 있고, 타겟 사용자에 대해 초기에 획득된 감정 음성 데이터(21)가 미리 획득되어 타겟 사용자 DB(20)에 저장되어 있는 것으로 가정한다.1 is a schematic flowchart of a speaker adaptive modeling method for a small amount of emotional voice data of a target user according to an embodiment of the present invention, and FIG. 2 is a diagram illustrating the flow according to FIG. 1 using a distribution diagram of voice data. Before the description, in the embodiment of the present invention, an initial model (IM) 11 for a plurality of users is already built and stored in the IM database 10, and emotional voice data initially acquired for the target user It is assumed that 21 is obtained in advance and stored in the target user DB 20 .

도 1 및 도 2를 참조하면, 먼저, IM(11)의 레이블된 데이터를 언레이블(UN Label)한다(S100). 즉, IM(11)에 해당되는 다른 사용자들의 감정 정보를 제거한다. 그 이유로는 음성으로 표출하는 감정은 각 사용자마다 다르게 표현될 수 있기 때문이다. 예를 들어, 특정 사용자의 뉴트럴(Neutral) 감정의 음성이 다른 사용자의 감정 음성 정보에서 화남(Anger) 감정의 음성과 유사할 수 있기 때문이다. 따라서, 다른 사용자와 유사한 음성으로 타겟 사용자의 훈련 데이터 세트를 강화할 때 IM(11)에서 레이블을 제거하여 활용하는 것이 바람직하다. 도 3을 참조하면, 좌측의 IM(11)의 레이블된 감정별 데이터에 대해 언레이블된 데이터의 예가 우측에 도시되어 있다. 1 and 2 , first, the labeled data of the IM 11 is UN-labeled (S100). That is, the emotional information of other users corresponding to the IM 11 is removed. The reason is that emotions expressed through voice may be expressed differently for each user. This is because, for example, a voice of a specific user's neutral emotion may be similar to a voice of an angry emotion in emotional voice information of another user. Therefore, it is desirable to remove the label from the IM 11 and utilize it when reinforcing the training data set of the target user with a voice similar to that of other users. Referring to FIG. 3 , an example of unlabeled data for the labeled emotion-specific data of the IM 11 on the left is shown on the right.

다음, 타겟 사용자 데이터베이스(20)에 저장되어 있는 획득된 타겟 사용자의 데이터(21)를 사용하여 감정별 데이터들의 특징 벡터값(TfeatureVector)을 산출한다(S110). 여기서, 감정별 데이터들의 특징 벡터값을 산출하는 방법은 이미 잘 알려져 있으므로 여기에서는 상세하게 설명하지 않는다.Next, a feature vector value (TfeatureVector) of data for each emotion is calculated using the acquired target user data 21 stored in the target user database 20 ( S110 ). Here, since the method of calculating the feature vector value of the emotion-specific data is already well known, it will not be described in detail here.

그 후, 타겟 사용자의 감정별로 특징 벡터값의 평균값을 산출한다(S120). 이러한 특징 벡터값의 평균값은 MLE(Maximum Liklihood Estimation) 기법을 사용하여 산출될 수 있다. 예를 들어, 타겟 사용자의 감정별 특징 벡터값의 MLE(TMLE)는 [수학식 1]과 같이 산출될 수 있다.Thereafter, the average value of the feature vector values for each emotion of the target user is calculated ( S120 ). The average value of these feature vector values may be calculated using a Maximum Liklihood Estimation (MLE) technique. For example, the MLE (TMLE) of the feature vector value for each emotion of the target user may be calculated as in [Equation 1].

[수학식 1][Equation 1]

여기서, e는 대응하는 감정의 인덱스이고, I는 특징 벡터의 인덱스이며, j는 데이터의 인덱스이고, N은 데이터의 개수이며, TfeatureVector는 타겟 사용자의 감정별 데이터들의 특징 벡터값이다.Here, e is an index of a corresponding emotion, I is an index of a feature vector, j is an index of data, N is the number of data, and TfeatureVector is a feature vector value of data for each emotion of the target user.

도 4를 참조하면, 타겟 사용자에 대해 획득된 화남(Anger), 슬픔(Sadness), 행복(Happiness)의 감정 각각에 대한 특징 벡터값의 평균값이 MLE 계산에 의해 산출됨을 알 수 있다.Referring to FIG. 4 , it can be seen that the average value of the feature vector values for each of the emotions of Anger, Sadness, and Happiness obtained for the target user is calculated by MLE calculation.

다음, 타겟 사용자의 감정별 특징 벡터값의 평균값들 사이의 거리를 산출하고, 산출된 거리 중에서 최대가 되는 거리의 절반을 임계값으로 설정한다(S130). 여기서, 감정별 특징 벡터값의 평균값들 사이의 거리는 다음의 [수학식 2]와 같이 유클리디언 거리(Euclidean Distance)로서 산출될 수 있다.Next, the distance between the average values of the feature vector values for each emotion of the target user is calculated, and half of the maximum distance among the calculated distances is set as a threshold value (S130). Here, the distance between the average values of the feature vector values for each emotion may be calculated as the Euclidean distance as shown in Equation 2 below.

[수학식 2][Equation 2]

여기서, TMLE는 [수학식 1]에서 산출된 타겟 사용자의 감정별 특징 벡터값의 평균값이고, e_i 및 e_i는 서로 다른 감정 인덱스이며, k는 특징 벡터의 인덱스이고, FN은 특징의 개수이다.Here, TMLE is the average value of the feature vector values for each emotion of the target user calculated in [Equation 1], e _i and e _i are different emotion indexes, k is the index of the feature vector, and FN is the number of features .

이와 같이, 타겟 사용자의 감정별 특징 벡터값의 평균값들 사이의 거리가 산출되면 다음의 [수학식 3]을 사용하여 산출된 거리 중에서 최대값의 절반, 즉 1/2의 거리를 임계값(Maximum Threshold Distance, MTD)으로 산출한다.In this way, when the distance between the average values of the feature vector values for each emotion of the target user is calculated, the distance of half of the maximum value, that is, 1/2 of the distance calculated using the following [Equation 3], is set as the threshold value (Maximum). Threshold Distance, MTD).

[수학식 3][Equation 3]

그 후, 상기 단계(S120)에서 산출된 타겟 사용자의 감정별로 특징 벡터값의 평균값을 기준으로 상기 단계(S130)에서 산출된 임계값의 범위 내에 있는 IM(11)의 데이터들을 선택한다(S140).Then, based on the average value of the feature vector values for each emotion of the target user calculated in the step S120, data of the IM 11 within the threshold value range calculated in the step S130 is selected (S140). .

여기서, 임계값의 범위 내에 있는 IM(11)의 데이터를 선택하기 위해 감정별 특징 벡터값의 평균값과 IM(11)의 데이터 사이의 거리가 다음의 [수학식 4]와 같이 유클리디언 거리로서 산출될 수 있다.Here, in order to select the data of the IM 11 within the range of the threshold, the distance between the average value of the feature vector values for each emotion and the data of the IM 11 is the Euclidean distance as in the following [Equation 4] can be calculated.

[수학식 4][Equation 4]

여기서, TMLE는 [수학식 1]에서 산출된 타겟 사용자의 감정별 특징 벡터값의 평균값이고, e_i는 감정 인덱스이며, k는 특징 벡터의 인덱스이고, IDS_m은 언레이블된 IM(11)의 데이터(Initial Data Set, IDS)이며, FN은 특징의 개수이다.Here, TMLE is the average value of the feature vector values for each emotion of the target user calculated in [Equation 1], e _i is the emotion index, k is the index of the feature vector, and IDS _m is the unlabeled IM (11). Data (Initial Data Set, IDS), where FN is the number of features.

따라서, [수학식 4]에 의해 산출된 거리(

)가 상기 단계(S130)에서 산출된 임계값(

) 내에 속하는 경우 해당 IM(11)의 데이터가 타겟 사용자의 음성 데이터로서 선택되게 된다. 구체적으로, 에 의해 비교한 후, [수학식 4]에 의해 산출된 거리(

)가 임계값(

)보다 큰 경우에는 임계값 범위를 벗어나므로 해당되는 IM(11)의 데이터는 폐기되고, [수학식 4]에 의해 산출된 거리(

)가 임계값(

) 이하인 경우에는 임계값 범위 내에 있는 것이므로 해당되는 IM(11)의 데이터가 선택된다.Therefore, the distance calculated by [Equation 4] (

) is the threshold value calculated in step (S130) (

), the data of the corresponding IM 11 is selected as voice data of the target user. Specifically, after comparing by , the distance calculated by [Equation 4] (

) is the threshold (

), since it is out of the threshold range, the data of the corresponding IM 11 is discarded, and the distance calculated by [Equation 4] (

) is the threshold (

) or less, since it is within the threshold range, the data of the corresponding IM 11 is selected.

그 후, 선택된 IM(11)의 데이터들에 대해 가장 가까이 있는 타겟 사용자의 감정별 특징 벡터값의 평균값을 찾아서 그에 해당되는 감정을 확인한다(S150). 이러한 단계는 전술한 단계(S140)과 동시에 수행될 수 있다. 예를 들어, 도 5에 도시된 바와 같이, IM(11)의 데이터와 타겟 사용자의 감정별 특징 벡터값의 평균값 사이의 거리가 [수학식 4]에 의해 산출되고, 산출된 거리와 임계값의 비교에 의해 IM(11) 데이터를 선택할 때 동시에 산출된 거리와 임계값의 차이가 최소일 때의 감정을 확인함으로써 동시에 수행될 수 있다.Thereafter, the average value of the feature vector values for each emotion of the target user closest to the selected data of the IM 11 is found and the corresponding emotion is checked ( S150 ). This step may be performed simultaneously with the aforementioned step (S140). For example, as shown in FIG. 5 , the distance between the data of the IM 11 and the average value of the feature vector value for each emotion of the target user is calculated by [Equation 4], and When selecting the IM 11 data by comparison, it can be simultaneously performed by confirming the emotion when the difference between the calculated distance and the threshold value is the minimum.

마지막으로, 상기 단계(S150)서 확인된 감정으로 IM(11) 데이터에 대한 레이블 정제를 수행한다(S160).Finally, label refinement is performed on the IM 11 data based on the emotion confirmed in the step (S150) (S160).

이와 같이, 본 발명의 실시예에 따르면, 타겟 사용자에 대해 획득된 초기 감정 음성 데이터가 소량인 경우 다수의 사용자 초기 모델인 IM(11)의 언레이블된 데이터 중에서 타겟 사용자의 감정별로 유사한 데이터를 선택하여 강화할 수 있다.As described above, according to an embodiment of the present invention, when the initial emotional voice data obtained for the target user is small, similar data is selected for each emotion of the target user from among the unlabeled data of the IM 11 , which is an initial model of a plurality of users. can be strengthened.

도 6을 참조하면, 타겟 사용자 데이터(21)에 대해 기존 방식에 따라 확보된 타겟 사용자 데이터(좌측)와 본 방식에 따라 강화된 타겟 사용자 데이터(우측)의 예가 도시되어 있다. 도 6에서 알 수 있는 바와 같이, 기존의 방식에 비해 본 방식에 따라 확보된 데이터의 수량이 더 많아서, 결과적으로 개인화된 감정 인식을 위한 화자 적응 모델링이 강화되었음을 알 수 있다.Referring to FIG. 6 , examples of target user data (left) secured according to the existing method for target user data 21 and target user data enhanced according to the present method (right) are illustrated. As can be seen from FIG. 6 , the amount of data secured according to this method is larger than that of the existing method, and as a result, it can be seen that speaker adaptive modeling for personalized emotion recognition is strengthened.

다음, 두 번째 방식에 대해 설명한다. 즉, 타겟 사용자에 대해 수집된 감정 음성 데이터 중 특정 감정에 대한 데이터가 부재인 경우에 수행되는 강인한 화자 적응 모델링 방식에 대한 것이다. 일반적으로, 사람은 일상생활에서 같은 비율로 감정을 표현하지 않는다. 타겟 사용자가 특정 감정을 오랫동한 표현하지 않으면 해당 감정 표현에 대한 표본 없이 훈련 모델을 생성해야만 한다. 이 경우에는 특정 감정은 시스템에서 인식되지 않으며 정확도가 0%가 된다. 따라서, 이 방식은 IM(11)에 있는 다양한 사용자의 음성 중 타겟 사용자와 유사한 음성 패턴을 가지고 있는 사용자의 음성이 부재 감정 데이터와 유사하다고 가정하여 해당 사용자의 감정 음성 데이터로 대체하여 강화하는 방식이다.Next, the second method will be described. That is, it relates to a robust speaker adaptation modeling method performed when data on a specific emotion among emotional voice data collected for a target user is absent. In general, people do not express emotions at the same rate in everyday life. If the target user does not express a certain emotion for a long time, a training model must be created without a sample of that emotion expression. In this case, the specific emotion is not recognized by the system and the accuracy is 0%. Therefore, in this method, it is assumed that the voice of a user having a voice pattern similar to that of the target user among the voices of various users in the IM 11 is similar to the absence emotion data, and is replaced with the emotional voice data of the corresponding user and reinforced. .

도 7은 본 발명의 실시예에 따른 타겟 사용자의 부재 감정 음성 데이터에 대한 화자 적응 모델링 방법의 개략적인 흐름도이고, 도 8은 도 7에 따른 흐름을 음성 데이터의 분포도를 사용하여 설명한 도면이다. 마찬가지로, 설명 전에, 이미 다수의 사용자에 대한 초기 모델(Initial Model, IM)(11)이 구축되어 IM 데이터베이스(10)에 저장되어 있고, 타겟 사용자에 대해 초기에 획득된 감정 음성 데이터(21)가 미리 획득되어 타겟 사용자 DB(20)에 저장되어 있는 것으로 가정한다.7 is a schematic flowchart of a speaker adaptive modeling method for voice data of a target user's absence emotion according to an embodiment of the present invention, and FIG. 8 is a diagram illustrating the flow according to FIG. 7 using a distribution diagram of voice data. Similarly, before the description, an initial model (IM) 11 for a plurality of users has already been built and stored in the IM database 10 , and emotional voice data 21 initially acquired for the target user is It is assumed that it is obtained in advance and stored in the target user DB 20 .

도 7 및 도 8을 참조하면, 먼저, 타겟 사용자의 감정 음성 데이터(21)별로 데이터 분포 요소들을 산출한다(S200). 예를 들어, 데이터 분포 요소로는 중앙값(Median), 분산(Variance), 왜도(Skewness), 첨도(Kurtosis)가 사용되지만, 본 발명에서는 이것으로만 한정하는 것은 아니다.Referring to FIGS. 7 and 8 , first, data distribution elements are calculated for each emotional voice data 21 of a target user ( S200 ). For example, median, variance, skewness, and kurtosis are used as data distribution elements, but the present invention is not limited thereto.

구체적으로는, [수학식 5]를 사용하여, 타겟 사용자의 감정 음성 데이터 각각에 대해 중앙값, 분산, 왜도, 첨도를 산출한다.Specifically, using [Equation 5], the median value, variance, skewness, and kurtosis are calculated for each emotional voice data of the target user.

[수학식 5][Equation 5]

여기서, fi는 특징 벡터의 인덱스이고, FeatureVector_fi는 특정 특징 값이며, mean은 FeatureVector_fi의 평균값, 즉, 전술한 TMLE와 같고, N은 데이터의 개수이다.Here, fi is the index of the feature vector, FeatureVector _fi is a specific feature value, mean is the average value of FeatureVector _fi , that is, the same as the aforementioned TMLE, and N is the number of data.

다음, IM(11)의 데이터 중에서 타겟 사용자의 데이터(21)에 존재하는 감정에 대응되는 데이터 각각에 대해 데이터 분포 요소들을 산출한다(S210). 즉, IM(11)의 감정 음성 데이터 중에서 타겟 사용자의 데이터(21)에 존재하는 감정 데이터의 데이터 분포 요소들과 IM(11)에 있는 다른 사용자들의 감정 음성 데이터의 데이터 분포 요소들을 비교하여 유사성(Similiarity)을 찾는 것이다.Next, among the data of the IM 11 , data distribution elements are calculated for each data corresponding to the emotion existing in the data 21 of the target user ( S210 ). That is, the similarity ( to find similiarity).

여기서의 데이터 분포 요소들은 상기 단계(S200)에서의 데이터 분포 요소와 동일해야 한다. 즉, 여기서의 데이터 분포 요소는 중앙값, 분산, 왜도, 첨도이다. 마찬가지로, 전술한 [수학식 5]가 사용되어 IM(11)의 데이터 각각에 대한 데이터 분포 요소들이 산출될 수 있다.The data distribution elements here should be the same as the data distribution elements in step S200. That is, the data distribution elements here are median, variance, skewness, and kurtosis. Similarly, the above-mentioned [Equation 5] may be used to calculate data distribution elements for each data of the IM 11 .

도 9를 참조하면, 타겟 사용자의 감정 음성 데이터에 화남(Anger)과 슬픔(Sadness)의 감정들에 대한 데이터만이 존재하고, 다른 감정, 예를 들어 행복(Happiness) 감정에 대한 데이터가 부재인 경우, 타겟 사용자의 화남과 슬픔 감정의 데이터들 각각에 대한 데이터 분포 요소들이 각각 산출되고, 또한 IM(11)의 데이터들 중에서 역시 화남(Anger)과 슬픔(Sadness) 감정의 데이터들에 대해서만 데이터 분포 요소들이 각각 산출됨을 알 수 있다.Referring to FIG. 9 , only data on emotions of anger and sadness exist in the emotional voice data of the target user, and data on other emotions, for example, happiness, are absent. In this case, data distribution elements for each of the data of the anger and sadness emotions of the target user are calculated, respectively, and among the data of the IM 11 , also data distribution for only the data of the anger and sadness emotions It can be seen that the elements are calculated individually.

그 후, 감정별로 타겟 사용자 데이터의 데이터 분포 요소들과 IM(11)의 데이터의 데이터 분포 요소들 사이의 유사성을 산출한다(S220). 여기서, 감정별 데이터 분포 요소들 사이의 유사성은 유클리디언 거리에 기초하여 산출될 수 있다.Thereafter, the similarity between the data distribution elements of the target user data and the data distribution elements of the data of the IM 11 is calculated for each emotion ( S220 ). Here, the similarity between the data distribution elements for each emotion may be calculated based on the Euclidean distance.

예를 들어, 다음의 [수학식 6] 및 [수학식 7]을 사용하여 타겟 사용자의 데이터와 다른 사용자의 데이터 사이의 유사성이 산출될 수 있다.For example, similarity between data of a target user and data of another user may be calculated using the following [Equation 6] and [Equation 7].

[수학식 6][Equation 6]

[수학식 7][Equation 7]

여기서, TDDF_ei는 타겟 사용자 데이터의 데이터 분포 요소이고,

는 IM(11) 데이터의 데이터 분포 요소이며, DFN은 데이터 분포 요소의 개수이다. Here, TDDF _ei is a data distribution element of the target user data,

is a data distribution element of the IM 11 data, and DFN is the number of data distribution elements.

즉, [수학식 6]을 사용하여 타겟 사용자 데이터(21)의 감정별 데이터 분포 요소와 IM(11) 데이터의 대응하는 감정별 데이터 분포 요소 사이의 유클리디언 거리를 산출한 후, [수학식 7]을 사용하여 타겟 사용자 데이터(21)의 감정별 데이터 분포 요소와 IM(11) 데이터의 대응하는 감정별 데이터 분포 요소 사이의 유사성을 산출한다.That is, after calculating the Euclidean distance between the emotion-specific data distribution element of the target user data 21 and the corresponding emotion-specific data distribution element of the IM 11 data using [Equation 6], [Equation 7] to calculate the similarity between the emotion-specific data distribution element of the target user data 21 and the corresponding emotion-specific data distribution element of the IM 11 data.

도 9의 예에서, 타겟 사용자의 화남과 슬픔 각각에 대해 산출된 데이터 분포 요소들과 사용자1(User1)과 사용자2(User2) 각각의 화남과 슬픔 각각에 대해 산출된 데이터 분포 요소들 사이에 산출된 유사성의 예가 도 10에 도시되어 있다. 도 10을 참조하면, (a)는 타겟 사용자의 화남과 슬픔 각각에 대해 산출된 데이터 분포 요소들과 사용자1(User1)의 화남과 슬픔 각각에 대해 산출된 데이터 분포 요소들 사이에 산출된 유사성으로 10.48이고, (b)는 타겟 사용자의 화남과 슬픔 각각에 대해 산출된 데이터 분포 요소들과 사용자1(User2)의 화남과 슬픔 각각에 대해 산출된 데이터 분포 요소들 사이에 산출된 유사성으로 12.14임을 알 수 있다. 따라서, 타겟 사용자의 감정 음성 데이터와 유사한 사용자는 사용자2(User2)임을 알 수 잇다.In the example of FIG. 9 , calculated between the data distribution elements calculated for each of the anger and sadness of the target user and the data distribution elements calculated for each of the anger and sadness of User1 and User2 respectively An example of similarity is shown in FIG. 10 . Referring to FIG. 10 , (a) is a similarity calculated between data distribution elements calculated for each of anger and sadness of the target user and data distribution elements calculated for each of anger and sadness of User1. 10.48, (b) is the similarity calculated between the data distribution elements calculated for each of the target user's anger and sadness and the data distribution elements calculated for each of the anger and sadness of User 1 (User2), indicating that it is 12.14. can Accordingly, it can be seen that the user similar to the emotional voice data of the target user is User2.

그 후, 산출된 유사성을 통해 타겟 사용자와 가장 유사한 감정 음성을 갖는 사용자의 감정 음성 데이터 중에서 타겟 사용자의 부재 감정에 해당되는 감정의 데이터를 타겟 사용자의 부재 감정의 데이터로서 선택한다(S230).Thereafter, from among the emotional voice data of the user having the most similar emotional voice to the target user through the calculated similarity, emotion data corresponding to the target user's absence emotion is selected as the target user's absence emotion data (S230).

도 10의 예를 참조하면, 타겟 사용자의 감정 음성과 가장 유사한 사용자가 사용자2(User2)이므로, 타겟 사용자의 부재 감정에 대응하는 사용자2(User2)의 감정 데이터가 타겟 사용자의 부재 감정의 데이터로서 선택된다. 도 9의 예를 참조하면, 타겟 사용자는 행복에 대한 감정 데이터가 부재이므로, IM(11) 데이터 중에서 타겟 사용자의 감정 음성과 가장 유사한 사용자2(User2)의 행복 감정에 해당하는 데이터가 타겟 사용자의 부재 감정인 행복 감정에 대한 데이터로서 확보되는 것이다. 이를 통해, 타겟 사용자의 행복 감정에 대한 데이터는 부재 데이터가 아닌 것으로 될 수 있다.Referring to the example of FIG. 10 , since the user most similar to the emotional voice of the target user is User2, the emotion data of User2 corresponding to the emotion of absence of the target user is data of the emotion of absence of the target user. is chosen Referring to the example of FIG. 9 , since the target user does not have emotional data on happiness, data corresponding to the happiness emotion of User2 that is most similar to the emotional voice of the target user among the IM 11 data is the target user's emotion data. It is secured as data on happiness emotion, which is an absent emotion. Through this, the data on the happiness emotion of the target user may be non-absent data.

이와 같이, 본 발명의 실시예에 따르면, 타겟 사용자에 대해 획득된 초기 감정 음성 데이터가 부재 감정 데이터인 경우 다수의 사용자 초기 모델인 IM(11)의 사용자들 중에서 데이터 분포 요소들 기반으로 타겟 사용자의 감정 음성과 가장 유사한 사용자의 감정 데이터 중에서 타겟 사용자의 부재 감정에 대응하는 감정의 데이터를 타겟 사용자의 부재 감정의 데이터로서 선택하여 강화할 수 있다.As described above, according to an embodiment of the present invention, when the initial emotional voice data obtained for the target user is absence emotion data, the target user is based on data distribution factors among users of the IM 11 , which is an initial model of a plurality of users. From among the emotional data most similar to the emotional voice of the user, emotion data corresponding to the target user's absence emotion may be selected and strengthened as the target user's absence emotion data.

다음, 세 번째 방식에 대해 설명한다. 즉, 타겟 사용자에 대해 수집된 감정별 음성 데이터가 불균형 상태인 경우에 수행되는 강인한 화자 적응 모델링 방식에 대한 것이다. 이 방식은 타겟 사용자의 감정별 음성 데이터의 수량이 음성 인식을 위한 기계학습의 훈련셋으로 사용하기에 적합하지 않은 불균형한 상태인 경우에 균형 상태를 만들기 위해 수량이 적은 감정의 데이터를 가상 데이터로서 생성하는 방식이다.Next, the third method will be described. That is, it relates to a robust speaker adaptation modeling method performed when voice data for each emotion collected for a target user is in an imbalanced state. In this method, in order to create a balanced state when the amount of speech data for each emotion of the target user is in an unbalanced state that is not suitable for use as a training set for machine learning for speech recognition, a small amount of emotion data is used as virtual data. way to create it.

도 11은 본 발명의 실시예에 따른 타겟 사용자의 불균형한 감정 음성 데이터에 대한 화자 적응 모델링 방법의 개략적인 흐름도이다. 설명 전에, 본 발명의 실시예에서는 이미 다수의 사용자에 대한 초기 모델(Initial Model, IM)(11)이 구축되어 IM 데이터베이스(10)에 저장되어 있고, 타겟 사용자에 대해 초기에 획득된 감정 음성 데이터(21)가 미리 획득되어 타겟 사용자 DB(20)에 저장되어 있는 것으로 가정한다.11 is a schematic flowchart of a speaker adaptive modeling method for unbalanced emotional voice data of a target user according to an embodiment of the present invention. Before the description, in the embodiment of the present invention, an initial model (IM) 11 for a plurality of users is already built and stored in the IM database 10, and emotional voice data initially acquired for the target user It is assumed that 21 is obtained in advance and stored in the target user DB 20 .

도 11을 참조하면, 먼저, 타겟 사용자의 감정별 데이터의 수량에 기초한 불균형비(Imbalanced Ratio, IR)를 산출한다(S300). 여기서, 불균형비는 다음의 [수학식 8]과 같이 타겟 사용자의 감정 데이터 중에서 가장 많은 개수(Major Class)와 가장 작은 개수(Minor Class)의 비를 나타낸다.Referring to FIG. 11 , first, an Imbalanced Ratio (IR) based on the quantity of data for each emotion of the target user is calculated ( S300 ). Here, the imbalance ratio represents the ratio of the largest number (Major Class) to the smallest number (Minor Class) among emotional data of the target user as shown in Equation 8 below.

[수학식 8][Equation 8]

Imbalanced Ratio = Major Class/Minor ClassImbalanced Ratio = Major Class/Minor Class

그 후, 산출된 불균형비가 불균형의 판단 기준이 되는 임계 비율, 예를 들어 2.0보다 작은지 여부를 판단한다(S310). 여기서, 임계 비율인 2.0은 하나의 예시에 불과할 뿐, 감정 인식 방식이나 기타 사용 등의 환경에 따라 다르게 설정될 수도 있다.Thereafter, it is determined whether the calculated imbalance ratio is smaller than a threshold ratio, for example, 2.0, which is a criterion for determining imbalance ( S310 ). Here, the critical ratio of 2.0 is only an example, and may be set differently depending on an environment such as an emotion recognition method or other use.

만약 산출된 불균형비가 임계 비율인 2.0이상인 경우 타겟 사용자의 감정별 데이터가 불균형 상태로 판단되어, 데이터 개수가 작은 감정 데이터의 데이터 개수를 가상 데이터 증강(virtual data augmentation) 방식으로 생성한다(S320). 여기서, 가상 데이터 증강 방식은 minority class의 데이터를 합성하여 증강시키는 오버 샘플링 알고리즘으로 잘 알려져 있는 SMOTE(Synthetic Minority Oversampling Technique)가 사용될 수 있다.If the calculated imbalance ratio is equal to or greater than the threshold ratio of 2.0, the target user's emotion-specific data is determined to be in an imbalanced state, and a small number of emotion data data numbers are generated using a virtual data augmentation method (S320). Here, as the virtual data augmentation method, SMOTE (Synthetic Minority Oversampling Technique), which is well-known as an oversampling algorithm for synthesizing and augmenting data of a minority class, may be used.

만약 산출된 불균형비가 임계 비율인 2.0보다 작은 경우에는, 타겟 사용자의 감정별 데이터가 균형 상태로 판단되어 가상 데이터 증강 방식의 수행이 중단된다(S330). If the calculated imbalance ratio is smaller than the threshold ratio of 2.0, the data for each emotion of the target user is determined to be in a balanced state, and the virtual data augmentation method is stopped ( S330 ).

이와 같이, 본 발명의 실시예에 따르면, 타겟 사용자의 감정별 음성 데이터가 불균형 상태인 경우 가상 데이터 증강 방식을 통해 데이터 수가 작은 감정의 데이터 수를 증가시킬 수 있다.As described above, according to an embodiment of the present invention, when the voice data for each emotion of the target user is in an imbalanced state, the number of emotion data with a small number of data can be increased through the virtual data augmentation method.

이하, 위에서 설명한 3가지 방식을 사용하여 본 발명의 실시예에 따라 개인화된 감정 인식을 위한 강인한 화자 적응 모델링을 수행하는 방법에 대해 설명한다.Hereinafter, a method for performing robust speaker adaptation modeling for personalized emotion recognition according to an embodiment of the present invention using the three methods described above will be described.

도 12는 본 발명의 실시예에 따른 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법의 개략적인 흐름도이다. 마찬가지로, 설명 전에, 본 발명의 실시예에서는 이미 다수의 사용자에 대한 초기 모델(Initial Model, IM)(11)이 구축되어 IM 데이터베이스(10)에 저장되어 있고, 타겟 사용자에 대해 초기에 획득된 감정 음성 데이터(21)가 미리 획득되어 타겟 사용자 DB(20)에 저장되어 있는 것으로 가정한다.12 is a schematic flowchart of a robust speaker adaptation modeling method for personalized emotion recognition according to an embodiment of the present invention. Similarly, before the description, in the embodiment of the present invention, an initial model (IM) 11 for a plurality of users is already built and stored in the IM database 10, and the emotion initially acquired for the target user It is assumed that the voice data 21 is obtained in advance and stored in the target user DB 20 .

도 12를 참조하면, 먼저, 타겟 사용자의 데이터(21)에 대해 감정별 데이터 개수와 임계 개수를 비교한다(S400). 여기서, 임계 개수는 전술한 세 번째 방식에 따른 화자 적응 모델링 방식, 즉 타겟 사용자에 대해 수집된 감정별 음성 데이터가 불균형 상태인 경우에 수행되는 강인한 화자 적응 모델링 방식을 수행하기에 충분한 개수를 판단하는 데 사용되는 기준값이다. 즉, 감정 데이터가 임계 개수보다 작으면, 세 번째 방식을 수행하기에 충분하지 않은 데이터 개수이므로, 전술한 첫 번째 방식이나 두 번째 방식의 수행에 의해 보다 많은 데이터 개수를 강화해야 함을 나타낸다. 그러나, 감정 데이터가 임계 개수 이상인 경우에는 전술한 첫 번째 방식이나 두 번째 방식의 수행없이 세 번째 방식을 수행하기에 충분한 개수이므로, 바로 세 번째 방식의 수행에 의해 데이터 강화가 수행된다. 일반적으로, 초기 타겟 사용자의 수집 데이터는 매우 적기 때문에 감정 데이터의 개수가 임계 개수보다 작지만, 그렇지 않은 경우도 있을 수 있으므로, 이러한 판단이 수행되어야 한다.Referring to FIG. 12 , first, the number of data for each emotion with respect to the data 21 of the target user is compared with the threshold number ( S400 ). Here, the threshold number is a sufficient number to perform the speaker adaptive modeling method according to the third method described above, that is, the robust speaker adaptation modeling method performed when the voice data for each emotion collected for the target user is in an imbalanced state. It is the reference value used for That is, if the emotional data is smaller than the threshold number, it is an insufficient number of data to perform the third method, indicating that a larger number of data should be strengthened by performing the first or second method described above. However, when the number of emotion data is equal to or greater than the threshold number, since the number is sufficient to perform the third method without performing the first or second method, data reinforcement is performed by directly performing the third method. In general, since the initial target user's collection data is very small, the number of emotion data is smaller than the threshold number, but there may be cases where this is not the case, so this determination must be performed.

그 후, 감정별 데이터 개수가 임계 개수 이상인지가 판단되고(S410), 만약 감정별 데이터 개수가 임계 개수보다 작은 것으로 판단되면, 해당 감정의 개수가 0으로서 부재 데이터 여부가 판단된다(S420). Thereafter, it is determined whether the number of data for each emotion is equal to or greater than the threshold number (S410), and if it is determined that the number of data for each emotion is smaller than the threshold number, it is determined whether the number of emotions is 0 and absent data (S420).

만약 부재 데이터가 아니고 소량의 데이터가 존재하는 것으로 판단되면, 도 1 내지 도 6을 참조하여 상기에서 설명한 첫 번째 방식에 따른 화자 적응 모델링 방식, 즉 타겟 사용자에 대해 수집된 감정 음성 데이터가 소량인 경우에 수행되는 강인한 화자 적응 모델링 방식을 수행하여 해당 감정에 대응하는 데이터를 강화한다(S430).If it is determined that a small amount of data is present instead of absent data, the speaker adaptive modeling method according to the first method described above with reference to FIGS. 1 to 6 , that is, when the emotional voice data collected for the target user is small The data corresponding to the emotion is reinforced by performing the strong speaker adaptive modeling method performed in (S430).

그러나, 상기 단계(S420)에서 해당 감정이 부재 데이터인 것으로 판단되면, 도 7 내지 도 10을 참조하여 상기에서 설명한 두 번째 방식에 따른 화자 적응 모델링 방식, 즉 타겟 사용자에 대해 수집된 감정 음성 데이터 중 특정 감정에 대한 데이터가 부재인 경우에 수행되는 강인한 화자 적응 모델링 방식을 수행하여 부재 감정에 대응하는 데이터를 강화한다(S440).However, if it is determined in step S420 that the corresponding emotion is absent data, the speaker adaptive modeling method according to the second method described above with reference to FIGS. 7 to 10 , that is, among the emotional voice data collected for the target user, A strong speaker adaptive modeling method performed when data on a specific emotion is absent is performed to reinforce data corresponding to the emotion of absence (S440).

한편, 상기 단계(S410)에서 감정별 데이터 개수가 임계 개수 이상인 것으로 판단되면, 타겟 사용자의 감정 데이터가 불균형 상태인지가 판단된다(S450). 여기서의 불균형 상태의 판단은 전술한 바와 같이 불균형비를 사용하여 판단되는 것을 의미한다.On the other hand, if it is determined that the number of data for each emotion is equal to or greater than the threshold number in step S410, it is determined whether the emotion data of the target user is in an imbalanced state (S450). Here, the determination of the imbalance state means that the determination is made using the imbalance ratio as described above.

만약 타겟 사용자의 감정 데이터가 불균형 상태인 경우, 세 번째 방식에 따른 화자 적응 모델링 방식, 즉 타겟 사용자에 대해 수집된 감정별 음성 데이터가 불균형 상태인 경우에 수행되는 강인한 화자 적응 모델링 방식을 수행하여 불균형한 감정 데이터를 강화한다(S460).If the emotion data of the target user is in an imbalanced state, the speaker adaptive modeling method according to the third method, that is, the robust speaker adaptation modeling method performed when the voice data for each emotion collected for the target user is in an imbalanced state is performed. One emotion data is reinforced (S460).

그러나, 타겟 사용자의 감정 데이터가 불균형 상태가 아닌 경우 화자 적응 모델링 방식의 수행이 중단된다(S470).However, when the emotion data of the target user is not in an imbalanced state, the speaker adaptive modeling method is stopped ( S470 ).

한편, 상기 단계(S430)와 상기 단계(S440)를 통해 강화된 타겟 사용자의 감정 음성 데이터에 대해 상기 단계(S410 내지 S470)이 반복 수행되어, 최종적으로 타겟 사용자의 감정 음성 데이터가 소량이 아니면서 부재한 감정 데이터도 없으면서 감정별 데이터 상태가 불균형한 상태가 아닌 훈련 데이터셋으로 강화된다.On the other hand, the above steps (S410 to S470) are repeatedly performed on the emotional voice data of the target user strengthened through the steps (S430) and (S440), and finally, the emotional voice data of the target user is not small. There is no missing emotion data, and the data state for each emotion is reinforced with a training dataset that is not in an unbalanced state.

이와 같이, 본 발명의 실시예에서는 타겟 사용자에 대해 초기에 수집된 감정 음성 데이터의 수량이나 불균형 상태에 따라 적응적으로 화자 적응 모델링을 수행할 수 있다.As described above, in an embodiment of the present invention, speaker adaptive modeling may be adaptively performed according to the quantity or imbalanced state of emotional voice data initially collected for the target user.

다음, 본 발명의 실시예에 따른 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 장치에 대해 설명한다. 본 장치는 전술한 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 방법을 수행하는 장치에 해당된다.Next, a robust speaker adaptive modeling apparatus for personalized emotion recognition according to an embodiment of the present invention will be described. The present apparatus corresponds to an apparatus for performing the above-described robust speaker adaptation modeling method for individualized emotion recognition.

도 13은 본 발명의 실시예에 따른 개인화된 감정 인식을 위한 강인한 화자 적응 모델링 장치의 개략적인 구성 블록도이다.13 is a schematic block diagram of a speaker adaptive modeling apparatus for personalized emotion recognition according to an embodiment of the present invention.

도 13에 도시된 바와 같이, 본 발명의 실시예에 따른 화자 적응 모델링 장치(100)는 제1 모델링부(110), 제2 모델링부(120), 제3 모델링부(130) 및 처리 제어부(140)를 포함한다.13 , the speaker adaptive modeling apparatus 100 according to an embodiment of the present invention includes a first modeling unit 110 , a second modeling unit 120 , a third modeling unit 130 , and a processing control unit ( 140).

제1 모델링부(110)는 전술한 도 1 내지 도 6을 참조하여 설명한 첫 번째 방식에 따른 화자 적응 모델링을 수행한다. 즉, 제1 모델링부(110)는 타겟 사용자의 감정 음성 데이터(21)가 소량의 데이터인 경우, 다수의 사용자 초기 모델(IM)(11)에서 타겟 사용자와 유사한 음성을 선택한 후 레이블을 정제하여 타겟 사용자훈련 데이터셋, 즉 타겟 사용자 데이터를 강화한다.The first modeling unit 110 performs speaker adaptive modeling according to the first method described above with reference to FIGS. 1 to 6 . That is, when the emotional voice data 21 of the target user is a small amount of data, the first modeling unit 110 selects a voice similar to the target user from the multiple user initial models (IM) 11 and refines the label to Reinforce the target user training dataset, that is, the target user data.

이를 위해, 제1 모델링부(110)는 레이블링기(111), 특징 벡터값 산출기(112), 평균값 산출기(113), 거리 산출기(114), 임계값 산출기(115), 데이터 선택기(116) 및 감정 확정기(117)를 포함한다.To this end, the first modeling unit 110 includes a labeler 111 , a feature vector value calculator 112 , an average value calculator 113 , a distance calculator 114 , a threshold value calculator 115 , and a data selector. (116) and an emotion determiner (117).

레이블링기(111)는 IM(11)의 감정 데이터에서 감정을 제거하는 언레이블링, 타겟 사용자의 데이터(21)에 대해 감정 레이블링을 수행한다.The labeler 111 performs unlabeling for removing emotions from the emotion data of the IM 11 and emotion labeling on the data 21 of the target user.

특징 벡터값 산출기(112)는 타겟 사용자의 데이터(21)를 사용하여 감정별 데이터들의 특징 벡터값(TfeatureVector)을 산출한다.The feature vector value calculator 112 calculates a feature vector value (TfeatureVector) of data for each emotion by using the data 21 of the target user.

평균값 산출기(113)는 타겟 사용자의 감정별로 특징 벡터값의 평균값을 산출한다. 여기서, 특징 벡터값의 평균값은 전술한 [수학식 1]과 같이 MLE 기법을 사용하여 산출될 수 있다.The average value calculator 113 calculates an average value of the feature vector values for each emotion of the target user. Here, the average value of the feature vector values may be calculated using the MLE technique as in [Equation 1] described above.

거리 산출기(114)는 타겟 사용자의 감정별 특징 벡터값의 평균값들 사이의 거리를 전술한 [수학식 2]와 같이 유클리디언 거리로서 산출한다. 또한, 거리 산출기(114)는 전술한 [수학식 4]를 사용하여 감정별 특징 벡터값의 평균값과 IM(11)의 데이터 사이의 거리를 산출한다.The distance calculator 114 calculates the distance between average values of the feature vector values for each emotion of the target user as the Euclidean distance as in [Equation 2] above. Also, the distance calculator 114 calculates the distance between the average value of the feature vector values for each emotion and the data of the IM 11 using the above-mentioned [Equation 4].

임계값 산출기(115)는 전술한 [수학식 3]을 사용하여 거리 산출기(114)에 의해 산출된 타겟 사용자의 감정별 특징 벡터값의 평균값들 사이의 거리 중에서 최대값의 절반, 즉 1/2의 거리를 임계값으로 산출한다.The threshold value calculator 115 is a half of the maximum value among the distances between average values of the feature vector values for each emotion of the target user calculated by the distance calculator 114 using the above-mentioned [Equation 3], that is, 1 A distance of /2 is calculated as a threshold value.

데이터 선택기(116)는 타겟 사용자의 감정별로 특징 벡터값의 평균값을 기준으로 임계값의 범위 내에 있는 IM(11)의 데이터들을 선택한다.The data selector 116 selects data of the IM 11 within a threshold range based on the average value of the feature vector values for each emotion of the target user.

감정 확정기(117)는 데이터 선택기(116)에 의해 선택된 IM(11)의 데이터들에 대해 가장 가까이 있는 타겟 사용자의 감정별 특징 벡터값의 평균값을 찾아서 그에 해당되는 감정을 확정한다.The emotion determiner 117 determines the corresponding emotion by finding the average value of the feature vector values for each emotion of the target user closest to the data of the IM 11 selected by the data selector 116 .

제2 모델링부(120)는 전술한 도 7 내지 도 10을 참조하여 설명한 두 번째 방식에 따른 화자 적응 모델링을 수행한다. 즉, 제2 모델링부(120)는 타겟 사용자의 감정 중 데이터가 부재인 감정이 있는 경우, IM(11)에 있는 다양한 사용자의 음성 중 타겟 사용자와 유사한 음성 패턴을 가지고 있는 사용자의 감정 음성 데이터 중에서 타겟 사용자의 부재 감정에 대응하는 감정의 데이터로 대체하여 강화한다. The second modeling unit 120 performs speaker adaptive modeling according to the second method described above with reference to FIGS. 7 to 10 . That is, when there is an emotion of the absence of data among the emotions of the target user, the second modeling unit 120 determines from among the emotion voice data of users having a voice pattern similar to that of the target user among the voices of various users in the IM 11 . It is reinforced by replacing it with emotion data corresponding to the target user's absence emotion.

이를 위해 제2 모델링부(120)는 분포 요소 산출기(121), 거리 산출기(122), 유사성 산출기(123) 및 데이터 선택기(124)를 포함한다.To this end, the second modeling unit 120 includes a distribution factor calculator 121 , a distance calculator 122 , a similarity calculator 123 , and a data selector 124 .

분포 요소 산출기(121)는 전술한 [수학식 5]를 사용하여 타겟 사용자의 감정 음성 데이터 각각에 대해 데이터 분포 요소들인 중앙값, 분산, 왜도, 첨도를 산출한다. 또한, 분포 요소 산출기(121)는 전술한 [수학식 5]를 사용하여 IM(11)의 사용자별 감정 음성 데이터 각각에 대해 중앙값, 분산, 왜도, 첨도를 산출한다.The distribution element calculator 121 calculates the median value, variance, skewness, and kurtosis, which are data distribution elements, for each emotional voice data of the target user by using the above-mentioned [Equation 5]. In addition, the distribution factor calculator 121 calculates the median value, variance, skewness, and kurtosis for each user-specific emotional voice data of the IM 11 using the above-mentioned [Equation 5].

거리 산출기(122)는 전술한 [수학식 6]을 사용하여 타겟 사용자의 감정별 분포 요소와 IM(11)의 사용자별 감정별 분포 요소 사이의 유클리디언 거리를 산출한다.The distance calculator 122 calculates the Euclidean distance between the distribution element for each emotion of the target user and the distribution element for each emotion of the user of the IM 11 by using the above-mentioned [Equation 6].

유사성 산출기(123)는 전술한 [수학식 7]을 사용하여 타겟 사용자의 데이터(21)와 IM(11)의 다른 사용자의 데이터 사이의 유사성을 산출한다. The similarity calculator 123 calculates the similarity between the data 21 of the target user and the data of other users of the IM 11 using Equation 7 described above.

데이터 선택기(124)는 산출된 유사성을 통해 타겟 사용자와 가장 유사한 감정 음성을 갖는 사용자의 감정 음성 데이터 중에서 타겟 사용자의 부재 감정에 해당되는 감정의 데이터를 타겟 사용자의 부재 감정의 데이터로서 선택한다.The data selector 124 selects emotion data corresponding to the target user's absence emotion from among the emotional voice data of the user having the most similar emotional voice to the target user through the calculated similarity, as the target user's absence emotion data.

제3 모델링부(130)는 전술한 도 11을 참조하여 설명한 세 번째 방식에 따른 화자 적응 모델링을 수행한다. 즉, 제3 모델링부(130)는 타겟 사용자에 대해 수집된 감정별 음성 데이터가 불균형 상태인 경우에 균형 상태를 만들기 위해 수량이 적은 감정의 데이터를 가상 데이터로서 생성하여 데이터 강화를 수행한다. The third modeling unit 130 performs speaker adaptive modeling according to the third method described above with reference to FIG. 11 . That is, when the voice data for each emotion collected for the target user is in an unbalanced state, the third modeling unit 130 generates data of a small amount of emotion as virtual data to create a balanced state, and performs data reinforcement.

이를 위해, 제3 모델링부(130)는 불균형비 산출기(131) 및 SMOTE 수행기(132)를 포함한다.To this end, the third modeling unit 130 includes an imbalance ratio calculator 131 and an SMOTE performer 132 .

불균형비 산출기(131)는 전술한 [수학식 8]을 사용하여 타겟 사용자의 불균형비, 즉 타겟 사용자의 감정 데이터 중에서 가장 많은 개수(Major Class)와 가장 작은 개수(Minor Class)의 비를 산출한다.The imbalance ratio calculator 131 calculates the imbalance ratio of the target user, that is, the ratio of the largest number (Major Class) and the smallest number (Minor Class) among the emotional data of the target user by using the above-mentioned [Equation 8]. do.

SMOTE 수행기(132)는 불균형비 산출기(131)에 의해 산출된 타겟 사용자의 불균형비에 의해 타겟 사용자의 감정 데이터가 불균형 상태인 경우, 예를 들어 불균형비가 2.0 이상인 경우 데이터 개수가 작은 감정 데이터의 데이터 개수를 가상 데이터 증강(virtual data augmentation) 방식, 예를 들어 SMOTE 오버샘플링 알고리즘을 사용하여 생성한다.The SMOTE executor 132 determines that the emotional data of the target user is in an imbalanced state by the imbalance ratio of the target user calculated by the imbalance ratio calculator 131, for example, when the imbalance ratio is 2.0 or more. The number of data is generated using a virtual data augmentation method, for example, the SMOTE oversampling algorithm.

다음, 처리 제어부(140)는 타겟 사용자의 데이터가 임계 개수보다 작은 경우, 타겟 사용자의 감정 음성 데이터가 소량이면 제1 모델링부(110)에 의한 데이터 강화가 수행되도록 하고, 타겟 사용자의 감정 음성 데이터가 부재이면 제2 모델링부(120)에 의한 데이터 강화가 수행되도록 한다.Next, when the target user's data is smaller than the threshold number, if the target user's emotional voice data is small, the processing control unit 140 performs data reinforcement by the first modeling unit 110, and the target user's emotional voice data If is absent, data enhancement by the second modeling unit 120 is performed.

또한, 처리 제어부(140)는 타겟 사용자의 데이터가 임계 개수 이상이고 불균형비가 임계 비율 이상이면, 제3 모델링부(130)에 의한 데이터 강화가 수행되도록 한다.In addition, if the target user's data is equal to or greater than the threshold number and the imbalance ratio is equal to or greater than the threshold ratio, the processing control unit 140 performs data enhancement by the third modeling unit 130 .

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiment of the present invention described above is not implemented only through the apparatus and method, and may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto. is within the scope of the right.

Claims

A robust speaker adaptive modeling method for personalized emotion recognition, comprising:
When the emotional voice data initially obtained from the target user (hereinafter referred to as “target user data”) is less than a preset threshold number and the emotional data of the target user data is not absent data, a plurality of pre-formed users Among the individual data of the unlabeled dataset in which the label is removed from the labeled training dataset of the initial model (hereinafter referred to as “initial model”) for Selecting and enriching individual unlabeled data having a similar pattern by comparison of
When the target user's data is less than the preset threshold number, and the emotional data of the target user's data is absent data, among user data most similar to the target user's emotional data among the datasets of the initial model Selecting and strengthening data corresponding to the emotion corresponding to the absence data of the target user; or
Reinforcing data for each emotion of the target user by a virtual data augmentation method using an oversampling algorithm when the target user's data is equal to or greater than the preset threshold number and the number of data for each emotion of the target user is in an imbalanced state
A speaker adaptive modeling method comprising a.

According to claim 1,
Selecting and strengthening the individual unlabeled data includes:
calculating a feature vector value of the emotion-specific data of the target user;
calculating an average value of feature vector values for each emotion of the target user;
calculating the distance between the average values of the feature vector values for each emotion of the target user;
setting a threshold value based on the calculated distance;
selecting unlabeled data of the initial model that falls within the range within the threshold value based on the average value of the feature vector values for each emotion of the target user;
Determining an emotion corresponding to the unlabeled data of the initial model based on a distance between the unlabeled data of the selected initial model and the average value of the feature vector values for each emotion of the target user, and
performing label refinement with a determined emotion on the unlabeled data of the initial model
Including, speaker adaptive modeling method.

3. The method of claim 2,
The average value of the feature vector values for each emotion of the target user is calculated by MLE (Maximum Liklihood Estimation),
A speaker-adaptive modeling method.

3. The method of claim 2,
The distance between the average values of the feature vector values for each emotion of the target user is calculated as the Euclidean distance,
A speaker-adaptive modeling method.

3. The method of claim 2,
The threshold value is set to half of the maximum distance among the calculated distances,
A speaker-adaptive modeling method.

According to claim 1,
The step of selecting and strengthening data corresponding to the emotion corresponding to the absence data of the target user,
Calculating a data distribution element for each emotion-specific data of the target user;
Calculating a data distribution element for each emotion-specific data for each user of the initial model;
calculating a distance between a data distribution element calculated for each emotion-specific data of the target user and a data distribution element calculated for each emotion-specific data for each user of the initial model;
Calculating the user-specific similarity between the target user and the initial model based on the calculated distance; and
From the data of the user of the initial model having the highest calculated similarity, data of the user of the initial model corresponding to the emotion corresponding to the absence data of the target user is selected, and the emotion corresponding to the absence data of the target user is selected. Steps to enrich as data
Including, speaker adaptive modeling method.

7. The method of claim 6,
wherein the data distribution factors include median, variance, skewness and kurtosis,
A speaker-adaptive modeling method.

According to claim 1,
Reinforcing data for each emotion of the target user in a virtual data augmentation method using the oversampling algorithm includes:
calculating an imbalance ratio for the emotion-specific data of the target user, wherein the imbalance ratio represents the ratio of the largest number to the smallest number among the emotion-specific data of the target user; and
When the imbalance ratio is greater than or equal to a preset threshold ratio to indicate that the data for each emotion of the target user is in an imbalanced state, virtual data is generated by a virtual data augmentation method for the smallest number of emotion data among the data for each emotion of the target user. step to strengthen
Including, speaker adaptive modeling method.

9. The method of claim 8,
The virtual data augmentation method is a Synthetic Minority Oversampling Technique (SMOTE) method for synthesizing and augmenting data of a minority class,
A speaker-adaptive modeling method.

As a strong speaker adaptive modeling device for personalized emotion recognition,
Emotional speech initially acquired from a target user among individual data of an unlabeled dataset in which labels are removed from a labeled training dataset of an initial model (hereinafter referred to as “initial model”) for a plurality of preformed users A first modeling unit that selects and reinforces individual unlabeled data having a similar pattern by comparing the emotion-specific dataset and feature vector values of the data (hereinafter referred to as “target user data”);
a second modeling unit for selecting and strengthening data corresponding to the emotion corresponding to the absence data of the target user from among the data of the user most similar to the emotion data of the target user from among the data of the initial model;
A third modeling unit for reinforcing data for each emotion of the target user in a virtual data augmentation method using an oversampling algorithm, and
When the target user's data is less than a preset threshold number and the emotional data of the target user's data is not absent data, data reinforcement is performed through the first modeling unit, or the target user's data is While less than the set threshold number, if the emotional data of the target user's data is absent data, data reinforcement is performed through the second modeling unit, or the target user's data is greater than or equal to the preset threshold number, the target When the number of data for each emotion of the user is in an imbalanced state, a processing control unit for performing data reinforcement through the third modeling unit
A speaker adaptive modeling device comprising a.

11. The method of claim 10,
The first modeling unit,
A labeler that unlabels the labeled data of the initial model or performs an emotion label on the selected data when selected as the target user's data among the unlabeled data of the initial model;
a feature vector value calculator for calculating a feature vector value of data for each emotion of the target user;
an average value calculator for calculating an average value of the feature vector values for each emotion of the target user;
A distance calculator for calculating the distance between the average values of the feature vector values for each emotion of the target user, or for calculating the distance between the unlabeled data of the initial model and the average value of the feature vector values for each emotion of the target user;
a threshold value calculator for calculating a threshold value based on the distance calculated by the distance calculator;
a data selector for selecting unlabeled data of the initial model that falls within the range within the threshold value based on the average value calculated by the average value calculator; and
An emotion determiner for determining an emotion corresponding to the unlabeled data of the selected initial model based on a distance between the unlabeled data of the initial model selected by the data selector and the average value calculated by the average value calculator
Including, speaker adaptive modeling device.

12. The method of claim 11,
The threshold value calculator calculates, as the threshold value, half of the maximum distance among the distances calculated by the distance calculator.
Speaker adaptive modeling device.

11. The method of claim 10,
The second modeling unit,
A distribution element calculator for calculating a data distribution element for each emotion-specific data of the target user and a data distribution element for each emotion-specific data for each user of the initial model;
A distance calculator for calculating a distance between a data distribution element calculated for each emotion-specific data of the target user and a data distribution element calculated for each emotion-specific data for each user of the initial model;
a similarity calculator for calculating the similarity for each user of the target user and the initial model based on the distance calculated by the distance calculator;
From the data of the user of the initial model having the highest calculated similarity, data of the user of the initial model corresponding to the emotion corresponding to the absence data of the target user is selected, and the emotion corresponding to the absence data of the target user is selected. Data selector to enrich as data
Including, speaker adaptive modeling device.

14. The method of claim 13,
wherein the data distribution factors include median, variance, skewness and kurtosis,
Speaker adaptive modeling device.

11. The method of claim 10,
The third modeling unit,
An imbalance ratio calculator for calculating an imbalance ratio for the emotion-specific data of the target user, wherein the imbalance ratio represents the ratio of the largest number to the smallest number among the emotion-specific data of the target user; and
When the imbalance ratio is greater than or equal to a preset threshold ratio to indicate that the data for each emotion of the target user is in an imbalanced state, virtual data is generated by a virtual data augmentation method for the smallest number of emotion data among the data for each emotion of the target user. data augmentation performer that enhances
Including, speaker adaptive modeling device.