JP7271827B2

JP7271827B2 - Voice emotion prediction method and system

Info

Publication number: JP7271827B2
Application number: JP2021152163A
Authority: JP
Inventors: チャン、キャン; チャオ、ラシェン; チュウ、ドンシェン; ホウ、ヤキン
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2023-05-12
Anticipated expiration: 2041-09-17
Also published as: JP2023044240A

Description

本発明は、信号処理の技術分野、特に音声感情予測方法及びシステムに関する。 TECHNICAL FIELD The present invention relates to the technical field of signal processing, and more particularly to a speech emotion prediction method and system.

コンピュータは現代人の仕事や生活に欠かせないものになり、ますます重要な役割を果たしているため、人々は一般に、人間とコンピュータの相互作用が人間同士のコミュニケーションと同じくらい親切で自然で感情的なものになることを望んでいる。この目的を達成するために、音声感情認識は研究者の注目を集めている。現在、音声感情認識は、主に２つのカテゴリに分類される。１つは、従来の機械学習方法に基づき、音声感情を表すことができる効果的な特徴を抽出して分類器と組み合わせることによって認識される。もう１つは、深層学習に基づく音声感情認識方法である。これは、最初のタイプの方法よりもパフォーマンスが優れたエンドツーエンドの方法である。ただし、どの深層学習モデルを音声感情認識に使用しても、各モデルには独自の欠点があるため、単一のモデルで効果的な感情的特徴情報を包括的に学習することは困難である。 As computers have become an integral part of modern man's work and life, playing an increasingly important role, people generally believe that human-computer interaction can be as kind, natural and emotional as human-to-human communication. I hope to become something. To this end, speech emotion recognition has attracted the attention of researchers. Currently, speech emotion recognition mainly falls into two categories. One is based on conventional machine learning methods and is recognized by extracting effective features that can express speech emotion and combining them with classifiers. The other is a speech emotion recognition method based on deep learning. This is an end-to-end method with better performance than the first type of method. However, no matter which deep learning model is used for speech emotion recognition, each model has its own drawbacks, making it difficult to comprehensively learn effective emotional feature information with a single model. .

本発明は、音声感情認識の精度を向上させる音声感情予測方法及びシステムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech emotion prediction method and system for improving the accuracy of speech emotion recognition.

上記の目的を達成するために、本発明は以下の解決手段を提供する。 In order to achieve the above objects, the present invention provides the following solutions.

感情音声データセットを収集し、データセットの各サンプルは、感情音声信号と、感情音声信号に対応する感情タイプを含むステップと、
データセットをトレーニングセットと検証セットに分割するステップと、
トレーニングセットに従って、それぞれＭ個の異なるタイプの分類器モデルをトレーニングして、各分類器モデルに対応する予測モデルを取得するステップと、
検証セットによれば、各予測モデルの混同行列をそれぞれ得て、かつｍ番目の予測モデルの混同行列に従って、ｍ番目の予測モデルに対応するＦ１値ベクトルを決定し、ｍ番目のＦ１値ベクトルを、ｍ∈［１、Ｍ］として記録するステップと、
予測待ちの感情音声信号セットを、それぞれ各予測モデルに入力し、ｍ番目の予測モデルによって出力された感情予測タイプは、感情予測ベクトルを構成し、これを、ｍ番目の感情予測ベクトルとして記録するステップと、
ｍ番目のＦ１値ベクトルにおけるｎ番目のＦ１値にｍ番目の感情予測ベクトルにおけるｎ番目の予測値を乗算すると、各乗算の結果がｍ番目の積ベクトルになり、ｎ番目のＦ１値に対応する感情タイプは、ｎ番目の予測値に対応する感情タイプと同じであり、ｎ∈［１、Ｎ］、Ｎは感情タイプの数を表すステップと、
それぞれ、各積ベクトルにおけるｎ番目の乗算結果を加算してｎ番目の加算結果を取得し、各加算結果が和ベクトルを構成するステップと、
和ベクトル内の要素の最大値に対応する感情タイプを予測された感情タイプとして決定するステップと、を含む音声感情の予測方法。 collecting an emotional voice data set, each sample of the data set including an emotional voice signal and an emotion type corresponding to the emotional voice signal;
splitting the dataset into a training set and a validation set;
training M different types of classifier models respectively according to the training set to obtain a prediction model corresponding to each classifier model;
According to the validation set, obtain the confusion matrix of each prediction model respectively, and determine the F1 value vector corresponding to the mth prediction model according to the confusion matrix of the mth prediction model, and the mth F1 value vector , mε[1,M];
The emotion voice signal set waiting for prediction is input to each prediction model respectively, and the emotion prediction type output by the m-th prediction model constitutes an emotion prediction vector, which is recorded as the m-th emotion prediction vector. a step;
When the nth F1 value in the mth F1 value vector is multiplied by the nth predicted value in the mth emotion prediction vector, the result of each multiplication is the mth product vector, corresponding to the nth F1 value. the emotion type is the same as the emotion type corresponding to the n-th predicted value, nε[1, N], where N represents the number of emotion types;
adding the n-th multiplication result in each product vector to obtain the n-th addition result, each addition result forming a sum vector;
determining the emotion type corresponding to the maximum value of the elements in the sum vector as the predicted emotion type.

本発明によって提供される具体的な実施例によれば、本発明は、以下の技術的効果を開示する。 According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects.

本発明は、トレーニングセットに従ってそれぞれ異なる分類器モデルをトレーニングし、次に検証セットからそれぞれ各予測モデルのＦ１値ベクトルを取得し、それに対応して、Ｆ１値ベクトルのＦ１値を感情予測ベクトルの予測値と乗算する。最後に、各積ベクトルの各対応する乗算結果を加算して情報融合を実現し、異なる分類器の認識結果を融合することにより、音声感情認識の精度が向上する。 The present invention trains different classifier models according to the training set, then obtains the F1 value vector of each prediction model respectively from the validation set, and correspondingly converts the F1 value of the F1 value vector to the prediction of the emotion prediction vector. Multiply by value. Finally, each corresponding multiplication result of each product vector is added to realize information fusion, and the recognition results of different classifiers are fused to improve the accuracy of speech emotion recognition.

本発明の音声感情予測方法のプロセスの模式図である。1 is a schematic diagram of the process of the speech emotion prediction method of the present invention; FIG. 本発明の実施例の音声感情予測方法のプロセスの模式図である。FIG. 4 is a schematic diagram of the process of the speech emotion prediction method according to an embodiment of the present invention; 本発明のＶＧＧモデルの構造の模式図である。1 is a schematic diagram of the structure of the VGG model of the present invention; FIG. 本発明のＲｅｓＮｅｔモデルの構造の模式図である。1 is a schematic diagram of the structure of a ResNet model of the present invention; FIG. 本発明のＸｃｅｐｔｉｏｎモデルの構造の模式図である。1 is a schematic diagram of the structure of the Xception model of the present invention; FIG. 本発明の音声感情予測システムの構造の模式図である。1 is a schematic diagram of the structure of the speech emotion prediction system of the present invention; FIG.

本発明は、音声感情認識の精度を向上させる音声感情の予測方法及びシステムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech emotion prediction method and system that improve the accuracy of speech emotion recognition.

本発明の上記目的、特徴及び利点をより顕著で分かりやすくするために、以下に図面及び発明を実施するための形態を参照しながら本発明をさらに詳しく説明する。 In order to make the above objects, features and advantages of the present invention more prominent and comprehensible, the present invention will be described in more detail below with reference to the drawings and detailed description.

図１に示すように、音声感情予測方法は、
感情音声データセットを収集し、データセットの各サンプルは、感情音声信号と、感情音声信号に対応する感情タイプを含み、
感情タイプは、中立、喜び、怒り、悲しみ、驚き、および恐れを含み、中立（Ｎｅｕｔｒａｌ）は感情がないことを意味するステップ１０１と、
データセットをトレーニングセットと検証セットに分割するステップ１０２と、
トレーニングセットに従って、それぞれＭ個の異なるタイプの分類器モデルをトレーニングして、各分類器モデルに対応する予測モデルを取得するステップ１０３と、
検証セットによれば、各予測モデルの混同行列をそれぞれ得て、かつｍ番目の予測モデルの混同行列に従って、ｍ番目の予測モデルに対応するＦ１値ベクトルを決定し、ｍ番目のＦ１値ベクトル、ｍ∈［１、Ｍ］として記録するステップ１０４と、
予測待ちの感情音声信号セットを、それぞれ各予測モデルに入力し、ｍ番目の予測モデルによって出力された感情予測タイプは、感情予測ベクトルを構成し、これを、ｍ番目の感情予測ベクトルとして記録するステップ１０５と、
ｍ番目のＦ１値ベクトルにおけるｎ番目のＦ１値にｍ番目の感情予測ベクトルにおけるｎ番目の予測値を乗算すると、各乗算の結果がｍ番目の積ベクトルになり、ｎ番目のＦ１値に対応する感情タイプは、ｎ番目の予測値に対応する感情タイプと同じであり、ｎ∈［１、Ｎ］、Ｎは感情タイプの数を表すステップ１０６と、
それぞれ、各積ベクトルにおけるｎ番目の乗算結果を加算してｎ番目の加算結果を取得し、各加算結果が和ベクトルを構成するステップ１０７と、
和ベクトル内の要素の最大値に対応する感情タイプを予測された感情タイプとして決定するステップ１０８と、を含む。 As shown in FIG. 1, the voice emotion prediction method includes:
collecting an emotional voice data set, each sample in the data set including an emotional voice signal and an emotion type corresponding to the emotional voice signal;
Emotion types include Neutral, Joy, Anger, Sadness, Surprise, and Fear, where Neutral means no emotion, step 101;
dividing 102 the dataset into a training set and a validation set;
step 103, respectively training M different types of classifier models according to the training set to obtain a prediction model corresponding to each classifier model;
According to the validation set, obtaining the confusion matrix of each prediction model respectively, and determining the F1 value vector corresponding to the mth prediction model according to the confusion matrix of the mth prediction model, the mth F1 value vector, recording 104 as mε[1,M];
The emotion voice signal set waiting for prediction is input to each prediction model respectively, and the emotion prediction type output by the m-th prediction model constitutes an emotion prediction vector, which is recorded as the m-th emotion prediction vector. step 105;
When the nth F1 value in the mth F1 value vector is multiplied by the nth predicted value in the mth emotion prediction vector, the result of each multiplication is the mth product vector, corresponding to the nth F1 value. step 106, where the emotion type is the same as the emotion type corresponding to the nth predicted value, nε[1, N], where N represents the number of emotion types;
Step 107, respectively, adding the n-th multiplication result in each product vector to obtain the n-th addition result, each addition result forming a sum vector;
and determining 108 the emotion type corresponding to the maximum value of the elements in the sum vector as the predicted emotion type.

Ｍ値は３であり、分類器モデルの３つの異なるタイプは、それぞれＶＧＧモデル、ＲｅｓＮｅｔモデル、およびＸｃｅｐｔｉｏｎモデルである。 The M-value is 3 and the three different types of classifier models are the VGG model, the ResNet model and the Xception model respectively.

Ｍ値が３の場合、音声感情予測法は具体的に、
感情音声データセットを収集し、データセットの各サンプルは、感情音声信号と、感情音声信号に対応する感情タイプを含むステップと、
データセットをトレーニングセットと検証セットに分割するステップと、
トレーニングセットに従って、それぞれ第１分類器モデル、第２分類器モデル、および第３分類器モデルをトレーニングして、第１予測モデル、第２予測モデル、および第３予測モデルを取得し、第１分類器モデル、第２分類器モデル及び第３分類器モデルは異なるタイプの分類器であるステップと、
検証セットによれば、第１予測モデルの混同行列、第２予測モデルの混同行列、および第３予測モデルの混同行列をそれぞれ得て、第１予測モデルの混同行列に従ってＦ１値ベクトルを決定し、それを第１のＦ１値ベクトルとして記録し、第２予測モデルの混同行列に従ってＦ１値ベクトルを決定し、それを第２のＦ１値ベクトルとして記録し、第３予測モデルの混同行列に従ってＦ１値ベクトルを決定し、それを第３のＦ１値ベクトルとして記録するステップと、
予測待ちの感情音声信号セットを、それぞれ第１の予測モデル、第２の予測モデル、および第３の予測モデルに入力し、第１の予測モデルによって出力される感情予測タイプは、第１の感情予測ベクトルを構成し、第２の予測モデルによって出力される感情予測タイプは、第２の感情予測ベクトルを構成し、第３の感情予測タイプによって出力される感情予測タイプは、第３の感情予測ベクトルを構成するステップと、
第１のＦ１値ベクトルにおけるｎ番目のＦ１値に、第１の感情予測ベクトルにおけるｎ番目の予測値を乗算すると、各乗算結果が第１の積ベクトルを構成し、第２のＦ１値ベクトルにおけるｎ番目のＦ１値に、第２の感情予測ベクトルにおけるｎ番目の予測値を乗算すると、各乗算結果が第２の積ベクトルを構成し、第３のＦ１値ベクトルにおけるｎ番目のＦ１値に、第３の感情予測ベクトルにおけるｎ番目の予測値を乗算すると、各乗算結果が第３の積ベクトルを構成し、ｎ番目のＦ１値に対応する感情タイプは、ｎ番目の予測値に対応する感情タイプと同じであるステップと、
第１の積ベクトルにおけるｎ番目の乗算結果、第２の積ベクトルにおけるｎ番目の乗算結果、および第３の積ベクトルにおけるｎ番目の乗算結果を加算し、各加算結果は和ベクトルを構成するステップと、
和ベクトル内の要素の最大値に対応する感情タイプを予測された感情タイプとして決定するステップと、を含む。 When the M value is 3, the speech emotion prediction method specifically:
collecting an emotional voice data set, each sample of the data set including an emotional voice signal and an emotion type corresponding to the emotional voice signal;
splitting the dataset into a training set and a validation set;
training a first classifier model, a second classifier model and a third classifier model respectively according to the training set to obtain a first prediction model, a second prediction model and a third prediction model; the classifier model, the second classifier model and the third classifier model are different types of classifiers;
obtain the confusion matrix of the first prediction model, the confusion matrix of the second prediction model, and the confusion matrix of the third prediction model, respectively, according to the validation set, determine the F1 value vector according to the confusion matrix of the first prediction model; Record it as the first F1 value vector, determine the F1 value vector according to the confusion matrix of the second prediction model, record it as the second F1 value vector, and determine the F1 value vector according to the confusion matrix of the third prediction model and recording it as a third F1 value vector;
The emotion speech signal set waiting for prediction is input to the first prediction model, the second prediction model, and the third prediction model, respectively, and the emotion prediction type output by the first prediction model is the first emotion The emotion prediction type output by the second prediction model that constitutes the prediction vector constitutes the second emotion prediction vector, and the emotion prediction type that is output by the third emotion prediction type is the third emotion prediction constructing a vector;
Multiplying the nth F1 value in the first F1 value vector by the nth predicted value in the first emotion prediction vector, each multiplication result constitutes a first product vector, and The nth F1 value is multiplied by the nth predicted value in the second emotion prediction vector, each multiplication result forming a second product vector, and the nth F1 value in the third F1 value vector is: Multiplying the n-th predicted value in the third emotion prediction vector, each multiplication result constitutes a third product vector, and the emotion type corresponding to the n-th F1 value is the emotion corresponding to the n-th predicted value. a step that is the same as the type;
summing the nth multiplication result in the first product vector, the nth multiplication result in the second product vector, and the nth multiplication result in the third product vector, each addition result forming a sum vector; and,
determining the emotion type corresponding to the maximum value of the elements in the sum vector as the predicted emotion type.

トレーニングセットにおける感情音声信号と検証セットにおける感情音声信号は、どちらも強化されたメルスペクトログラムである。予測待ちの感情音声信号セットにおける感情音声信号は、強化されたメルスペクトログラムである。強化されたメルスペクトログラムは、自然対数から変換した、強化関数として強化されたメルスペクトログラムである。 Both the emotional speech signals in the training set and the emotional speech signals in the validation set are enhanced mel-spectrograms. The affective audio signal in the set of emotional audio signals awaiting prediction is the enhanced mel-spectrogram. The enhanced mel-spectrogram is the mel-spectrogram enhanced as the enhancement function transformed from the natural logarithm.

本発明は、最初に、トレーニングセット音声強化メルスペクトログラムからそれぞれ異なる基本分類器ネットワークモデルをトレーニングし、次に、検証セット音声強化メルスペクトログラムからそれぞれ各基本分類器ネットワークモデルのＦ１値ベクトルを取得し、かつそれをテスト音声と、各対応する基本分類器感情予測値ベクトルにおいて、ドット積演算し、最後に各基本モデルのドット積ベクトルを加算して情報融合を実現する。この方法は、異なる分類器の分類情報を使用し、各予測モデルの優勢相補を通じて感情認識の精度を向上させる。 The present invention first trains different base classifier network models from the training set speech enhanced mel-spectrogram, and then obtains the F1 value vector of each base classifier network model respectively from the validation set speech enhanced mel-spectrogram, Then, the test speech and each corresponding basic classifier emotion prediction value vector are subjected to dot product operation, and finally the dot product vector of each basic model is added to realize information fusion. This method uses the classification information of different classifiers to improve the accuracy of emotion recognition through the dominance complement of each prediction model.

以下では、具体的な実施例を使用して、本発明の音声感情予測方法を説明する。 In the following, the speech emotion prediction method of the present invention is described using specific examples.

本実施例では、７２００個のＣＡＳＩＡ中国語感情音声データを、トレーニングセット、検証セット、およびテストセットとして選択し、３つの比率は８：１：１で、認識される感情のタイプは６種類であった。図２に示すように、音声感情予測法は、
トレーニングセットの感情音声から強化されたメルスペクトログラムを抽出し、ＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類ネットワークモデルをそれぞれトレーニングし、ＶＧＧモデル（ＶＧＧ分類ネットワークモデル）、ＲｅｓＮｅｔモデル（ＲｅｓＮｅｔ分類ネットワークモデル）、およびＸｃｅｐｔｉｏｎ（Ｘｃｅｐｔｉｏｎ分類ネットワークモデル）モデルに対応する予測モデルを取得し、
ＶＧＧモデルの構造を図３に、ＲｅｓＮｅｔモデルの構造を図４に、Ｘｃｅｐｔｉｏｎモデルの構造を図５に示したＳｔｅｐ１と、
検証セットの感情音声から、強化されたメルスペクトログラムを抽出し、Ｓｔｅｐ１でトレーニングされたＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類器ネットワークモデルの入力として使用され、各予測モデルによって出力された検証セットの音声感情混同行列に従って、各基本分類器ネットワークモデルの下での検証セット音声のさまざまな感情のＦ１値ベクトルを取得したＳｔｅｐ２と、
テストセットの感情音声から強化されたメルスペクトログラムを抽出し、それぞれＳｔｅｐ１でトレーニングされたＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類ネットワークモデルの入力とし、それぞれ異なるモデルのテストサンプルの感情予測ベクトルを取得し、次に各予測モデルの感情予測ベクトルと、Ｓｔｅｐ２で取得された検証セットの各予測モデルでのＦ１値ベクトルに対してドット積演算を実行し、次に各予測モデルのドット積ベクトルを加算して和ベクトルを取得し、和ベクトル内の要素の最大値に対応する感情は、テスト音声の認識感情であった。ここで、其中，ドット積演算は、感情予測ベクトルのｎ番目の予測値にＦ１値ベクトルのｎ番目のＦ１値を乗算することを指し、ｎ番目のＦ１値に対応する感情タイプは、ｎ番目の予測値に対応する感情タイプと同じであったＳｔｅｐ３と、を含む。 In this example, 7200 CASIA Chinese emotional speech data were selected as the training set, validation set and test set, with three ratios of 8:1:1 and six types of recognized emotions. there were. As shown in Figure 2, the speech emotion prediction method is:
We extracted the enhanced mel-spectrograms from the emotional speech of the training set and trained three basic classification network models, VGG, ResNet and Xception respectively, to obtain the VGG model (VGG classification network model), the ResNet model (ResNet classification network model ), and a prediction model corresponding to the Xception (Xception classification network model) model,
The structure of the VGG model is shown in FIG. 3, the structure of the ResNet model is shown in FIG. 4, and the structure of the Xception model is shown in FIG.
From the emotion speech of the validation set, the enhanced mel-spectrogram was extracted and used as input for the three basic classifier network models VGG, ResNet, and Xception trained in Step 1, and the validation output by each predictive model. Step2 obtained the F1 value vectors of different emotions of the validation set speech under each basic classifier network model according to the speech emotion confusion matrix of the set;
We extract the enhanced mel-spectrograms from the test set emotional speech and use them as inputs for the three basic classification network models, VGG, ResNet, and Xception, respectively, trained in Step 1, and use the emotion prediction vectors of the test samples of different models, respectively. Then, the dot product operation is performed on the emotion prediction vector of each prediction model and the F1 value vector in each prediction model of the validation set obtained in Step 2, and then the dot product vector of each prediction model is Summed to obtain a sum vector, the emotion corresponding to the maximum value of the elements in the sum vector was the perceived emotion of the test speech. Wherein, dot product operation refers to multiplying the nth prediction value of the emotion prediction vector by the nth F1 value of the F1 value vector, and the emotion type corresponding to the nth F1 value is the nth and Step 3, which was the same as the emotion type corresponding to the predicted value of .

トレーニングされたＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類器の下でのテスト音声（テストセット）の感情予測ベクトルは、それぞれ

であり、
ここで、Ｎは、感情の種類の数を示し、Ｅ_Ｖｇｇは、第１の予測モデル（ＶＧＧ予測モデル）によって出力された感情予測ベクトル、

は、第１の予測モデルの第１種感情タイプの予測値、

は、第１の予測モデルの第２種感情タイプの予測値、

は、第１の予測モデルの第Ｎ種の感情タイプの予測値であった。
Ｅ_Ｒｅｓは、第２の予測モデル（ＲｅｓＮｅｔ予測モデル）によって出力された感情予測ベクトル、

は、第２の予測モデルの第１種感情タイプの予測値、

は、第２の予測モデルの第２種感情タイプの予測値、

は、第２の予測モデルの第Ｎ種の感情タイプの予測値であった。
Ｅ_Ｘｃｅは、第３の予測モデル（Ｘｃｅｐｔｉｏｎ予測モデル）によって出力された感情予測ベクトル、

は、第３の予測モデルの第１種感情タイプの予測値、

は、第３の予測モデルの第２種感情タイプの予測値、

は、第３の予測モデルの第Ｎ種の感情タイプの予測値であった。 The emotion prediction vectors of the test speech (test set) under the three basic classifiers trained VGG, ResNet and Xception are respectively

and
where N denotes the number of types of emotion, E _Vgg is the emotion prediction vector output by the first prediction model (VGG prediction model),

is the predicted value of the first emotion type of the first prediction model,

is the predicted value of the second emotion type of the first prediction model,

was the predicted value of the Nth emotion type of the first prediction model.
E _Res is the emotion prediction vector output by the second prediction model (ResNet prediction model);

is the predicted value of the first emotion type of the second prediction model,

is the predicted value of the second emotion type of the second prediction model,

was the predictive value of the Nth emotion type of the second predictive model.
E _Xce is the emotion prediction vector output by the third prediction model (Xception prediction model);

is the predicted value of the first emotion type of the third prediction model,

is the predicted value of the second emotion type of the third prediction model,

was the predictive value of the Nth emotion type of the third predictive model.

トレーニングされたＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類器の下での検証セットにおける感情音声信号のＦ１値ベクトルはそれぞれ

であり、
Ｆ１_Ｖｇｇは、検証セットに従って第１の予測モデルによって取得されたＦ１値ベクトル、

は、第１の予測モデルの第１種感情タイプによって予測されたＦ１値、

は、第１の予測モデルの第２種感情タイプによって予測されたＦ１値、

は、第１の予測モデルの第Ｎ種の感情タイプによって予測されたＦ１値であった。
Ｆ１_Ｒｅｓは、検証セットに従って第２の予測モデルによって取得されたＦ１値ベクトル、

は、第２の予測モデルの第１種感情タイプによって予測されたＦ１値、

は、第２の予測モデルの第２種感情タイプによって予測されたＦ１値、

は、第２の予測モデルの第Ｎ種の感情タイプによって予測されたＦ１値であった。
Ｆ１_Ｘｃｅは、検証セットに従って第３の予測モデルによって取得されたＦ１値ベクトル、

は、第３の予測モデルの第１種感情タイプによって予測されたＦ１値、

は、第３の予測モデルの第２種感情タイプによって予測されたＦ１値、

は、第３の予測モデルの第Ｎ種の感情タイプによって予測されたＦ１値であった。 The F1 value vectors of emotional speech signals in the validation set under the three basic classifiers trained VGG, ResNet, and Xception are respectively

and
F1 _Vgg is the F1 value vector obtained by the first prediction model according to the validation set;

is the F1 value predicted by the first emotion type of the first prediction model,

is the F1 value predicted by the second emotion type of the first prediction model,

was the F1 value predicted by the Nth emotion type of the first prediction model.
F1 _Res is the F1 value vector obtained by the second predictive model according to the validation set;

is the F1 value predicted by the first emotion type of the second prediction model,

is the F1 value predicted by the second type emotion type of the second prediction model,

was the F1 value predicted by the Nth emotion type of the second prediction model.
F1 _Xce is the F1 value vector obtained by the third predictive model according to the validation set;

is the F1 value predicted by the first emotion type of the third prediction model,

is the F1 value predicted by the second type emotion type of the third prediction model,

was the F1 value predicted by the Nth emotion type of the third prediction model.

各予測モデルのＦ１値ベクトルと感情予測ベクトルに対してドット積演算を実行し、ＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類器の下でのドット積ベクトルを取得した。

であり、
Ｖ_Ｖｇｇは、第１の予測モデルによって対応するドット積ベクトル、Ｖ_Ｒｅｓは、第２の予測モデルに対応するドット積ベクトル、Ｖ_Ｘｃｅは、第３の予測モデルに対応するドット積ベクトルであった。 A dot product operation was performed on the F1 value vector and emotion prediction vector of each prediction model to obtain dot product vectors under three basic classifiers: VGG, ResNet, and Xception.

and
_{V_Vgg} was the dot-product vector corresponding to the first prediction model, _{V_Res} was the dot-product vector corresponding to the second prediction model, and _{V_Xce} was the dot-product vector corresponding to the third prediction model. .

次に、ＶＧＧ、ＲｅｓＮｅｔ、Ｘｃｅｐｔｉｏｎの３つの基本的な分類器の下でのドット積ベクトルを加算して、和ベクトルを次のように取得した。
Ｓ＝Ｖ_Ｘｃｅ＋Ｖ_Ｒｅｓ＋Ｖ_Ｘｃｅ（１０）
ベクトルＳに含まれる要素の数はＮ個であり、Ｎ個の要素の最大値要素に対応する感情は、テスト音声の認識感情であった。 The dot product vectors under the three basic classifiers VGG, ResNet, Xception were then summed to obtain the sum vector as follows.
S= _Vxce + _VRes + _Vxce (10)
The number of elements included in the vector S is N, and the emotion corresponding to the maximum value element of the N elements was the recognition emotion of the test speech.

テストセットの実験結果を表１に示した。表１から、本発明の音声感情認識方法は、各単一分類器モデルよりも認識精度が高く、本発明による認識方法の性能が優れていることが分かる。 Table 1 shows the experimental results of the test set. From Table 1, it can be seen that the speech emotion recognition method of the present invention has higher recognition accuracy than each single classifier model, and the performance of the recognition method according to the present invention is superior.

表１各種方法の実験結果の比較表

Table 1 Comparison table of experimental results of various methods

図６に示すように、音声感情予測システムは、
感情音声データセットを収集するために使用され、データセットの各サンプルは、感情音声信号と、感情音声信号に対応する感情タイプを含むデータ収集モジュール２０１と、
データセットをトレーニングセットと検証セットに分割するために使用されるデータセット分割モジュール２０２と、
トレーニングセットに従って、それぞれＭ個の異なるタイプの分類器モデルをトレーニングして、各分類器モデルに対応する予測モデルを取得するために使用されるモデルトレーニングモジュール２０３と、
検証セットによれば、各予測モデルの混同行列をそれぞれ得て、かつｍ番目の予測モデルの混同行列に従って、ｍ番目の予測モデルに対応するＦ１値ベクトルを決定し、ｍ番目のＦ１値ベクトルとして記録するために使用されるＦ１値ベクトル決定モジュール２０４と、
予測待ちの感情音声信号セットを、それぞれ各予測モデルに入力するために使用され、ｍ番目の予測モデルによって出力された感情予測タイプは、感情予測ベクトルを構成し、これを、ｍ番目の感情予測ベクトルとして記録する感情予測ベクトル出力モジュール２０５と、
ｍ番目のＦ１値ベクトルにおけるｎ番目のＦ１値にｍ番目の感情予測ベクトルにおけるｎ番目の予測値を乗算すると、各乗算の結果がｍ番目の積ベクトルになり、ｎ番目のＦ１値に対応する感情タイプは、ｎ番目の予測値に対応する感情タイプと同じであるために使用されるＦ１値ベクトルと感情予測ベクトルの乗算モジュール２０６と、
それぞれ、各積ベクトルにおけるｎ番目の乗算結果を加算してｎ番目の加算結果を取得し、各加算結果が和ベクトルを構成するために使用される和ベクトル決定モジュール２０７と、
和ベクトル内の要素の最大値に対応する感情タイプを予測された感情タイプとして決定するために使用される感情タイプ決定モジュールと、を含む。 As shown in FIG. 6, the voice emotion prediction system
a data collection module 201 used to collect an emotional voice data set, each sample of the data set including an emotional voice signal and an emotion type corresponding to the emotional voice signal;
a dataset splitting module 202 used to split the dataset into a training set and a validation set;
a model training module 203 used to train M different types of classifier models respectively according to the training set to obtain a prediction model corresponding to each classifier model;
According to the validation set, obtain the confusion matrix of each prediction model respectively, and determine the F1 value vector corresponding to the mth prediction model according to the confusion matrix of the mth prediction model, as the mth F1 value vector an F1 value vector determination module 204 used to record;
The emotion speech signal set waiting for prediction is used to input each prediction model, respectively, and the emotion prediction type output by the m-th prediction model constitutes an emotion prediction vector, which is referred to as the m-th emotion prediction an emotion prediction vector output module 205 that records as a vector;
When the nth F1 value in the mth F1 value vector is multiplied by the nth predicted value in the mth emotion prediction vector, the result of each multiplication is the mth product vector, corresponding to the nth F1 value. a multiplication module 206 of the F1 value vector and the emotion prediction vector used for the emotion type to be the same as the emotion type corresponding to the nth predicted value;
a sum vector determination module 207, respectively, summing the nth multiplication result in each product vector to obtain the nth summation result, each summation result being used to construct a sum vector;
an emotion type determination module used to determine the emotion type corresponding to the maximum value of the elements in the sum vector as the predicted emotion type.

本明細書では、特定の例を使用して、本発明の原理と実施形態を説明し、上記の実施例の説明は、本発明の方法とコアアイデアを理解するのを助けるためにのみ使用され、同時に、当業者にとって、本発明のアイデアによれば、発明を実施するための形態および応用範囲に変更がある。要約すると、本明細書の内容は、本発明の限定として解釈されるべきではない。 Specific examples are used herein to describe the principles and embodiments of the present invention, and the above example descriptions are only used to help understand the methods and core ideas of the present invention. At the same time, according to the idea of the present invention, there are variations in the mode for carrying out the invention and the scope of application for those skilled in the art. In summary, nothing in this specification should be construed as a limitation of the present invention.

Claims

a data collection module for collecting an emotional voice data set, each sample of said data set including an emotional voice signal and an emotion type corresponding to the emotional voice signal;
a dataset splitting module used to split the dataset into a training set and a validation set;
a model training module used to train M different types of classifier models respectively according to the training set to obtain a prediction model corresponding to each classifier model;
According to the validation set, obtain the confusion matrix of each prediction model respectively, and determine the F1 value vector corresponding to the mth prediction model according to the confusion matrix of the mth prediction model, and the mth F1 value vector an F1 value vector determination module used to record as mε[1,M];
It is used to input the emotional voice signal set awaiting prediction to each prediction model, respectively, and m
an emotion prediction vector output module that configures an emotion prediction vector from the emotion prediction type output by the th prediction model and records this as an m-th emotion prediction vector;
n in the m-th emotion prediction vector to the n-th F1 value in the m-th F1 value vector
Multiplying the prediction values, the result of each multiplication is the m-th product vector, the emotion type corresponding to the n-th F1 value is the same as the emotion type corresponding to the n-th prediction value, and n∈ [1
, N], where N is the number of emotion types, an F1 value vector and an emotion prediction vector multiplication module;
a sum vector determination module respectively summing the nth multiplication result in each product vector to obtain the nth summation result, each summation result being used to construct a sum vector;
an emotion type determination module used to determine an emotion type corresponding to a maximum value of elements in the sum vector as a predicted emotion type.

The M-value is 3, and the three different types of classifier models are VGG models, Res
2. The speech emotion prediction system according to claim 1 , wherein the speech emotion prediction system is a Net model and an Xception model.

2. The speech emotion prediction system of claim 1 , wherein the emotional speech signals in the training set and the emotional speech signals in the validation set are both enhanced mel-spectrograms.

2. The speech emotion prediction system of claim 1 , wherein the emotion speech signal in the set of emotion speech signals awaiting prediction is an enhanced mel spectrogram.

5. The speech emotion prediction system according to claim 4 , wherein said enhanced mel-spectrogram is a mel-spectrogram enhanced as an enhancement function transformed from a natural logarithm.