CN111145787B

CN111145787B - Voice emotion feature fusion method and system based on main and auxiliary networks

Info

Publication number: CN111145787B
Application number: CN201911368375.2A
Authority: CN
Inventors: 张雪英; 胡德生; 张静; 黄丽霞; 牛溥华; 李凤莲
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-03-14
Anticipated expiration: 2039-12-26
Also published as: CN111145787A

Abstract

The invention provides a voice emotion feature fusion method and system based on a main network and an auxiliary network, wherein the method comprises the following steps: respectively inputting a plurality of first characteristics and second characteristics corresponding to each voice emotion data in the test set into a lower half part of the main network model and an auxiliary network model with parameters to obtain a main network high-layer characteristic and an auxiliary network high-layer characteristic corresponding to each voice emotion data; performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features, and determining main and auxiliary network fusion features corresponding to the voice emotion data; and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics. The invention effectively fuses various features and improves the accuracy of speech emotion fusion.

Description

Voice emotion feature fusion method and system based on main and auxiliary networks

Technical Field

The invention relates to the technical field of emotional feature fusion, in particular to a voice emotional feature fusion method and system based on a main network and an auxiliary network.

Background

Emotional states are important factors for human-to-human communication and for efficient human-to-machine communication. In order to realize the natural communication between people and machines, the machines have the capabilities of speaking, thinking and emotion of people at the same time, and the goal of the field of artificial intelligence is pursued all the time. Research on Speech Emotion Recognition (SER) will promote realization of the target, and achievements of the method can be widely applied to the fields of man-machine interaction, telemedicine, electronic education, criminal investigation, emotion dispersion and the like, so that the development of Speech Emotion Recognition research has important significance and practical value.

The main structure of the existing speech emotion recognition system is to fuse different types of features and provide various mixed network structures so as to further improve the accuracy of the speech emotion recognition system. However, these methods have two major problems:

first, there is a lack of an efficient mechanism to effectively fuse different types of speech emotion features. More specifically, it is currently the mainstream practice to simply concatenate different types of features as input to identify a network. However, due to the difference of dimensions and dimensions of different features and the difference of actual physical meanings of the features, the features of different types affect each other, and the accuracy rate of the features cannot achieve the ideal effect.

Second, the correspondence between the network output and the actual annotation is not reasonable. More specifically, assuming that the width of the spread in time of the identifier with the recurrent neural network LSTM as the core is T, the LSTM corresponds to one output at each time instant. It is obviously unreasonable if we associate the output at each moment with an emotion type.

Disclosure of Invention

Based on this, the invention aims to provide a voice emotion feature fusion method and system based on a main network and an auxiliary network so as to improve the accuracy of voice emotion fusion.

In order to achieve the above object, the present invention provides a method for fusing speech emotion characteristics based on a primary and secondary network, the method comprising:

step S1: determining a training set and a test set;

step S2: determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;

and step S3: inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data;

and step S4: inputting the second characteristics corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level characteristics corresponding to the voice emotion data;

step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data;

step S6: and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.

Optionally, determining the training set and the test set specifically includes:

step S11: determining a speech emotion database; the voice emotion database comprises 363 pieces of voice emotion data;

step S12: determining a standard database according to the voice emotion database;

step S13: performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features;

step S14: respectively carrying out standardization processing on a plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;

step S15: selecting 243 first features and second features corresponding to the voice emotion data as a training set; and using the first characteristics and the second characteristics corresponding to the remaining 120 pieces of speech emotion data as a test set.

Optionally, performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features, specifically including:

step S131: performing MFCC frame feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC frame features;

step S132: after averaging the plurality of voice MFCC frame characteristics, obtaining a plurality of voice MFCC segment characteristics;

step S133: and carrying out global feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech global features.

Optionally, the determining the standard database according to the speech emotion database specifically includes:

step S121: judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the speech emotion data are larger than the set speech frame length, adopting truncation operation to enable the speech emotion data to be equal to the set speech frame length, and placing the processed speech emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted to enable the voice emotion data to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.

Optionally, determining a main network model with parameters, an auxiliary network model with parameters, and auxiliary parameters by using the training set specifically includes:

step S21: determining main network model parameters by using a plurality of first characteristics corresponding to each piece of speech emotion data in the training set;

step S22: and determining auxiliary network model parameters and auxiliary parameters by using a plurality of second characteristics corresponding to each piece of speech emotion data in the training set.

Optionally, determining a main network model parameter by using a plurality of first features corresponding to each piece of speech emotion data in the training set specifically includes:

step S211: keeping the auxiliary parameters and the auxiliary network model parameters unchanged, inputting a plurality of first characteristics corresponding to each voice emotion data in the training set into a main network, training by adopting a gradient descent algorithm to obtain main network model parameters, and determining a first cost function value by utilizing a cost function formula;

step S212: judging whether the first cost function value is larger than or equal to a first set value or not; if the first cost function value is larger than or equal to the first set value, returning to the step S211; and if the first cost function value is smaller than a first set value, outputting the main network model parameters.

Optionally, the determining an auxiliary network model parameter and an auxiliary parameter by using a plurality of second features corresponding to each piece of speech emotion data in the training set specifically includes:

step S221: keeping the parameters of the main network model unchanged, inputting a plurality of second characteristics corresponding to the speech emotion data in the training set into an auxiliary network, training by adopting a gradient descent algorithm to obtain auxiliary network model parameters and auxiliary parameters, and determining a second cost function value by utilizing a cost function formula;

step S222: judging whether the second valence function value is larger than or equal to a second set value or not; if the second cost function value is greater than or equal to a second set value, returning to the step S221; and if the second cost function value is smaller than a second set value, outputting the auxiliary network model parameter and the auxiliary parameter.

The invention also provides a voice emotion feature fusion system based on the main and auxiliary networks, which comprises the following components:

the set determining module is used for determining a training set and a testing set;

the model determining module is used for determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;

a main network high-level feature determination module, configured to input the multiple first features corresponding to the voice emotion data in the test set into a lower half of the main network model with parameters, and obtain a main network high-level feature corresponding to the voice emotion data;

the auxiliary network high-level feature determining module is used for inputting the second features corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain the auxiliary network high-level features corresponding to the voice emotion data;

the main network fusion characteristic determining module is used for performing characteristic fusion on the main network high-level characteristics, the auxiliary parameters and the auxiliary network high-level characteristics in a main network mode and an auxiliary network mode, and determining main network fusion characteristics corresponding to the voice emotion data;

and the fusion characteristic determining module is used for inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.

Optionally, the set determining module specifically includes:

the voice emotion database determining unit is used for determining a voice emotion database; the voice emotion database comprises 363 pieces of voice emotion data;

the standard database determining unit is used for determining a standard database according to the voice emotion database;

the feature extraction unit is used for performing feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features;

the normalization processing unit is used for respectively normalizing the plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic;

the set determining unit is used for selecting 243 first features and second features corresponding to the voice emotion data as a training set; and using the first characteristics and the second characteristics corresponding to the remaining 120 pieces of speech emotion data as a test set.

Optionally, the feature extraction unit specifically includes:

the first extraction subunit is used for performing MFCC frame feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech MFCC frame features;

the average processing subunit is configured to perform average processing on the multiple voice MFCC frame features to obtain multiple voice MFCC segment features;

and the second extraction subunit is used for carrying out global feature extraction on each piece of speech emotion data in the standard database to obtain a plurality of speech global features.

Optionally, the standard database determining unit specifically includes:

the judging subunit is used for judging whether each piece of speech emotion data in the speech emotion database is larger than a set speech frame length; if the speech emotion data are larger than the set speech frame length, adopting truncation operation to enable the speech emotion data to be equal to the set speech frame length, and placing the processed speech emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted to enable the voice emotion data to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a voice emotion feature fusion method and system based on a main network and an auxiliary network, wherein the method comprises the following steps: respectively inputting a plurality of first features and second features corresponding to the voice emotion data in the test set into a lower half part of the main network model and an auxiliary network model with parameters to obtain a main network high-level feature and an auxiliary network high-level feature corresponding to the voice emotion data; performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data; and inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics. The invention effectively fuses various features and improves the accuracy of speech emotion fusion.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a speech emotion feature fusion method based on a main network and an auxiliary network in an embodiment of the present invention;

FIG. 2 is a structural diagram of a voice emotion feature fusion system based on a main network and an auxiliary network in the embodiment of the present invention;

FIG. 3 is a diagram of a speech emotion feature fusion network structure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention aims to provide a voice emotion feature fusion method and system based on a main network and an auxiliary network so as to improve the accuracy of voice emotion fusion.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Long-short term memory (LSTM): a time-cycle neural network is specially designed for solving the long-term dependence problem of a general RNN (neural network) cycle neural network.

Bidirectional long-short term memory unit (BLSTM): BLSTM is a bi-directional LSTM that can be compatible with contextual information at the same time.

Mel-frequency cepstral coefficients (MFCCs) are linear transforms of the log energy spectrum based on the nonlinear mel scale of the frequencies of a sound.

Support Vector Machines (SVMs) are generalized linear classifiers that perform binary classification on data in a supervised learning manner, and the decision boundary is the maximum margin hyperplane for solving learning samples.

A Deep Neural Network (DNN) perception machine is provided with an input layer, an output layer and a hidden layer, wherein input feature vectors reach the output layer through hidden layer transformation, and classification results are obtained at the output layer. A deep neural network is a network structure of perceptrons with multiple hidden layers.

Hidden Markov Models (HMMs) are statistical models that describe a Markov process with Hidden unknown parameters.

Convolutional Neural Networks (CNNs) are a class of feed-forward Neural Networks that include convolution calculations and have a deep structure, have a characterization learning capability, and can perform a translation invariant classification on input information according to their hierarchical structure.

Attention mechanism (Attention mechanism) in cognitive science, a human being selectively focuses on a part of all information while ignoring other visible information due to a bottleneck of information processing.

Short Time Fourier Transform (STFT) is a mathematical Transform related to the Fourier Transform to determine the frequency and phase of the local area sinusoid of a Time-varying signal.

The spectrogram is a voice spectrogram, the abscissa of which is time, the ordinate of which is frequency, and the coordinate point value of which is voice data energy.

In Softmax probability theory and related fields, the normalized exponential function is called as popularization of a logic function, and actually is the gradient logarithm normalization of finite term discrete probability distribution.

Principal Component Analysis (PCA): is a statistical method. A group of variables which are possibly correlated are converted into a group of linearly uncorrelated variables through orthogonal transformation, and the group of converted variables are called principal components.

Dropout is a method for optimizing an artificial neural network with a deep structure, and in the learning process, partial weights or outputs of hidden layers are randomly zeroed, so that interdependency among nodes is reduced, regularization of the neural network is realized, and the structural risk of the neural network is reduced.

Fig. 1 is a flowchart of a speech emotion feature fusion method based on a primary network and a secondary network in an embodiment of the present invention, and as shown in fig. 1, the present invention provides a speech emotion feature fusion method based on a primary network and a secondary network, where the method includes:

step S1: determining a training set and a test set;

The following is a detailed discussion of the various steps:

step S1: determining a training set and a test set, specifically comprising:

step S11: determining a speech emotion database, which specifically comprises:

selecting five emotions from the Emo-DB data set to obtain a voice emotion database; the voice emotion database comprises 363 pieces of voice emotion data; the five emotions include happy, sad, angry, surprised, and neutral.

The Emo-DB data set consists of 535 utterances performed by 10 professional actors, covering seven emotions of neutrality, fear, joy, anger, sadness, disgust and boredom, and the sampling frequency of the voice emotion database is reduced from 44.1kHz to 16kHz.

Step S12: determining a standard database according to the voice emotion database, which specifically comprises the following steps: judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the speech emotion data are larger than the set speech frame length, adopting truncation operation to enable the speech emotion data to be equal to the set speech frame length, and placing the processed speech emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted, the voice emotion data are enabled to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.

Because the frame lengths of different voices are different, the longest voice frame length is 2441, and the shortest voice frame length is 121, in order to align the different voice frame lengths, truncation and zero filling operations are adopted here. By observation, most of the voice frame lengths are below 1600, so the set voice frame length is 1600, zero filling operation is adopted for insufficient voice, and truncation operation is adopted for excessive voice.

Step S13: performing feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC segment features and a plurality of voice global features, and specifically comprising:

step S131: performing MFCC frame feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC frame features;

step S133: performing global feature extraction on each voice emotion data in the standard database to obtain a plurality of voice global features; the global features of the speech comprise prosodic features including energy, speech speed and zero-crossing rate, voice quality features represented by formants and spectral features represented by MFCC, the total number of the global features of the speech is 98, and the global features of the speech are converted into 70 dimensions after PCA processing.

Step S14: respectively carrying out standardization processing on a plurality of voice MFCC segment characteristics and the voice global characteristics corresponding to the voice emotion data to respectively obtain a first characteristic and a second characteristic; the first feature is 40 x 60 dimensions and the second feature is 70 dimensions.

Step S15: selecting 243 first features and second features corresponding to the voice emotion data as a training set; and taking the first characteristic and the second characteristic corresponding to the remaining 120 pieces of the speech emotion data as a test set.

Step S2: determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set, which specifically comprises the following steps:

step S21: determining a main network model parameter by using a plurality of first features corresponding to each piece of speech emotion data in the training set, specifically including:

step S211: keeping the auxiliary parameters and the auxiliary network model parameters unchanged, inputting a plurality of first features corresponding to each piece of speech emotion data in the training set into a main network, training by adopting a gradient descent algorithm to obtain main network model parameters, and determining a first cost function value by using a cost function formula;

step S212: judging whether the first cost function value is larger than or equal to a first set value or not; if the first cost function value is greater than or equal to the first set value, returning to step S211; and if the first cost function value is smaller than a first set value, outputting the main network model parameters.

Step S22: determining an auxiliary network model parameter and an auxiliary parameter by using a plurality of second features corresponding to each piece of speech emotion data in the training set, specifically comprising:

step S222: judging whether the second valence function value is larger than or equal to a second set value or not; if the second cost function value is greater than or equal to a second set value, returning to the step S221; and if the second value function value is smaller than a second set value, outputting the auxiliary network model parameters and the auxiliary parameters.

And step S3: inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data; the high level feature of the main network is 160 dimensions.

As shown in fig. 3, the network structure is divided into two parts, a primary network model and a secondary network model. The master network model is divided into two parts: m is a group of _D The lower half part of the main network model is represented and formed by two layers of bidirectional long-time memory unit BLSTM networks with attention mechanisms, and the number of hidden neurons in each layer is 160; m _U The upper half part of the main network model is represented and composed of two layers of multi-layer full connection layers DNN, and the number of neurons in each layer is 100. e.g. of a cylinder ⁰ Features representing speech MFCC segments normalized for input to the main network model, h ^l And representing the output of the last hidden layer of the lower half part of the main network model after the adaptive time pooling algorithm. The auxiliary network model is composed of two layers of fully connected layers DNN, the number of neurons in the first layer is 200, and the number of neurons in the second layer is 100. v. of ⁰ Representing global features of speech, normalized as input to an auxiliary network model, v ^m And representing the output of the last hidden layer of the auxiliary network model. h is a total of ^l And Wv ^m M spliced to be input into main network model _U In the upper half, the auxiliary parameter W plays a role in controlling the separate training of the main network and the auxiliary network, and meanwhile, the weighting coefficients which can be distributed to the main network characteristic and the auxiliary network characteristic are optimized by the gradient descent algorithm W.

Step S31: inputting said first feature of 40 x 60 dimensions into the lower half M of the main network model _D Then obtaining an output result;

step S32: and (4) applying a self-adaptive time pooling algorithm to the output result for training, and finally outputting the main network high-level features with 160 dimensions. The calculation formula of the self-adaptive time pooling algorithm is as follows:

wherein h is _t For output results at time t after the second BLSTM layer, h _t ∈R ^D ，R ^D For a range of values, D represents h _t Is equal to 80, T is the BLSTM network time step, is equal to 40, alpha _t The weighting coefficient is obtained through network learning and is specifically calculated by the following formula:

α _t ＝softmax(β _t )

wherein σ is a non-linear mapping function, e.g. Sigmoid function, W _β And U _β Are respectively a coefficient matrix, gamma _β Is a coefficient vector, γ _β ^T Represents gamma _β Transposition, W _β 、U _β And gamma _β The parameters are network learning parameters, and are obtained by random initialization through truncation normal distribution and then training and optimizing with an identification network through a gradient descent algorithm.

And step S4: inputting the second characteristics corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level characteristics corresponding to the voice emotion data; the second characteristic is 70 dimensions, and the auxiliary network high-level characteristic is 100 dimensions.

Step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features in a main and auxiliary network mode, and determining main and auxiliary network fusion features corresponding to the voice emotion data, wherein the specific formula is as follows:

wherein the content of the first and second substances,

as a primary and secondary network convergence feature, h ^l For high-level features of the main network, W is an auxiliary parameter, v ^M To assist in network high-level features.

Fig. 2 is a structural diagram of a voice emotion feature fusion system based on a primary network and a secondary network in an embodiment of the present invention, and as shown in fig. 2, the present invention further provides a voice emotion feature fusion system based on a primary network and a secondary network, where the system includes:

the set determining module 1 is used for determining a training set and a testing set;

the model determining module 2 is used for determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by using the training set;

a main network high-level feature determining module 3, configured to input the multiple first features corresponding to the voice emotion data in the test set into a lower half of the main network model with parameters, and obtain a main network high-level feature corresponding to the voice emotion data;

an auxiliary network high-level feature determining module 4, configured to input the second feature corresponding to each piece of speech emotion data in the test set into an auxiliary network model with parameters, so as to obtain an auxiliary network high-level feature corresponding to each piece of speech emotion data;

a main and auxiliary network fusion characteristic determining module 5, configured to perform characteristic fusion on the main network high-level characteristic, the auxiliary parameter, and the auxiliary network high-level characteristic in a main and auxiliary network manner, and determine a main and auxiliary network fusion characteristic corresponding to each piece of speech emotion data;

and the fusion characteristic determining module 6 is used for inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics.

The various modules are discussed in detail below:

as an embodiment, the set determining module of the present invention specifically includes:

the set determining unit is used for selecting 243 corresponding first features and second features of the voice emotion data as training sets; and taking the first characteristic and the second characteristic corresponding to the remaining 120 pieces of the speech emotion data as a test set.

As an embodiment, the feature extraction unit of the present invention specifically includes:

As an embodiment, the standard database determining unit of the present invention specifically includes:

the judging subunit is used for judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the voice emotion data are larger than the set voice frame length, adopting truncation operation to enable the voice emotion data to be equal to the set voice frame length, and putting the processed voice emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted, the voice emotion data are enabled to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each piece of voice emotion data is equal to the length of a set voice frame, directly putting each piece of voice emotion data into the standard database.

As an embodiment, the model determining module 2 of the present invention specifically includes:

a main network model parameter determining unit, configured to determine a main network model parameter by using a plurality of first features corresponding to each piece of speech emotion data in the training set;

and the auxiliary network model parameter determining unit is used for determining auxiliary network model parameters and auxiliary parameters by utilizing the plurality of second characteristics corresponding to the voice emotion data in the training set.

As an implementation manner, the master network model parameter determining unit of the present invention specifically includes:

a main network model parameter determining subunit, configured to keep the auxiliary parameters and the auxiliary network model parameters unchanged, input the plurality of first features corresponding to each piece of speech emotion data in the training set into a main network, and perform training by using a gradient descent algorithm to obtain a main network model parameter;

a first cost function value stator unit for determining a first cost function value using a cost function formula;

the first judgment subunit is used for judging whether the first cost function value is greater than or equal to a first set value or not; if the first cost function value is larger than or equal to a first set value, returning to a main network model parameter determination subunit; and if the first cost function value is smaller than a first set value, outputting the main network model parameter.

As an embodiment, the auxiliary network model parameter determining unit of the present invention specifically includes:

an auxiliary network model parameter determining subunit, configured to keep the main network model parameter unchanged, input the plurality of second features corresponding to each piece of speech emotion data in the training set into an auxiliary network, and perform training by using a gradient descent algorithm to obtain an auxiliary network model parameter and an auxiliary parameter;

a second cost function value stator unit for determining a second cost function value using the cost function formula;

a second judging subunit, configured to judge whether the second cost function value is greater than or equal to a second set value; if the second cost function value is larger than or equal to a second set value, returning to an auxiliary network model parameter determination subunit; and if the second cost function value is smaller than a second set value, outputting the auxiliary network model parameter and the auxiliary parameter.

Simulation verification

In order to verify the effectiveness of the speech emotion feature fusion method provided by the invention, some comparative experiments are performed on an Emo _ DB data set, and the experimental results are shown in Table 1.

TABLE 1 recognition of different network structures on Emo _ DB data set

The BLSTM + ATP (MFCC speech segment feature) network structure is a BLSTM network added with an adaptive time pooling algorithm, and takes the speech MFCC segment features after standardized processing as input features.

The DNN (global feature) network structure is a multi-layer fully-connected network, and takes the global features after the normalization process as input features.

The BLSTM + ATP and DNN carbonate network structure is characterized in that the high-level features obtained by inputting the features of a voice MFCC section after standardized processing into a BLSTM + ATP network and the high-level features obtained by inputting the global features after standardized processing into a DNN network are directly spliced together and input into a classifier, and the two networks are trained simultaneously without primary and secondary points.

The last one is the network structure of the method proposed by the present invention, whose accuracy is 89.84%. As can be seen from table 1, the accuracy of the direct splicing feature fusion method is higher than that of single-type feature identification, but is not significantly improved; the method for carrying out the feature fusion through the main network and the auxiliary network has the recognition accuracy rate which is obviously higher than that of the feature fusion method of direct splicing, and verifies that the feature fusion method provided by the invention can effectively improve the accuracy rate of the speech emotion recognition system compared with the feature fusion method of direct splicing.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A voice emotion feature fusion method based on a main network and an auxiliary network is characterized by comprising the following steps:

step S1: determining a training set and a test set;

the main network model comprises an upper half part of the main network model and a lower half part of the main network model, and the lower half part of the main network model comprises two layers of bidirectional long-time and short-time memory unit BLSTM networks added with an attention mechanism and an adaptive time pooling algorithm; the upper half part of the main network model comprises two layers of multi-layer fully connected layers DNN;

the auxiliary network model comprises two layers of fully connected layers DNN;

the calculation formula of the self-adaptive time pooling algorithm is as follows:

wherein the content of the first and second substances,

after passing through the second BLSTM layer

The result of the output of the time of day,

，

to take a range of values, D represents

Equal to 80, t is the BLSTM network time step, equal to 40,

the weighting coefficient is obtained through network learning and is specifically calculated by the following formula:

wherein the content of the first and second substances,

is a function of the Sigmoid and is,

and

respectively, a matrix of coefficients, each of which is,

is a vector of the coefficients of the image data,

represents

The process of transposition is carried out,

、

and

the learning parameters are network learning parameters, which are obtained by random initialization through truncation normal distribution and then training and optimizing with an identification network through a gradient descent algorithm;

and step S3: inputting a plurality of first characteristics corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level characteristics corresponding to the voice emotion data;

step S5: performing feature fusion on the main network high-level features, the auxiliary parameters and the auxiliary network high-level features, and determining main and auxiliary network fusion features corresponding to the voice emotion data;

step S6: inputting the main network and auxiliary network fusion characteristics corresponding to each piece of voice emotion data into the upper half part of a main network model with parameters to obtain fusion characteristics;

determining a training set and a test set, specifically comprising:

step S13: performing feature extraction on each voice emotion data in the standard database to obtain a plurality of voice MFCC segment features and a plurality of voice global features;

step S15: selecting 243 corresponding first features and second features of the voice emotion data as a training set; using the first characteristics and the second characteristics corresponding to the remaining 120 pieces of speech emotion data as a test set;

the global features of the speech comprise prosodic features including energy, speech speed and zero-crossing rate, voice quality features represented by formants and spectral features represented by MFCC, the total number of the global features of the speech is 98, and the global features of the speech are converted into 70 dimensions after PCA processing.

2. The method as claimed in claim 1, wherein the method for fusing speech emotion features based on primary and secondary networks, which is used for performing feature extraction on each speech emotion data in the standard database to obtain a plurality of speech MFCC segment features and a plurality of speech global features, specifically comprises:

step S133: and carrying out global feature extraction on each voice emotion data in the standard database to obtain a plurality of voice global features.

3. The main-auxiliary network-based speech emotion feature fusion method according to claim 1, wherein the determining a standard database according to the speech emotion database specifically comprises:

step S121: judging whether each piece of voice emotion data in the voice emotion database is larger than a set voice frame length; if the voice emotion data are larger than the set voice frame length, adopting truncation operation to enable the voice emotion data to be equal to the set voice frame length, and putting the processed voice emotion data into the standard database; if the voice emotion data are smaller than the set voice frame length, zero filling operation is adopted to enable the voice emotion data to be equal to the set voice frame length, and the processed voice emotion data are placed in the standard database; and if each voice emotion data is equal to the length of a set voice frame, directly putting each voice emotion data into the standard database.

4. The method for fusing speech emotion characteristics based on primary and secondary networks as claimed in claim 1, wherein the determining of the primary network model with parameters, the secondary network model with parameters and the secondary parameters by using the training set specifically comprises:

step S21: determining a main network model parameter by using a plurality of first characteristics corresponding to each piece of speech emotion data in the training set;

5. The main-auxiliary network-based voice emotion feature fusion method of claim 4, wherein the determining of main network model parameters by using the plurality of first features corresponding to each piece of voice emotion data in the training set specifically comprises:

step S212: judging whether the first cost function value is larger than or equal to a first set value or not; if the first cost function value is greater than or equal to the first set value, returning to step S211; and if the first cost function value is smaller than a first set value, outputting the main network model parameter.

6. The method as claimed in claim 4, wherein the determining of the auxiliary network model parameters and the auxiliary parameters by using the plurality of second features corresponding to each piece of speech emotion data in the training set specifically comprises:

step S221: keeping the main network model parameters unchanged, inputting a plurality of second characteristics corresponding to each piece of speech emotion data in the training set into an auxiliary network, training by adopting a gradient descent algorithm to obtain auxiliary network model parameters and auxiliary parameters, and determining a second cost function value by using a cost function formula;

step S222: judging whether the second valence function value is larger than or equal to a second set value; if the second cost function value is greater than or equal to a second set value, returning to step S221; and if the second value function value is smaller than a second set value, outputting the auxiliary network model parameters and the auxiliary parameters.

7. A voice emotion feature fusion system based on a main network and an auxiliary network is characterized by comprising:

the model determining module is used for determining a main network model with parameters, an auxiliary network model with parameters and auxiliary parameters by utilizing the training set; the main network high-level feature determination module is used for inputting a plurality of first features corresponding to the voice emotion data in the test set into the lower half part of the main network model with parameters to obtain main network high-level features corresponding to the voice emotion data;

the main network model comprises an upper half part of the main network model and a lower half part of the main network model, and the lower half part of the main network model comprises two layers of bidirectional long-time memory unit BLSTM networks added with an attention mechanism and a self-adaptive time pooling algorithm; the upper half part of the main network model comprises two layers of multi-layer fully connected layers DNN;

the auxiliary network model comprises two layers of fully connected layers DNN;

wherein the content of the first and second substances,

after passing through the second BLSTM layer

The result of the output of the time of day,

，

to take a range of values, D represents

Equal to 80, t is the BLSTM network time step, equal to 40,

wherein the content of the first and second substances,

in order to be a function of Sigmoid,

and

respectively, are a matrix of coefficients, each of which is,

in the form of a vector of coefficients,

represents

The operation of transposition is carried out,

、

and

are all network learning parametersThe number is obtained by random initialization through truncation normal distribution and then training and optimizing with an identification network through a gradient descent algorithm;

the auxiliary network high-level feature determining module is used for inputting second features corresponding to the voice emotion data in the test set into an auxiliary network model with parameters to obtain auxiliary network high-level features corresponding to the voice emotion data;

the main network fusion characteristic determining module is used for performing characteristic fusion on the main network high-level characteristic, the auxiliary parameter and the auxiliary network high-level characteristic and determining a main network fusion characteristic corresponding to each piece of voice emotion data;

the fusion characteristic determining module is used for inputting the main and auxiliary network fusion characteristics corresponding to the voice emotion data into the upper half part of the main network model with parameters to obtain fusion characteristics;

the set determining module specifically includes:

the set determining unit is used for selecting 243 first features and second features corresponding to the voice emotion data as a training set; taking the first characteristic and the second characteristic corresponding to the remaining 120 pieces of speech emotion data as a test set;

8. The system for fusing emotion voice features based on primary and secondary networks as claimed in claim 7, wherein the feature extraction unit specifically comprises: