CN111899766B

CN111899766B - Speech emotion recognition method based on optimization fusion of depth features and acoustic features

Info

Publication number: CN111899766B
Application number: CN202010855013.2A
Authority: CN
Inventors: 孙林慧; 黄译庆; 傅升; 李平安
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2023-04-14
Anticipated expiration: 2040-08-24
Also published as: CN111899766A

Abstract

The invention discloses a speech emotion recognition method based on optimization fusion of depth features and acoustic features, provides a method for realizing high-robustness speech emotion recognition by adopting a genetic algorithm to perform optimization fusion on depth bottleneck features and acoustic features, and overcomes the defects of the conventional speech emotion recognition method. Compared with the traditional speech emotion recognition method based on single depth features or acoustic features, the method can mine abundant speech emotion information from different layers and describe the speech emotion information more comprehensively, so that the recognition rate of the system is higher, the robustness of the system is further improved, and the method can be well applied to intelligent human-computer interaction.

Description

Speech emotion recognition method based on optimization fusion of depth features and acoustic features

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice emotion recognition method based on optimal fusion of depth features and acoustic features.

Background

With the rapid development of artificial intelligence technology, the thinking and emotion of machines are the same as those of people, and the trend and the demand of the network era are met. The intelligent interaction between the machine and the human is realized, wherein the indispensable item is to enable the machine to have emotion computing capability. Speech carries complex information as the most basic and convenient way for human beings to communicate. The speech signal can not only transmit semantic content, but also reflect the inherent emotion of the speaker. In specific human-computer interaction, the characteristics of the voice, such as nature, convenience, effectiveness and the like, become the key research content of many scholars, so that the voice emotion recognition technology is generated. The speech emotion recognition is to enable a computer to acquire emotion information in a speech signal, extract acoustic features containing the emotion information from the speech signal, and find out a mapping relation between the acoustic features and an emotion state, so that the emotion state analysis of a speaker is realized. The speech emotion recognition of the computer is an important component of computer emotion intelligence, is a key for realizing intelligent man-machine interaction, and has great research value and application value on emotion cognitive direction, signal processing, information acquisition and other researches.

In order to establish a high-robustness speech emotion recognition model, three problems need to be considered: feature extraction, model training and emotion recognition. The extraction of the features containing rich speech emotion information is of great importance, and the speech emotion recognition performance can be directly influenced. Therefore, the extraction, selection and fusion of features are intensively studied in the present invention. Currently, the features for speech emotion recognition can be mainly classified into acoustic features and deep bottleneck features. The acoustic features mainly include MFCC, fundamental frequency, zero crossing rate, energy amplitude, and the like. The acoustic features are widely used in the existing research and can achieve good recognition effect in a certain scene, but the acoustic features in the speech emotion recognition generally only consider the physical layer information of speech signals, and rich emotion information is not fully extracted. In recent years, deep Neural Networks (DNNs) have become popular topics in the industry and academia, and due to their strong feature extraction capability and modeling capability, DNNs successfully improve the conventional recognition rate by one level. The current networks commonly used in the field of speech recognition include: deep Belief Networks (DBNs), convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). In the field of speech emotion recognition, emotion recognition by using a deep belief network can be divided into two cases: firstly, extracting features by using a DBN (database network), obtaining expressions of different levels of the features, closely connecting the features with classification labels, and mining emotion information of a voice signal in a deep level so as to obtain emotion features with higher discrimination; and secondly, the deep belief network is used for classification, the final output layer of the trained deep belief network is changed into a classifier for direct classification of categories, and the classifier capable of achieving a good effect is a Support Vector Machine (SVM). The present inventors have studied these two applications of DBNs.

Although the extracted acoustic features and the extracted depth bottleneck features can achieve a certain recognition effect in a certain scene, rich emotion information in the voice is difficult to completely represent by a single feature, and the recognition rate of the system is still to be improved in an additional scene. In view of the above, it is necessary to provide a speech emotion recognition method based on optimal fusion of depth features and acoustic features, so as to achieve a high recognition rate in multiple language scenes.

Disclosure of Invention

The invention provides a speech emotion recognition method based on optimization fusion of depth features and acoustic features, aiming at achieving the aim of achieving high recognition rate in a plurality of language scenes. Compared with single traditional acoustic features and deep bottleneck features, the method extracts the deep bottleneck features and the traditional acoustic features at the same time, fuses the two features by utilizing a genetic algorithm, and can obtain higher emotion recognition performance under different language scenes.

In order to achieve the above purpose, the invention provides a speech emotion recognition method based on optimization fusion of depth features and acoustic features, which comprises the following steps:

step 1, inputting a voice signal in a corpus, preprocessing the voice signal and extracting acoustic features of the voice signal;

step 2, extracting Fourier coefficient characteristics of the voice signal, taking the Fourier coefficient characteristics as DNN input, and training a DNN for extracting deep bottleneck characteristics of the voice signal;

step 3, performing feature selection on the extracted acoustic features and the extracted depth bottleneck features by adopting a Fisher criterion, reducing feature redundancy, and obtaining high-quality features with high emotion discrimination;

step 4, optimizing and fusing acoustic features and deep bottleneck features by adopting a genetic algorithm, wherein the acoustic features represent physical layer information of emotion information, the deep bottleneck features represent information highly related to emotion classification label information, and the deep bottleneck features and the emotion classification label information are fused to improve the speech emotion recognition effect;

and 5, combining the test data according to the optimization results to obtain a fused test feature set, training the SVM by taking the fused test feature set as the input of a Support Vector Machine (SVM), using the trained SVM for realizing speech emotion recognition, and evaluating the performance of the provided speech emotion recognition method based on optimization fusion.

A further development of the invention is that step 1 comprises:

step 1-1: sampling a time domain continuous voice signal input by each sentence, and then preprocessing the voice signal by adopting pre-emphasis, framing and windowing and end point detection technologies to obtain a preprocessed signal;

step 1-2: calculating acoustic characteristics of the preprocessed voice signal, wherein the acoustic characteristics comprise MFCC, fundamental tone frequency, zero crossing rate and short-time energy;

step 1-3: and calculating the statistical characteristics of each voice, namely counting the frame signals of each voice respectively, wherein the statistical characteristics comprise a maximum value, a minimum value, a median value, a variance and a mean value, and the finally obtained statistical characteristics are the acoustic characteristics of each voice.

A further development of the invention is that said step 2 comprises:

step 2-1: calculating Fourier coefficient characteristics of the preprocessed voice signals, and taking the obtained Fourier coefficient characteristics as input of DNN;

step 2-2: firstly, carrying out unsupervised pre-training on DNN, and then introducing supervised error back propagation to carry out parameter fine adjustment to obtain a trained DNN model;

step 2-3: all training voice signals are input into the DNN after training is finished again, and the output of the DNN on the third layer, namely the output of the bottleneck layer, is obtained, and the output is the deep bottleneck characteristic of each frame of voice signals;

step 2-4: and calculating the statistical characteristics of the depth bottleneck characteristics of each frame of training voice to obtain characteristics, namely the depth bottleneck characteristics of each voice, wherein the statistical characteristics comprise a maximum value, a minimum value, a mean value, a variance and a median.

A further development of the invention is that said step 3 comprises:

step 3-1: respectively calculating Fisher values of each dimension characteristic in the acoustic characteristic and the depth bottleneck characteristic by adopting a Fisher criterion according to the acoustic characteristic and the depth bottleneck characteristic obtained in the step 1 and the step 2;

step 3-2: and (4) respectively sequencing Fisher values obtained from the deep bottleneck characteristics and the acoustic characteristics in the step (3-1), and deleting the deep bottleneck characteristics and the acoustic characteristics of which the Fisher values are lower than a threshold value P to finish the characteristic selection process.

In a further improvement of the speech emotion recognition method according to any of the preceding claims, the step 4 comprises:

step 4-1: optimizing and fusing the depth bottleneck characteristic and the acoustic characteristic after characteristic selection by adopting a genetic algorithm, respectively marking MFCC, short-time energy, zero crossing rate, pitch frequency and depth bottleneck characteristic in the acoustic characteristic as { x1, x2, x3, x4, x5}, and endowing an initial weight value for each type of characteristic as { w1, w2, w3, w4, w5};

step 4-2: taking the weighted fusion of the initial weight and the characteristics as genetic algorithm input, namely { w1 x1, w2 x2, w3 x3, w4 x4, w5 x5}, initializing the genetic algorithm, setting a target function of the genetic algorithm as an identification rate, and starting the genetic algorithm to optimize the fusion weight;

step 4-3: and outputting and storing a weight value optimizing result by the genetic algorithm, and performing weighted fusion on the acoustic characteristic and the deep bottleneck characteristic by taking the weight value optimizing result as a fusion weight value of the test and training SVM data.

In a further development of the invention, the step 4-2 comprises: optimizing the weight combination by adopting a genetic algorithm, and specifically comprising the following steps of:

A. initializing weights, carrying out binary coding on weight combinations, and generating an initial population;

B. decoding results in a weighted combination and combines features in a weighted manner. And leading the combined features into a support vector machine for training, and taking a voice emotion recognition result obtained by the support vector machine as a fitness function. The greater the likelihood that individuals with high fitness are retained;

C. selecting, simulating survival rules of fittest according to the fitness function, selecting excellent individuals from the population as parents, and generating a new population;

D. carrying out mutation operation, randomly selecting a pair of individuals from a population, and exchanging some genes of the individuals to form new individuals;

E. for each individual in the population, changing the gene of the individual according to a certain mutation probability to form a new individual to be added into the population;

F. the weights are decoded and fitness values are calculated. Meanwhile, comparing the speech emotion recognition rates of the offspring and the parent to update the best individual;

G. checking whether the iteration number or the fitness value meets a termination condition: if not, repeating the steps C to F; if the condition is met, go to step H;

H. and outputting the optimal weight combination.

A further development of the invention is that said step 5 comprises:

step 5-1: extracting acoustic features and deep bottleneck features of the test data according to the weight combination obtained by genetic algorithm optimization in the step 4, and performing weighted fusion according to the weight combination;

step 5-2: and applying the feature set obtained by fusion to SVM training, and realizing speech emotion recognition by the SVM obtained by training.

The invention has the further improvement that according to the optimizing result in the step 4, the test data is fused according to the weight combination, a characteristic set such as a formula can be obtained, and the characteristic set is input into the SVM for training; the formula is:

T＝{w1'*x1,w2'*x2,w3'*x3,w4'*x4,w5'*x5}

the objective function of the SVM optimal hyperplane is obtained through training:

/>

s.t.y _i (w ^T x _i +b)≥1-ξ _i ,ξ _i ≥0,i＝1,2,...,N

wherein, C represents a penalty coefficient, which can control the penalty of sample misclassification and balance the complexity and loss error of the model. Xi _i Representing the relaxation factor, N the dimension of the feature, w the support vector, and b a constant.

A further development of the invention is that the speech signal is in wav format.

The invention has the further improvement that Fisher coefficients of acoustic features and depth bottleneck features of each dimension are respectively calculated according to a formula, the Fisher coefficients are ranked from low to high, only the first 105 dimensions with large Fisher coefficients are selected for the acoustic features, and the first 100 dimensions with large Fisher coefficients are selected for the depth bottleneck features; after feature screening, the acoustic features comprise 105 dimensions, and the deep bottleneck features comprise 100 dimensions; the formula is:

where μ represents the mean of the d-th dimensional features and σ represents the standard deviation of the d-th dimensional features.

The invention has the beneficial effects that:

1. the speech emotion recognition method based on the optimization fusion of the depth features and the acoustic features has certain theoretical research value and practical application value. The method overcomes the defect that single characteristic cannot comprehensively characterize speech emotion information by extracting the deep bottleneck characteristic and the acoustic characteristic of the speech signal and fusing the deep bottleneck characteristic and the acoustic characteristic, so that the system performance is greatly improved, and higher recognition rate can be achieved under different language situations.

2. The method selects the characteristics with high contribution degree to emotion recognition by adopting the Fisher criterion to screen the characteristics, reduces the redundancy of the characteristics and reduces the overall calculation complexity of the system.

3. The method adopts the genetic algorithm to perform optimization fusion on the characteristics, and compared with a mode of fusing acoustic characteristics and deep bottleneck characteristics in a proportion of 1.

Drawings

FIG. 1 is a system block diagram of the present invention as a whole.

Fig. 2 is a DNN network model for extracting deep bottleneck features.

FIG. 3 is a diagram of genetic algorithm optimization process.

FIG. 4 is a diagram illustrating average speech emotion recognition rates based on acoustic features in different feature dimensions when EMO-DB is used.

Fig. 5 is a comparison graph of the recognition rates of the deep bottleneck feature and the fourier coefficient feature.

FIG. 6 is a graph comparing performance of different feature fusion modes.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention mainly relates to a speech emotion recognition method based on optimization fusion of depth features and acoustic features. In recent years, deep learning has achieved a lot of achievements in data mining, pattern recognition, natural language processing, multimedia learning, voice, recommendation and personalization technologies and other related fields, and the strong feature extraction and modeling capabilities of the deep learning enable the performance of the pattern recognition to be greatly improved.

As shown in fig. 1, the present invention utilizes the powerful feature extraction capability of DNN, and takes the fourier coefficient feature as DNN input for training DNN, and the trained DNN may have the capability of extracting deep bottleneck feature. The method comprises the following steps of firstly preprocessing each training sample voice, then respectively extracting acoustic features and depth bottleneck features of each frame of voice signal, calculating statistical features of the acoustic features and the depth bottleneck features, and then respectively selecting the two main features by adopting a Fisher criterion. And giving an initialization weight to each class of the selected features, wherein the weight represents the contribution degree of each class of the features to classification recognition, inputting the feature set subjected to weighted combination into a genetic algorithm for optimizing the weight combination, acquiring the weight combination after optimizing, inputting the final feature set into an SVM for training in a weight combination mode, and inputting the processed corresponding test data to obtain a final recognition result. Through cross validation and comparison experiments, compared with a method of singly using a certain type of characteristics or directly fusing the characteristics, the voice emotion recognition method based on optimization fusion can obtain better system recognition performance.

According to the speech emotion recognition method based on optimization fusion, after the acoustic features and the depth features are fused in a genetic algorithm optimization mode, the higher speech emotion recognition performance is realized, the method is suitable for speech emotion recognition under different speech scenes, and the robustness of a speech emotion recognition system is further improved. The following is a detailed discussion of specific embodiments of the invention:

step 1: the input speech signal is preprocessed and the acoustic features of the speech signal are extracted.

1. The speech signal is preprocessed.

Because the voice signal has the short-time stationary characteristic, the voice signal needs to be preprocessed before feature extraction, so that the feature information of the voice signal can be extracted. The pretreatment operation mainly comprises the following steps: pre-emphasis, framing and windowing, and end point detection.

2. And extracting acoustic features of the preprocessed voice signals, and then calculating the statistical features of each voice.

The extracted acoustic features mainly comprise four major categories of Mel Frequency Cepstrum Coefficients (MFCC), fundamental Frequency, zero-crossing rate and energy amplitude, wherein the extracted MFCC parameters extract a 24-dimensional coefficient and a first-order difference coefficient thereof; then, the statistical characteristics of each voice are calculated, and the statistical characteristics comprise five types including a maximum value, a minimum value, a mean value, a variance, a median value and the like. The feature dimension obtained after extraction is 24 × 5 × 2, that is, the MFCC feature extracted from each voice is 240 dimensions. The dimensionalities of the fundamental tone frequency, the zero crossing rate and the energy amplitude of each voice are all 5-dimensional, so that after the acoustic features of each voice are extracted, 255-dimensional features are contained in total.

And 2, step: and extracting Fourier coefficient features of the voice signal, taking the Fourier coefficient features as DNN input for training DNN, wherein the DNN obtained by training can be used for extracting deep bottleneck features of the voice signal.

1. And calculating Fourier coefficients of the preprocessed voice signals.

And performing fast Fourier transform on the preprocessed voice signals to obtain harmonic coefficients, and calculating the modulus value of each harmonic coefficient to obtain the Fourier coefficient.

2. Taking Fourier coefficients as DNN input, carrying out unsupervised pre-training on the DNN, and then carrying out supervised fine tuning.

Each layer of DNN is implemented by a Restricted Boltzmann Machine (RBM). In the invention, impulse parameters in a DNN training stage are set to be 0.9, a learning rate is set to be 0.005, batch is set to be 2, and iteration times are set to be 50. In the training process, firstly, training data is used as the input of a first RBM, and after the training of a first-layer network is finished, the output value of a hidden layer of the first-layer network is used as the input of a visible layer of a next RBM, so that the RBM of a second layer is obtained through training. By analogy, the hidden layer output of the previous layer is used as the input of the visible layer of the next layer, so that the pre-training process of all RBMs is realized. And after all RBMs are trained, superposing the RBMs according to the hierarchical relationship, thereby obtaining a trained multilayer network structure.

After the pre-training is finished, fine tuning is carried out by adopting a Back Propagation (BP) -based algorithm, parameters of each layer of network after the pre-training are used as initialization parameters of the DNN, and then a softmax function output layer is added to form a complete DNN. Each output node of the DNN corresponds to a class, which serves the purpose of supervised training. In the process of adjusting parameters by using a BP algorithm, a Cross Entropy (CE) function is adopted as an adjusting function, and model parameters are estimated by calculating a minimum cost function, wherein an objective function is as follows:

wherein

Representing the network parameters, N representing the number of emotion categories, y (j) representing the output of node j, and y' (j) representing the probability of the output. Thus, the training of the DNN model is completed.

3. And acquiring a deep bottleneck characteristic.

Re-inputting the fourier coefficients of all the training speech signals into the trained DNN, and obtaining the output of the DNN at the third layer, namely the output of the bottleneck layer, wherein the output is the deep bottleneck characteristic of each frame of speech signal, the DNN model for integrally extracting the deep bottleneck characteristic is shown in fig. 2, and the network structure is 1280-1280-100-1280-7.

4. And calculating the statistical characteristics of the deep bottleneck characteristics of each frame of training voice.

Calculating a statistical value of the deep bottleneck characteristic of each voice signal, wherein statistical variables comprise 5 types: maximum, minimum, mean, variance, and median.

And step 3: and screening the extracted acoustic features and depth features by adopting Fisher criterion.

And (3) respectively calculating Fisher coefficients of the acoustic features and the depth bottleneck features of each dimension according to a formula (2), and sequencing the Fisher coefficients from low to high. And only selecting the front 105 dimensions with large Fisher coefficients for the acoustic features, and selecting the front 100 dimensions with large Fisher coefficients for the depth bottleneck features. Therefore, after feature screening, the acoustic features include 105 dimensions and the deep bottleneck features include 100 dimensions.

And 4, step 4: and (3) optimizing and fusing the acoustic characteristics and the deep bottleneck characteristics by adopting a genetic algorithm, wherein the overall flow is shown in figure 3.

1. Symbol arrangement

The acoustic features totally comprise 4 features, namely MFCC, zero crossing rate, energy amplitude and pitch frequency, the 4 features form the acoustic features, the feature dimension after feature selection is 105-dimensional, the MFCC, short-time energy, zero crossing rate, pitch frequency and depth bottleneck features in the acoustic features are respectively marked as { x1, x2, x3, x4, x5}, and an initial weight is given to each type of features and is set as { w1, w2, w3, w4, w5}.

2. Starting genetic algorithm for optimizing

And taking the initial weight and the weighted fusion of the features as genetic algorithm input, namely { w1 x1, w2 x2, w3 x3, w4 x4, w5 x5}, initializing the genetic algorithm, setting a target function of the genetic algorithm as an identification rate, setting the iteration times of the genetic algorithm as 500 times, setting the population size as 50, and setting the variation rate and the cross probability as 20% and 80% respectively. And starting a genetic algorithm to optimize the fusion weight. The method comprises the following specific steps:

A. initializing the weight, carrying out binary coding on the weight combination, and generating an initial population.

B. Decoding results in a weighted combination and combines features in a weighted manner. And leading the combined features into a support vector machine for training, and taking a voice emotion recognition result obtained by the support vector machine as a fitness function. The greater the likelihood that an individual with a high fitness will be retained.

C. And (4) carrying out selection operation, simulating survival rules of the fittest according to the fitness function, selecting excellent individuals (one group of weights represents one individual) from the population as parents, and generating a new population.

D. Mutation operations are performed, a pair of individuals is randomly selected from a population, and some genes of the individuals are exchanged to form new individuals.

E. For each individual in the population, the gene of the individual is changed with a certain mutation probability to form a new individual to be added into the population.

F. The weights are decoded and fitness values are calculated. Meanwhile, comparing the speech emotion recognition rates of the offspring and the parent to update the best individual.

G. It is checked whether the number of iterations or the fitness value fulfils a termination condition. If not, repeating steps C to F. If the condition is satisfied, go to step H.

H. And outputting the optimal weight combination.

3. And acquiring a weight value optimizing result output by the genetic algorithm, and storing the weight value optimizing result as { w1', w2', w3', w4', w5' }.

And 5: and combining the test data according to the optimizing result to obtain a fused test feature set, taking the fused test feature set as the input of a Support Vector Machine (SVM), training the SVM, and finishing the final speech emotion recognition by the trained SVM.

1. And inputting a fusion feature set to train the SVM.

And (5) according to the optimizing result in the step (4), fusing the test data according to the weight combination to obtain a characteristic set, such as a formula (3), and inputting the characteristic set into the SVM for training.

T＝{w1'*x1,w2'*x2,w3'*x3,w4'*x4,w5'*x5} (3)

The training solves the objective function of the SVM optimal hyperplane as follows:

s.t.y _i (w ^T x _i +b)≥1-ξ _i ,ξ _i ≥0,i＝1,2,...,N (4)

2. And performing performance evaluation by using the trained SVM.

The corpus used in the experiment is a German Berlin speech emotion library (EMO-DB), which contains 535 voices in total, 5 actors and 5 actresses read 10 different German text contents with different emotions, and the included emotions are angry, happy, worshipping, afraid, disgust, neutral and chatless in 7 types. The speech sampling frequency is 169z, 16bit quantization, the frame length is set to 256 and the frame shift is set to 128 during framing. The experimental environment is 64-bit operating system under Windows7, 4G memory. In the experiment, all the voices in the corpus are randomly and averagely divided into 10 parts, wherein 8 parts are used for training, 2 parts are used for testing, the experiment is repeated for 5 times, and the average of 5 experiments is taken as a final recognition result.

Firstly, in order to obtain the feature dimension which should be reserved when the acoustic features are subjected to feature screening, the invention performs performance evaluation on different feature dimensions after the feature screening, wherein the evaluation index is the average recognition rate, and the classifier is the SVM. As shown in fig. 4, the acoustic features may achieve optimal performance while preserving 105-dimensional features. Meanwhile, in order to verify the effectiveness of the feature selection method, the recognition effects of acoustic features when feature selection is carried out without adopting a Fisher criterion and when feature selection is carried out by adopting the Fisher criterion are compared, as shown in the table 1, after the feature selection is carried out, the emotion recognition rate is improved by 3.88%, so that the fact that the speech emotion recognition performance can be improved by adopting the Fisher criterion features is proved.

TABLE 1 Acoustic feature Speech Emotion recognition Rate (%)

In order to obtain the deep bottleneck characteristic with the best performance, the method compares the output under different network layers, adopts the Fisher criterion to carry out characteristic screening on the output, and finally inputs the characteristic into the SVM for training to obtain a recognition result. The network structures adopted include 1280-100-1280-1280-7,1280-1280-100-1280-7,1280-1280-1280-100-7 from table 2, it can be seen that when the bottleneck layer is at the third layer, the speech emotion recognition performance can reach the highest level, 72.22%, so in the subsequent experiments, the network structure is adopted. Fig. 5 compares the performance of the three features, namely, the fourier coefficient feature input into the DNN, the DNN bottleneck layer feature, and the bottleneck layer feature after feature selection, and experimental results can verify that the bottleneck layer feature after feature selection can achieve the best performance.

TABLE 2 speech emotion recognition (%) -when bottleneck layer is at different locations of network

/>

Finally, in order to verify the effectiveness of the feature fusion algorithm provided by the invention, the invention compares the performances of a genetic algorithm and a common fusion algorithm. As shown in fig. 6, when a general fusion algorithm is used, the average speech emotion recognition rate is 80.87%, and when a feature fusion method based on a genetic algorithm is used, the average speech emotion recognition rate of 7 types of emotions reaches 84.22%, so that the speech emotion recognition rate can be further improved by the method provided by the present invention.

The above results show that: the speech emotion recognition method based on the depth feature and acoustic feature optimization fusion can further improve speech emotion recognition performance, compared with a common 1.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A speech emotion recognition method based on depth feature and acoustic feature optimization fusion is characterized by comprising the following steps:

step 1, inputting a voice signal in a corpus, preprocessing the voice signal and extracting acoustic features of the voice signal; the method specifically comprises the following steps:

step 1-2: calculating the acoustic characteristics of the preprocessed voice signal, wherein the acoustic characteristics comprise MFCC, fundamental tone frequency, zero crossing rate and short-time energy;

after extracting the frame-level MFCC, calculating the statistical characteristics of each voice, wherein the statistical characteristics comprise a maximum value, a minimum value, a median value, a variance and a mean value, and the finally obtained statistical characteristics are the MFCC characteristics of each voice;

step 2, extracting Fourier coefficient features of the voice signals, inputting the Fourier coefficient features as DNN, and training the DNN for extracting deep bottleneck features of the voice signals;

step 2-1: calculating Fourier coefficient characteristics of the preprocessed voice signals, and taking the obtained Fourier coefficient characteristics as the input of DNN;

step 2-2: firstly, carrying out unsupervised pre-training on DNN, and then introducing supervised error back propagation to carry out parameter fine tuning to obtain a trained DNN model;

step 2-3: re-inputting all training voice signals into the DNN after training, and obtaining the output of the DNN on the third layer, namely the output of the bottleneck layer, wherein the output is the deep bottleneck layer characteristic of each frame of voice signals;

step 2-4: calculating the statistical characteristics of the frame-level deep bottleneck layer characteristics of each training voice, wherein the obtained characteristics are the deep bottleneck characteristics of each voice, and the statistical characteristics comprise a maximum value, a minimum value, a mean value, a variance and a median value;

step 3, performing feature selection on the extracted acoustic features and the extracted depth bottleneck features by adopting a Fisher criterion, reducing feature redundancy, and obtaining high-quality features with high emotion discrimination; the method comprises the following specific steps:

respectively calculating Fisher coefficients of the acoustic features and the depth bottleneck features of each dimension according to a formula, sequencing the Fisher coefficients from low to high, selecting only the first 105 dimensions with large Fisher coefficients for the acoustic features, and selecting the first 100 dimensions with large Fisher coefficients for the depth bottleneck features; after feature screening, the acoustic features comprise 105 dimensions, and the deep bottleneck features comprise 100 dimensions; the formula is:

wherein it is present>

Mean value representing a d-th dimension feature>

Represents the standard deviation of the d-dimension feature;

step 4, optimizing and fusing acoustic features and deep bottleneck features by adopting a genetic algorithm, wherein the acoustic features represent physical layer information of emotion information, the deep bottleneck features represent information highly related to emotion classification label information, and the acoustic features and the deep bottleneck features are fused to improve the speech emotion recognition effect; the method specifically comprises the following steps:

step 4-2: taking the initial weight and the weighted fusion of the features as genetic algorithm input, namely { w1 x1, w2 x2, w3 x3, w4 x4, w5 x5}, initializing the genetic algorithm, setting a target function of the genetic algorithm as an identification rate, and starting the genetic algorithm to optimize the fusion weight; specifically, the method comprises the following steps:

B. decoding to obtain a weight combination, combining the characteristics in a weighting mode, importing the combined characteristics into a support vector machine for training, and taking a speech emotion recognition result obtained by the support vector machine as a fitness function;

C. selecting, simulating survival rules of fitters according to the fitness function, selecting excellent individuals from the groups as parents, and generating new groups;

E. for each individual in the population, changing the gene of the individual with a certain mutation probability to form a new individual to be added into the population;

F. decoding the weights and calculating a fitness value, and meanwhile, comparing the speech emotion recognition rates of the offspring and the parent to update the optimal individual;

H. outputting the optimal weight combination;

step 4-3: outputting and storing a weight value optimizing result by a genetic algorithm, and performing weighted fusion on the acoustic characteristic and the depth bottleneck characteristic by using the weight value as a fusion weight value of test and training SVM data;

and 5, combining the test data according to the optimization results to obtain a fused test feature set, using the fused test feature set as the input of a Support Vector Machine (SVM), training the SVM, using the trained SVM for realizing speech emotion recognition, and evaluating the performance of the provided speech emotion recognition method based on optimization fusion.

2. The emotion speech recognition method of claim 1, wherein: the step 3 comprises the following steps:

3. The emotion speech recognition method of claim 1, wherein: the step 5 comprises the following steps:

step 5-1: according to the weight combination obtained by genetic algorithm optimization in the step 4, extracting the acoustic features and the depth bottleneck features of the test data, and performing weighted fusion on the weight combination;

4. The speech emotion recognition method of claim 1, wherein: according to the optimization result in the step 4, the test data are fused according to the weight combination, a feature set such as a formula can be obtained, and the feature set is input into the SVM for training; the formula is:

the training solves the objective function of the SVM optimal hyperplane as follows: />

Wherein the content of the first and second substances,Crepresents a penalty factor that controls the penalty for misclassification of samples, balances the complexity and loss error of the model, and/or>

Representing relaxation the factor(s) is (are),Ndimension representing a feature>

Represents a support vector, <' > based on>

Is a constant.

5. The speech emotion recognition method of claim 1, wherein: the voice signal is in wav format.