CN110956981B

CN110956981B - Speech emotion recognition method, device, equipment and storage medium

Info

Publication number: CN110956981B
Application number: CN201911246544.5A
Authority: CN
Inventors: 孙亚新; 叶青
Original assignee: Hubei University of Arts and Science
Current assignee: Hubei University of Arts and Science
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2022-04-26
Anticipated expiration: 2039-12-06
Also published as: CN110956981A

Abstract

The invention belongs to the technical field of voice signal processing and mode recognition, and discloses a voice emotion recognition method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples; extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed; performing feature statistics on the feature data of the voice signal to be processed through a preset statistical function to obtain a feature statistical result to be confirmed; obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed; and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result. Through the method, the speech emotion fragments form speech emotion data, and the speech emotion data are input into the preset Softmax classification model, so that speech emotion can be better recognized.

Description

Speech emotion recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech signal processing and pattern recognition technologies, and in particular, to a speech emotion recognition method, apparatus, device, and storage medium.

Background

There are many speech emotion recognition methods, but these methods do not notice that human speech emotion expressions have short-term and local characteristics. For example, in speech emotion recognition, the first half of a sentence, one word is angry and the whole sentence is angry. Several problems arise: first, identifying emotion using whole sentence often dilutes the characteristic changes of emotion. For example, "do we know to Beijing tomorrow and you feel feasible? "this sentence is often the second half of the sentence showing a large emotional difference. Feature variations that cause the use of mean pooling for time, convolution and fully connected layers for all features in deep learning to dilute emotion; secondly, when the parts are combined into sentences, the emotional characteristic changes are neutralized frequently. As is well known, chinese tones have one to four tones, wherein two and four tones have completely opposite characteristics in terms of time variation. Leading to the use of mean pooling for time in deep learning, neutralizing emotional feature changes for time series attention layers, and the like; thirdly, the positions of the words composing the emotion in the sentence are not fixed, which causes great difference of characteristics with the emotion. For example, "do this work? Is "and" feasible? Thus! "express the same meaning, but the output characteristics of the prior convolutional neural network are completely different.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a speech emotion recognition method, a speech emotion recognition device, speech emotion recognition equipment and a storage medium, and aims to solve the technical problem of how to accurately recognize speech emotion.

In order to achieve the above object, the present invention provides a speech emotion recognition method, which comprises the following steps:

acquiring a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;

extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;

performing feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;

obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;

and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result.

Preferably, before the step of obtaining a test speech sample with a preset dimension and performing segmentation processing on the test speech sample through a preset rule to obtain a plurality of initial speech samples, the method further includes:

acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples;

performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed;

performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed;

obtaining target training characteristic data through a preset multi-objective optimization algorithm according to the statistical result of the training characteristics to be confirmed;

acquiring emotion types corresponding to the target training characteristic data according to the target training characteristic data;

and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.

Preferably, the step of obtaining target training feature data through a preset multi-objective optimization algorithm according to the statistical result of the training features to be confirmed includes:

carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories;

and obtaining target training characteristic data through a preset multi-target optimization algorithm according to the training characteristic data to be optimized.

Preferably, the step of inputting the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result includes:

inputting the characteristic target data into the preset Softmax classification model to obtain speech emotion category data;

performing data statistics on the voice emotion type data to obtain a voice emotion type data value;

and obtaining a voice emotion recognition result according to the voice emotion category data value.

Preferably, the step of obtaining the speech emotion recognition result according to the speech emotion category data value includes:

judging whether the voice emotion type data value belongs to a preset voice emotion type threshold range or not;

and if the voice emotion type data value belongs to the preset voice emotion type threshold range, acquiring a voice emotion recognition result according to the voice emotion type data value.

Preferably, after the step of determining whether the speech emotion category data value belongs to a preset speech emotion category threshold range, the method further includes:

and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.

Preferably, the step of performing feature statistics on the to-be-processed speech signal feature data through a preset statistical function to obtain a to-be-confirmed feature statistical result includes:

screening the voice signal characteristic data to be processed to obtain label sample characteristic data;

and carrying out feature statistics on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed.

In addition, to achieve the above object, the present invention further provides a speech emotion recognition apparatus, including: the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a test voice sample with a preset dimensionality and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;

the extraction module is used for extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed;

the statistical module is used for carrying out feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed;

the calculation module is used for obtaining feature target data through a preset multi-target optimization algorithm according to the feature statistical result to be confirmed;

and the determining module is used for inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result.

In addition, to achieve the above object, the present invention also provides an electronic device, including: a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, the speech emotion recognition program being configured to implement the steps of the speech emotion recognition method as described in any of the above.

In addition, to achieve the above object, the present invention further provides a storage medium, on which a speech emotion recognition program is stored, and the speech emotion recognition program implements the steps of the speech emotion recognition method as described in any one of the above when executed by a processor.

The method comprises the steps of firstly obtaining a test voice sample with a preset dimensionality, carrying out segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples, then carrying out signal characteristic data extraction on the initial voice samples to obtain voice signal characteristic data to be processed, screening the voice signal characteristic data to be processed to obtain tag sample characteristic data, carrying out characteristic statistics on the tag sample characteristic data through a preset statistical function to obtain a characteristic statistical result to be confirmed, then obtaining characteristic target data through a preset multi-target optimization algorithm according to the characteristic statistical result to be confirmed, and finally inputting the characteristic target data into a preset Softmax classification model to obtain a voice emotion recognition result. By the method, the voice emotion fragments and the emotion relation between the sentences and the fragments can be fully utilized and converted into the voice emotion data, so that the voice emotion recognition effect is improved.

Drawings

Fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a speech emotion recognition method according to the present invention;

FIG. 3 is a flowchart illustrating a speech emotion recognition method according to a second embodiment of the present invention;

FIG. 4 is a block diagram of a speech emotion recognition apparatus according to a first embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the electronic device may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a speech emotion recognition program.

In the electronic apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the electronic device according to the present invention may be disposed in the electronic device, and the electronic device calls the speech emotion recognition program stored in the memory 1005 through the processor 1001 and executes the speech emotion recognition method according to the embodiment of the present invention.

An embodiment of the present invention provides a speech emotion recognition method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the speech emotion recognition method according to the present invention.

In this embodiment, the speech emotion recognition method includes the following steps:

step S10: the method comprises the steps of obtaining a test voice sample with preset dimensionality, and conducting segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples.

Before the step of obtaining a test voice sample with a preset dimension, and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples, obtaining a training voice sample with a preset dimension, performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial training voice samples, performing feature extraction on the initial training voice samples to obtain training voice signal features to be processed, performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain training feature statistical results to be confirmed, obtaining target training feature data through a preset multi-objective optimization algorithm according to the training feature statistical results to be confirmed, and obtaining an emotion category corresponding to the target training feature data according to the target training feature data, and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.

In addition, it should be understood that the preset rule is a user-defined sample division rule, that is, if the duration of the obtained test voice sample of the preset dimension is 5s, and the preset rule is set to 0.2s, 25 segments of 0.2s initial voice samples are obtained after division according to the preset rule.

In addition, it should be noted that the preset dimension may be a time dimension, or may be a non-time dimension, and the present embodiment is not limited thereto.

Step S20: and extracting signal characteristic data of the initial voice sample to obtain voice signal characteristic data to be processed.

Furthermore, it should be understood that performing signal feature data on the initial speech sample extracts Mel Frequency Cepstral Coefficients (MFCC), Log Frequency Power Coefficients (Log Frequency coeffients, LFPC), Linear Predictive Cepstral Coefficients (LPCC), Zero Crossing with Peak Amplitude (zca), Perceptual Linear Prediction (PLP), Rasta filter Perceptual Linear prediction (R-PLP).

It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculated_iFirst derivative in the time dimension Δ F_iSecond derivative Δ Δ F_iConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.

Further, for ease of understanding, the following is exemplified:

suppose, MFCC corresponds to F_MFCC∈R^39×z，ΔF_MFCC∈R^39×z，ΔΔF_i∈R^39×zWherein z is the number of frames, i.e. the number of degrees of the time dimension, the concatenation result in the non-time dimension

When MFCC and LPCC are connected, suppose

After being connected in series, are

Furthermore, it should be understood that at each time of speech signal feature extraction, MFCC, LFPC, LPCC, ZCPA, PLP, R-PLP features are extracted, where the number of Mel filters of MFCC, LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).

Step S30: and carrying out feature statistics on the voice signal feature data to be processed through a preset statistical function to obtain a feature statistical result to be confirmed.

The statistical results of the above features in the time dimension are obtained using a statistical function using a mean (mean), a standard deviation (standard deviation), a minimum (min), a maximum (max), a kurtosis (kurtosis), and a skewness (skewness).

In addition, it should be understood that the tag sample feature data is obtained by screening the statistical results obtained above, feature statistics is performed on the tag sample feature data through a preset statistical function to obtain a feature statistical result to be confirmed, and the feature statistical result of the tag sample is recorded as { x }₁,x₂,...,x_nAnd n is the number of labeled specimens.

Step S40: and obtaining characteristic target data through a preset multi-target optimization algorithm according to the statistical result of the characteristics to be confirmed.

In addition, it should be noted that { x in the above step₁,x₂,...,x_nDivide by statement label into X_A＝[x₁,x₂,…,x_m]，X_B＝[x_m+1,x_m+2,…,x_n]Wherein X is_AIs a fragment of class A emotion, X_BIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:

(1) for X ∈ X_AIs mixing X_AThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equation_AThe distribution characteristics of the surrounding data.

b_x＝[b₁,b₂,…,b_k]

In the formula b_jDenotes the jth box, 1 (x)_i∈X_j) At x_iBelong to X_jValue of time is 1 otherwise 0, X_jIs X_AA subset of (2), X_jThe angle between the sample inside and x is distributed in the jth bin.

(2) For X ∈ X_AIs mixing X_BThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equation_AThe distribution characteristics of the surrounding data.

In the formula

Is shown as

A box, 1 (x)_i∈X_j) At x_iBelong to X_jValue of time is 1 otherwise 0, X_jIs X_BA subset of (2), X_jThe angle between the sample inside and x is distributed in the jth bin.

(3) The difference in data distribution of the two data sets around point x is calculated using the following equation:

in the formula

The distance between two vectors is represented, where the euclidean distance is used.

(4) According to the calculation result of the last step, the fragment set which is inclined to the A feeling can be obtained

Segment set inclined to B emotion

And a collection of segments that are prone to neutral emotion

Wherein

Is d_x>X of T.

Is d_x<-x of T.

Is T>d_x>-T. T is a threshold value set autonomously. For each set, clustering into a plurality of regions by using a spectral clustering method to obtain a region label of each segment xi

(5) Definition of

L＝[L_A,L_B,L_C]Wherein L is_A∈R^p、L_B∈R^q、L_C∈R^uP, q and u are each independently

And

number of samples, L_A、L_BAnd L_CThe element values in (1) are 1, 2 and 3 respectively. The feature subspace of the segment is learned using the following objective equation:

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)

beta is an equilibrium parameter. Wherein J₁(o_i,o_j) Can realize

And

the intra-class distance between three classes is small, and the inter-class distance is large, and is defined as follows:

in the formula o_iAnd o_jIs composed of

And

results after mapping to subspace. l_iAnd l_jCorresponds to o_iAnd o_jThe value in L. m is a threshold value, the range of the distance between classes is adjusted, and the value of 1 is taken in the invention. G_ijIs xi and x_jThe gaussian distance between. The calculation formula is as follows:

J2(o_i,o_j) It is possible to try to keep the relative relationship within each region constant and the regions belonging to the same class relatively close but not overlapping. The definition is as follows:

in the formula

And

are the region labels for xi and xj. G_liIs a_iClass all G_ijMaximum value of (2). The method can keep the relationship between two segments when the two segments belong to the same area, and minimize the distance between the two segments with a small weight when the two segments do not belong to the same area but belong to the same category, so that the two areas can be prevented from overlapping as much as possible.

To optimize the objective equation J, we define o_i＝φ(W_qφ(…W₃φ(W₂φ(W₁x_i+b₁)+b₂)+b₃)+b₄) Where phi (·) is a sigmoid function, W₁,W₂,…,W_qTo map the matrix, b₁,b₂,…,b_qIs an offset. By calculating

And

can obtain W₁,W₂,…,W_qAnd b₁,b₂,…,b_qThe value of (a) is,

is to take the derivative of J with respect to W,

is the derivative of J over b.

Step S50: and inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result.

Further, it is understood that W is obtained according to the above steps₁,W₂,…,W_qAnd b₁,b₂,…,b_qCalculating { x₁,x₂,...,x_mFeature selection result z.

In addition, W is defined as₁,W₂,…,W_qAnd b₁,b₂,…,b_qIs the characteristic target data in the present application.

Further, it should be understood that { x } is obtained separately using a preset Softmax classifier obtained during training₁,x₂,...,x_mSpeech emotion classification of { l }₁,l₂,...,l_m}. Then according to { l₁,l₂,...,l_mAnd voting to obtain the emotion of the statement.

In addition, it should be noted that the feature target data is input into the preset Softmax classification model to obtain speech emotion category data, data statistics is performed on the speech emotion category data to obtain a speech emotion category data value, and a speech emotion recognition result is obtained according to the speech emotion category data value.

In addition, the step of obtaining the speech emotion recognition result according to the speech emotion category data value is to determine whether the speech emotion category data value belongs to a preset speech emotion category threshold range, and if the speech emotion category data value belongs to the preset speech emotion category threshold range, obtain the speech emotion recognition result according to the speech emotion category data value; and if the voice emotion category data value does not belong to the preset voice emotion category threshold range, returning to the step of inputting the feature target data into the preset Softmax classification model to obtain voice emotion category data.

In addition, the corpus used for emotion recognition effect evaluation according to the present invention is a standard database in the speech emotion recognition field. The training process is first completed and then the recognition test is performed. The test mode was performed in a 5-fold crossover fashion. 7 emotions of anger, fear, irritability, disgust, happiness, neutrality and sadness can be identified, the average classification accuracy is 94.65% under the condition that the speaker depends on the emotion recognition method, and the distinction degree between other emotions is better except that the emotion recognition method is easier to confuse with anger and anger. The average classification accuracy was 89.30% with speaker independence.

In the embodiment, a test voice sample with a preset dimension is obtained, the test voice sample is subjected to sectional processing through a preset rule to obtain a plurality of initial voice samples, then signal characteristic data extraction is carried out on the initial voice samples to obtain voice signal characteristic data to be processed, the voice signal characteristic data to be processed is screened to obtain label sample characteristic data, characteristic statistics is carried out on the label sample characteristic data through a preset statistical function to obtain a characteristic statistical result to be confirmed, then emotion classification is carried out on the training characteristic statistical result to be confirmed to obtain training characteristic data to be optimized corresponding to different emotion classifications, target training characteristic data is obtained through a preset multi-objective optimization algorithm according to the training characteristic data to be optimized, and finally the characteristic target data is input into a preset Softmax classification model, and obtaining a speech emotion recognition result. By the method, the voice emotion fragments and the emotion relation between the sentences and the fragments can be fully utilized to form the data with a tendency, so that the process of processing the tendency of a human can be simulated, the unbalanced information of the data is utilized, the data are compared with each other, the fragments with different emotions are separated under constraint conditions, and the sample scale is increased and the sample diversity is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a speech emotion recognition method according to a second embodiment of the present invention.

Based on the first embodiment, before the step S10, the speech emotion recognition method in this embodiment further includes:

step S000: and acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples.

Step S001: and extracting the characteristics of the initial training voice sample to obtain the characteristics of the training voice signal to be processed.

Step S002: and carrying out feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed.

Step S003: and obtaining target training characteristic data through a preset multi-objective optimization algorithm according to the statistical result of the training characteristics to be confirmed.

Step S004: and acquiring the emotion types corresponding to the target training characteristic data according to the target training characteristic data.

Step S005: and establishing a preset Softmax classification model according to the emotion classes and target training characteristic data corresponding to the emotion classes.

In addition, it should be noted that, the step of obtaining the target training feature data through the preset multi-objective optimization algorithm according to the to-be-confirmed training feature statistical result includes performing emotion class division on the to-be-confirmed training feature statistical result to obtain the to-be-optimized training feature data corresponding to different emotion classes, and obtaining the target training feature data through the preset multi-objective optimization algorithm according to the to-be-optimized training feature data.

In addition, it should be noted that, in the above step, a preset Softmax classification model is established, and in this stage, training is performed on all speakers respectively to obtain a classifier corresponding to each speaker, and the specific process is as follows:

step (1-1) segmenting each statement;

extracting the characteristics of each segment;

step (1-3) performing feature statistics on all features;

step (1-4) training a sentence fragment emotion classification method based on tendency cognitive learning;

step (1-5) training a support vector machine for each feature subspace;

the classification result of the step (1-6) is obtained by voting the results of all the support vector machines;

in addition, in the step (1-1), the speech signal is segmented at intervals of 0.2 seconds.

In the step (1-2), the extracting the speech signal feature for each segment includes: MFCC (Mel Frequency Cepstrum Coefficient), LFPC (Log Frequency Power Coefficient), LPCC (Linear Predictive Cepstral coding), ZCPA (Zero Crossing with Peak Amplitude), PLP (Percentual Linear Predictive), R-PLP (Ras ta Percentual Linear Predictive), the feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension; then calculate each class of features F_iFirst derivative in the time dimension Δ F_iSecond derivative Δ Δ F_iConnecting the original features, the first derivative result and the second derivative result in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.

The characteristic statistics of the characteristics in the step (1-3) is as follows: obtaining the statistical results of mean, standard variance, minimum, maximum, kurtosis and skewness of the features in the time dimensionThe feature statistics are noted as { x₁,x₂,...,x_nThe corresponding label is denoted as Y ═ Y₁,y₂,...,y_n]∈Rⁿ。

In the step (1-4), a data set X is given_A＝[x₁,x₂,…,x_m]，X_B＝[x_m+1,x_m+2,…,x_n]Wherein X is_AIs a fragment of class A emotion, X_BIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:

step (1-4-1) for X ∈ X_AIs mixing X_AThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equation_AThe distribution characteristics of the surrounding data.

b_x＝[b₁,b₂,…,b_k]

Step (1-4-2) for X ∈ X_AIs mixing X_BThe angle of the sample within the X-centered Parzen window from the center sample X is divided into bins and X is then calculated at X using the following equation_AThe distribution characteristics of the surrounding data.

In the formula

Is shown as

Step (1-4-3) calculates the difference in data distribution of the two data sets around point x using the following formula:

in the formula

Representing the distance between two vectors, a variety of distance calculation methods may be used.

Step (1-4-4) according to the calculation result of step (1-4-3), a fragment set prone to A emotion can be obtained

Segment set inclined to B emotion

And a collection of segments that are prone to neutral emotion

Wherein

Is d_x>X of T.

Is d_x<-x of T.

Step (1-4-5) definition

And

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)

beta is an equilibrium parameter. Wherein J₁(o_i,o_j) Can realize

And

in the formula o_iAnd o_jIs composed of

And

results after mapping to subspace. l_iAnd l_jCorresponds to o_iAnd o_jThe value in L. m is a threshold value, adjusting the range of inter-class distances. G_ijIs x_iAnd x_jThe gaussian distance between. The calculation formula is as follows:

J₂(o_i,o_j) It is possible to try to keep the relative relationship within each region constant and the regions belonging to the same class relatively close but not overlapping. The definition is as follows:

in the formula

And

To optimize the objective equation J, define o_i＝φ(W_qφ(…W₃φ(W₂φ(W₁x_i+b₁)+b₂)+b₃)+b₄) Where phi (·) is a sigmoid function, W₁,W₂,…,W_qTo map the matrix, b₁,b₂,…,b_qIs an offset. By calculating

And

can obtain W₁,W₂,…,W_qAnd b₁,b₂,…,b_qThe value of (a) is,

is to take the derivative of J with respect to W,

is the derivative of J over b.

Step (1-4-6) for the product obtained in step (1-4-5)

And

training a Softmax classifier to separate emotion a, emotion B and emotion C.

And (1-4-7) training a softmax classifier capable of recognizing all emotion pairs according to the operation processes of the steps (1-4-5) and the steps (1-4-6).

Further, it should be understood that the following is a summary of the above:

the first step is as follows: all training sample voices were segmented at 0.2 second intervals.

The second step is that: extracting the characteristics of MFCC, LFPC, LPCC, ZCAP, PLP and R-PLP from all the speech segment training signals, wherein the number of Mel filters of the MFCC and the LFPC is 40; the linear prediction orders of the LPCC, the PLP and the R-PLP are respectively 12, 16 and 16; the frequency segmentation of the ZCAP is as follows: 0,106, 223, 352, 495, 655, 829, 1022, 1236, 1473, 1734, 2024, 2344, 2689, 3089, 3522, 4000. So that the dimensions of each class of features of each statement are respectively: ti 39, ti 40, ti 12, ti 16, where ti is the number of frames in the i-th sentence, and the number following the multiplier is the dimension of each frame feature. To obtain the change of the speech signal in the time dimension, a first derivative, a second derivative, is also calculated for the above features in the time dimension. Finally, the dimensionality of each type of feature is respectively as follows: ti 117, ti 140, ti 36, ti 48. The extracted speech signal features of the ith sample are combined from all the above features, and have a dimension ti (117+140+36+48+48+48).

The third step: the following statistical function was used: the mean (mean), standard deviation (standard deviation), minimum (min), maximum (max), kurtosis (kurtosis), skewness (skewness) are obtained as statistics of the above features in the time dimension. The feature statistics of the labeled samples are noted as { x₁,x₂,...,x_nAnd n is the number of labeled specimens.

The fourth step: will { x in the previous step₁,x₂,...,x_nDivide by statement label into X_A＝[x₁,x₂,…,x_m]，X_B＝[x_m+1,x_m+2,…,x_n]Wherein X is_AIs a fragment of class A emotion, X_BIs a segment of B-type emotion, and the training of the sentence segment emotion classification method based on tendency cognitive learning comprises the following steps:

b_x＝[b₁,b₂,…,b_k]

In the formula

Is shown as

in the formula

(4) According to the calculation results of the steps (1-4-3), the fragment set which is inclined to the A feeling can be obtained

Segment set inclined to B emotion

And a collection of segments that are prone to neutral emotion

Wherein

Is d_x>X of T.

Is d_x<-x of T.

(5) Definition of

And

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)

beta is an equilibrium parameter. Wherein J₁(o_i,o_j) Can realize

And

in the formula o_iAnd o_jIs composed of

And

results after mapping to subspace. l_iAnd l_jCorresponds to o_iAnd o_jThe value in L. m is a threshold value, the range of the distance between classes is adjusted, and the value of 1 is taken in the invention. G_ijIs x_iAnd x_jThe gaussian distance between. The calculation formula is as follows:

in the formula

And

And

can obtain W₁,W₂,…,W_qAnd b₁,b₂,…,b_qThe value of (a) is,

is to take the derivative of J with respect to W,

is the derivative of J over b.

(6) For those obtained in the above step (1-4-5)

And

training the Sof tmax classifier to separate emotion A, emotion B and emotion C.

(7) And (4) training a Softmax classifier capable of recognizing all emotion pairs according to the operation process of the step (1-4-5) and the step (1-4-6).

In the embodiment, a plurality of initial training voice samples are obtained by obtaining training voice samples with preset dimensions and performing segmentation processing on the test voice samples through preset rules, then feature extraction is performed on the initial training voice samples to obtain training voice signal features to be processed, feature statistics is performed on the training voice signal features to be processed through a preset statistical function to obtain training feature statistical results to be confirmed, target training feature data are obtained through a preset multi-target optimization algorithm according to the training feature statistical results to be confirmed, then emotion classes corresponding to the target training feature data are obtained according to the target training feature data, and a preset Softmax classification model is established according to the emotion classes and the target training feature data corresponding to the emotion classes. By the method, the model can be trained aiming at the sentence local segments, different local segments in a sentence can be prevented from containing different emotions, or different local segments of the same emotion are prevented from conflicting with each other, and therefore the difference between deep learning physical meanings and speech emotion recognition characteristics is reduced.

In addition, an embodiment of the present invention further provides a storage medium, where a speech emotion recognition program is stored on the storage medium, and the speech emotion recognition program, when executed by a processor, implements the steps of the speech emotion recognition method described above.

Referring to FIG. 4, FIG. 4 is a block diagram illustrating a first embodiment of a speech emotion recognition apparatus according to the present invention.

As shown in fig. 4, the speech emotion recognition apparatus according to the embodiment of the present invention includes: the acquisition module 4001 is configured to acquire a test voice sample with a preset dimension, and perform segmentation processing on the test voice sample according to a preset rule to obtain a plurality of initial voice samples; an extraction module 4002, configured to perform signal feature data extraction on the initial voice sample to obtain to-be-processed voice signal feature data; the statistic module 4003 is configured to perform feature statistics on the to-be-processed speech signal feature data through a preset statistic function to obtain a feature statistical result to be confirmed; the calculating module 4004 is configured to obtain feature target data through a preset multi-objective optimization algorithm according to the feature statistical result to be confirmed; the determining module 4005 is configured to input the feature target data into a preset Softmax classification model, and obtain a speech emotion recognition result.

The obtaining module 4001 obtains a test voice sample with a preset dimension, and performs a segmentation process on the test voice sample according to a preset rule to obtain a plurality of initial voice samples.

The extraction module 4002 performs signal feature data extraction on the initial voice sample to obtain the operation of voice signal feature data to be processed.

It should be understood that the above feature extraction result of each type of feature is a two-dimensional matrix, wherein one dimension is a time dimension, and then each type of feature F is calculated_iFirst derivative in the time dimension Δ F_iThe original features, the first derivative results and the second derivative results are connected in series in a non-time dimension to form a final feature extraction result of each type of features; and (4) connecting the final feature extraction results of the features of all the classes in series on a non-time dimension to obtain the feature extraction result of the sample.

Further, for ease of understanding, the following is exemplified:

When MFCC and LPCC are connected, suppose

After being connected in series, are

And the statistic module 4003 performs feature statistics on the feature data of the voice signal to be processed through a preset statistic function to obtain a feature statistic result to be confirmed.

And the calculating module 4004 obtains the characteristic target data through a preset multi-objective optimization algorithm according to the statistical result of the characteristics to be confirmed.

b_x＝[b₁,b₂,…,b_k]

In the formula

Is shown as

in the formula

Segment set inclined to B emotion

And a collection of segments that are prone to neutral emotion

Wherein

Is d_x>X of T.

Is d_x<-x of T.

(5) Definition of

And

J＝J₁(o_i,o_j)+β*J₂(o_i,o_j)

beta is an equilibrium parameter. Wherein J₁(o_i,o_j) Can realize

And

in the formula o_iAnd o_jIs composed of

And

in the formula

And

And

can obtain W₁,W₂,…,W_qAnd b₁,b₂,…,b_qThe value of (a) is,

is to find the conductance of J to WThe number of the first and second groups is,

is the derivative of J over b.

And the determining module 4005 inputs the feature target data into a preset Softmax classification model to obtain a speech emotion recognition result.

It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.

It should be noted that the above-described work flows are only exemplary, and do not limit the scope of the present invention, and in practical applications, a person skilled in the art may select some or all of them to achieve the purpose of the solution of the embodiment according to actual needs, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the speech emotion recognition method provided in any embodiment of the present invention, and are not described herein again.

Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech emotion recognition method, characterized in that the method comprises:

inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result;

the method for obtaining the initial voice samples comprises the following steps of obtaining a test voice sample with a preset dimensionality, carrying out segmentation processing on the test voice sample through a preset rule, and obtaining a plurality of initial voice samples, wherein before the step of obtaining the initial voice samples, the method further comprises the following steps:

determining a peripheral data distribution characteristic set corresponding to each training characteristic data to be optimized;

determining data distribution differences corresponding to different emotion classes according to the surrounding data distribution feature set;

obtaining emotion fragment sets corresponding to different emotion types according to the data distribution difference;

determining a feature subspace corresponding to each emotion fragment set;

and establishing a preset Softmax classification model based on the plurality of feature subspaces.

2. The method of claim 1, wherein the step of inputting the feature target data into a preset Softmax classification model to obtain the speech emotion recognition result comprises:

3. The method of claim 2, wherein the step of obtaining speech emotion recognition results according to the speech emotion classification data values comprises:

4. The method of claim 3, wherein the step of determining whether the speech emotion classification data value falls within a preset speech emotion classification threshold range further comprises:

5. The method according to claim 1, wherein the step of performing feature statistics on the feature data of the speech signal to be processed by using a preset statistical function to obtain a feature statistical result to be confirmed comprises:

6. An apparatus for speech emotion recognition, the apparatus comprising:

the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for acquiring a test voice sample with a preset dimensionality and performing segmentation processing on the test voice sample through a preset rule to obtain a plurality of initial voice samples;

the determining module is used for inputting the characteristic target data into a preset Softmax classification model to obtain a speech emotion recognition result;

the speech emotion recognition apparatus further includes: acquiring training voice samples with preset dimensions, and performing segmentation processing on the test voice samples through preset rules to obtain a plurality of initial training voice samples; performing feature extraction on the initial training voice sample to obtain the features of a training voice signal to be processed; performing feature statistics on the training voice signal features to be processed through a preset statistical function to obtain a training feature statistical result to be confirmed; carrying out emotion category division on the statistical result of the training features to be confirmed to obtain training feature data to be optimized corresponding to different emotion categories; determining a peripheral data distribution characteristic set corresponding to each training characteristic data to be optimized; determining data distribution differences corresponding to different emotion classes according to the surrounding data distribution feature set; obtaining emotion fragment sets corresponding to different emotion types according to the data distribution difference; determining a feature subspace corresponding to each emotion fragment set; and establishing a preset Softmax classification model based on the plurality of feature subspaces.

7. An electronic device, characterized in that the device comprises: a memory, a processor and a speech emotion recognition program stored on the memory and executable on the processor, the speech emotion recognition program being configured to implement the steps of the speech emotion recognition method as claimed in any of claims 1 to 5.

8. A storage medium having stored thereon a speech emotion recognition program, which when executed by a processor implements the steps of the speech emotion recognition method as claimed in any one of claims 1 to 5.