CN109003625B

CN109003625B - Speech emotion recognition method and system based on ternary loss

Info

Publication number: CN109003625B
Application number: CN201810839374.0A
Authority: CN
Inventors: 陶建华; 黄健; 李雅
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2021-01-12
Anticipated expiration: 2038-07-27
Also published as: CN109003625A

Abstract

The invention belongs to the technical field of emotion recognition, and particularly relates to a voice emotion recognition method and system based on ternary loss, aiming at solving the technical problem of accurately recognizing confusable emotion categories. To this end, the speech emotion recognition method of the present invention includes: performing framing processing on voice data to be detected to obtain a voice sequence with a specific length; carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence; and predicting the emotion classification corresponding to the emotion feature vector based on the preset speech emotion classifier and according to the plurality of preset real emotion classifications. The speech emotion recognition method can better recognize the confusable speech emotion types, and meanwhile, the speech emotion recognition system can execute and realize the method.

Description

Speech emotion recognition method and system based on ternary loss

Technical Field

The invention belongs to the technical field of emotion recognition, and particularly relates to a voice emotion recognition method and system based on ternary loss.

Background

The speech emotion recognition has wide application in human-computer interaction and artificial intelligence, and is a key research direction in the fields of human-computer interaction and artificial intelligence. The speech emotion recognition mainly comprises two parts, namely speech emotion feature extraction and speech emotion recognition model training. Most speech emotion recognition methods focus on extracting robust and effective speech emotion features and finding effective emotion recognition models. However, emotions are characterized by ambiguity, and some emotions are particularly confusing with each other, such as the two categories "angry" and "nausea," the two categories "surprise" and "sick heart.

In addition, the problem of input lengthening exists in speech emotion recognition, the traditional machine learning method needs input information with a fixed length, the general method is to cut off a long sample and complement a short sample with 0, and the experimental effects of the methods are not ideal.

Accordingly, there is a need in the art for a new speech emotion recognition method and system to solve the above problems.

Disclosure of Invention

The method aims to solve the technical problem in the prior art, namely, the technical problem of how to accurately identify the confusable emotion types is solved. To this end, in one aspect of the present invention, a speech emotion recognition method based on ternary loss is provided, which includes:

performing framing processing on voice data to be detected to obtain a voice sequence with a specific length;

carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence;

predicting emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;

the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on the speech data samples and using a machine learning algorithm.

Further, a preferred technical solution provided by the present invention is:

before the step of "obtaining an emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network", the method further includes:

obtaining a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;

and carrying out network training on the emotion time sequence coding network according to the ternary voice sample group and a loss function shown as the following formula:

L＝L₁+L₂

wherein, L is₁Representing a preset triplet loss function, said L₂Representing a preset cross entropy loss function;

said L₁As shown in the following formula:

wherein "+" represents when the "[ solution ] is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentioned

The first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; the N represents the number of the ternary voice sample groups; the f (x) represents the emotion feature vector corresponding to the voice data sample x,

the alpha represents a preset distance parameter;

said L₂As follows:

wherein, said y_iRepresents a preset i-th real emotion category label, the

Represents said y_iLinear regression of (2) processed values.

Further, a preferred technical solution provided by the present invention is:

the step of obtaining a plurality of ternary sets of speech samples from the speech data samples comprises:

obtaining a ternary voice sample group according to the voice data sample and according to the method shown in the following formula:

wherein, the

To represent

Of 2 norm, said

To represent

Is the square of the 2 norm.

Further, a preferred technical solution provided by the present invention is:

the step of framing the voice data to be tested to obtain the voice sequence with specific length comprises the following steps:

framing the voice data to be detected according to a preset time threshold value to obtain a plurality of voice frames;

comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and the plurality of voice frames, specifically:

if the number of the voice frames is equal to the frame number threshold value F, taking the plurality of voice frames as voice sequences;

if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences;

if the number of the voice frames is less than the frame number threshold value F, the plurality of voice frames are taken as a data whole, the data whole is copied and spliced for a plurality of times until the total frame number is more than the frame number threshold value F, and continuous F voice frames are randomly selected from the data whole to be taken as voice sequences, or

Repeatedly copying and splicing each voice frame until the total frame number is greater than the frame number threshold value F, and randomly selecting continuous F voice frames from the voice frames as voice sequences, or

And copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.

Further, a preferred technical solution provided by the present invention is:

the step of obtaining the emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network comprises the following steps:

obtaining the emotion feature vector corresponding to the voice sequence according to the method shown in the following formula:

i_t＝σ(W_xix_t+W_hih_t-1+W_ci·c_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cf·c_t-1+b_f)

c_t＝f_t·c_t-1+i_t·tanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_co·c_t+b_o)

H_t＝o_t·tanh(c_t)

wherein, the i_t，f_t，o_tAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said x_t，h_t，c_tRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentioned_t-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h is_t-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W is_xiA transition matrix representing the input matrix to the input gate; the W is_hiA transition matrix representing the hidden layer matrix to the input gate; the W is_ciA transition matrix representing the state of the cell to the input gate; the W is_xfA transition matrix representing the input matrix to the forgetting gate; the W is_hfRepresenting transfer of a hidden layer matrix to a forgetting gateA matrix; the W is_cfA transition matrix representing a state of a cell to a forgetting gate; the W is_xcA transition matrix representing the input matrix to the cell state; the W is_hcA transition matrix representing the hidden layer matrix to the cell state; the W is_xoA transition matrix representing the input matrix to the output gate; the W is_hoA transition matrix representing a hidden layer matrix to an output gate, said W_coA transition matrix representing the state of the cell to the output gate, said b_i，b_f，b_o，b_cRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.

In another aspect of the present invention, there is also provided a speech emotion recognition system based on ternary loss, including:

the variable-length input processing module is configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;

the emotion time sequence coding module is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;

the voice emotion recognition module is configured to predict emotion categories corresponding to the emotion feature vectors based on a preset voice emotion classifier and according to a plurality of preset real emotion categories;

Further, a preferred technical solution provided by the present invention is:

the voice emotion recognition system also comprises a ternary voice sample group acquisition module and a voice emotion loss function module; the voice emotion loss function module comprises a ternary loss function submodule and a cross entropy loss function submodule;

the ternary voice sample group acquisition module is configured to acquire a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;

the ternary loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:

the alpha represents a preset distance parameter;

the cross entropy loss function submodule is configured to perform network training on the emotion time sequence coding network according to a loss function shown in the following formula:

wherein, said y_iRepresents a preset i-th real emotion category label, the

Represents said y_iLinear regression of (2) processed values.

Further, a preferred technical solution provided by the present invention is:

the ternary voice sample group obtaining module is further configured to obtain a ternary voice sample group according to the voice data sample and according to the following method:

wherein, the

To represent

Of 2 norm, said

To represent

Is the square of the 2 norm.

Further, a preferred technical solution provided by the present invention is:

the variable length input processing module is further configured to perform the following operations:

Further, a preferred technical solution provided by the present invention is:

the emotion time sequence coding module is further configured to obtain an emotion feature vector according to a method shown in the following formula:

i_t＝σ(W_xix_t+W_hih_t-1+W_ci·c_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cf·c_t-1+b_f)

c_t＝f_t·c_t-1+i_t·tanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_co·c_t+b_o)

H_t＝o_t·tanh(c_t)

wherein, the i_t，f_t，o_tAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said x_t，h_t，c_tRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c is mentioned_t-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h is_t-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W is_xiA transition matrix representing the input matrix to the input gate; the W is_hiA transition matrix representing the hidden layer matrix to the input gate; the W is_ciA transition matrix representing the state of the cell to the input gate; the W is_xfA transition matrix representing the input matrix to the forgetting gate; the W is_hfA transition matrix representing the hidden layer matrix to the forgetting gate; the W is_cfA transition matrix representing a state of a cell to a forgetting gate; the W is_xcA transition matrix representing the input matrix to the cell state; the W is_hcA transition matrix representing the hidden layer matrix to the cell state; the W is_xoA transition matrix representing the input matrix to the output gate; the W is_hoA transition matrix representing a hidden layer matrix to an output gate, said W_coA transition matrix representing the state of the cell to the output gate, said b_i，b_f，b_o，b_cRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.

Compared with the closest prior art, the technical scheme at least has the following beneficial effects:

1. the speech emotion recognition method based on ternary loss mainly comprises the following steps: framing the voice data to be detected to obtain a voice sequence with a specific length; carrying out time sequence coding on the voice sequence by utilizing an emotion time sequence coding network to obtain a robust emotion characteristic vector; and predicting the emotion type of the voice data to be detected based on the voice emotion classifier and by using the emotion feature vector. The method can effectively improve the speech emotion recognition precision.

2. The method is based on the triple loss function and trains the emotion time sequence coding network according to the triple voice sample group, and the triple loss function can reduce the distance between the positive sample pairs and increase the distance between the negative sample pairs, namely reduce the intra-class distance and increase the inter-class distance, so that the voice emotions among all classes are easier to distinguish.

3. In the invention, the selection standard of the ternary voice sample group for training the emotion time sequence coding network is that the distance of the central voice emotion feature of the positive sample is greater than that of the voice emotion feature of the negative sample, namely, a 'difficult' sample is selected, so that the emotion time sequence coding network after training can obtain a more robust emotion feature vector, and meanwhile, useless training of the 'easy' sample can be avoided, and network convergence is accelerated.

4. According to the method, the voice data to be detected is subjected to framing processing according to a time threshold value to obtain a plurality of voice frames; the number of voice frames and the frame number threshold F are compared and a voice sequence is acquired based on the comparison result and the plurality of voice frames. Based on the method, the problem of input side length can be solved well, and the accuracy of speech emotion recognition is improved.

5. The invention provides a speech emotion recognition system based on ternary loss, which can realize the speech emotion recognition method based on ternary loss.

Drawings

FIG. 1 is a schematic diagram of the main steps of a speech emotion recognition method based on ternary loss in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a main structure of a speech emotion recognition system based on ternary loss according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a main structure of a variable-length input processing module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a main structure of a speech emotion recognition system based on ternary loss according to an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The emotion recognition method based on ternary loss provided by the invention is explained below with reference to the accompanying drawings.

Fig. 1 exemplarily shows main steps of an emotion recognition method based on ternary loss in this embodiment, and as shown in fig. 1, an emotion recognition method based on ternary loss in this embodiment may include the following steps:

step S101: and performing framing processing on the voice data to be detected to obtain a voice sequence with a specific length.

Specifically, framing processing is carried out on voice data to be detected according to a preset time threshold value, and a plurality of voice frames are obtained;

comparing the number of the voice frames with a preset frame number threshold value F, and acquiring the voice sequence according to the comparison result and a plurality of voice frames, specifically:

if the number of the voice frames is equal to the frame number threshold value F, taking a plurality of voice frames as voice sequences;

if the number of the voice frames is larger than the frame number threshold value F, randomly selecting continuous F voice frames in the middle part from the plurality of voice frames as voice sequences, namely deleting the voice frames at the head end and the tail end in the voice data to be detected so that the number of the voice frames is equal to the frame number threshold value F;

if the number of the voice frames is less than the frame number threshold value F, the present invention can be divided into three modes for processing, including a loop mode, a copy mode and a fill mode. A cyclic mode, namely, taking the plurality of voice frames as a data whole, copying and splicing the data whole for a plurality of times until the total frame number is greater than a frame number threshold value F, and randomly selecting continuous F voice frames from the data whole as a voice sequence; a copy mode, namely, each voice frame is copied and spliced for multiple times until the total frame number is greater than a frame number threshold value F, and continuous F voice frames are randomly selected from the voice frames as a voice sequence; and filling the mode, namely copying and splicing the last voice frame of the voice data to be detected for multiple times until the total frame number is equal to the frame number threshold value F.

In this embodiment, the speech data in the speech emotion database IEMOCAP is used. And under the condition that the number of the voice frames is less than the frame number threshold value F, comparing the accuracy of the voice emotion recognition of the three modes. Referring to table 1, table 1 shows a comparison table of the accuracy of speech emotion recognition in the above three modes:

TABLE 1

Mode(s)	Accuracy rate
		Circulation mode	59.6％
Repetitive pattern	58.1％
		Filling mode	56.3％

It can be seen from table 1 that the cyclic mode works best, the copy mode works second, and the fill mode works worst. The cyclic model enables the process of emotion dynamic change to be cyclic and longer, the long-term memory model is favorable for modeling the emotion dynamic change better, the single frame is repeatedly copied by the copy mode, the emotion dynamic change process is slowed down equivalently, the modeling of the emotion dynamic process is not facilitated, the effect of the filling mode does not contribute to the emotion dynamic change process, and therefore the performance of the two modes is not good. Therefore, in the present embodiment, when the number of speech frames is smaller than the frame number threshold F, the loop mode is selected as a method of acquiring a speech sequence.

Step S102: and carrying out time sequence coding based on a preset emotion time sequence coding network and according to the voice sequence to obtain an emotion feature vector corresponding to the voice sequence.

Specifically, the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on preset voice data samples and is constructed by utilizing a machine learning algorithm. The emotion time sequence coding network acquires emotion feature vectors corresponding to the voice sequence according to methods shown in formula (1) to formula (5):

i_t＝σ(W_xix_t+W_hih_t-1+W_ci·c_t-1+b_i) (1)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cf·c_t-1+b_f) (2)

c_t＝f_t·c_t-1+i_t·tanh(W_xcx_t+W_hch_t-1+b_c) (3)

o_t＝σ(W_xox_t+W_hoh_t-1+W_co·c_t+b_o) (4)

H_t＝o_t·tanh(c_t) (5)

wherein i_t，f_t，o_tAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; x is the number of_t，h_t，c_tRespectively representing an input matrix, a hidden layer matrix and a unit state of the emotion time sequence coding network at the current moment t; c. C_t-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h is_t-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; w_xiA transition matrix representing the input matrix to the input gate; w_hiA transition matrix representing the hidden layer matrix to the input gate; w_ciA transition matrix representing the state of the cell to the input gate; w_xfA transition matrix representing the input matrix to the forgetting gate; w_hfA transition matrix representing the hidden layer matrix to the forgetting gate; w_cfA transition matrix representing a state of a cell to a forgetting gate; w_xcA transition matrix representing the input matrix to the cell state; w_hcA transition matrix representing the hidden layer matrix to the cell state; w_xoA transition matrix representing the input matrix to the output gate; w_hoRepresenting the transition of the hidden layer matrix to the output gate, W_coTo representTransition matrix of cell states to output gates, b_i，b_f，b_o，b_cRespectively representing the bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; "·" denotes a hadamard product, σ denotes a preset activation function, and tanh denotes an all-curve tangent function. In addition, h is_t，c_tAll are intermediate data in the emotion time sequence coding network and are expressed in a matrix form.

The embodiment further comprises a step of network training the emotion time sequence coding network, which specifically comprises the following steps:

acquiring a plurality of ternary voice sample groups according to the voice data samples; the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, the emotion classes of the first voice data sample and the second voice data sample are the same, and the emotion classes of the first voice data sample and the third voice data sample are different;

and (3) carrying out network training on the emotion time sequence coding network according to the obtained ternary voice sample group and the loss function shown in the formula (6):

L＝L₁+L₂ (6)

wherein L is₁Representing a predetermined triplet loss function, L₂Representing a preset cross entropy loss function; the triplet loss function can decrease the distance between pairs of positive samples and increase the distance between pairs of negative samples. The cross entropy loss function is a monitoring cross entropy loss function and is used for monitoring the learning of the network and guiding the clustering of the same emotion classes by utilizing preset sample class information.

L₁As shown in equation (7):

wherein "+" represents when "," is present]When the value in "[ means ] is greater than zero, the value is taken as a loss value]"a value less than zero is zero;

the first voice data sample, the second voice data sample and the third voice data sample in the ith ternary voice sample group; n represents the number of ternary speech sample groups; f (x) represents the emotion feature vector corresponding to the voice data sample x,

and is

Representing a set of real numbers; alpha represents a preset distance parameter;

L₂as shown in equation (8):

wherein, y_iRepresents a preset ith real emotion category label,

denotes y_iLinear regression of (2) processed values.

Further, it is important to select the ternary speech sample set for training the emotion time sequence coding network, because there are many ternary speech sample sets that can be formed in the whole training set, but most of the ternary speech sample sets are not beneficial to the training of the emotion time sequence coding network, and the convergence rate of the emotion time sequence coding network is also reduced. Therefore, the 'difficult' training samples are selected as much as possible, namely the ternary speech sample set in the training set is selected to train the emotion time sequence coding net according to the formula (9):

wherein,

to represent

The square of the 2-norm of (c),

to represent

Is the square of the 2 norm. In this embodiment, when selecting a ternary speech sample group, a sample is given

Selecting a sample

Make it

And selecting

Make it

The emotion time sequence coding network is trained, and the operations are only carried out on batch samples input each time when the emotion time sequence coding network is trained.

The embodiment of the present invention further provides a speech emotion recognition system based on ternary loss, referring to fig. 2, fig. 2 exemplarily shows a main structure of a speech emotion recognition system based on ternary loss in this embodiment, and the speech emotion recognition system shown in fig. 2 may include:

a variable-length input processing module 21 configured to perform framing processing on the voice data to be detected to obtain a voice sequence with a specific length;

the emotion time sequence coding module 22 is configured to perform time sequence coding according to the voice sequence based on a preset emotion time sequence coding network to obtain an emotion feature vector corresponding to the voice sequence;

the speech emotion recognition module 23 is configured to predict emotion classes corresponding to the emotion feature vectors based on a preset speech emotion classifier and according to a plurality of preset real emotion classes;

the emotion time sequence coding network is a long-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model constructed based on speech data samples and using a machine learning algorithm.

Further, the variable-length input processing module is further configured to perform the following operations:

and if the number of the voice frames is less than the frame number threshold value F, taking a plurality of voice frames as a data whole, copying and splicing the data whole for multiple times until the total frame number is greater than the frame number threshold value F, randomly selecting continuous F voice frames from the data whole as a voice sequence, or copying and splicing each voice frame for multiple times until the total frame number is greater than the frame number threshold value F, randomly selecting continuous F voice frames from the data whole as the voice sequence, or copying and splicing the last voice frame of the voice data to be tested for multiple times until the total frame number is equal to the frame number threshold value F.

Referring to fig. 3, fig. 3 illustrates the main structure of a variable-length input processing module according to an embodiment, and the variable-length input processing module shown in fig. 3 may further include a speech framing sub-module 211 and a variable-length input processing sub-module 212.

The voice framing submodule 211 is configured to perform framing processing on the voice data to be detected according to a preset time threshold, and obtain a plurality of voice frames.

And the variable length input processing sub-module 212 is configured to compare the number of the voice frames with a preset frame number threshold value F and acquire a voice sequence according to the comparison result and the plurality of voice frames.

Further, the emotion time sequence coding module is configured to obtain the emotion feature vector according to the method shown in formula (1) -formula (5).

Further, referring to fig. 4, fig. 4 illustrates the main structure of a speech emotion recognition system based on ternary loss, and as shown in fig. 4, the speech emotion recognition system further includes a ternary speech sample group obtaining module 41 and a speech emotion loss function module 42; the speech emotion loss function module includes a ternary loss function sub-module 421 and a cross entropy loss function sub-module 422.

The ternary speech sample set obtaining module 41 is configured to obtain a plurality of ternary speech sample sets from the speech data samples. The ternary set of speech samples includes a first speech data sample, a second speech data sample, and a third speech data sample, and the emotion classifications of the first speech data sample and the second speech data sample are the same and the emotion classifications of the first speech data sample and the third speech data sample are different.

Ternary loss function submodule 421 is configured to perform network training on the emotion time sequence coding network according to the loss function shown in equation (7).

The cross entropy loss function sub-module 422 is configured to perform network training on the emotion time series coding network according to the loss function shown in formula (8).

Further, the ternary speech sample set obtaining module 41 may obtain the ternary speech sample set according to the method shown in formula (9), that is, selecting a "difficult" training sample to train the emotion time sequence coding network.

It will be understood by those skilled in the art that the physical forms of the variable-length input processing module, the emotion time sequence coding module and the speech emotion recognition module may be independent from each other, and may of course be functional units integrated into one physical module, and the emotion time sequence coding module described above may include a memory and a processor, and a computing program stored in the memory and executable on the processor, and the computing program may perform the functions of the variable-length input processing module, the emotion time sequence coding module and the speech emotion recognition module.

Those of skill in the art will appreciate that the various illustrative method steps and systems described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A speech emotion recognition method based on ternary loss is characterized by comprising the following steps:

the emotion time sequence coding network is a long-time and short-time memory neural network model which is based on a preset voice data sample and is constructed by utilizing a machine learning algorithm; the speech emotion classifier is a support vector machine model which is constructed based on the speech data samples and by utilizing a machine learning algorithm;

before the step of obtaining the emotion feature vector corresponding to the voice sequence by performing time sequence coding according to the voice sequence based on a preset emotion time sequence coding network, the method further includes: obtaining a plurality of ternary voice sample groups according to the voice data samples;

L＝L₁+L₂

the step of "framing the voice data to be tested and acquiring the voice sequence with a specific length" includes:

2. The method for speech emotion recognition based on ternary loss according to claim 1, wherein before the step of obtaining the emotion feature vector corresponding to the speech sequence based on the preset emotion time sequence coding network and performing time sequence coding according to the speech sequence, the method further comprises:

the ternary voice sample group comprises a first voice data sample, a second voice data sample and a third voice data sample, and the emotion classes of the first voice data sample and the second voice data sample are the same and the emotion classes of the first voice data sample and the third voice data sample are different;

said L₁As shown in the following formula:

wherein the "+" representsWhen the term is used]When the value in "" is larger than zero, the value is taken as a loss value, and when said "," is larger than zero]"a value less than zero is zero; the above-mentioned

the alpha represents a preset distance parameter;

said L₂As follows:

wherein, said y_iRepresents a preset i-th real emotion category label, the

Represents said y_iLinear regression of (2) processed values.

3. The method of claim 2, wherein the step of obtaining a plurality of ternary speech sample groups from the speech data samples comprises:

wherein, the

To represent

Of 2 norm, said

To represent

Is the square of the 2 norm.

4. The method for speech emotion recognition based on ternary loss according to any one of claims 1-3, wherein the step of obtaining the emotion feature vector corresponding to the speech sequence based on a preset emotion time sequence coding network and performing time sequence coding according to the speech sequence comprises:

i_t＝σ(W_xix_t+W_hih_t-1+W_ci·c_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cf·c_t-1+b_f)

c_t＝f_t·c_t-1+i_t·tanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_co·c_t+b_o)

H_t＝o_t·tanh(c_t)

wherein, the i_t，f_t，o_tAn input gate, a forgetting gate and an output gate which respectively represent the emotion time sequence coding network; said x_t，h_t，c_tRespectively representing the input matrix, the hidden layer matrix and the motion vector of the emotion time sequence coding network at the current moment tA cell state; c is mentioned_t-1Representing the unit state of the emotion time sequence coding network at the last time t-1; h is_t-1Representing the hidden layer matrix of the emotion time sequence coding network at the last time t-1; the W is_xiA transition matrix representing the input matrix to the input gate; the W is_hiA transition matrix representing the hidden layer matrix to the input gate; the W is_ciA transition matrix representing the state of the cell to the input gate; the W is_xfA transition matrix representing the input matrix to the forgetting gate; the W is_hfA transition matrix representing the hidden layer matrix to the forgetting gate; the W is_cfA transition matrix representing a state of a cell to a forgetting gate; the W is_xcA transition matrix representing the input matrix to the cell state; the W is_hcA transition matrix representing the hidden layer matrix to the cell state; the W is_xoA transition matrix representing the input matrix to the output gate; the W is_hoA transition matrix representing a hidden layer matrix to an output gate, said W_coA transition matrix representing the state of the cell to the output gate, said b_i，b_f，b_o，b_cRespectively representing bias items corresponding to the states of the input gate, the forgetting gate, the output gate and the unit; the "·" represents a hadamard product, the σ represents a preset activation function, and the tanh represents an all-curve tangent function.

5. A speech emotion recognition system based on ternary loss is characterized by comprising:

the voice emotion recognition system comprises a voice emotion loss function module; the voice emotion loss function module comprises a ternary loss function submodule and a cross entropy loss function submodule;

wherein the variable length input processing module is further configured to perform the following operations:

6. The ternary loss based speech emotion recognition system of claim 5, further comprising a ternary speech sample set acquisition module;

the alpha represents a preset distance parameter;

wherein, said y_iRepresents a preset i-th real emotion category label, the

Represents said y_iLinear regression of (2) processed values.

7. The system according to claim 6, wherein the ternary loss based speech emotion recognition module is further configured to obtain a ternary speech sample set according to the speech data sample and according to the following method:

wherein, the

To represent

Of 2 norm, said

To represent

Is the square of the 2 norm.

8. The ternary loss based speech emotion recognition system of any of claims 5-7, wherein the emotion time sequence coding module is further configured to obtain the emotion feature vector according to the following method:

i_t＝σ(W_xix_t+W_hih_t-1+W_ci·c_t-1+b_i)

f_t＝σ(W_xfx_t+W_hfh_t-1+W_cf·c_t-1+b_f)

c_t＝f_t·c_t-1+i_t·tanh(W_xcx_t+W_hch_t-1+b_c)

o_t＝σ(W_xox_t+W_hoh_t-1+W_co·c_t+b_o)

H_t＝o_t·tanh(c_t)