CN118038901A

CN118038901A - Bimodal voice emotion recognition method and system

Info

Publication number: CN118038901A
Application number: CN202410172882.3A
Authority: CN
Inventors: 张�杰; 曹晖; 申美伦
Original assignee: Air Force Medical University of PLA
Current assignee: Air Force Medical University of PLA
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-05-14

Abstract

The invention discloses a bimodal voice emotion recognition method and a bimodal voice emotion recognition system, which relate to the technical field of emotion recognition and comprise the following steps: acquiring a voice signal of voice data to be recognized, and extracting text information in the voice signal; after framing treatment is carried out on the voice signal, the voice signal is input into a voice pre-training model for coding, and advanced characteristics of the voice signal are obtained; extracting acoustic features in the voice information, and splicing the advanced features and the acoustic features according to frames to obtain a voice feature sequence; extracting a text feature sequence by using a text pre-training model; extracting key emotion features in the voice feature sequence and the text feature sequence, and adding time sequence information to obtain voice depth emotion features and text depth emotion features; adopting a modal fusion algorithm to fuse the voice depth emotion characteristics and the text depth emotion characteristics to obtain voice emotion characteristics to carry out emotion recognition on voice data to be recognized; the voice and text information are organically fused, so that the accuracy and the robustness of emotion recognition are improved.

Description

Bimodal voice emotion recognition method and system

Technical Field

The invention relates to the technical field of emotion recognition, in particular to a bimodal voice emotion recognition method and system.

Background

Emotion recognition is a very important branch field in artificial intelligence research, and has good application prospects in the aspects of educational medical treatment, business marketing analysis, intelligent robots and the like. In recent years, various large companies have introduced many intelligent voice assistants, such as hundred degrees, mi Xiao love, microsoft ice, etc., so that users can perform man-machine interaction in a voice word mode, etc., but if more real intelligent interaction is desired to be realized, the voice assistants are required to be capable of more accurately understanding and analyzing the emotional state of the users and making reasonable responses.

At present, speech emotion recognition is realized mainly by extracting different acoustic features and spectrum features of speech and then analyzing and learning emotion information contained in the features by using a deep learning network. The method still has some problems, on one hand, the extracted acoustic features and the spectrum features have general characterization capability, and the voice information cannot be efficiently represented; on the other hand, in emotion recognition, there is often a limitation in information using data of a single modality, and emotion information of an expressive person cannot be fully mined.

Disclosure of Invention

Aiming at the defect that the prior art often has information limitation in using single-mode data and cannot fully mine the emotion information of an expressive person, the invention provides a dual-mode voice emotion recognition method and a dual-mode voice emotion recognition system, which can more fully analyze and recognize the emotion state of an individual by utilizing audio and text in voice, thereby solving the problem that the prior art often has information limitation in using single-mode data and cannot fully mine the emotion information of the expressive person.

A bimodal speech emotion recognition method comprising the steps of:

Acquiring a voice signal of voice data to be recognized, and extracting text information in the voice signal;

Framing the voice signals, inputting each frame of the framed voice signals into a voice pre-training model for coding, and obtaining advanced features of the voice signals;

extracting the acoustic features of the MFCCs in the voice information, and splicing the advanced features of the voice signals with the acoustic features according to frames to obtain a voice feature sequence;

Extracting high-level features of text information by using a text pre-training model, and constructing a text feature sequence;

The method comprises the steps of respectively extracting key emotion features in a voice feature sequence and a text feature sequence by using a self-attention mechanism, respectively adding time sequence information to each key emotion feature through a long-short-term memory neural network, and obtaining voice depth emotion features and text depth emotion features;

adopting a modal fusion algorithm to fuse the voice depth emotion characteristics and the text depth emotion characteristics to obtain voice emotion characteristics;

and carrying out emotion recognition on the voice data to be recognized according to the voice emotion characteristics.

Further, the voice signal is transcribed through a voice transcription API or a local voice transcription model, and text information in the voice signal is extracted.

Further, the framing processing is performed on the voice signal, each frame of the voice signal after framing is input into a voice pre-training model for coding, and advanced features of the voice signal are obtained, which specifically comprises the following steps:

The voice signal is segmented in a mode that the length of 20ms is one frame, and a voice sequence A= { a ₁,a₂,a₃,…,a_N };

The voice sequence is input into a convolutional neural network CNN for encoding, and an intermediate feature sequence M= { M ₁,m₂,m₃,…,m_N }.

Further, the method also comprises the step of improving the coding form of the transducer when the voice sequence is input into the convolutional neural network CNN for coding, and adding relative position codes to each element in the intermediate characteristic sequence MIt is expressed as:

Where r _i-j is the position code of i relative to j, u and v are the parameters to be learned, W ^k is decomposed into And/>Representing input and position codes, respectively;

Inputting the intermediate feature sequence M added with the relative position codes into the improved Transformer Encoder, predicting the information of the intermediate features by utilizing the sequence context feature information, and preliminarily merging the context information to obtain the pre-training feature F _HuBERT＝{h₁,h₂,h₃,…,h_N of the voice signal;

F＝transformer(M)

Wherein F _HuBERT∈R^N×768 and N are the number of voice frames.

Further, the extracting the MFCC acoustic features in the voice information specifically includes the following steps:

The speech signal sequence a= { a ₁,a₂,a₃,…,a_N } is multiplied by the hamming window W (i, k), resulting in a '= { a' ₁,a'₂,a'₃,…,a'_N }, expressed as:

A'＝A*W(i,k)

Performing Fourier transform on the characteristic A' to obtain energy distribution of each frame on a frequency spectrum, and performing modulo squaring on the frequency spectrum of the voice signal to obtain a power spectrum of the voice signal;

Let DFT of the speech signal be:

Wherein a' _i is a windowed input speech signal;

the obtained power spectrum of each frame passes through a group of triangular filter banks with Mel scale, and the logarithmic energy output by each filter bank is calculated;

carrying the logarithmic energy into discrete cosine transform to obtain L-order Mel parameters; wherein, the L-order is the MFCC coefficient order, and further obtains the MFCC acoustic feature of each frame signal, which comprises the following steps:

Defining a filter group with M filters, wherein the adopted filters are triangular filters, the central frequency is f (M), the value of M is between 22 and 26, the interval between f (M) is reduced along with the reduction of the value of M, and the interval between f (M) is widened along with the increase of the value of M;

The frequency response of the triangular filter is defined as:

the logarithmic energy of each filter bank output is calculated as:

Obtaining MFCC coefficients through Discrete Cosine Transform (DCT):

Carrying the logarithmic energy into discrete cosine transform to obtain L-order Mel parameters, wherein L-order refers to the number of MFC coefficient orders, generally 13 is taken, and M is the number of triangular filters;

Obtaining MFCC coefficients through Discrete Cosine Transform (DCT):

carrying the logarithmic energy into discrete cosine transform to obtain L-order Mel parameters;

The MFCC characteristics, F _mfcc＝{mfcc₁,mfcc₂,mfcc₃,…,mfcc_N, of each frame of signal are obtained.

Further, the advanced features of the voice signal are spliced with the acoustic features according to frames to obtain a voice feature sequence; the method specifically comprises the following steps:

Splicing the advanced voice feature F _HuBERT＝{h₁,h₂,h₃,…,h_N and the MFCC acoustic feature according to the dimension of the frame to obtain the modal feature of the voice Sequence.

Further, extracting advanced features of the text information by using the text pre-training model, and constructing a text feature sequence; which comprises the following steps:

Assume that the initial sequence of text is t= { w ₁,w₂,w₃,…,w_N }, where w _i represents the i-th word in the sequence of text;

And (3) respectively adding supplementary CLS and SEP to the head and tail of the text initial sequence to obtain:

T＝[w_CLS,w₁,w₂,……,w_N,w_SEP]

Processing the input text sequence T by using a library function BERT Tokenizer to obtain three lists of input_ids and token_type_ ids, attention _mask;

sending the three lists into a text pre-training model to obtain a high-level characteristic sequence of the text Where F _t∈R^N×768 is N×768, each word gets a 768-dimensional word vector.

Further, the method uses a self-attention mechanism to extract key emotion features in a voice feature sequence and a text feature sequence respectively, and adds time sequence information to each key emotion feature through a long-short-term memory neural network to obtain a voice depth emotion feature and a text depth emotion feature respectively, and comprises the following steps:

For voice feature sequences Three learning matrices/>For text feature sequence/>Three learning matrices/>The self-attention mechanism input is obtained in the following manner.

Wherein,Query^A/Key^A/Value^A∈R^N ^×M,Query^T/Key^T/Value^T∈R^N×M.

Similarity is calculated for each Query and Key:

the similarity score obtained is normalized by Softmax, and becomes a probability distribution with a weight sum of 1:

and carrying out weighted summation on the value by using the obtained weight coefficient to obtain Atten tion scores of the voice feature sequence, wherein the score is as follows:

The text feature sequence has the following Attention scores:

the calculated attention features thereby obtain a speech self-attention feature sequence Y _a and a text self-attention feature sequence Y _t;

Adding front and rear time sequence information to the voice self-attention feature sequence Y _a and the text self-attention feature sequence Y _t by utilizing LSTM to obtain a new voice feature sequence F _audio and a new text feature sequence F _text:

F_audio＝LSTM(Y_a),F_text＝LSTM(Y_t)。

further, the method for fusing the deep emotion feature of the voice and the deep emotion feature of the text by adopting a modal fusion algorithm to obtain the deep emotion feature of the voice comprises the following steps:

Assuming mask matrix is mask _T→A, for the mth feature in the text sequence, calculate the attention weight of all features in the speech sequence to it

The attention weights are sequenced according to the size, a voice sequence node M= { M ₁,m₂,…,m_K } corresponding to the previous K large value is selected, the position (M, t) in the mask matrix mask _T→A is set to be 1, the rest positions are set to be 0, and M epsilon M;

The similarity score for each feature in the sequence of speech features to the sequence of text features F _text is noted as s _A→T;

Calculating an attention weight w _A→T using a softmax function while masking other unimportant features using a mask matrix;

w_A→T＝w_A→T*mask_T→A

obtaining a shared emotion semantic feature vector C _T from the text and;

C_T＝w_A→T*F_text

adding time sequence information to the shared semantic features by using LSTM to obtain shared emotion semantics F _share;

F_share＝LSTM(C_T)

splicing the voice emotion semantic features and the shared emotion semantic features to obtain enhanced voice emotion features F _{en_audio};

F_{en_audio}＝concat(F_audio,F_share)

Combining the enhanced voice emotion characteristics and the text characteristics into a new characteristic sequence, and learning the association of the voice and the text characteristics in different subspaces by utilizing a multi-head self-attention mechanism to obtain multi-mode emotion characteristics;

F_temp1＝multihead_self_6ttention(F_temp)

F_multi＝concate(F_temp1[0],F_temp1[1])。

further, a bimodal speech emotion recognition system includes:

The acquisition module is used for acquiring a voice signal of voice data to be recognized and extracting text information in the voice signal;

the processing module is used for carrying out framing processing on the voice signals, inputting each frame of voice signals after framing into the voice pre-training model for coding, and obtaining advanced characteristics of the voice signals;

The splicing module is used for extracting the MFCC acoustic characteristics in the voice information, splicing the advanced characteristics of the voice signals with the acoustic characteristics according to frames, and obtaining a voice characteristic sequence;

The text feature sequence construction module is used for extracting advanced features of text information by using the text pre-training model to construct a text feature sequence;

The emotion feature extraction module is used for respectively extracting key emotion features in the voice feature sequence and the text feature sequence by using a self-attention mechanism, and respectively adding time sequence information to each key emotion feature through the long-short-period memory neural network to obtain a voice depth emotion feature and a text depth emotion feature;

The fusion module is used for fusing the voice deep emotion characteristics and the text deep emotion characteristics by adopting a modal fusion algorithm to obtain voice emotion characteristics;

And the recognition module is used for carrying out emotion recognition on the voice data to be recognized according to the voice emotion characteristics.

The invention provides a bimodal voice emotion recognition method and a bimodal voice emotion recognition system, which have the following beneficial effects:

Compared with the traditional method, the traditional method generally depends on voice data only, and organically fuses the information of two modes of voice and text, extracts advanced features by using a pre-training model, and fuses the features more comprehensively and accurately through self-adaptive feature learning and multi-modal emotion representation; the method not only improves the accuracy and the robustness of emotion recognition, but also reduces the dependence on a large amount of marked data, so that the method is more expandable, simultaneously comprehensively considers voice and text information, has great potential in the emotion recognition field, can be applied to multiple fields of emotion intelligent assistants, intelligent client services, emotion analysis and the like, and provides better experience and more accurate decision support for users.

Drawings

FIG. 1 is a schematic diagram of the operation of a triangular filter according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an LSTM structure according to an embodiment of the invention;

FIG. 3 is a schematic diagram illustrating a mask matrix calculation process according to an embodiment of the present invention;

FIG. 4 is a flowchart of a bimodal speech emotion recognition method in an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

The invention provides a bimodal speech emotion recognition method based on a pre-training model. And secondly, extracting initial characteristics of the voice and the text by using a pre-training model and other methods, then mining key characteristic nodes in the voice and the text by using a deep learning technology, and adding timing sequence information. And finally, fusing the high-level features of the voice and the text by using a modal fusion algorithm, and acquiring the high-level emotion features to carry out emotion classification. The network structure in the whole model actively learns the feature expression of various emotion states under the feedback of the data feature labels, so that the network structure has excellent voice emotion recognition capability, and the method comprises the following steps:

Step 1: extracting text information in the voice signal; transcribing the input voice signal and extracting text content therein. This may be accomplished through a speech transcription API or a local speech transcription model. Speech is a way for people to express a content, which contains text information of the expressed content. For a piece of speech data a, in order to extract the text information T therein, this can be done by calling a speech transcription API provided by the large company or using a local speech transcription model.

Step 2: voice data features are extracted.

Step 2.1: and extracting advanced features of the voice pre-training model.

For a segment of speech signal A, a speech pre-training model HuBERT is used to achieve high-dimensional feature acquisition of the model. Firstly, carrying out framing treatment on a voice signal, and segmenting the voice signal in a mode that the length of 20ms is one frame to obtain a voice sequence A= { a ₁,a₂,a₃,…,a_N }. And sending the voice sequence of the divided frames into a convolutional neural network (Convolutional Neural Network, CNN) for coding, and obtaining an intermediate characteristic sequence M= { M ₁,m₂,m₃,…,m_N }.

m_i＝CNN encoder,i＝1,2,3,…,N

The invention improves the coding form of the transducer, and adds relative position codes to each element in the intermediate characteristic sequence M

Where r _i-j is the position code of i relative to j, and u and v are parameters that need to be learned. W ^k is decomposed intoAnd/>Representing parameters for the input and parameters for the location, respectively.

Then, the intermediate feature sequence M added with the relative position codes is sent into a modified Transfo rmer Encoder to predict similar 'complete filling' and the information of the intermediate feature is predicted by the context feature information of the sequence, and the context information is preliminarily integrated to obtain the pre-training feature of the voice signal

F_HuBERT＝{h₁,h₂,h₃,…,h_N}

F＝transformer(M)

Wherein F _HuBERT∈R^N×768 and N are the number of voice frames.

Step 2.2: and extracting acoustic characteristics of the voice.

Key acoustic features Mel-cepstrum coefficient (Mel-Frequency Ceptral Coeffi cients, MFCC) in the speech signal are extracted.

The frame-divided speech signal sequence a= { a ₁,a₂,a₃,…,a_N }, in step 2.1, is multiplied by a hamming window W (i, k), to obtain a '= { a' ₁,a'₂,a'₃,…,a'_N }:

A'＝A*W(i,k)

And performing fast Fourier transform on the characteristic A' to obtain energy distribution on the frequency spectrum of each frame, and performing modular squaring on the frequency spectrum of the voice signal to obtain the power spectrum of the voice signal. Let DFT of the speech signal be:

where a' _i is the speech signal that is windowed in.

And then, the obtained power spectrum of each frame passes through a triangular filter bank with a set of Mel scales to define a filter bank with M filters (the number of the filters is similar to that of critical bands), the adopted filters are triangular filters, and the center frequency is f (M). M is generally 22-26. The interval between f (m) decreases as the value of m decreases, and increases as the value of m increases, as shown in fig. 1:

the frequency response of the triangular filter is defined as:

the logarithmic energy of each filter bank output is calculated as:

Obtaining MFCC coefficients through Discrete Cosine Transform (DCT):

The logarithmic energy is brought into discrete cosine transform to obtain L-order Mel parameter. The L-th order refers to the MFC coefficient order, which is typically 13. Where M is the number of triangular filters.

The 13-dimensional MFCC characteristic, F _mfcc＝{mfcc₁,mfcc₂,mfcc₃,…,mfcc_N, of each frame of signal is obtained.

Step 2.3: the high-dimensional features MFCC acoustic features are mixed.

Advanced speech features F _HuBERT＝{h₁,h₂,h₃,…,h_W to be extracted by HuBERT model. Splicing with the MFCC features according to the dimension of the frame to obtain the modal features of the voice Sequence.

Step 3: extracting text data characteristics; and extracting high-level characteristics of the text from the text information extracted from the voice signal by using a text pre-training model, and constructing a text characteristic sequence. Text data in the speech signal is obtained from step 1, assuming that the initial sequence of text is t= { w ₁,w₂,w₃,…,w_N }, where w _i represents the i-th word in the sequence of text. Advanced features are extracted for the text data by means of a text pre-training model BERT.

First, adding "[ CLS ]", "[ SEP ]" to the beginning and end of the text initial sequence, "[ CLS ]" i.e. "Classification" can be used as sentence vector representation of the sentence for Classification, "[ SEP ]" i.e. "Segmen tation" can be used as end mark of a sentence, and the sentence can be represented as t= [ w _CLS,w₁,w₂,……,w_N,w_SEP ].

The input text sequence T is then processed using the library function BERT Tokenizer to obtain three lists of input_ids, token_type_ ids, attention _mask. Where input_ids is a list of IDs in the BERT vocabulary that converts each word in the text; token_type_ids are lists for distinguishing whether each word of an input text belongs to an upper sentence or a lower sentence, wherein the upper sentence is 0, and the lower sentence is 1; attention _mask is a list of tags that perform self-attention operations on those words in the input text. Then three sequence lists are sent into the BERT pre-training model to obtain the advanced feature sequence of the text

F_t＝BERT(input_ids,token_type_ids,attention_mask)

Where F _t∈R^N×768 is N×768, each word gets a 768-dimensional word vector.

Step 4: and acquiring the intra-modal depth emotion characteristics.

Step 4.1: and acquiring the deep emotion characteristics of the voice.

Key emotion features in a sequence of speech features are first mined by a self-attention mechanism. Although each feature in the voice feature sequence may contain emotion information, most of the features only contain a small amount of information, the emotion information is not decided on emotion decision, and only a small part of the features are rich in emotion information, but the features are not necessarily before and after the sequence is adjacent, and are likely to be at random positions of the sequence, so that the distance problem on the sequence can be ignored on the one hand through a self-attention mechanism. On the other hand, by calculating the weight of each feature to other features of the sequence, the features containing rich emotion information can be effectively given higher probability weight, and the probability weight of the features which do not play a key role in emotion decision is reduced, so that the emotion information carried by the key features is highlighted.

For voice feature sequencesCreating three learnable matrices/> The self-attention mechanism input is obtained in the following manner.

Wherein,Query^A/Key^A/Value^A∈R^N×M。

And then calculating the similarity of each Query and Key according to the following formula:

the similarity score obtained in the previous step is normalized by using a mode such as Softmax, and the similarity score is changed into a probability distribution with a weight sum of 1.

And finally, carrying out weighted summation on the value by using the weight coefficient obtained in the last step to obtain Attentio n scores.

The calculated attention characteristic is a voice self-attention characteristic sequence Y _a of which the voice characteristic sequence F _a is fused with the important emotion information of the whole voice.

And then adding time sequence information to the voice data through a long-short-term memory neural network, wherein the LSTM structure is shown in figure 2. The output h _t and the cell state C _t of each hidden layer unit in the LSTM are related to the output h _t-1 of the last unit, and the cell state C _t-1 is related to the current input information x _t, namely the calculation of each cell state needs to know the calculation result of the previous unit; this is consistent with face image data in text, speech and video, which has strict timing, so timing information for data features can be added through LSTM.

Adding front and rear time sequence information to the voice data feature sequence Y _a by using LSTM to obtain a new voice feature sequence F _audio:

F_audio＝LSTM(Y_a)

step 4.2: acquiring text depth emotion characteristics

Similar to the speech data processing of step 4.1, the text feature sequence is first mined using a self-attention mechanismAnd give more weight to key emotional features while giving less weight to other features. Matrix/>, which can be learned by constructing three parametersA self-attentive input of the text sequence is obtained.

The attention score calculation for the entire text feature sequence is abstracted into the following formula:

And obtaining a text self-attention feature sequence Y _t of the text feature sequence F _t fused with the important emotion information of the whole sentence of text. Modeling the text self-attention feature sequence according to the time sequence by utilizing the LSTM and adding time sequence information to obtain a new text feature sequence F _text.

F_text＝LSTM(Y_t)。

Step 5: and fusing the voice and text characteristic information and carrying out emotion recognition.

Although the difference between the voice and text data is large, the two modalities share the expression intention of the speaker, and the text data is from the voice signal, so that the two modalities have partial common emotion information. The emotion expression of the voice can be enhanced by calculating the shared emotion semantic information of the voice and the text.

In order to enable the model to find the shared emotion semantics in the voice and the text, the algorithm designs a mask acquisition method. Assuming mask _T→A as the mask matrix, for the mth feature in the text sequence, the attention weights for all features in the speech sequence are calculated as follows

The attention weights calculated for the attention are ranked according to the size, the voice sequence node M= { M ₁,m₂,…,m_K } corresponding to the previous K large value is selected, the position (M, t) in the mask matrix mask _T→A is set to be 1, the rest positions are set to be 0, and M epsilon M.

The mask calculation process can be abstracted into the process shown in fig. 3, wherein the calculation is divided into three steps, namely, the attention weight of the voice sequence to each text feature sequence is calculated first, and the numerical values on the edges of the graph represent the weight. Then, for each text feature sequence node, only larger weight edges are reserved. Finally, a mask matrix is constructed according to the reserved edges.

After the mask matrix is calculated, the similarity score for each feature in the speech feature sequence to the text feature sequence F _text is denoted s _A→T.

In order to be able to let the model focus on shared emotion semantics, the attention weight w _A→T is calculated using the softmax function, while other less important features are masked using the mask matrix.

w_A→T＝w_A→T*mask_T→A

Finally, a shared emotion semantic feature vector C _T from the text and is obtained.

C_T＝w_A→T*F_text

And adding time sequence information to the shared semantic features by using the LSTM to obtain the shared emotion semantic F _share.

F_share＝LSTM(C_T)

And then splicing the voice emotion semantic features and the shared emotion semantic features to obtain enhanced voice emotion features F _{en_audio}.

F_{en_audio}＝concat(F_audio,F_share)

And combining the enhanced voice emotion characteristics and the text characteristics into a new characteristic sequence, and learning the association of the voice and the text characteristics in different subspaces by utilizing a multi-head self-attention mechanism to obtain the multi-mode emotion characteristics.

F_temp1＝multihead_self_attention(F_temp)

F_multi＝concate(F_temp1[0],F_temp1[1])

And finally, learning and outputting emotion recognition results by using the full connection layer.

result＝Linear(F_multi)

The overall network model structure is shown in fig. 4. And sending the training data into the network to train and optimize model parameters, so as to obtain the emotion analysis model with good prediction effect. And sending the voice data into the model to obtain emotion prediction results.

Based on the same inventive concept, the invention provides a bimodal voice emotion recognition system, comprising:

The acquisition module is used for acquiring the voice signal of the voice data to be recognized and extracting text information in the voice signal.

The processing module is used for carrying out framing processing on the voice signals, inputting each frame of voice signals after framing into the voice pre-training model for coding, and obtaining the advanced characteristics of the voice signals.

And the splicing module is used for extracting the MFCC acoustic characteristics in the voice information, and splicing the advanced characteristics of the voice signals with the acoustic characteristics according to frames to obtain a voice characteristic sequence.

And the text feature sequence construction module is used for extracting advanced features of the text information by using the text pre-training model to construct a text feature sequence.

The emotion feature extraction module is used for respectively extracting key emotion features in the voice feature sequence and the text feature sequence by using a self-attention mechanism, and respectively adding time sequence information to each key emotion feature through the long-short-period memory neural network to obtain voice depth emotion features and text depth emotion features.

And the fusion module is used for fusing the voice deep emotion characteristics and the text deep emotion characteristics by adopting a modal fusion algorithm to obtain the voice emotion characteristics.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A method for bimodal speech emotion recognition, comprising the steps of:

2. The method of claim 1, wherein the text information is extracted by transcribing the speech signal through a speech transcription API or a local speech transcription model.

3. The method for recognizing bi-modal speech emotion according to claim 1, wherein the framing process is performed on the speech signal, and each frame of speech signal after framing is input into a speech pre-training model for encoding, so as to obtain advanced features of the speech signal, and specifically comprising the following steps:

4. A bimodal speech emotion recognition method according to claim 3 and also comprising modifying the form of the code of the transducer when the speech sequence is input into the convolutional neural network CNN for encoding, adding a relative position code to each element in the intermediate signature sequence MIt is expressed as:

F＝transformer(M)

Wherein F _HuBERT∈R^N×768 and N are the number of voice frames.

5. The method for recognizing bimodal speech emotion according to claim 1, wherein said extracting MFCC acoustic features in speech information comprises the steps of:

A′＝A*W(i,k)

Let DFT of the speech signal be:

Wherein a' _i is a windowed input speech signal;

The frequency response of the triangular filter is defined as:

the logarithmic energy of each filter bank output is calculated as:

Obtaining MFCC coefficients through Discrete Cosine Transform (DCT):

Carrying the logarithmic energy into discrete cosine transform to obtain L-order Mel parameters, wherein L-order refers to the coefficient order of MFC (MFC) and is usually 13, and M is a triangular filter;

Obtaining MFCC coefficients through Discrete Cosine Transform (DCT):

6. The method for recognizing bi-modal speech emotion according to claim 1, wherein said concatenating advanced features of speech signal with acoustic features in frames to obtain a sequence of speech features; the method specifically comprises the following steps:

7. The method for recognizing bi-modal speech emotion according to claim 6, wherein said text pre-training model is used to extract advanced features of text information and construct a text feature sequence; which comprises the following steps:

T＝[w_CLS,w₁,w₂,……,w_N,w_SEP]

8. The method for recognizing bi-modal speech emotion according to claim 7, wherein the steps of extracting key emotion features in a speech feature sequence and a text feature sequence respectively using a self-attention mechanism, and adding timing information to each key emotion feature through a long-short-term memory neural network respectively to obtain a speech deep emotion feature and a text deep emotion feature, include the steps of:

For voice feature sequences Three learning matrices/>For text feature sequence/>Three learning matrices/>Self-attention mechanism input is obtained by:

Wherein, Query^A/Key^A/Value^A∈R^N×M,Query^T/Key^T/Value^T∈R^N×M;

Similarity is calculated for each Query and Key:

and carrying out weighted summation on the value by using the obtained weight coefficient to obtain the attribute score of the voice feature sequence as follows:

The text feature sequence has the following Attention scores:

F_audio＝LSTM(Y_a),F_text＝LSTM(Y_t)。

9. The method for recognizing dual-mode voice emotion according to claim 1, wherein the method for fusing the deep emotion feature of voice and the deep emotion feature of text by using a modal fusion algorithm to obtain the voice emotion feature comprises the following steps:

w_A→T＝w_A→T*mask_T→A

obtaining a shared emotion semantic feature vector C _T from the text and;

C_T＝w_A→T*F_text

F_share＝LSTM(C_T)

F_{en_audio}＝concat(F_audio,F_share)

F_temp1＝multihead_self_attention(F_temp)

F_multi＝concate(F_temp1[0],F_temp1[1])。

10. a bimodal speech emotion recognition system, comprising: