WO2024008215A2

WO2024008215A2 - Speech emotion recognition method and apparatus

Info

Publication number: WO2024008215A2
Application number: PCT/CN2023/117475
Authority: WO
Inventors: 刘汝洲
Original assignee: 顺丰科技有限公司
Priority date: 2022-07-08
Filing date: 2023-09-07
Publication date: 2024-01-11
Also published as: CN117409818A; WO2024008215A3

Abstract

The present application provides a speech emotion recognition method and apparatus. The method comprises: obtaining a first audio feature encoding of a current audio frame and text feature information of a historical audio frame, wherein the historical audio frame precedes the current audio frame; predicting a text feature encoding of the current audio frame on the basis of the text feature information of the historical audio frame; performing fusion on the first audio feature encoding and the text feature encoding of the current audio frame to obtain a fused feature vector; and performing speech emotion recognition on the basis of the fused feature vector to obtain a speech emotion recognition result of the current audio frame. The present application utilizes the text feature information of the historical audio frame to predict the text feature encoding of the current audio frame, and after performing fusion on the first audio feature encoding and the text feature encoding of the current audio frame, same performs speech emotion recognition; deep fusion is carried out on audio information and text information, and the accuracy of speech emotion recognition is able to be improved.

Description

Voice emotion recognition method and device

Technical field

This application mainly relates to the field of artificial intelligence technology, and specifically relates to a speech emotion recognition method and device.

Background technique

In smart phone customer service scenarios, emotional analysis of calls can be performed to provide business decision support. There are two mainstream solutions. One is to model the customer service voice acoustics to capture the speaking speed, intonation, auxiliary sounds and spectrum domain. Change, by defining emotion category labels, inputting features into statistical models or deep models, and classifying emotion labels. The second is to conduct information mining on the text after ASR transcribing to determine the speaker's emotion and provide a reference for customer service quality inspection. The above technical route: all are based on the model framework of "mutually independent generation" of acoustic features and text features. This approach has the following shortcomings: it does not take advantage of the coupling effect of acoustic features and text features in the feature space, resulting in accurate speech emotion recognition. degree is lower.

That is to say, the accuracy of voice emotion recognition in the existing technology is low.

Contents of the invention

This application provides a voice emotion recognition method and device, aiming to solve the problem of low accuracy of voice emotion recognition in the prior art.

In a first aspect, this application provides a voice emotion recognition method. The voice emotion recognition method includes:

Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;

Predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;

Fusion of the first audio feature code of the current audio frame and the text feature code to obtain a fusion feature vector;

Perform speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame.

Optionally, performing speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:

Perform feature extraction on the current audio frame based on the Teager energy operator to obtain the second audio feature;

Encode the second audio feature to obtain the second audio feature code of the current audio frame;

Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.

Optionally, performing speech emotion recognition based on the second audio feature encoding and the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:

Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector;

Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector;

The second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.

Optionally, the weight adjustment of the fused feature vector based on the preset attention layer to obtain the first target feature vector includes:

Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution;

The predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.

Optionally, the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:

Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame;

The first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.

Determine whether there is an audio frame before the historical audio frame;

If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame;

The text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.

Optionally, determining the text feature information of the historical audio frame based on the first audio feature coding of the historical audio frame and the preset text feature coding includes:

Fusion of the first audio feature coding and the preset text feature coding of the historical audio frame to obtain the historical fusion feature direction quantity;

Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of the historical audio frame;

Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.

In a second aspect, this application provides a voice emotion recognition device. The voice emotion recognition device includes:

An acquisition unit, configured to acquire the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;

A prediction unit, configured to predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;

a fusion unit, configured to fuse the first audio feature coding of the current audio frame and the text feature coding to obtain a fusion feature vector;

A recognition unit, configured to perform speech emotion recognition based on the fusion feature vector, and obtain a speech emotion recognition result of the current audio frame.

Optionally, the identification unit is used for:

Optionally, the acquisition unit is used for:

Determine whether there is an audio frame before the historical audio frame;

Optionally, the acquisition unit is used for:

Fusion of the first audio feature code and the preset text feature code of the historical audio frame to obtain a historical fusion feature vector;

In a third aspect, this application provides a computer device, which includes:

one or more processors;

memory; and

One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the voice emotion recognition method according to any one of the first aspects.

In a fourth aspect, the present application provides a computer-readable storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute any one of the steps described in the first aspect. Steps in speech emotion recognition methods.

The present application provides a speech emotion recognition method and device. The speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results. This application first uses the text feature information of historical audio frames to predict Measure the text feature code of the current audio frame, and then fuse the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition. Deeply integrating audio information and text information can improve the accuracy of speech emotion recognition. .

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.

Figure 1 is a schematic scene diagram of the speech emotion recognition system provided by the embodiment of the present application;

Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application;

Figure 3 is a schematic module diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application;

Figure 4 is a schematic flow chart of performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame in one embodiment of the speech emotion recognition method provided in the embodiment of the present application;

Figure 5 is a schematic structural diagram of an embodiment of the speech emotion recognition device provided in the embodiment of the present application;

FIG. 6 is a schematic structural diagram of an embodiment of a computer device provided in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts fall within the scope of protection of this application.

In the description of this application, it needs to be understood that the terms "center", "longitudinal", "transverse", "length", "width", "thickness", "upper", "lower", "front", " The directions or positional relationships indicated by "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside", etc. are based on the directions shown in the accompanying drawings or positional relationship is only for the convenience of describing the present application and simplifying the description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present application. In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more features. In the description of this application, the meaning of "plurality" is two or two Above, unless otherwise expressly and specifically limited.

In this application, the word "exemplary" is used to mean "serving as an example, illustration, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the present application. In the following description, details are set forth for the purpose of explanation. It will be understood that one of ordinary skill in the art will recognize that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been described in detail to avoid obscuring the description of the application with unnecessary detail. Thus, this application is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.

The embodiment of the present application provides a speech emotion recognition method and device, which will be described in detail below.

Please refer to Figure 1. Figure 1 is a schematic diagram of a scene of a voice emotion recognition system provided by an embodiment of the present application. The voice emotion recognition system may include a computer device 100, and a voice emotion recognition device is integrated in the computer device 100.

In the embodiment of the present application, the computer device 100 may be an independent server, or a server network or server cluster composed of servers. For example, the computer device 100 described in the embodiment of the present application includes, but is not limited to, a computer, a network A host, a single network server, a set of multiple network servers, or a cloud server composed of multiple servers. Among them, the cloud server consists of a large number of computers or network servers based on cloud computing (Cloud Computing).

In this embodiment of the present application, the above-mentioned computer device 100 may be a general-purpose computer device or a special-purpose computer device. In specific implementation, the computer device 100 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc. This embodiment does not The type of computer device 100 is defined.

Those skilled in the art can understand that the application environment shown in Figure 1 is only one application scenario of the solution of the present application and does not constitute a limitation on the application scenarios of the solution of the present application. Other application environments may also include those shown in Figure 1 More or less computer devices are shown. For example, only one computer device is shown in Figure 1. It can be understood that the speech emotion recognition system can also include one or more other computer devices that can process data, and the details are not discussed here. limited.

In addition, as shown in Figure 1, the voice emotion recognition system may also include a memory 200 for storing data.

It should be noted that the scene diagram of the speech emotion recognition system shown in Figure 1 is only an example. The speech emotion recognition system and the scene described in the embodiment of the present application are for the purpose of more clearly illustrating the technical solution of the embodiment of the present application, and are not This constitutes a limitation on the technical solutions provided by the embodiments of the present application. Persons of ordinary skill in the art will know that with the evolution of speech emotion recognition systems and the emergence of new business scenarios, the technical solutions provided by the embodiments of the present application are not suitable for similar technologies. question, the same applies.

First, an embodiment of the present application provides a speech emotion recognition method. The speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is in the current audio frame. Before; predict the text feature coding of the current audio frame based on the text feature information of the historical audio frame; fuse the first audio feature coding and text feature coding of the current audio frame to obtain a fused feature vector; perform speech emotion recognition based on the fused feature vector to obtain the current Speech emotion recognition results for audio frames.

As shown in Figures 2 and 3, Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application, and Figure 3 is a module schematic diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application. , the speech emotion recognition method includes the following steps S201 to S204:

S201. Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame.

Among them, the historical audio frame is before the current audio frame. Specifically, the historical audio frame is the audio frame preceding the current audio frame. Among them, the current audio frame and the historical audio frame have the same length, for example, both are 10ms-30ms, which can be set according to specific settings.

In a specific embodiment, the audio to be recognized is obtained; the audio to be recognized is divided into frames to obtain multiple audio frames, and the current audio frame is obtained from the multiple audio frames. We need to cut the audio of variable length into small segments of fixed length. This step is called framing. Generally, 10-30ms is taken as one frame. In order to avoid the omission of signals at the window boundary, when offsetting the frame, there must be frame overlap (a portion of overlap is required between frames). Generally, half of the frame length is used as the frame shift, that is, each time the frame is shifted by one-half of a frame and then the next frame is taken, this can avoid the characteristics from frame to frame changing too much. The usual choice is 25ms per frame and 10ms for frame iteration. Framing is necessary because the speech signal changes rapidly, and the Fourier transform is suitable for analyzing stationary signals. In speech recognition, the frame length is generally set to 10 to 30ms, so that there are enough cycles in one frame without changing too drastically. Each frame signal is usually multiplied by a smooth window function to allow both ends of the frame to smoothly attenuate to zero. This can reduce the intensity of the side lobes after Fourier transform and obtain a higher quality spectrum. The time difference between frames is often taken as 10ms, so that there will be overlap between frames. Otherwise, because the signal at the connection between frames will be weakened due to windowing, this part of the information will be lost. The Fourier transform is performed frame by frame in order to obtain the spectrum of each frame. Generally, only the amplitude spectrum is retained and the phase spectrum is discarded.

In this embodiment of the present application, obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:

(1) Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame.

Fbank is FilterBank. The response of the human ear to the sound spectrum is nonlinear. Fbank is a front-end processing algorithm that processes audio in a manner similar to the human ear, which can improve the performance of speech recognition. The general steps to obtain the fbank characteristics of the speech signal are: pre-emphasis, framing, windowing, short-time Fourier transform (STFT), mel filtering, demeaning, etc.

In another specific embodiment, Fbank feature extraction and fundamental frequency feature extraction are performed on the current audio frame respectively to obtain Fbank features and Pitch features, and the Fbank features and Pitch features are fused to obtain the first audio feature x1 _ti of the current audio frame. The pitch period (Pitch) is the reciprocal of the vibration frequency of the vocal cords. It refers to the period in which airflow passes through the vocal tract to cause the vocal cords to vibrate when a person makes a voiced sound. The period in which the vocal cords vibrate is the pitch period. The estimation of the pitch period is called pitch detection (PitchDetection). Fundamental frequency contains a large number of features that characterize speech emotion and is crucial in speech emotion recognition. Commonly used fundamental frequency feature extraction methods include: autocorrelation function method (ACF), time domain average amplitude difference method (AMFD) and wavelet method-frequency domain.

(2) Encode the first audio feature x1 _ti of the current audio frame to obtain the first audio feature encoding of the current audio frame

In a specific embodiment, the first audio feature x1 _ti of the current audio frame is input to the first acoustic model Encoder-1 for encoding to obtain the first audio feature encoding of the current audio frame. Among them, the first acoustic model Encoder-1 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc. Preferably, the first acoustic model Encoder-1 can be a BiLSTM model. The output layer of the first acoustic model Encoder-1 adopts a ctc network construction method. In the training phase, the ctc network is used to align the speech features and labels of each frame. The input of the first acoustic model Encoder-1 is the first audio feature x1 _ti obtained by fusing the Fbank feature and the Pitch feature. That is frank+pitch.

In a specific embodiment, the text feature information of historical audio frames may be manually annotated.

In another specific embodiment, obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:

(1) Determine whether there is an audio frame before the historical audio frame.

(2) If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame.

If there is no audio frame before the historical audio frame, it means that the historical audio frame is the first frame and there is no text feature information available, and the preset text feature encoding is obtained. The default text feature encoding is a manually preset default text feature encoding.

(3) Determine the text of the historical audio frame based on the first audio feature coding and the preset text feature coding of the historical audio frame Feature information.

In a specific embodiment, determining the text feature information of the historical audio frame based on the first audio feature coding and the preset text feature coding of the historical audio frame includes: fusing the first audio feature coding and the preset text feature of the historical audio frame Encoding to obtain the historical fusion feature vector; input the historical fusion feature vector into the Softmax layer to obtain the historical prediction text probability distribution of the historical audio frame; determine the text feature information of the historical audio frame based on the historical prediction text probability distribution. Specifically, the text with the highest probability of the historical predicted text probability distribution is determined as the text feature information y _ui-1 of the historical audio frame.

In another specific embodiment, speech-to-text software is used to encode the first audio feature of the historical audio frame and convert it into text to obtain the text feature information y _ui-1 of the historical audio frame.

S202. Predict the text feature coding of the current audio frame based on the text feature information of the historical audio frame.

In a specific embodiment, the text feature information y _ui-1 of the historical audio frame is input into the preset language model to obtain the text feature code p _ui of the current audio frame. Among them, the preset language model can be BERT model, LSTM model, xlnet, GPT, etc. The full name of LSTM is Long Short-Term Memory, which is a type of RNN (Recurrent Neural Network). Due to its design characteristics, LSTM is very suitable for modeling time series data, such as text data. BiLSTM is the abbreviation of Bi-directional Long Short-Term Memory, which is a combination of forward LSTM and backward LSTM. Both are often used to model contextual information in natural language processing tasks.

S203. Fusion of the first audio feature coding and text feature coding of the current audio frame to obtain a fusion feature vector.

In a specific embodiment, the first audio feature of the current audio frame is encoded and text feature coding p _ui input the preset shared network model Joint net, and obtain the fusion feature vector h _ti . Among them, the default shared network model Joint net is a tranfomermer network. The default shared network model Joint net can also be a BiLSTM model.

S204. Perform speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.

In a specific embodiment, the fusion feature vector h _ti is input into the target emotion recognition model Mood classfier for classification, and the classification result is obtained, and the classification result is determined as the speech emotion recognition result of the current audio frame. Among them, the target emotion classification model is obtained by training a preset classification neural network model through an emotion classification training set. The emotion classification training set includes multiple emotion classification training samples. The emotion classification training samples include emotion sample characteristics and corresponding sample labels. The default classification neural network model can be DNN. Sample labels can include multiple categories such as happiness, sadness, anger, disgust, fear, surprise, etc.

In order to improve the accuracy of emotion classification, in another specific embodiment, speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain the speech emotion recognition result of the current audio frame, which may include:

(1) Extract features from the current audio frame based on the Teager energy operator to obtain the second audio feature.

The Teager energy operator is a nonlinear operator that can track the instantaneous energy of a signal. The Teager energy operator satisfies formula (1),
ψ[x(n)]＝x ² (n)-x(n+1)x(n-1) (1)

Among them, x(n) is the signal of the current audio frame, and Ψ is the Teager energy operator.

Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature x2 _ti . The second audio feature x2 _ti is higher-order and has richer features than the first audio feature x1 _ti . The introduction of high-order features can improve the ability to represent speech emotion feature vectors and improve the accuracy of emotion classification.

(2) Encode the second audio feature to obtain the second audio feature code of the current audio frame.

Specifically, the second audio feature x2 _ti is input to the second acoustic model Encoder-2 for encoding to obtain the second audio feature encoding of the current audio frame. Among them, the second acoustic model Encoder-2 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc. Preferably, the second acoustic model Encoder-2 can be a BiLSTM model.

(3) Perform speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame.

Referring to Figure 4, Figure 4 is a schematic flow chart of performing voice emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the voice emotion recognition result of the current audio frame in one embodiment of the voice emotion recognition method provided in the embodiment of the present application. In a specific embodiment, performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame may include S301-S303:

S301. Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector.

The default attention layer is a model based on the Attention mechanism. The general definition of Attention is as follows: Given a set of vector sets Value and a vector Query, the Attention mechanism is a mechanism that calculates the weighted sum of Value based on the Query. The specific calculation process of the Attention mechanism can be summarized into two processes: the first process is to calculate the weight coefficient based on Query and Key, and the second process is to perform a weighted sum of Value based on the weight coefficient. The first process can be subdivided into two stages: the first stage calculates the similarity or correlation between the two based on Query and Key; the second stage normalizes the original scores of the first stage.

The preset attention layer may be a self-attention layer.

In a specific embodiment, adjusting the weight of the fused feature vector based on the preset attention layer to obtain the first target feature vector may include: obtaining three vectors: vector Query, vector Key, and vector Value based on the fused feature vector; Input the vector Query, vector Key, and vector Value into the preset attention layer Attention, and adjust the weight of the fused feature vector to obtain the first target feature vector.

In another specific embodiment, adjusting the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector may include:

(1) Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution.

Specifically, the fusion feature vector h _ti is input into the Softmax layer to obtain the predicted text probability distribution P(y _ui |y _u-1 ...y ₀ , X).

Further, the text feature information of the current audio frame is determined according to the predicted text probability distribution P(y _ui |y _u-1 ...y ₀ ,X). Store the text feature information of the current audio frame. When predicting the next frame of the current audio frame, predict the next frame of the current audio frame based on the first audio feature code of the next frame of the current audio frame and the text feature information of the current audio frame. Frame speech emotion recognition results.

(2) Input the predicted text probability distribution and fusion feature vector into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.

Specifically, the predicted text probability distribution P(y _ui | _y _u-1 ... _y ₀ , The weight is adjusted to obtain the first target feature vector c _ti .

S302. Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector.

Specifically, the first target feature vector c _ti and the second audio feature code are input into the preset shared network model Joint net to obtain the second target feature vector. Among them, the default shared network model is a tranfomer network. The default shared network model can also be a BiLSTM model.

S303. Input the second target feature vector into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.

Furthermore, the model training of this application is divided into 4 stages:

a. The first stage: pre-training and transfer learning for the first acoustic model Encoder-1, the second acoustic model Encoder-2, and the preset language model. Based on the general BERT pre-trained language model and massive text data in the field of telephone customer service Training, obtain the trained language model, fintune the trained language model, and pre-train the language model.

b. The second stage: train the first acoustic model Encoder-1, the second acoustic model Encoder-2, the pre-trained language model and the preset shared network model respectively. Among them, the first acoustic model Encoder-1 and the second acoustic model Encoder-2 are first aligned end-to-end through the ctc network. The entire training includes two stages: the pre-training stage and the fintune stage. The pre-training only involves the first acoustic model. Training of model Encoder-1, second acoustic model Encoder-2 and language model training.

c. The third stage of learning: pre-train the target emotion recognition model, extract features through the second acoustic model Encoder-2, pre-train the target emotion recognition model, and obtain a pre-trained speech emotion classification model.

d. The fourth stage of learning: On the basis of pre-training, based on ftune corresponding to speech text (with emotion labels), joint training of speech and language data is achieved to obtain the final target emotion classification model. After the training of the emotion recognition task, the language model and the acoustic model are jointly trained, and then trained in parallel with the ASR task. The reason for this is to make full use of the public data in the ASR industry, and the purpose is to improve the representation ability of the acoustic model.

The present application provides a speech emotion recognition method and device. The speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results. This application first uses the text feature information of the historical audio frame to predict the text feature code of the current audio frame, and then fuses the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition, and combines the audio information with the text information. Deep fusion can improve the accuracy of speech emotion recognition

This application implements deep fusion representation of information based on joint task unified modeling. Through joint task learning, the information of emotional acoustic features and language features is integrated, effectively improving the accuracy of emotion recognition.

In order to better implement the voice emotion recognition method in the embodiment of the present application, based on the voice emotion recognition method, the embodiment of the present application also provides a voice emotion recognition device. As shown in Figure 5, the voice emotion recognition device 500 includes:

The acquisition unit 501 is used to acquire the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame;

Prediction unit 502, used to predict the text feature coding of the current audio frame based on the text feature information of historical audio frames;

Fusion unit 503 is used to fuse the first audio feature coding and text feature coding of the current audio frame to obtain the fusion feature eigenvector;

The recognition unit 504 is used to perform speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.

Optionally, the identification unit 504 is used for:

Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature;

Speech emotion recognition is performed based on the second audio feature encoding and fusion feature vector, and the speech emotion recognition result of the current audio frame is obtained.

Optionally, the identification unit 504 is used for:

Fusion of the first target feature vector and the second audio feature encoding to obtain the second target feature vector;

Optionally, the identification unit 504 is used for:

Input the predicted text probability distribution and fusion feature vector into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.

Optionally, the acquisition unit 501 is used for:

Encode the first audio feature of the current audio frame to obtain the first audio feature code of the current audio frame.

Optionally, the acquisition unit 501 is used for:

Determine whether there is an audio frame before the historical audio frame;

Text feature information of the historical audio frame is determined based on the first audio feature coding and the preset text feature coding of the historical audio frame.

Optionally, the acquisition unit 501 is used for:

Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of historical audio frames;

Determine text feature information of historical audio frames based on historical predicted text probability distribution.

An embodiment of the present application also provides a computer device that integrates any voice emotion recognition device provided by the embodiment of the present application. The computer device includes:

one or more processors;

memory; and

One or more application programs, wherein one or more application programs are stored in the memory and configured to execute by the processor the steps of the voice emotion recognition method in any of the above voice emotion recognition method embodiments.

As shown in Figure 6, it shows a schematic structural diagram of the computer equipment involved in the embodiment of the present application. Specifically:

The computer device may include components such as a processor 601 of one or more processing cores, a memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art can understand that the structure of the computer equipment shown in the figures does not constitute a limitation on the computer equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components. in:

The processor 601 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 602, and calling software programs stored in the memory 602. Data, perform various functions of the computer equipment and process the data to conduct overall monitoring of the computer equipment. Optionally, the processor 601 may include one or more processing cores; the processor 601 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processor, digital signal processor (Digital Signal Processor, DSP). ), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can be any conventional processor, etc. Preferably, the processor 601 can integrate an application processor and a modem processor, where the application processor mainly processes the operating system, User interfaces and applications, etc. The modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 601.

The memory 602 can be used to store software programs and modules. The processor 601 executes various functional applications and data processing by running the software programs and modules stored in the memory 602 . The memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program based on Data created by the use of computer equipment, etc. In addition, memory 602 may include high-speed random access memory and may also include non-volatile memory, such as At least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602 .

The computer equipment also includes a power supply 603 that supplies power to various components. Preferably, the power supply 603 can be logically connected to the processor 601 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system. The power supply 603 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

The computer device may also include an input unit 604 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and functional controls.

Although not shown, the computer device may also include a display unit and the like, which will not be described again here. Specifically, in this embodiment, the processor 601 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 will run the executable files stored in The application program in the memory 602 implements various functions, as follows:

Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio The first audio feature encoding and text feature encoding of the frame are used to obtain a fusion feature vector; speech emotion recognition is performed based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions, or by controlling relevant hardware through instructions. The instructions can be stored in a computer-readable storage medium, and loaded and executed by the processor.

To this end, embodiments of the present application provide a computer-readable storage medium, which may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc. . A computer program is stored thereon, and the computer program is loaded by the processor to execute the steps in any of the speech emotion recognition methods provided by the embodiments of the present application. For example, a computer program loaded by a processor may perform the following steps:

Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio The first audio feature encoding and text feature encoding of the frame are used to obtain the fusion feature vector; based on the fusion feature The vector performs speech emotion recognition and obtains the speech emotion recognition result of the current audio frame.

In the above embodiments, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the above detailed descriptions of other embodiments and will not be described again here.

During specific implementation, each of the above units or structures can be implemented as an independent entity, or can be combined in any way and implemented as the same or several entities. For the specific implementation of each of the above units or structures, please refer to the previous method embodiments. Here No longer.

For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.

The speech emotion recognition method and device provided by the embodiments of the present application have been introduced in detail. This article uses specific examples to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the present application. The application method and its core idea; at the same time, for those skilled in the art, there will be changes in the specific implementation and application scope based on the ideas of this application. In summary, the content of this specification should not be understood as an infringement of this application. limits.

Claims

A voice emotion recognition method, characterized in that the voice emotion recognition method includes:

Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;

Predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;

Fusion of the first audio feature code of the current audio frame and the text feature code to obtain a fusion feature vector;

Perform speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame.
The speech emotion recognition method according to claim 1, wherein the speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame includes:

Perform feature extraction on the current audio frame based on the Teager energy operator to obtain the second audio feature;

Encode the second audio feature to obtain the second audio feature code of the current audio frame;

Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
The speech emotion recognition method according to claim 2, characterized in that the speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain the speech emotion recognition result of the current audio frame, including :

Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector;

Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector;

The second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
The speech emotion recognition method according to claim 3, characterized in that the weight adjustment of the fusion feature vector based on the preset attention layer to obtain the first target feature vector includes:

Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution;

The predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
The speech emotion recognition method according to claim 1, wherein said obtaining the first audio feature encoding of the current audio frame and the text feature information of historical audio frames includes:

Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame;

The first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
The speech emotion recognition method according to claim 1, wherein said obtaining the first audio feature encoding of the current audio frame and the text feature information of historical audio frames includes:

Determine whether there is an audio frame before the historical audio frame;

If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame;

The text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
The speech emotion recognition method according to claim 6, wherein the text feature information of the historical audio frame is determined based on the first audio feature coding and the preset text feature coding of the historical audio frame, including :

Fusion of the first audio feature code and the preset text feature code of the historical audio frame to obtain a historical fusion feature vector;

Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of the historical audio frame;

Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
A voice emotion recognition device, characterized in that the voice emotion recognition device includes:

An acquisition unit, configured to acquire the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;

A prediction unit, configured to predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;

a fusion unit, configured to fuse the first audio feature coding of the current audio frame and the text feature coding to obtain a fusion feature vector;

A recognition unit, configured to perform speech emotion recognition based on the fusion feature vector, and obtain a speech emotion recognition result of the current audio frame.
A computer device, characterized in that the computer device includes:

one or more processors;

memory; and

One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the speech emotion recognition of any one of claims 1 to 7 method.
A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program is loaded by a processor to execute the steps of the speech emotion recognition method described in any one of claims 1 to 7.