WO2024008215A2 - Speech emotion recognition method and apparatus - Google Patents

Speech emotion recognition method and apparatus Download PDF

Info

Publication number
WO2024008215A2
WO2024008215A2 PCT/CN2023/117475 CN2023117475W WO2024008215A2 WO 2024008215 A2 WO2024008215 A2 WO 2024008215A2 CN 2023117475 W CN2023117475 W CN 2023117475W WO 2024008215 A2 WO2024008215 A2 WO 2024008215A2
Authority
WO
WIPO (PCT)
Prior art keywords
audio frame
feature
emotion recognition
historical
text
Prior art date
Application number
PCT/CN2023/117475
Other languages
French (fr)
Chinese (zh)
Other versions
WO2024008215A3 (en
Inventor
刘汝洲
Original Assignee
顺丰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 顺丰科技有限公司 filed Critical 顺丰科技有限公司
Publication of WO2024008215A2 publication Critical patent/WO2024008215A2/en
Publication of WO2024008215A3 publication Critical patent/WO2024008215A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention

Definitions

  • This application mainly relates to the field of artificial intelligence technology, and specifically relates to a speech emotion recognition method and device.
  • This application provides a voice emotion recognition method and device, aiming to solve the problem of low accuracy of voice emotion recognition in the prior art.
  • this application provides a voice emotion recognition method.
  • the voice emotion recognition method includes:
  • performing speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:
  • Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
  • performing speech emotion recognition based on the second audio feature encoding and the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:
  • the second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  • the weight adjustment of the fused feature vector based on the preset attention layer to obtain the first target feature vector includes:
  • the predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
  • the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:
  • the first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
  • the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:
  • the text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
  • determining the text feature information of the historical audio frame based on the first audio feature coding of the historical audio frame and the preset text feature coding includes:
  • Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
  • this application provides a voice emotion recognition device.
  • the voice emotion recognition device includes:
  • An acquisition unit configured to acquire the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;
  • a prediction unit configured to predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame
  • a fusion unit configured to fuse the first audio feature coding of the current audio frame and the text feature coding to obtain a fusion feature vector
  • a recognition unit configured to perform speech emotion recognition based on the fusion feature vector, and obtain a speech emotion recognition result of the current audio frame.
  • the identification unit is used for:
  • Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
  • the identification unit is used for:
  • the second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  • the identification unit is used for:
  • the predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
  • the acquisition unit is used for:
  • the first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
  • the acquisition unit is used for:
  • the text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
  • the acquisition unit is used for:
  • Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
  • this application provides a computer device, which includes:
  • processors one or more processors
  • One or more application programs wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the voice emotion recognition method according to any one of the first aspects.
  • the present application provides a computer-readable storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute any one of the steps described in the first aspect. Steps in speech emotion recognition methods.
  • the present application provides a speech emotion recognition method and device.
  • the speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results.
  • This application first uses the text feature information of historical audio frames to predict Measure the text feature code of the current audio frame, and then fuse the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition. Deeply integrating audio information and text information can improve the accuracy of speech emotion recognition. .
  • Figure 1 is a schematic scene diagram of the speech emotion recognition system provided by the embodiment of the present application.
  • Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application.
  • Figure 3 is a schematic module diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application.
  • Figure 4 is a schematic flow chart of performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame in one embodiment of the speech emotion recognition method provided in the embodiment of the present application;
  • Figure 5 is a schematic structural diagram of an embodiment of the speech emotion recognition device provided in the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an embodiment of a computer device provided in an embodiment of the present application.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more features. In the description of this application, the meaning of "plurality” is two or two Above, unless otherwise expressly and specifically limited.
  • the embodiment of the present application provides a speech emotion recognition method and device, which will be described in detail below.
  • Figure 1 is a schematic diagram of a scene of a voice emotion recognition system provided by an embodiment of the present application.
  • the voice emotion recognition system may include a computer device 100, and a voice emotion recognition device is integrated in the computer device 100.
  • the computer device 100 may be an independent server, or a server network or server cluster composed of servers.
  • the computer device 100 described in the embodiment of the present application includes, but is not limited to, a computer, a network A host, a single network server, a set of multiple network servers, or a cloud server composed of multiple servers.
  • the cloud server consists of a large number of computers or network servers based on cloud computing (Cloud Computing).
  • the above-mentioned computer device 100 may be a general-purpose computer device or a special-purpose computer device.
  • the computer device 100 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc.
  • PDA personal digital assistant
  • This embodiment does not The type of computer device 100 is defined.
  • Figure 1 is only one application scenario of the solution of the present application and does not constitute a limitation on the application scenarios of the solution of the present application.
  • Other application environments may also include those shown in Figure 1 More or less computer devices are shown.
  • the speech emotion recognition system can also include one or more other computer devices that can process data, and the details are not discussed here. limited.
  • the voice emotion recognition system may also include a memory 200 for storing data.
  • an embodiment of the present application provides a speech emotion recognition method.
  • the speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is in the current audio frame. Before; predict the text feature coding of the current audio frame based on the text feature information of the historical audio frame; fuse the first audio feature coding and text feature coding of the current audio frame to obtain a fused feature vector; perform speech emotion recognition based on the fused feature vector to obtain the current Speech emotion recognition results for audio frames.
  • Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application
  • Figure 3 is a module schematic diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application.
  • the speech emotion recognition method includes the following steps S201 to S204:
  • the historical audio frame is before the current audio frame.
  • the historical audio frame is the audio frame preceding the current audio frame.
  • the current audio frame and the historical audio frame have the same length, for example, both are 10ms-30ms, which can be set according to specific settings.
  • the audio to be recognized is obtained; the audio to be recognized is divided into frames to obtain multiple audio frames, and the current audio frame is obtained from the multiple audio frames.
  • framing Generally, 10-30ms is taken as one frame.
  • frame overlap a portion of overlap is required between frames.
  • half of the frame length is used as the frame shift, that is, each time the frame is shifted by one-half of a frame and then the next frame is taken, this can avoid the characteristics from frame to frame changing too much.
  • the usual choice is 25ms per frame and 10ms for frame iteration.
  • Framing is necessary because the speech signal changes rapidly, and the Fourier transform is suitable for analyzing stationary signals.
  • the frame length is generally set to 10 to 30ms, so that there are enough cycles in one frame without changing too drastically.
  • Each frame signal is usually multiplied by a smooth window function to allow both ends of the frame to smoothly attenuate to zero. This can reduce the intensity of the side lobes after Fourier transform and obtain a higher quality spectrum.
  • the time difference between frames is often taken as 10ms, so that there will be overlap between frames. Otherwise, because the signal at the connection between frames will be weakened due to windowing, this part of the information will be lost.
  • the Fourier transform is performed frame by frame in order to obtain the spectrum of each frame. Generally, only the amplitude spectrum is retained and the phase spectrum is discarded.
  • obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:
  • Fbank is FilterBank.
  • the response of the human ear to the sound spectrum is nonlinear.
  • Fbank is a front-end processing algorithm that processes audio in a manner similar to the human ear, which can improve the performance of speech recognition.
  • the general steps to obtain the fbank characteristics of the speech signal are: pre-emphasis, framing, windowing, short-time Fourier transform (STFT), mel filtering, demeaning, etc.
  • Fbank feature extraction and fundamental frequency feature extraction are performed on the current audio frame respectively to obtain Fbank features and Pitch features, and the Fbank features and Pitch features are fused to obtain the first audio feature x1 ti of the current audio frame.
  • the pitch period (Pitch) is the reciprocal of the vibration frequency of the vocal cords. It refers to the period in which airflow passes through the vocal tract to cause the vocal cords to vibrate when a person makes a voiced sound. The period in which the vocal cords vibrate is the pitch period.
  • the estimation of the pitch period is called pitch detection (PitchDetection).
  • Fundamental frequency contains a large number of features that characterize speech emotion and is crucial in speech emotion recognition. Commonly used fundamental frequency feature extraction methods include: autocorrelation function method (ACF), time domain average amplitude difference method (AMFD) and wavelet method-frequency domain.
  • ACF autocorrelation function method
  • AMFD time domain average amplitude difference method
  • wavelet method-frequency domain wavelet method-frequency domain.
  • the first audio feature x1 ti of the current audio frame is input to the first acoustic model Encoder-1 for encoding to obtain the first audio feature encoding of the current audio frame.
  • the first acoustic model Encoder-1 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc.
  • the first acoustic model Encoder-1 can be a BiLSTM model.
  • the output layer of the first acoustic model Encoder-1 adopts a ctc network construction method.
  • the ctc network is used to align the speech features and labels of each frame.
  • the input of the first acoustic model Encoder-1 is the first audio feature x1 ti obtained by fusing the Fbank feature and the Pitch feature. That is frank+pitch.
  • the text feature information of historical audio frames may be manually annotated.
  • obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:
  • the default text feature encoding is a manually preset default text feature encoding.
  • determining the text feature information of the historical audio frame based on the first audio feature coding and the preset text feature coding of the historical audio frame includes: fusing the first audio feature coding and the preset text feature of the historical audio frame Encoding to obtain the historical fusion feature vector; input the historical fusion feature vector into the Softmax layer to obtain the historical prediction text probability distribution of the historical audio frame; determine the text feature information of the historical audio frame based on the historical prediction text probability distribution. Specifically, the text with the highest probability of the historical predicted text probability distribution is determined as the text feature information y ui-1 of the historical audio frame.
  • speech-to-text software is used to encode the first audio feature of the historical audio frame and convert it into text to obtain the text feature information y ui-1 of the historical audio frame.
  • the text feature information y ui-1 of the historical audio frame is input into the preset language model to obtain the text feature code p ui of the current audio frame.
  • the preset language model can be BERT model, LSTM model, xlnet, GPT, etc.
  • the full name of LSTM is Long Short-Term Memory, which is a type of RNN (Recurrent Neural Network). Due to its design characteristics, LSTM is very suitable for modeling time series data, such as text data.
  • BiLSTM is the abbreviation of Bi-directional Long Short-Term Memory, which is a combination of forward LSTM and backward LSTM. Both are often used to model contextual information in natural language processing tasks.
  • the first audio feature of the current audio frame is encoded and text feature coding p ui input the preset shared network model Joint net, and obtain the fusion feature vector h ti .
  • the default shared network model Joint net is a tranfomermer network.
  • the default shared network model Joint net can also be a BiLSTM model.
  • the fusion feature vector h ti is input into the target emotion recognition model Mood classfier for classification, and the classification result is obtained, and the classification result is determined as the speech emotion recognition result of the current audio frame.
  • the target emotion classification model is obtained by training a preset classification neural network model through an emotion classification training set.
  • the emotion classification training set includes multiple emotion classification training samples.
  • the emotion classification training samples include emotion sample characteristics and corresponding sample labels.
  • the default classification neural network model can be DNN. Sample labels can include multiple categories such as happiness, sadness, anger, disgust, fear, surprise, etc.
  • speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain the speech emotion recognition result of the current audio frame, which may include:
  • the Teager energy operator is a nonlinear operator that can track the instantaneous energy of a signal.
  • x(n) is the signal of the current audio frame
  • is the Teager energy operator
  • Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature x2 ti .
  • the second audio feature x2 ti is higher-order and has richer features than the first audio feature x1 ti .
  • the introduction of high-order features can improve the ability to represent speech emotion feature vectors and improve the accuracy of emotion classification.
  • the second audio feature x2 ti is input to the second acoustic model Encoder-2 for encoding to obtain the second audio feature encoding of the current audio frame.
  • the second acoustic model Encoder-2 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc.
  • the second acoustic model Encoder-2 can be a BiLSTM model.
  • Figure 4 is a schematic flow chart of performing voice emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the voice emotion recognition result of the current audio frame in one embodiment of the voice emotion recognition method provided in the embodiment of the present application.
  • performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame may include S301-S303:
  • the default attention layer is a model based on the Attention mechanism.
  • the general definition of Attention is as follows: Given a set of vector sets Value and a vector Query, the Attention mechanism is a mechanism that calculates the weighted sum of Value based on the Query.
  • the specific calculation process of the Attention mechanism can be summarized into two processes: the first process is to calculate the weight coefficient based on Query and Key, and the second process is to perform a weighted sum of Value based on the weight coefficient.
  • the first process can be subdivided into two stages: the first stage calculates the similarity or correlation between the two based on Query and Key; the second stage normalizes the original scores of the first stage.
  • the preset attention layer may be a self-attention layer.
  • adjusting the weight of the fused feature vector based on the preset attention layer to obtain the first target feature vector may include: obtaining three vectors: vector Query, vector Key, and vector Value based on the fused feature vector; Input the vector Query, vector Key, and vector Value into the preset attention layer Attention, and adjust the weight of the fused feature vector to obtain the first target feature vector.
  • adjusting the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector may include:
  • the fusion feature vector h ti is input into the Softmax layer to obtain the predicted text probability distribution P(y ui
  • the text feature information of the current audio frame is determined according to the predicted text probability distribution P(y ui
  • predict the next frame of the current audio frame predict the next frame of the current audio frame based on the first audio feature code of the next frame of the current audio frame and the text feature information of the current audio frame.
  • the weight is adjusted to obtain the first target feature vector c ti .
  • the first target feature vector c ti and the second audio feature code are input into the preset shared network model Joint net to obtain the second target feature vector.
  • the default shared network model is a tranfomer network.
  • the default shared network model can also be a BiLSTM model.
  • model training of this application is divided into 4 stages:
  • the first stage pre-training and transfer learning for the first acoustic model Encoder-1, the second acoustic model Encoder-2, and the preset language model.
  • the general BERT pre-trained language model and massive text data in the field of telephone customer service Training obtain the trained language model, fintune the trained language model, and pre-train the language model.
  • the second stage train the first acoustic model Encoder-1, the second acoustic model Encoder-2, the pre-trained language model and the preset shared network model respectively.
  • the first acoustic model Encoder-1 and the second acoustic model Encoder-2 are first aligned end-to-end through the ctc network.
  • the entire training includes two stages: the pre-training stage and the fintune stage.
  • the pre-training only involves the first acoustic model. Training of model Encoder-1, second acoustic model Encoder-2 and language model training.
  • the third stage of learning pre-train the target emotion recognition model, extract features through the second acoustic model Encoder-2, pre-train the target emotion recognition model, and obtain a pre-trained speech emotion classification model.
  • the fourth stage of learning On the basis of pre-training, based on ftune corresponding to speech text (with emotion labels), joint training of speech and language data is achieved to obtain the final target emotion classification model. After the training of the emotion recognition task, the language model and the acoustic model are jointly trained, and then trained in parallel with the ASR task. The reason for this is to make full use of the public data in the ASR industry, and the purpose is to improve the representation ability of the acoustic model.
  • the present application provides a speech emotion recognition method and device.
  • the speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results.
  • This application first uses the text feature information of the historical audio frame to predict the text feature code of the current audio frame, and then fuses the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition, and combines the audio information with the text information. Deep fusion can improve the accuracy of speech emotion recognition
  • This application implements deep fusion representation of information based on joint task unified modeling. Through joint task learning, the information of emotional acoustic features and language features is integrated, effectively improving the accuracy of emotion recognition.
  • the embodiment of the present application also provides a voice emotion recognition device.
  • the voice emotion recognition device 500 includes:
  • the acquisition unit 501 is used to acquire the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame;
  • Prediction unit 502 used to predict the text feature coding of the current audio frame based on the text feature information of historical audio frames
  • Fusion unit 503 is used to fuse the first audio feature coding and text feature coding of the current audio frame to obtain the fusion feature eigenvector;
  • the recognition unit 504 is used to perform speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
  • the identification unit 504 is used for:
  • Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature
  • Speech emotion recognition is performed based on the second audio feature encoding and fusion feature vector, and the speech emotion recognition result of the current audio frame is obtained.
  • the identification unit 504 is used for:
  • the second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  • the identification unit 504 is used for:
  • the acquisition unit 501 is used for:
  • the acquisition unit 501 is used for:
  • Text feature information of the historical audio frame is determined based on the first audio feature coding and the preset text feature coding of the historical audio frame.
  • the acquisition unit 501 is used for:
  • An embodiment of the present application also provides a computer device that integrates any voice emotion recognition device provided by the embodiment of the present application.
  • the computer device includes:
  • processors one or more processors
  • One or more application programs wherein one or more application programs are stored in the memory and configured to execute by the processor the steps of the voice emotion recognition method in any of the above voice emotion recognition method embodiments.
  • FIG. 6 shows a schematic structural diagram of the computer equipment involved in the embodiment of the present application. Specifically:
  • the computer device may include components such as a processor 601 of one or more processing cores, a memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604.
  • a processor 601 of one or more processing cores a processor 601 of one or more processing cores
  • a memory 602 of one or more computer-readable storage media e.g., a processor 601 of one or more processing cores
  • a power supply 603 e.g., a processor 601 of one or more processors, a memory 602 of one or more computer-readable storage media
  • a power supply 603 e.g., a power supply 603
  • the processor 601 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 602, and calling software programs stored in the memory 602. Data, perform various functions of the computer equipment and process the data to conduct overall monitoring of the computer equipment.
  • the processor 601 may include one or more processing cores; the processor 601 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processor, digital signal processor (Digital Signal Processor, DSP). ), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA off-the-shelf programmable gate array
  • FPGA field-Programmable Gate Array
  • the general-purpose processor can be a microprocessor or the processor can be any conventional processor, etc.
  • the processor 601 can integrate an application processor and a modem processor, where the application processor mainly processes the operating system, User interfaces and applications, etc.
  • the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 601.
  • the memory 602 can be used to store software programs and modules.
  • the processor 601 executes various functional applications and data processing by running the software programs and modules stored in the memory 602 .
  • the memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program based on Data created by the use of computer equipment, etc.
  • memory 602 may include high-speed random access memory and may also include non-volatile memory, such as At least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602 .
  • the computer equipment also includes a power supply 603 that supplies power to various components.
  • the power supply 603 can be logically connected to the processor 601 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • the power supply 603 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
  • the computer device may also include an input unit 604 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and functional controls.
  • an input unit 604 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and functional controls.
  • the computer device may also include a display unit and the like, which will not be described again here.
  • the processor 601 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 will run the executable files stored in The application program in the memory 602 implements various functions, as follows:
  • the first audio feature code of the current audio frame and the text feature information of the historical audio frame where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio
  • the first audio feature encoding and text feature encoding of the frame are used to obtain a fusion feature vector; speech emotion recognition is performed based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
  • embodiments of the present application provide a computer-readable storage medium, which may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc. .
  • a computer program is stored thereon, and the computer program is loaded by the processor to execute the steps in any of the speech emotion recognition methods provided by the embodiments of the present application.
  • a computer program loaded by a processor may perform the following steps:
  • the first audio feature code of the current audio frame and the text feature information of the historical audio frame where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio
  • the first audio feature encoding and text feature encoding of the frame are used to obtain the fusion feature vector; based on the fusion feature
  • the vector performs speech emotion recognition and obtains the speech emotion recognition result of the current audio frame.
  • each of the above units or structures can be implemented as an independent entity, or can be combined in any way and implemented as the same or several entities.
  • each of the above units or structures please refer to the previous method embodiments. Here No longer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application provides a speech emotion recognition method and apparatus. The method comprises: obtaining a first audio feature encoding of a current audio frame and text feature information of a historical audio frame, wherein the historical audio frame precedes the current audio frame; predicting a text feature encoding of the current audio frame on the basis of the text feature information of the historical audio frame; performing fusion on the first audio feature encoding and the text feature encoding of the current audio frame to obtain a fused feature vector; and performing speech emotion recognition on the basis of the fused feature vector to obtain a speech emotion recognition result of the current audio frame. The present application utilizes the text feature information of the historical audio frame to predict the text feature encoding of the current audio frame, and after performing fusion on the first audio feature encoding and the text feature encoding of the current audio frame, same performs speech emotion recognition; deep fusion is carried out on audio information and text information, and the accuracy of speech emotion recognition is able to be improved.

Description

语音情绪识别方法及装置Voice emotion recognition method and device 技术领域Technical field
本申请主要涉及人工智能技术领域,具体涉及一种语音情绪识别方法及装置。This application mainly relates to the field of artificial intelligence technology, and specifically relates to a speech emotion recognition method and device.
背景技术Background technique
智能电话客服场景下,可以通过对通话进行情绪分析,提供商业决策支持,主流有两种解决方案,一是通过对客服语音声学建模,捕捉说话语音语速,声调,辅助音及频谱域的变化,通过定义情绪类别标签,输入特征到统计模型或者深度模型,进行情绪标签分类。二是ASR转写以后的文本进行信息挖掘,判断说话人的情绪,为客服质检提供参考依据。以上技术路线:都是基于声学特征和文本特征“相互独立生成”的模型框架下,这样的方式存在着以下不足:没有利用声学特征和文本特征的在特征空间的耦合作用,导致语音情绪识别准确度较低。In smart phone customer service scenarios, emotional analysis of calls can be performed to provide business decision support. There are two mainstream solutions. One is to model the customer service voice acoustics to capture the speaking speed, intonation, auxiliary sounds and spectrum domain. Change, by defining emotion category labels, inputting features into statistical models or deep models, and classifying emotion labels. The second is to conduct information mining on the text after ASR transcribing to determine the speaker's emotion and provide a reference for customer service quality inspection. The above technical route: all are based on the model framework of "mutually independent generation" of acoustic features and text features. This approach has the following shortcomings: it does not take advantage of the coupling effect of acoustic features and text features in the feature space, resulting in accurate speech emotion recognition. degree is lower.
也即,现有技术中语音情绪识别准确度较低。That is to say, the accuracy of voice emotion recognition in the existing technology is low.
发明内容Contents of the invention
本申请提供一种语音情绪识别方法及装置,旨在解决现有技术中语音情绪识别准确度较低的问题。This application provides a voice emotion recognition method and device, aiming to solve the problem of low accuracy of voice emotion recognition in the prior art.
第一方面,本申请提供一种语音情绪识别方法,所述语音情绪识别方法包括:In a first aspect, this application provides a voice emotion recognition method. The voice emotion recognition method includes:
获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,所述历史音频帧在所述当前音频帧之前;Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;
基于所述历史音频帧的文本特征信息预测所述当前音频帧的文本特征编码;Predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;
融合所述当前音频帧的第一音频特征编码和所述文本特征编码,得到融合特征向量;Fusion of the first audio feature code of the current audio frame and the text feature code to obtain a fusion feature vector;
基于所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。Perform speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame.
可选地,所述基于所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果,包括:Optionally, performing speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:
基于Teager能量算子对所述当前音频帧进行特征提取,得到第二音频特征;Perform feature extraction on the current audio frame based on the Teager energy operator to obtain the second audio feature;
对所述第二音频特征进行编码,得到所述当前音频帧的第二音频特征编码; Encode the second audio feature to obtain the second audio feature code of the current audio frame;
基于所述第二音频特征编码和所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
可选地,所述基于所述第二音频特征编码和所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果,包括:Optionally, performing speech emotion recognition based on the second audio feature encoding and the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:
基于预设注意力层对所述融合特征向量进行权重调整,得到第一目标特征向量;Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector;
融合所述第一目标特征向量和所述第二音频特征编码,得到第二目标特征向量;Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector;
将所述第二目标特征向量输入目标情绪识别模型,得到所述当前音频帧的语音情绪识别结果。The second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
可选地,所述基于预设注意力层对所述融合特征向量进行权重调整,得到第一目标特征向量,包括:Optionally, the weight adjustment of the fused feature vector based on the preset attention layer to obtain the first target feature vector includes:
将所述融合特征向量输入Softmax层,得到预测文本概率分布;Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution;
将所述预测文本概率分布和所述融合特征向量输入所述预设注意力层对所述融合特征向量进行权重调整,得到所述第一目标特征向量。The predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
可选地,所述获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,包括:Optionally, the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:
对所述当前音频帧进行Fbank特征提取,得到所述当前音频帧的第一音频特征;Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame;
对所述当前音频帧的第一音频特征进行编码,得到所述当前音频帧的第一音频特征编码。The first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
可选地,所述获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,包括:Optionally, the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:
判断所述历史音频帧之前是否存在音频帧;Determine whether there is an audio frame before the historical audio frame;
若所述历史音频帧之前不存在音频帧,则获取历史音频帧的第一音频特征编码和预设文本特征编码;If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame;
基于所述历史音频帧的第一音频特征编码和所述预设文本特征编码确定所述历史音频帧的文本特征信息。The text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
可选地,所述基于所述历史音频帧的第一音频特征编码和所述预设文本特征编码确定所述历史音频帧的文本特征信息,包括:Optionally, determining the text feature information of the historical audio frame based on the first audio feature coding of the historical audio frame and the preset text feature coding includes:
融合所述历史音频帧的第一音频特征编码和预设文本特征编码,得到历史融合特征向 量;Fusion of the first audio feature coding and the preset text feature coding of the historical audio frame to obtain the historical fusion feature direction quantity;
将所述历史融合特征向量输入Softmax层,得到所述历史音频帧的历史预测文本概率分布;Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of the historical audio frame;
基于所述历史预测文本概率分布确定所述历史音频帧的文本特征信息。Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
第二方面,本申请提供一种语音情绪识别装置,所述语音情绪识别装置包括:In a second aspect, this application provides a voice emotion recognition device. The voice emotion recognition device includes:
获取单元,用于获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,所述历史音频帧在所述当前音频帧之前;An acquisition unit, configured to acquire the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;
预测单元,用于基于所述历史音频帧的文本特征信息预测所述当前音频帧的文本特征编码;A prediction unit, configured to predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;
融合单元,用于融合所述当前音频帧的第一音频特征编码和所述文本特征编码,得到融合特征向量;a fusion unit, configured to fuse the first audio feature coding of the current audio frame and the text feature coding to obtain a fusion feature vector;
识别单元,用于基于所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。A recognition unit, configured to perform speech emotion recognition based on the fusion feature vector, and obtain a speech emotion recognition result of the current audio frame.
可选地,所述识别单元,用于:Optionally, the identification unit is used for:
基于Teager能量算子对所述当前音频帧进行特征提取,得到第二音频特征;Perform feature extraction on the current audio frame based on the Teager energy operator to obtain the second audio feature;
对所述第二音频特征进行编码,得到所述当前音频帧的第二音频特征编码;Encode the second audio feature to obtain the second audio feature code of the current audio frame;
基于所述第二音频特征编码和所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
可选地,所述识别单元,用于:Optionally, the identification unit is used for:
基于预设注意力层对所述融合特征向量进行权重调整,得到第一目标特征向量;Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector;
融合所述第一目标特征向量和所述第二音频特征编码,得到第二目标特征向量;Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector;
将所述第二目标特征向量输入目标情绪识别模型,得到所述当前音频帧的语音情绪识别结果。The second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
可选地,所述识别单元,用于:Optionally, the identification unit is used for:
将所述融合特征向量输入Softmax层,得到预测文本概率分布;Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution;
将所述预测文本概率分布和所述融合特征向量输入所述预设注意力层对所述融合特征向量进行权重调整,得到所述第一目标特征向量。The predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
可选地,所述获取单元,用于: Optionally, the acquisition unit is used for:
对所述当前音频帧进行Fbank特征提取,得到所述当前音频帧的第一音频特征;Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame;
对所述当前音频帧的第一音频特征进行编码,得到所述当前音频帧的第一音频特征编码。The first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
可选地,所述获取单元,用于:Optionally, the acquisition unit is used for:
判断所述历史音频帧之前是否存在音频帧;Determine whether there is an audio frame before the historical audio frame;
若所述历史音频帧之前不存在音频帧,则获取历史音频帧的第一音频特征编码和预设文本特征编码;If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame;
基于所述历史音频帧的第一音频特征编码和所述预设文本特征编码确定所述历史音频帧的文本特征信息。The text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
可选地,所述获取单元,用于:Optionally, the acquisition unit is used for:
融合所述历史音频帧的第一音频特征编码和预设文本特征编码,得到历史融合特征向量;Fusion of the first audio feature code and the preset text feature code of the historical audio frame to obtain a historical fusion feature vector;
将所述历史融合特征向量输入Softmax层,得到所述历史音频帧的历史预测文本概率分布;Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of the historical audio frame;
基于所述历史预测文本概率分布确定所述历史音频帧的文本特征信息。Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
第三方面,本申请提供一种计算机设备,所述计算机设备包括:In a third aspect, this application provides a computer device, which includes:
一个或多个处理器;one or more processors;
存储器;以及memory; and
一个或多个应用程序,其中所述一个或多个应用程序被存储于所述存储器中,并配置为由所述处理器执行以实现第一方面中任一项所述的语音情绪识别方法。One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the voice emotion recognition method according to any one of the first aspects.
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质存储有多条指令,所述指令适于处理器进行加载,以执行第一方面中任一项所述的语音情绪识别方法中的步骤。In a fourth aspect, the present application provides a computer-readable storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute any one of the steps described in the first aspect. Steps in speech emotion recognition methods.
本申请提供一种语音情绪识别方法及装置,该语音情绪识别方法包括:获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,历史音频帧在当前音频帧之前;基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码;融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特征向量;基于融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。本申请先利用历史音频帧的文本特征信息预 测出当前音频帧的文本特征编码,然后把当前音频帧的文本特征编码和第一音频特征编码融合后进行语音情绪识别,将音频信息与文本信息进行深度融合,能够提高语音情绪识别的准确度。The present application provides a speech emotion recognition method and device. The speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results. This application first uses the text feature information of historical audio frames to predict Measure the text feature code of the current audio frame, and then fuse the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition. Deeply integrating audio information and text information can improve the accuracy of speech emotion recognition. .
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.
图1为本申请实施例所提供的语音情绪识别系统的场景示意图;Figure 1 is a schematic scene diagram of the speech emotion recognition system provided by the embodiment of the present application;
图2是本申请实施例中提供的语音情绪识别方法的一个实施例流程示意图;Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application;
图3是本申请实施例中提供的语音情绪识别方法的一个实施例模块示意图;Figure 3 is a schematic module diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application;
图4是本申请实施例中提供的语音情绪识别方法一个实施例中基于第二音频特征编码和融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果的流程示意图;Figure 4 is a schematic flow chart of performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame in one embodiment of the speech emotion recognition method provided in the embodiment of the present application;
图5是本申请实施例中提供的语音情绪识别装置的一个实施例结构示意图;Figure 5 is a schematic structural diagram of an embodiment of the speech emotion recognition device provided in the embodiment of the present application;
图6是本申请实施例中提供的计算机设备的一个实施例结构示意图。FIG. 6 is a schematic structural diagram of an embodiment of a computer device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts fall within the scope of protection of this application.
在本申请的描述中,需要理解的是,术语“中心”、“纵向”、“横向”、“长度”、“宽度”、“厚度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个特征。在本申请的描述中,“多个”的含义是两个或两个 以上,除非另有明确具体的限定。In the description of this application, it needs to be understood that the terms "center", "longitudinal", "transverse", "length", "width", "thickness", "upper", "lower", "front", " The directions or positional relationships indicated by "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inside", "outside", etc. are based on the directions shown in the accompanying drawings or positional relationship is only for the convenience of describing the present application and simplifying the description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as a limitation of the present application. In addition, the terms “first” and “second” are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more features. In the description of this application, the meaning of "plurality" is two or two Above, unless otherwise expressly and specifically limited.
在本申请中,“示例性”一词用来表示“用作例子、例证或说明”。本申请中被描述为“示例性”的任何实施例不一定被解释为比其它实施例更优选或更具优势。为了使本领域任何技术人员能够实现和使用本申请,给出了以下描述。在以下描述中,为了解释的目的而列出了细节。应当明白的是,本领域普通技术人员可以认识到,在不使用这些特定细节的情况下也可以实现本申请。在其它实例中,不会对公知的结构和过程进行详细阐述,以避免不必要的细节使本申请的描述变得晦涩。因此,本申请并非旨在限于所示的实施例,而是与符合本申请所公开的原理和特征的最广范围相一致。In this application, the word "exemplary" is used to mean "serving as an example, illustration, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the present application. In the following description, details are set forth for the purpose of explanation. It will be understood that one of ordinary skill in the art will recognize that the present application may be practiced without these specific details. In other instances, well-known structures and processes have not been described in detail to avoid obscuring the description of the application with unnecessary detail. Thus, this application is not intended to be limited to the embodiments shown but is to be accorded the widest scope consistent with the principles and features disclosed herein.
本申请实施例提供一种语音情绪识别方法及装置,以下分别进行详细说明。The embodiment of the present application provides a speech emotion recognition method and device, which will be described in detail below.
请参阅图1,图1为本申请实施例所提供的语音情绪识别系统的场景示意图,该语音情绪识别系统可以包括计算机设备100,计算机设备100中集成有语音情绪识别装置。Please refer to Figure 1. Figure 1 is a schematic diagram of a scene of a voice emotion recognition system provided by an embodiment of the present application. The voice emotion recognition system may include a computer device 100, and a voice emotion recognition device is integrated in the computer device 100.
本申请实施例中,该计算机设备100可以是独立的服务器,也可以是服务器组成的服务器网络或服务器集群,例如,本申请实施例中所描述的计算机设备100,其包括但不限于计算机、网络主机、单个网络服务器、多个网络服务器集或多个服务器构成的云服务器。其中,云服务器由基于云计算(Cloud Computing)的大量计算机或网络服务器构成。In the embodiment of the present application, the computer device 100 may be an independent server, or a server network or server cluster composed of servers. For example, the computer device 100 described in the embodiment of the present application includes, but is not limited to, a computer, a network A host, a single network server, a set of multiple network servers, or a cloud server composed of multiple servers. Among them, the cloud server consists of a large number of computers or network servers based on cloud computing (Cloud Computing).
本申请实施例中,上述的计算机设备100可以是一个通用计算机设备或者是一个专用计算机设备。在具体实现中计算机设备100可以是台式机、便携式电脑、网络服务器、掌上电脑(Personal Digital Assistant,PDA)、移动手机、平板电脑、无线终端设备、通信设备、嵌入式设备等,本实施例不限定计算机设备100的类型。In this embodiment of the present application, the above-mentioned computer device 100 may be a general-purpose computer device or a special-purpose computer device. In specific implementation, the computer device 100 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc. This embodiment does not The type of computer device 100 is defined.
本领域技术人员可以理解,图1中示出的应用环境,仅仅是本申请方案的一种应用场景,并不构成对本申请方案应用场景的限定,其他的应用环境还可以包括比图1中所示更多或更少的计算机设备,例如图1中仅示出1个计算机设备,可以理解的,该语音情绪识别系统还可以包括一个或多个可处理数据的其他计算机设备,具体此处不作限定。Those skilled in the art can understand that the application environment shown in Figure 1 is only one application scenario of the solution of the present application and does not constitute a limitation on the application scenarios of the solution of the present application. Other application environments may also include those shown in Figure 1 More or less computer devices are shown. For example, only one computer device is shown in Figure 1. It can be understood that the speech emotion recognition system can also include one or more other computer devices that can process data, and the details are not discussed here. limited.
另外,如图1所示,该语音情绪识别系统还可以包括存储器200,用于存储数据。In addition, as shown in Figure 1, the voice emotion recognition system may also include a memory 200 for storing data.
需要说明的是,图1所示的语音情绪识别系统的场景示意图仅仅是一个示例,本申请实施例描述的语音情绪识别系统以及场景是为了更加清楚的说明本申请实施例的技术方案,并不构成对于本申请实施例提供的技术方案的限定,本领域普通技术人员可知,随着语音情绪识别系统的演变和新业务场景的出现,本申请实施例提供的技术方案对于类似的技术 问题,同样适用。It should be noted that the scene diagram of the speech emotion recognition system shown in Figure 1 is only an example. The speech emotion recognition system and the scene described in the embodiment of the present application are for the purpose of more clearly illustrating the technical solution of the embodiment of the present application, and are not This constitutes a limitation on the technical solutions provided by the embodiments of the present application. Persons of ordinary skill in the art will know that with the evolution of speech emotion recognition systems and the emergence of new business scenarios, the technical solutions provided by the embodiments of the present application are not suitable for similar technologies. question, the same applies.
首先,本申请实施例中提供一种语音情绪识别方法,该语音情绪识别方法包括:获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,历史音频帧在当前音频帧之前;基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码;融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特征向量;基于融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。First, an embodiment of the present application provides a speech emotion recognition method. The speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is in the current audio frame. Before; predict the text feature coding of the current audio frame based on the text feature information of the historical audio frame; fuse the first audio feature coding and text feature coding of the current audio frame to obtain a fused feature vector; perform speech emotion recognition based on the fused feature vector to obtain the current Speech emotion recognition results for audio frames.
如图2和图3所示,图2是本申请实施例中提供的语音情绪识别方法的一个实施例流程示意图,图3是本申请实施例中提供的语音情绪识别方法的一个实施例模块示意图,该语音情绪识别方法包括如下步骤S201~S204:As shown in Figures 2 and 3, Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application, and Figure 3 is a module schematic diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application. , the speech emotion recognition method includes the following steps S201 to S204:
S201、获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息。S201. Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame.
其中,历史音频帧在当前音频帧之前。具体的,历史音频帧为当前音频帧的前一个音频帧。其中,当前音频帧和历史音频帧的长度相同,例如都为10ms-30ms,根据具体设定即可。Among them, the historical audio frame is before the current audio frame. Specifically, the historical audio frame is the audio frame preceding the current audio frame. Among them, the current audio frame and the historical audio frame have the same length, for example, both are 10ms-30ms, which can be set according to specific settings.
在一个具体的实施例中,获取待识别音频;对待识别音频分帧,得到多个音频帧,从多个音频帧中获取当前音频帧。我们需要将不定长的音频切分成固定长度的小段,这一步称为分帧。一般取10-30ms为一帧,为了避免窗边界对信号的遗漏,因此对帧做偏移时候,要有帧迭(帧与帧之间需要重叠一部分)。一般取帧长的一半作为帧移,也就是每次位移一帧的二分之一后再取下一帧,这样可以避免帧与帧之间的特性变化太大。通常的选择是25ms每帧,帧迭为10ms。要分帧是因为语音信号是快速变化的,而傅里叶变换适用于分析平稳的信号。在语音识别中,一般把帧长取为10~30ms,这样一帧内既有足够多的周期,又不会变化太剧烈。每帧信号通常要与一个平滑的窗函数相乘,让帧两端平滑地衰减到零,这样可以降低傅里叶变换后旁瓣的强度,取得更高质量的频谱。帧和帧之间的时间差常常取为10ms,这样帧与帧之间会有重叠,否则,由于帧与帧连接处的信号会因为加窗而被弱化,这部分的信息就丢失了。傅里叶变换是逐帧进行的,为的是取得每一帧的频谱。一般只保留幅度谱,丢弃相位谱。In a specific embodiment, the audio to be recognized is obtained; the audio to be recognized is divided into frames to obtain multiple audio frames, and the current audio frame is obtained from the multiple audio frames. We need to cut the audio of variable length into small segments of fixed length. This step is called framing. Generally, 10-30ms is taken as one frame. In order to avoid the omission of signals at the window boundary, when offsetting the frame, there must be frame overlap (a portion of overlap is required between frames). Generally, half of the frame length is used as the frame shift, that is, each time the frame is shifted by one-half of a frame and then the next frame is taken, this can avoid the characteristics from frame to frame changing too much. The usual choice is 25ms per frame and 10ms for frame iteration. Framing is necessary because the speech signal changes rapidly, and the Fourier transform is suitable for analyzing stationary signals. In speech recognition, the frame length is generally set to 10 to 30ms, so that there are enough cycles in one frame without changing too drastically. Each frame signal is usually multiplied by a smooth window function to allow both ends of the frame to smoothly attenuate to zero. This can reduce the intensity of the side lobes after Fourier transform and obtain a higher quality spectrum. The time difference between frames is often taken as 10ms, so that there will be overlap between frames. Otherwise, because the signal at the connection between frames will be weakened due to windowing, this part of the information will be lost. The Fourier transform is performed frame by frame in order to obtain the spectrum of each frame. Generally, only the amplitude spectrum is retained and the phase spectrum is discarded.
本申请实施例中,获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,可以包括:In this embodiment of the present application, obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:
(1)对当前音频帧进行Fbank特征提取,得到当前音频帧的第一音频特征。 (1) Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame.
Fbank即FilterBank,人耳对声音频谱的响应是非线性的,Fbank就是一种前端处理算法,以类似于人耳的方式对音频进行处理,可以提高语音识别的性能。获得语音信号的fbank特征的一般步骤是:预加重、分帧、加窗、短时傅里叶变换(STFT)、mel滤波、去均值等。Fbank is FilterBank. The response of the human ear to the sound spectrum is nonlinear. Fbank is a front-end processing algorithm that processes audio in a manner similar to the human ear, which can improve the performance of speech recognition. The general steps to obtain the fbank characteristics of the speech signal are: pre-emphasis, framing, windowing, short-time Fourier transform (STFT), mel filtering, demeaning, etc.
在另一个具体的实施例中,对当前音频帧分别进行Fbank特征提取和基频特征提取,得到Fbank特征和Pitch特征,融合Fbank特征和Pitch特征,得到当前音频帧的第一音频特征x1ti。基音周期(Pitch)是声带振动频率的倒数。它指的是人发出浊音时,气流通过声道促使声带振动的周期。声带震动的周期即为基音周期。基音周期的估计称为基音检测(PitchDetection)。基频包含了大量表征语音情感的特征,在语音情感识别中至关重要。常用的基频特征提取方法有:自相关函数法(ACF)、时域平均幅度差法(AMFD)以及小波法-频域。In another specific embodiment, Fbank feature extraction and fundamental frequency feature extraction are performed on the current audio frame respectively to obtain Fbank features and Pitch features, and the Fbank features and Pitch features are fused to obtain the first audio feature x1 ti of the current audio frame. The pitch period (Pitch) is the reciprocal of the vibration frequency of the vocal cords. It refers to the period in which airflow passes through the vocal tract to cause the vocal cords to vibrate when a person makes a voiced sound. The period in which the vocal cords vibrate is the pitch period. The estimation of the pitch period is called pitch detection (PitchDetection). Fundamental frequency contains a large number of features that characterize speech emotion and is crucial in speech emotion recognition. Commonly used fundamental frequency feature extraction methods include: autocorrelation function method (ACF), time domain average amplitude difference method (AMFD) and wavelet method-frequency domain.
(2)对当前音频帧的第一音频特征x1ti进行编码,得到当前音频帧的第一音频特征编码 (2) Encode the first audio feature x1 ti of the current audio frame to obtain the first audio feature encoding of the current audio frame
在一个具体的实施例中,将当前音频帧的第一音频特征x1ti输入第一声学模型Encoder-1进行编码,得到当前音频帧的第一音频特征编码其中,第一声学模型Encoder-1可以为隐马尔科夫模型(HMM)、深度神经网络(DNN)、卷积神经网络(CNN)以及循环神经网络(RNN)等等。优选地,第一声学模型Encoder-1可以为BiLSTM模型,第一声学模型Encoder-1的输出层采用ctc网络构建方式,在训练阶段采用ctc网络对每一帧语音特征与标签进行对齐。第一声学模型Encoder-1的输入为Fbank特征和Pitch特征融合得到的第一音频特征x1ti。即frank+pitch。In a specific embodiment, the first audio feature x1 ti of the current audio frame is input to the first acoustic model Encoder-1 for encoding to obtain the first audio feature encoding of the current audio frame. Among them, the first acoustic model Encoder-1 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc. Preferably, the first acoustic model Encoder-1 can be a BiLSTM model. The output layer of the first acoustic model Encoder-1 adopts a ctc network construction method. In the training phase, the ctc network is used to align the speech features and labels of each frame. The input of the first acoustic model Encoder-1 is the first audio feature x1 ti obtained by fusing the Fbank feature and the Pitch feature. That is frank+pitch.
在一个具体的实施例中,历史音频帧的文本特征信息可以是人工标注的。In a specific embodiment, the text feature information of historical audio frames may be manually annotated.
在另一个具体的实施例中,获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,可以包括:In another specific embodiment, obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:
(1)判断历史音频帧之前是否存在音频帧。(1) Determine whether there is an audio frame before the historical audio frame.
(2)若历史音频帧之前不存在音频帧,获取历史音频帧的第一音频特征编码和预设文本特征编码。(2) If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame.
若历史音频帧之前不存在音频帧,说明历史音频帧是第一帧,没有可用文本特征信息,则获取预设文本特征编码。预设文本特征编码为人工预先设定的默认文本特征编码。If there is no audio frame before the historical audio frame, it means that the historical audio frame is the first frame and there is no text feature information available, and the preset text feature encoding is obtained. The default text feature encoding is a manually preset default text feature encoding.
(3)基于历史音频帧的第一音频特征编码和预设文本特征编码确定历史音频帧的文本 特征信息。(3) Determine the text of the historical audio frame based on the first audio feature coding and the preset text feature coding of the historical audio frame Feature information.
在一个具体的实施例中,基于历史音频帧的第一音频特征编码和预设文本特征编码确定历史音频帧的文本特征信息,包括:融合历史音频帧的第一音频特征编码和预设文本特征编码,得到历史融合特征向量;将历史融合特征向量输入Softmax层,得到历史音频帧的历史预测文本概率分布;基于历史预测文本概率分布确定历史音频帧的文本特征信息。具体的,将历史预测文本概率分布概率最高的文本确定为历史音频帧的文本特征信息yui-1In a specific embodiment, determining the text feature information of the historical audio frame based on the first audio feature coding and the preset text feature coding of the historical audio frame includes: fusing the first audio feature coding and the preset text feature of the historical audio frame Encoding to obtain the historical fusion feature vector; input the historical fusion feature vector into the Softmax layer to obtain the historical prediction text probability distribution of the historical audio frame; determine the text feature information of the historical audio frame based on the historical prediction text probability distribution. Specifically, the text with the highest probability of the historical predicted text probability distribution is determined as the text feature information y ui-1 of the historical audio frame.
在另一个具体的实施例中,使用语音转文字软件将历史音频帧的第一音频特征编码进行语音转文字,得到历史音频帧的文本特征信息yui-1In another specific embodiment, speech-to-text software is used to encode the first audio feature of the historical audio frame and convert it into text to obtain the text feature information y ui-1 of the historical audio frame.
S202、基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码。S202. Predict the text feature coding of the current audio frame based on the text feature information of the historical audio frame.
在一个具体的实施例中,将历史音频帧的文本特征信息yui-1输入预设语言模型,得到当前音频帧的文本特征编码pui。其中,预设语言模型可以为BERT模型、LSTM模型,xlnet,GPT等。LSTM的全称是Long Short-Term Memory,它是RNN(Recurrent Neural Network)的一种。LSTM由于其设计的特点,非常适合用于对时序数据的建模,如文本数据。BiLSTM是Bi-directional Long Short-Term Memory的缩写,是由前向LSTM与后向LSTM组合而成。两者在自然语言处理任务中都常被用来建模上下文信息。In a specific embodiment, the text feature information y ui-1 of the historical audio frame is input into the preset language model to obtain the text feature code p ui of the current audio frame. Among them, the preset language model can be BERT model, LSTM model, xlnet, GPT, etc. The full name of LSTM is Long Short-Term Memory, which is a type of RNN (Recurrent Neural Network). Due to its design characteristics, LSTM is very suitable for modeling time series data, such as text data. BiLSTM is the abbreviation of Bi-directional Long Short-Term Memory, which is a combination of forward LSTM and backward LSTM. Both are often used to model contextual information in natural language processing tasks.
S203、融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特征向量。S203. Fusion of the first audio feature coding and text feature coding of the current audio frame to obtain a fusion feature vector.
在一个具体的实施例中,将当前音频帧的第一音频特征编码和文本特征编码pui输入预设共享网络模型Joint net,得到融合特征向量hti。其中,预设共享网络模型Joint net为tranfomermer网络。预设共享网络模型Joint net也可以为BiLSTM模型。In a specific embodiment, the first audio feature of the current audio frame is encoded and text feature coding p ui input the preset shared network model Joint net, and obtain the fusion feature vector h ti . Among them, the default shared network model Joint net is a tranfomermer network. The default shared network model Joint net can also be a BiLSTM model.
S204、基于融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。S204. Perform speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
在一个具体的实施例中,将融合特征向量hti输入目标情绪识别模型Mood classfier进行分类,得到分类结果,将分类结果确定为当前音频帧的语音情绪识别结果。其中,目标情绪分类模型为通过情绪分类训练集对预设分类神经网络模型训练得到的,情绪分类训练集包括多个情绪分类训练样本,情绪分类训练样本包括情绪样本特征和对应的样本标签。预设分类神经网络模型可以为DNN。样本标签可以包括开心(happiness),难过(sadness),生气(anger),恶心(disgust),害怕(fear),惊讶(surprise)等多个类别。In a specific embodiment, the fusion feature vector h ti is input into the target emotion recognition model Mood classfier for classification, and the classification result is obtained, and the classification result is determined as the speech emotion recognition result of the current audio frame. Among them, the target emotion classification model is obtained by training a preset classification neural network model through an emotion classification training set. The emotion classification training set includes multiple emotion classification training samples. The emotion classification training samples include emotion sample characteristics and corresponding sample labels. The default classification neural network model can be DNN. Sample labels can include multiple categories such as happiness, sadness, anger, disgust, fear, surprise, etc.
为了提高情绪分类准确度,在另一个具体的实施例中,基于第二音频特征编码和融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果,可以包括: In order to improve the accuracy of emotion classification, in another specific embodiment, speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain the speech emotion recognition result of the current audio frame, which may include:
(1)基于Teager能量算子对当前音频帧进行特征提取,得到第二音频特征。(1) Extract features from the current audio frame based on the Teager energy operator to obtain the second audio feature.
Teager能量算子是一种非线性算子,能够跟踪信号的瞬时能量。Teager能量算子满足公式(1),
ψ[x(n)]=x2(n)-x(n+1)x(n-1)     (1)
The Teager energy operator is a nonlinear operator that can track the instantaneous energy of a signal. The Teager energy operator satisfies formula (1),
ψ[x(n)]=x 2 (n)-x(n+1)x(n-1) (1)
其中,x(n)为当前音频帧的信号,Ψ为Teager能量算子。Among them, x(n) is the signal of the current audio frame, and Ψ is the Teager energy operator.
基于Teager能量算子对当前音频帧进行特征提取,得到第二音频特征x2ti。第二音频特征x2ti比第一音频特征x1ti更高阶,特征更丰富,引入高阶特征能够提高对语音情感特征向量表征能力,提高情绪分类准确度。Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature x2 ti . The second audio feature x2 ti is higher-order and has richer features than the first audio feature x1 ti . The introduction of high-order features can improve the ability to represent speech emotion feature vectors and improve the accuracy of emotion classification.
(2)对第二音频特征进行编码,得到当前音频帧的第二音频特征编码。(2) Encode the second audio feature to obtain the second audio feature code of the current audio frame.
具体的,将第二音频特征x2ti输入第二声学模型Encoder-2进行编码,得到当前音频帧的第二音频特征编码。其中,第二声学模型Encoder-2可以为隐马尔科夫模型(HMM)、深度神经网络(DNN)、卷积神经网络(CNN)以及循环神经网络(RNN)等等。优选地,第二声学模型Encoder-2可以为BiLSTM模型。Specifically, the second audio feature x2 ti is input to the second acoustic model Encoder-2 for encoding to obtain the second audio feature encoding of the current audio frame. Among them, the second acoustic model Encoder-2 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc. Preferably, the second acoustic model Encoder-2 can be a BiLSTM model.
(3)基于第二音频特征编码和融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。(3) Perform speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
参阅图4,图4是本申请实施例中提供的语音情绪识别方法一个实施例中基于第二音频特征编码和融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果的流程示意图。在一个具体的实施例中,基于第二音频特征编码和融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果,可以包括S301-S303:Referring to Figure 4, Figure 4 is a schematic flow chart of performing voice emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the voice emotion recognition result of the current audio frame in one embodiment of the voice emotion recognition method provided in the embodiment of the present application. In a specific embodiment, performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame may include S301-S303:
S301、基于预设注意力层对融合特征向量进行权重调整,得到第一目标特征向量。S301. Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector.
预设注意力层为基于Attention机制的模型。Attention的通用定义如下:给定一组向量集合Value,以及一个向量Query,Attention机制是一种根据该Query计算Value的加权求和的机制。Attention机制的具体计算过程可以归纳为两个过程:第一个过程是根据Query和Key计算权重系数,第二个过程根据权重系数对Value进行加权求和。而第一个过程又可以细分为两个阶段:第一个阶段根据Query和Key计算两者的相似性或者相关性;第二个阶段对第一阶段的原始分值进行归一化处理。 The default attention layer is a model based on the Attention mechanism. The general definition of Attention is as follows: Given a set of vector sets Value and a vector Query, the Attention mechanism is a mechanism that calculates the weighted sum of Value based on the Query. The specific calculation process of the Attention mechanism can be summarized into two processes: the first process is to calculate the weight coefficient based on Query and Key, and the second process is to perform a weighted sum of Value based on the weight coefficient. The first process can be subdivided into two stages: the first stage calculates the similarity or correlation between the two based on Query and Key; the second stage normalizes the original scores of the first stage.
其中,预设注意力层可以为自注意力层。The preset attention layer may be a self-attention layer.
在一个具体的实施例中,基于预设注意力层对融合特征向量进行权重调整,得到第一目标特征向量,可以包括:根据融合特征向量得到向量Query、向量Key、向量Value,三个向量;将向量Query、向量Key、向量Value输入预设注意力层Attention,对融合特征向量进行权重调整,得到第一目标特征向量。In a specific embodiment, adjusting the weight of the fused feature vector based on the preset attention layer to obtain the first target feature vector may include: obtaining three vectors: vector Query, vector Key, and vector Value based on the fused feature vector; Input the vector Query, vector Key, and vector Value into the preset attention layer Attention, and adjust the weight of the fused feature vector to obtain the first target feature vector.
在另一个具体的实施例中,基于预设注意力层对融合特征向量进行权重调整,得到第一目标特征向量,可以包括:In another specific embodiment, adjusting the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector may include:
(1)将融合特征向量输入Softmax层,得到预测文本概率分布。(1) Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution.
具体的,将融合特征向量hti输入Softmax层,得到预测文本概率分布P(yui|yu-1…y0,X)。Specifically, the fusion feature vector h ti is input into the Softmax layer to obtain the predicted text probability distribution P(y ui |y u-1 ...y 0 , X).
进一步的,根据预测文本概率分布P(yui|yu-1…y0,X)确定当前音频帧的文本特征信息。存储当前音频帧的文本特征信息,在预测当前音频帧的下一帧时,根据当前音频帧的下一帧的第一音频特征编码和当前音频帧的文本特征信息,预测当前音频帧的下一帧的语音情绪识别结果。Further, the text feature information of the current audio frame is determined according to the predicted text probability distribution P(y ui |y u-1 ...y 0 ,X). Store the text feature information of the current audio frame. When predicting the next frame of the current audio frame, predict the next frame of the current audio frame based on the first audio feature code of the next frame of the current audio frame and the text feature information of the current audio frame. Frame speech emotion recognition results.
(2)将预测文本概率分布和融合特征向量输入预设注意力层对融合特征向量进行权重调整,得到第一目标特征向量。(2) Input the predicted text probability distribution and fusion feature vector into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
具体的,将预测文本概率分布P(yui|yu-1…y0,X)作为向量Query,将融合特征向量hti作为向量Value输入预设注意力层,对融合特征向量hti进行权重调整,得到第一目标特征向量ctiSpecifically, the predicted text probability distribution P(y ui | y u-1 ... y 0 , The weight is adjusted to obtain the first target feature vector c ti .
S302、融合第一目标特征向量和第二音频特征编码,得到第二目标特征向量。S302. Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector.
具体的,将第一目标特征向量cti和第二音频特征编码输入预设共享网络模型Joint net,得到第二目标特征向量。其中,预设共享网络模型为tranfomermer网络。预设共享网络模型也可以为BiLSTM模型。Specifically, the first target feature vector c ti and the second audio feature code are input into the preset shared network model Joint net to obtain the second target feature vector. Among them, the default shared network model is a tranfomer network. The default shared network model can also be a BiLSTM model.
S303、将第二目标特征向量输入目标情绪识别模型,得到当前音频帧的语音情绪识别结果。S303. Input the second target feature vector into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
进一步的,本申请的模型训练分为4个阶段:Furthermore, the model training of this application is divided into 4 stages:
a.第一阶段:对第一声学模型Encoder-1、第二声学模型Encoder-2、预设语言模型进行预训练与迁移学习。基于通用bert的预训练语言模型和电话客服领域海量文本数据进行 训练,得到训练的语言模型,对训练的语言模型进行fintune,预训练语言模型。a. The first stage: pre-training and transfer learning for the first acoustic model Encoder-1, the second acoustic model Encoder-2, and the preset language model. Based on the general BERT pre-trained language model and massive text data in the field of telephone customer service Training, obtain the trained language model, fintune the trained language model, and pre-train the language model.
b.第二阶段:分别对第一声学模型Encoder-1、第二声学模型Encoder-2、预训练语言模型以及预设共享网络模型训练。其中,第一声学模型Encoder-1和第二声学模型Encoder-2首先通过ctc网络进行端到端对齐,整个训练包括两个阶段:预训练阶段和fintune阶段,预训练只涉及第一声学模型Encoder-1、第二声学模型Encoder-2的训练以及语言模型的训练。b. The second stage: train the first acoustic model Encoder-1, the second acoustic model Encoder-2, the pre-trained language model and the preset shared network model respectively. Among them, the first acoustic model Encoder-1 and the second acoustic model Encoder-2 are first aligned end-to-end through the ctc network. The entire training includes two stages: the pre-training stage and the fintune stage. The pre-training only involves the first acoustic model. Training of model Encoder-1, second acoustic model Encoder-2 and language model training.
c.第三阶段学习:对目标情绪识别模型预训练,通过第二声学模型Encoder-2提取特征,对目标情绪识别模型预训练,得到预训练语音情绪分类模型。c. The third stage of learning: pre-train the target emotion recognition model, extract features through the second acoustic model Encoder-2, pre-train the target emotion recognition model, and obtain a pre-trained speech emotion classification model.
d.第四阶段学习:在预训练基础上,基于语音对应文本(带情感标签)的ftune,实现语音与语言数据的联合训练,得到最终的目标情绪分类模型。情绪识别任务的训练给予语言模型和声学模型联合训练以后,与asr任务并行训练,这样做的理由是充分利用ASR行业的公开数据,目的是提高声学模型的表示能力。d. The fourth stage of learning: On the basis of pre-training, based on ftune corresponding to speech text (with emotion labels), joint training of speech and language data is achieved to obtain the final target emotion classification model. After the training of the emotion recognition task, the language model and the acoustic model are jointly trained, and then trained in parallel with the ASR task. The reason for this is to make full use of the public data in the ASR industry, and the purpose is to improve the representation ability of the acoustic model.
本申请提供一种语音情绪识别方法及装置,该语音情绪识别方法包括:获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,历史音频帧在当前音频帧之前;基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码;融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特征向量;基于融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。本申请先利用历史音频帧的文本特征信息预测出当前音频帧的文本特征编码,然后把当前音频帧的文本特征编码和第一音频特征编码融合后进行语音情绪识别,将音频信息与文本信息进行深度融合,能够提高语音情绪识别的准确度The present application provides a speech emotion recognition method and device. The speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results. This application first uses the text feature information of the historical audio frame to predict the text feature code of the current audio frame, and then fuses the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition, and combines the audio information with the text information. Deep fusion can improve the accuracy of speech emotion recognition
本申请基于联合任务统一建模实现信息深度融合表征,通过联合任务学习,融合了情绪声学特征和语言特征的信息,有效的提高了情绪识别的准确率。This application implements deep fusion representation of information based on joint task unified modeling. Through joint task learning, the information of emotional acoustic features and language features is integrated, effectively improving the accuracy of emotion recognition.
为了更好实施本申请实施例中语音情绪识别方法,在语音情绪识别方法基础之上,本申请实施例中还提供一种语音情绪识别装置,如图5所示,语音情绪识别装置500包括:In order to better implement the voice emotion recognition method in the embodiment of the present application, based on the voice emotion recognition method, the embodiment of the present application also provides a voice emotion recognition device. As shown in Figure 5, the voice emotion recognition device 500 includes:
获取单元501,用于获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,历史音频帧在当前音频帧之前;The acquisition unit 501 is used to acquire the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame;
预测单元502,用于基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码;Prediction unit 502, used to predict the text feature coding of the current audio frame based on the text feature information of historical audio frames;
融合单元503,用于融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特 征向量;Fusion unit 503 is used to fuse the first audio feature coding and text feature coding of the current audio frame to obtain the fusion feature eigenvector;
识别单元504,用于基于融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。The recognition unit 504 is used to perform speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
可选地,识别单元504,用于:Optionally, the identification unit 504 is used for:
基于Teager能量算子对当前音频帧进行特征提取,得到第二音频特征;Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature;
对第二音频特征进行编码,得到当前音频帧的第二音频特征编码;Encode the second audio feature to obtain the second audio feature code of the current audio frame;
基于第二音频特征编码和融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。Speech emotion recognition is performed based on the second audio feature encoding and fusion feature vector, and the speech emotion recognition result of the current audio frame is obtained.
可选地,识别单元504,用于:Optionally, the identification unit 504 is used for:
基于预设注意力层对融合特征向量进行权重调整,得到第一目标特征向量;Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector;
融合第一目标特征向量和第二音频特征编码,得到第二目标特征向量;Fusion of the first target feature vector and the second audio feature encoding to obtain the second target feature vector;
将第二目标特征向量输入目标情绪识别模型,得到当前音频帧的语音情绪识别结果。The second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
可选地,识别单元504,用于:Optionally, the identification unit 504 is used for:
将融合特征向量输入Softmax层,得到预测文本概率分布;Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution;
将预测文本概率分布和融合特征向量输入预设注意力层对融合特征向量进行权重调整,得到第一目标特征向量。Input the predicted text probability distribution and fusion feature vector into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
可选地,获取单元501,用于:Optionally, the acquisition unit 501 is used for:
对当前音频帧进行Fbank特征提取,得到当前音频帧的第一音频特征;Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame;
对当前音频帧的第一音频特征进行编码,得到当前音频帧的第一音频特征编码。Encode the first audio feature of the current audio frame to obtain the first audio feature code of the current audio frame.
可选地,获取单元501,用于:Optionally, the acquisition unit 501 is used for:
判断历史音频帧之前是否存在音频帧;Determine whether there is an audio frame before the historical audio frame;
若历史音频帧之前不存在音频帧,则获取历史音频帧的第一音频特征编码和预设文本特征编码;If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame;
基于历史音频帧的第一音频特征编码和预设文本特征编码确定历史音频帧的文本特征信息。Text feature information of the historical audio frame is determined based on the first audio feature coding and the preset text feature coding of the historical audio frame.
可选地,获取单元501,用于:Optionally, the acquisition unit 501 is used for:
融合历史音频帧的第一音频特征编码和预设文本特征编码,得到历史融合特征向量;Fusion of the first audio feature code and the preset text feature code of the historical audio frame to obtain a historical fusion feature vector;
将历史融合特征向量输入Softmax层,得到历史音频帧的历史预测文本概率分布; Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of historical audio frames;
基于历史预测文本概率分布确定历史音频帧的文本特征信息。Determine text feature information of historical audio frames based on historical predicted text probability distribution.
本申请实施例还提供一种计算机设备,其集成了本申请实施例所提供的任一种语音情绪识别装置,计算机设备包括:An embodiment of the present application also provides a computer device that integrates any voice emotion recognition device provided by the embodiment of the present application. The computer device includes:
一个或多个处理器;one or more processors;
存储器;以及memory; and
一个或多个应用程序,其中一个或多个应用程序被存储于存储器中,并配置为由处理器执行上述语音情绪识别方法实施例中任一实施例中的语音情绪识别方法的步骤。One or more application programs, wherein one or more application programs are stored in the memory and configured to execute by the processor the steps of the voice emotion recognition method in any of the above voice emotion recognition method embodiments.
如图6所示,其示出了本申请实施例所涉及的计算机设备的结构示意图,具体来讲:As shown in Figure 6, it shows a schematic structural diagram of the computer equipment involved in the embodiment of the present application. Specifically:
该计算机设备可以包括一个或者一个以上处理核心的处理器601、一个或一个以上计算机可读存储介质的存储器602、电源603和输入单元604等部件。本领域技术人员可以理解,图中示出的计算机设备结构并不构成对计算机设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:The computer device may include components such as a processor 601 of one or more processing cores, a memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art can understand that the structure of the computer equipment shown in the figures does not constitute a limitation on the computer equipment, and may include more or fewer components than shown in the figures, or combine certain components, or arrange different components. in:
处理器601是该计算机设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分,通过运行或执行存储在存储器602内的软件程序和/或模块,以及调用存储在存储器602内的数据,执行计算机设备的各种功能和处理数据,从而对计算机设备进行整体监控。可选的,处理器601可包括一个或多个处理核心;处理器601可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,优选的,处理器601可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器601中。The processor 601 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 602, and calling software programs stored in the memory 602. Data, perform various functions of the computer equipment and process the data to conduct overall monitoring of the computer equipment. Optionally, the processor 601 may include one or more processing cores; the processor 601 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processor, digital signal processor (Digital Signal Processor, DSP). ), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can be any conventional processor, etc. Preferably, the processor 601 can integrate an application processor and a modem processor, where the application processor mainly processes the operating system, User interfaces and applications, etc. The modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 601.
存储器602可用于存储软件程序以及模块,处理器601通过运行存储在存储器602的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器602可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据计算机设备的使用所创建的数据等。此外,存储器602可以包括高速随机存取存储器,还可以包括非易失性存储器,例如 至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器602还可以包括存储器控制器,以提供处理器601对存储器602的访问。The memory 602 can be used to store software programs and modules. The processor 601 executes various functional applications and data processing by running the software programs and modules stored in the memory 602 . The memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program based on Data created by the use of computer equipment, etc. In addition, memory 602 may include high-speed random access memory and may also include non-volatile memory, such as At least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602 .
计算机设备还包括给各个部件供电的电源603,优选的,电源603可以通过电源管理系统与处理器601逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源603还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。The computer equipment also includes a power supply 603 that supplies power to various components. Preferably, the power supply 603 can be logically connected to the processor 601 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system. The power supply 603 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
该计算机设备还可包括输入单元604,该输入单元604可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The computer device may also include an input unit 604 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and functional controls.
尽管未示出,计算机设备还可以包括显示单元等,在此不再赘述。具体在本实施例中,计算机设备中的处理器601会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行文件加载到存储器602中,并由处理器601来运行存储在存储器602中的应用程序,从而实现各种功能,如下:Although not shown, the computer device may also include a display unit and the like, which will not be described again here. Specifically, in this embodiment, the processor 601 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 will run the executable files stored in The application program in the memory 602 implements various functions, as follows:
获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,历史音频帧在当前音频帧之前;基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码;融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特征向量;基于融合特征向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio The first audio feature encoding and text feature encoding of the frame are used to obtain a fusion feature vector; speech emotion recognition is performed based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructions, or by controlling relevant hardware through instructions. The instructions can be stored in a computer-readable storage medium, and loaded and executed by the processor.
为此,本申请实施例提供一种计算机可读存储介质,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。其上存储有计算机程序,计算机程序被处理器进行加载,以执行本申请实施例所提供的任一种语音情绪识别方法中的步骤。例如,计算机程序被处理器进行加载可以执行如下步骤:To this end, embodiments of the present application provide a computer-readable storage medium, which may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc. . A computer program is stored thereon, and the computer program is loaded by the processor to execute the steps in any of the speech emotion recognition methods provided by the embodiments of the present application. For example, a computer program loaded by a processor may perform the following steps:
获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,历史音频帧在当前音频帧之前;基于历史音频帧的文本特征信息预测当前音频帧的文本特征编码;融合当前音频帧的第一音频特征编码和文本特征编码,得到融合特征向量;基于融合特征 向量进行语音情绪识别,得到当前音频帧的语音情绪识别结果。Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio The first audio feature encoding and text feature encoding of the frame are used to obtain the fusion feature vector; based on the fusion feature The vector performs speech emotion recognition and obtains the speech emotion recognition result of the current audio frame.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对其他实施例的详细描述,此处不再赘述。In the above embodiments, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the above detailed descriptions of other embodiments and will not be described again here.
具体实施时,以上各个单元或结构可以作为独立的实体来实现,也可以进行任意组合,作为同一或若干个实体来实现,以上各个单元或结构的具体实施可参见前面的方法实施例,在此不再赘述。During specific implementation, each of the above units or structures can be implemented as an independent entity, or can be combined in any way and implemented as the same or several entities. For the specific implementation of each of the above units or structures, please refer to the previous method embodiments. Here No longer.
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。For the specific implementation of each of the above operations, please refer to the previous embodiments and will not be described again here.
以上对本申请实施例所提供的一种语音情绪识别方法及装置进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。 The speech emotion recognition method and device provided by the embodiments of the present application have been introduced in detail. This article uses specific examples to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the present application. The application method and its core idea; at the same time, for those skilled in the art, there will be changes in the specific implementation and application scope based on the ideas of this application. In summary, the content of this specification should not be understood as an infringement of this application. limits.

Claims (10)

  1. 一种语音情绪识别方法,其特征在于,所述语音情绪识别方法包括:A voice emotion recognition method, characterized in that the voice emotion recognition method includes:
    获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,所述历史音频帧在所述当前音频帧之前;Obtain the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;
    基于所述历史音频帧的文本特征信息预测所述当前音频帧的文本特征编码;Predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;
    融合所述当前音频帧的第一音频特征编码和所述文本特征编码,得到融合特征向量;Fusion of the first audio feature code of the current audio frame and the text feature code to obtain a fusion feature vector;
    基于所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。Perform speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame.
  2. 根据权利要求1所述的语音情绪识别方法,其特征在于,所述基于所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果,包括:The speech emotion recognition method according to claim 1, wherein the speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame includes:
    基于Teager能量算子对所述当前音频帧进行特征提取,得到第二音频特征;Perform feature extraction on the current audio frame based on the Teager energy operator to obtain the second audio feature;
    对所述第二音频特征进行编码,得到所述当前音频帧的第二音频特征编码;Encode the second audio feature to obtain the second audio feature code of the current audio frame;
    基于所述第二音频特征编码和所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
  3. 根据权利要求2所述的语音情绪识别方法,其特征在于,所述基于所述第二音频特征编码和所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果,包括:The speech emotion recognition method according to claim 2, characterized in that the speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain the speech emotion recognition result of the current audio frame, including :
    基于预设注意力层对所述融合特征向量进行权重调整,得到第一目标特征向量;Adjust the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector;
    融合所述第一目标特征向量和所述第二音频特征编码,得到第二目标特征向量;Fusion of the first target feature vector and the second audio feature encoding to obtain a second target feature vector;
    将所述第二目标特征向量输入目标情绪识别模型,得到所述当前音频帧的语音情绪识别结果。The second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  4. 根据权利要求3所述的语音情绪识别方法,其特征在于,所述基于预设注意力层对所述融合特征向量进行权重调整,得到第一目标特征向量,包括:The speech emotion recognition method according to claim 3, characterized in that the weight adjustment of the fusion feature vector based on the preset attention layer to obtain the first target feature vector includes:
    将所述融合特征向量输入Softmax层,得到预测文本概率分布;Input the fused feature vector into the Softmax layer to obtain the predicted text probability distribution;
    将所述预测文本概率分布和所述融合特征向量输入所述预设注意力层对所述融合特征向量进行权重调整,得到所述第一目标特征向量。The predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
  5. 根据权利要求1所述的语音情绪识别方法,其特征在于,所述获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,包括:The speech emotion recognition method according to claim 1, wherein said obtaining the first audio feature encoding of the current audio frame and the text feature information of historical audio frames includes:
    对所述当前音频帧进行Fbank特征提取,得到所述当前音频帧的第一音频特征; Perform Fbank feature extraction on the current audio frame to obtain the first audio feature of the current audio frame;
    对所述当前音频帧的第一音频特征进行编码,得到所述当前音频帧的第一音频特征编码。The first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
  6. 根据权利要求1所述的语音情绪识别方法,其特征在于,所述获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,包括:The speech emotion recognition method according to claim 1, wherein said obtaining the first audio feature encoding of the current audio frame and the text feature information of historical audio frames includes:
    判断所述历史音频帧之前是否存在音频帧;Determine whether there is an audio frame before the historical audio frame;
    若所述历史音频帧之前不存在音频帧,则获取历史音频帧的第一音频特征编码和预设文本特征编码;If there is no audio frame before the historical audio frame, obtain the first audio feature code and the preset text feature code of the historical audio frame;
    基于所述历史音频帧的第一音频特征编码和所述预设文本特征编码确定所述历史音频帧的文本特征信息。The text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
  7. 根据权利要求6所述的语音情绪识别方法,其特征在于,所述基于所述历史音频帧的第一音频特征编码和所述预设文本特征编码确定所述历史音频帧的文本特征信息,包括:The speech emotion recognition method according to claim 6, wherein the text feature information of the historical audio frame is determined based on the first audio feature coding and the preset text feature coding of the historical audio frame, including :
    融合所述历史音频帧的第一音频特征编码和预设文本特征编码,得到历史融合特征向量;Fusion of the first audio feature code and the preset text feature code of the historical audio frame to obtain a historical fusion feature vector;
    将所述历史融合特征向量输入Softmax层,得到所述历史音频帧的历史预测文本概率分布;Input the historical fusion feature vector into the Softmax layer to obtain the historical predicted text probability distribution of the historical audio frame;
    基于所述历史预测文本概率分布确定所述历史音频帧的文本特征信息。Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
  8. 一种语音情绪识别装置,其特征在于,所述语音情绪识别装置包括:A voice emotion recognition device, characterized in that the voice emotion recognition device includes:
    获取单元,用于获取当前音频帧的第一音频特征编码和历史音频帧的文本特征信息,其中,所述历史音频帧在所述当前音频帧之前;An acquisition unit, configured to acquire the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;
    预测单元,用于基于所述历史音频帧的文本特征信息预测所述当前音频帧的文本特征编码;A prediction unit, configured to predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame;
    融合单元,用于融合所述当前音频帧的第一音频特征编码和所述文本特征编码,得到融合特征向量;a fusion unit, configured to fuse the first audio feature coding of the current audio frame and the text feature coding to obtain a fusion feature vector;
    识别单元,用于基于所述融合特征向量进行语音情绪识别,得到所述当前音频帧的语音情绪识别结果。A recognition unit, configured to perform speech emotion recognition based on the fusion feature vector, and obtain a speech emotion recognition result of the current audio frame.
  9. 一种计算机设备,其特征在于,所述计算机设备包括:A computer device, characterized in that the computer device includes:
    一个或多个处理器;one or more processors;
    存储器;以及 memory; and
    一个或多个应用程序,其中所述一个或多个应用程序被存储于所述存储器中,并配置为由所述处理器执行以实现权利要求1至7中任一项所述的语音情绪识别方法。One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the speech emotion recognition of any one of claims 1 to 7 method.
  10. 一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序被处理器进行加载,以执行权利要求1至7中任一项所述的语音情绪识别方法的步骤。 A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program is loaded by a processor to execute the steps of the speech emotion recognition method described in any one of claims 1 to 7.
PCT/CN2023/117475 2022-07-08 2023-09-07 Speech emotion recognition method and apparatus WO2024008215A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210806418.6A CN117409818A (en) 2022-07-08 2022-07-08 Speech emotion recognition method and device
CN202210806418.6 2022-07-08

Publications (2)

Publication Number Publication Date
WO2024008215A2 true WO2024008215A2 (en) 2024-01-11
WO2024008215A3 WO2024008215A3 (en) 2024-02-29

Family

ID=89454303

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/117475 WO2024008215A2 (en) 2022-07-08 2023-09-07 Speech emotion recognition method and apparatus

Country Status (2)

Country Link
CN (1) CN117409818A (en)
WO (1) WO2024008215A2 (en)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305642B (en) * 2017-06-30 2019-07-19 腾讯科技(深圳)有限公司 The determination method and apparatus of emotion information
US11205444B2 (en) * 2019-08-16 2021-12-21 Adobe Inc. Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
CN110570879A (en) * 2019-09-11 2019-12-13 深圳壹账通智能科技有限公司 Intelligent conversation method and device based on emotion recognition and computer equipment
CN111028827B (en) * 2019-12-10 2023-01-24 深圳追一科技有限公司 Interaction processing method, device, equipment and storage medium based on emotion recognition
CN111524534B (en) * 2020-03-20 2021-04-09 北京捷通华声科技股份有限公司 Voice analysis method, system, device and storage medium
CN113506586B (en) * 2021-06-18 2023-06-20 杭州摸象大数据科技有限公司 Method and system for identifying emotion of user
CN114022192A (en) * 2021-10-20 2022-02-08 百融云创科技股份有限公司 Data modeling method and system based on intelligent marketing scene
CN114492579A (en) * 2021-12-25 2022-05-13 浙江大华技术股份有限公司 Emotion recognition method, camera device, emotion recognition device and storage device
CN114639150A (en) * 2022-03-16 2022-06-17 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN117409818A (en) 2024-01-16
WO2024008215A3 (en) 2024-02-29

Similar Documents

Publication Publication Date Title
US11450312B2 (en) Speech recognition method, apparatus, and device, and storage medium
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
US20200321008A1 (en) Voiceprint recognition method and device based on memory bottleneck feature
CN110534099B (en) Voice wake-up processing method and device, storage medium and electronic equipment
WO2019149108A1 (en) Identification method and device for voice keywords, computer-readable storage medium, and computer device
CN111312245B (en) Voice response method, device and storage medium
CN108735201A (en) Continuous speech recognition method, apparatus, equipment and storage medium
WO2019037700A1 (en) Speech emotion detection method and apparatus, computer device, and storage medium
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN111832308B (en) Speech recognition text consistency processing method and device
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
WO2022105553A1 (en) Speech synthesis method and apparatus, readable medium, and electronic device
CN111081230A (en) Speech recognition method and apparatus
CN113314119B (en) Voice recognition intelligent household control method and device
WO2020220824A1 (en) Voice recognition method and device
CN112259089A (en) Voice recognition method and device
KR20220130565A (en) Keyword detection method and apparatus thereof
CN114127849A (en) Speech emotion recognition method and device
CN111414513A (en) Music genre classification method and device and storage medium
CN113129867A (en) Training method of voice recognition model, voice recognition method, device and equipment
CN115803806A (en) Systems and methods for training dual-mode machine-learned speech recognition models
CN113468857B (en) Training method and device for style conversion model, electronic equipment and storage medium
CN116090474A (en) Dialogue emotion analysis method, dialogue emotion analysis device and computer-readable storage medium
CN114022192A (en) Data modeling method and system based on intelligent marketing scene
CN113823265A (en) Voice recognition method and device and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834980

Country of ref document: EP

Kind code of ref document: A2