WO2024008215A2 - Procédé et appareil de reconnaissance d'émotion vocale - Google Patents

Procédé et appareil de reconnaissance d'émotion vocale Download PDF

Info

Publication number
WO2024008215A2
WO2024008215A2 PCT/CN2023/117475 CN2023117475W WO2024008215A2 WO 2024008215 A2 WO2024008215 A2 WO 2024008215A2 CN 2023117475 W CN2023117475 W CN 2023117475W WO 2024008215 A2 WO2024008215 A2 WO 2024008215A2
Authority
WO
WIPO (PCT)
Prior art keywords
audio frame
feature
emotion recognition
historical
text
Prior art date
Application number
PCT/CN2023/117475
Other languages
English (en)
Chinese (zh)
Other versions
WO2024008215A3 (fr
Inventor
刘汝洲
Original Assignee
顺丰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 顺丰科技有限公司 filed Critical 顺丰科技有限公司
Publication of WO2024008215A2 publication Critical patent/WO2024008215A2/fr
Publication of WO2024008215A3 publication Critical patent/WO2024008215A3/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention

Definitions

  • This application mainly relates to the field of artificial intelligence technology, and specifically relates to a speech emotion recognition method and device.
  • This application provides a voice emotion recognition method and device, aiming to solve the problem of low accuracy of voice emotion recognition in the prior art.
  • this application provides a voice emotion recognition method.
  • the voice emotion recognition method includes:
  • performing speech emotion recognition based on the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:
  • Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
  • performing speech emotion recognition based on the second audio feature encoding and the fused feature vector to obtain the speech emotion recognition result of the current audio frame includes:
  • the second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  • the weight adjustment of the fused feature vector based on the preset attention layer to obtain the first target feature vector includes:
  • the predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
  • the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:
  • the first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
  • the obtaining the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame includes:
  • the text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
  • determining the text feature information of the historical audio frame based on the first audio feature coding of the historical audio frame and the preset text feature coding includes:
  • Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
  • this application provides a voice emotion recognition device.
  • the voice emotion recognition device includes:
  • An acquisition unit configured to acquire the first audio feature encoding of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame precedes the current audio frame;
  • a prediction unit configured to predict the text feature encoding of the current audio frame based on the text feature information of the historical audio frame
  • a fusion unit configured to fuse the first audio feature coding of the current audio frame and the text feature coding to obtain a fusion feature vector
  • a recognition unit configured to perform speech emotion recognition based on the fusion feature vector, and obtain a speech emotion recognition result of the current audio frame.
  • the identification unit is used for:
  • Speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain a speech emotion recognition result of the current audio frame.
  • the identification unit is used for:
  • the second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  • the identification unit is used for:
  • the predicted text probability distribution and the fusion feature vector are input into the preset attention layer to adjust the weight of the fusion feature vector to obtain the first target feature vector.
  • the acquisition unit is used for:
  • the first audio feature of the current audio frame is encoded to obtain the first audio feature encoding of the current audio frame.
  • the acquisition unit is used for:
  • the text feature information of the historical audio frame is determined based on the first audio feature code of the historical audio frame and the preset text feature code.
  • the acquisition unit is used for:
  • Text feature information of the historical audio frame is determined based on the historical predicted text probability distribution.
  • this application provides a computer device, which includes:
  • processors one or more processors
  • One or more application programs wherein the one or more application programs are stored in the memory and configured to be executed by the processor to implement the voice emotion recognition method according to any one of the first aspects.
  • the present application provides a computer-readable storage medium that stores a plurality of instructions, and the instructions are suitable for loading by a processor to execute any one of the steps described in the first aspect. Steps in speech emotion recognition methods.
  • the present application provides a speech emotion recognition method and device.
  • the speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results.
  • This application first uses the text feature information of historical audio frames to predict Measure the text feature code of the current audio frame, and then fuse the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition. Deeply integrating audio information and text information can improve the accuracy of speech emotion recognition. .
  • Figure 1 is a schematic scene diagram of the speech emotion recognition system provided by the embodiment of the present application.
  • Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application.
  • Figure 3 is a schematic module diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application.
  • Figure 4 is a schematic flow chart of performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame in one embodiment of the speech emotion recognition method provided in the embodiment of the present application;
  • Figure 5 is a schematic structural diagram of an embodiment of the speech emotion recognition device provided in the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an embodiment of a computer device provided in an embodiment of the present application.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more features. In the description of this application, the meaning of "plurality” is two or two Above, unless otherwise expressly and specifically limited.
  • the embodiment of the present application provides a speech emotion recognition method and device, which will be described in detail below.
  • Figure 1 is a schematic diagram of a scene of a voice emotion recognition system provided by an embodiment of the present application.
  • the voice emotion recognition system may include a computer device 100, and a voice emotion recognition device is integrated in the computer device 100.
  • the computer device 100 may be an independent server, or a server network or server cluster composed of servers.
  • the computer device 100 described in the embodiment of the present application includes, but is not limited to, a computer, a network A host, a single network server, a set of multiple network servers, or a cloud server composed of multiple servers.
  • the cloud server consists of a large number of computers or network servers based on cloud computing (Cloud Computing).
  • the above-mentioned computer device 100 may be a general-purpose computer device or a special-purpose computer device.
  • the computer device 100 may be a desktop computer, a portable computer, a network server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc.
  • PDA personal digital assistant
  • This embodiment does not The type of computer device 100 is defined.
  • Figure 1 is only one application scenario of the solution of the present application and does not constitute a limitation on the application scenarios of the solution of the present application.
  • Other application environments may also include those shown in Figure 1 More or less computer devices are shown.
  • the speech emotion recognition system can also include one or more other computer devices that can process data, and the details are not discussed here. limited.
  • the voice emotion recognition system may also include a memory 200 for storing data.
  • an embodiment of the present application provides a speech emotion recognition method.
  • the speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is in the current audio frame. Before; predict the text feature coding of the current audio frame based on the text feature information of the historical audio frame; fuse the first audio feature coding and text feature coding of the current audio frame to obtain a fused feature vector; perform speech emotion recognition based on the fused feature vector to obtain the current Speech emotion recognition results for audio frames.
  • Figure 2 is a schematic flow diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application
  • Figure 3 is a module schematic diagram of an embodiment of the speech emotion recognition method provided in the embodiment of the present application.
  • the speech emotion recognition method includes the following steps S201 to S204:
  • the historical audio frame is before the current audio frame.
  • the historical audio frame is the audio frame preceding the current audio frame.
  • the current audio frame and the historical audio frame have the same length, for example, both are 10ms-30ms, which can be set according to specific settings.
  • the audio to be recognized is obtained; the audio to be recognized is divided into frames to obtain multiple audio frames, and the current audio frame is obtained from the multiple audio frames.
  • framing Generally, 10-30ms is taken as one frame.
  • frame overlap a portion of overlap is required between frames.
  • half of the frame length is used as the frame shift, that is, each time the frame is shifted by one-half of a frame and then the next frame is taken, this can avoid the characteristics from frame to frame changing too much.
  • the usual choice is 25ms per frame and 10ms for frame iteration.
  • Framing is necessary because the speech signal changes rapidly, and the Fourier transform is suitable for analyzing stationary signals.
  • the frame length is generally set to 10 to 30ms, so that there are enough cycles in one frame without changing too drastically.
  • Each frame signal is usually multiplied by a smooth window function to allow both ends of the frame to smoothly attenuate to zero. This can reduce the intensity of the side lobes after Fourier transform and obtain a higher quality spectrum.
  • the time difference between frames is often taken as 10ms, so that there will be overlap between frames. Otherwise, because the signal at the connection between frames will be weakened due to windowing, this part of the information will be lost.
  • the Fourier transform is performed frame by frame in order to obtain the spectrum of each frame. Generally, only the amplitude spectrum is retained and the phase spectrum is discarded.
  • obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:
  • Fbank is FilterBank.
  • the response of the human ear to the sound spectrum is nonlinear.
  • Fbank is a front-end processing algorithm that processes audio in a manner similar to the human ear, which can improve the performance of speech recognition.
  • the general steps to obtain the fbank characteristics of the speech signal are: pre-emphasis, framing, windowing, short-time Fourier transform (STFT), mel filtering, demeaning, etc.
  • Fbank feature extraction and fundamental frequency feature extraction are performed on the current audio frame respectively to obtain Fbank features and Pitch features, and the Fbank features and Pitch features are fused to obtain the first audio feature x1 ti of the current audio frame.
  • the pitch period (Pitch) is the reciprocal of the vibration frequency of the vocal cords. It refers to the period in which airflow passes through the vocal tract to cause the vocal cords to vibrate when a person makes a voiced sound. The period in which the vocal cords vibrate is the pitch period.
  • the estimation of the pitch period is called pitch detection (PitchDetection).
  • Fundamental frequency contains a large number of features that characterize speech emotion and is crucial in speech emotion recognition. Commonly used fundamental frequency feature extraction methods include: autocorrelation function method (ACF), time domain average amplitude difference method (AMFD) and wavelet method-frequency domain.
  • ACF autocorrelation function method
  • AMFD time domain average amplitude difference method
  • wavelet method-frequency domain wavelet method-frequency domain.
  • the first audio feature x1 ti of the current audio frame is input to the first acoustic model Encoder-1 for encoding to obtain the first audio feature encoding of the current audio frame.
  • the first acoustic model Encoder-1 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc.
  • the first acoustic model Encoder-1 can be a BiLSTM model.
  • the output layer of the first acoustic model Encoder-1 adopts a ctc network construction method.
  • the ctc network is used to align the speech features and labels of each frame.
  • the input of the first acoustic model Encoder-1 is the first audio feature x1 ti obtained by fusing the Fbank feature and the Pitch feature. That is frank+pitch.
  • the text feature information of historical audio frames may be manually annotated.
  • obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame may include:
  • the default text feature encoding is a manually preset default text feature encoding.
  • determining the text feature information of the historical audio frame based on the first audio feature coding and the preset text feature coding of the historical audio frame includes: fusing the first audio feature coding and the preset text feature of the historical audio frame Encoding to obtain the historical fusion feature vector; input the historical fusion feature vector into the Softmax layer to obtain the historical prediction text probability distribution of the historical audio frame; determine the text feature information of the historical audio frame based on the historical prediction text probability distribution. Specifically, the text with the highest probability of the historical predicted text probability distribution is determined as the text feature information y ui-1 of the historical audio frame.
  • speech-to-text software is used to encode the first audio feature of the historical audio frame and convert it into text to obtain the text feature information y ui-1 of the historical audio frame.
  • the text feature information y ui-1 of the historical audio frame is input into the preset language model to obtain the text feature code p ui of the current audio frame.
  • the preset language model can be BERT model, LSTM model, xlnet, GPT, etc.
  • the full name of LSTM is Long Short-Term Memory, which is a type of RNN (Recurrent Neural Network). Due to its design characteristics, LSTM is very suitable for modeling time series data, such as text data.
  • BiLSTM is the abbreviation of Bi-directional Long Short-Term Memory, which is a combination of forward LSTM and backward LSTM. Both are often used to model contextual information in natural language processing tasks.
  • the first audio feature of the current audio frame is encoded and text feature coding p ui input the preset shared network model Joint net, and obtain the fusion feature vector h ti .
  • the default shared network model Joint net is a tranfomermer network.
  • the default shared network model Joint net can also be a BiLSTM model.
  • the fusion feature vector h ti is input into the target emotion recognition model Mood classfier for classification, and the classification result is obtained, and the classification result is determined as the speech emotion recognition result of the current audio frame.
  • the target emotion classification model is obtained by training a preset classification neural network model through an emotion classification training set.
  • the emotion classification training set includes multiple emotion classification training samples.
  • the emotion classification training samples include emotion sample characteristics and corresponding sample labels.
  • the default classification neural network model can be DNN. Sample labels can include multiple categories such as happiness, sadness, anger, disgust, fear, surprise, etc.
  • speech emotion recognition is performed based on the second audio feature encoding and the fusion feature vector to obtain the speech emotion recognition result of the current audio frame, which may include:
  • the Teager energy operator is a nonlinear operator that can track the instantaneous energy of a signal.
  • x(n) is the signal of the current audio frame
  • is the Teager energy operator
  • Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature x2 ti .
  • the second audio feature x2 ti is higher-order and has richer features than the first audio feature x1 ti .
  • the introduction of high-order features can improve the ability to represent speech emotion feature vectors and improve the accuracy of emotion classification.
  • the second audio feature x2 ti is input to the second acoustic model Encoder-2 for encoding to obtain the second audio feature encoding of the current audio frame.
  • the second acoustic model Encoder-2 can be a hidden Markov model (HMM), a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), etc.
  • the second acoustic model Encoder-2 can be a BiLSTM model.
  • Figure 4 is a schematic flow chart of performing voice emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the voice emotion recognition result of the current audio frame in one embodiment of the voice emotion recognition method provided in the embodiment of the present application.
  • performing speech emotion recognition based on the second audio feature encoding and fusion feature vector to obtain the speech emotion recognition result of the current audio frame may include S301-S303:
  • the default attention layer is a model based on the Attention mechanism.
  • the general definition of Attention is as follows: Given a set of vector sets Value and a vector Query, the Attention mechanism is a mechanism that calculates the weighted sum of Value based on the Query.
  • the specific calculation process of the Attention mechanism can be summarized into two processes: the first process is to calculate the weight coefficient based on Query and Key, and the second process is to perform a weighted sum of Value based on the weight coefficient.
  • the first process can be subdivided into two stages: the first stage calculates the similarity or correlation between the two based on Query and Key; the second stage normalizes the original scores of the first stage.
  • the preset attention layer may be a self-attention layer.
  • adjusting the weight of the fused feature vector based on the preset attention layer to obtain the first target feature vector may include: obtaining three vectors: vector Query, vector Key, and vector Value based on the fused feature vector; Input the vector Query, vector Key, and vector Value into the preset attention layer Attention, and adjust the weight of the fused feature vector to obtain the first target feature vector.
  • adjusting the weight of the fusion feature vector based on the preset attention layer to obtain the first target feature vector may include:
  • the fusion feature vector h ti is input into the Softmax layer to obtain the predicted text probability distribution P(y ui
  • the text feature information of the current audio frame is determined according to the predicted text probability distribution P(y ui
  • predict the next frame of the current audio frame predict the next frame of the current audio frame based on the first audio feature code of the next frame of the current audio frame and the text feature information of the current audio frame.
  • the weight is adjusted to obtain the first target feature vector c ti .
  • the first target feature vector c ti and the second audio feature code are input into the preset shared network model Joint net to obtain the second target feature vector.
  • the default shared network model is a tranfomer network.
  • the default shared network model can also be a BiLSTM model.
  • model training of this application is divided into 4 stages:
  • the first stage pre-training and transfer learning for the first acoustic model Encoder-1, the second acoustic model Encoder-2, and the preset language model.
  • the general BERT pre-trained language model and massive text data in the field of telephone customer service Training obtain the trained language model, fintune the trained language model, and pre-train the language model.
  • the second stage train the first acoustic model Encoder-1, the second acoustic model Encoder-2, the pre-trained language model and the preset shared network model respectively.
  • the first acoustic model Encoder-1 and the second acoustic model Encoder-2 are first aligned end-to-end through the ctc network.
  • the entire training includes two stages: the pre-training stage and the fintune stage.
  • the pre-training only involves the first acoustic model. Training of model Encoder-1, second acoustic model Encoder-2 and language model training.
  • the third stage of learning pre-train the target emotion recognition model, extract features through the second acoustic model Encoder-2, pre-train the target emotion recognition model, and obtain a pre-trained speech emotion classification model.
  • the fourth stage of learning On the basis of pre-training, based on ftune corresponding to speech text (with emotion labels), joint training of speech and language data is achieved to obtain the final target emotion classification model. After the training of the emotion recognition task, the language model and the acoustic model are jointly trained, and then trained in parallel with the ASR task. The reason for this is to make full use of the public data in the ASR industry, and the purpose is to improve the representation ability of the acoustic model.
  • the present application provides a speech emotion recognition method and device.
  • the speech emotion recognition method includes: obtaining the first audio feature code of the current audio frame and the text feature information of the historical audio frame, wherein the historical audio frame is before the current audio frame; based on The text feature information of the historical audio frame predicts the text feature coding of the current audio frame; the first audio feature coding and the text feature coding of the current audio frame are fused to obtain a fused feature vector; based on the fused feature vector, speech emotion recognition is performed to obtain the current audio frame's Speech emotion recognition results.
  • This application first uses the text feature information of the historical audio frame to predict the text feature code of the current audio frame, and then fuses the text feature code of the current audio frame with the first audio feature code to perform speech emotion recognition, and combines the audio information with the text information. Deep fusion can improve the accuracy of speech emotion recognition
  • This application implements deep fusion representation of information based on joint task unified modeling. Through joint task learning, the information of emotional acoustic features and language features is integrated, effectively improving the accuracy of emotion recognition.
  • the embodiment of the present application also provides a voice emotion recognition device.
  • the voice emotion recognition device 500 includes:
  • the acquisition unit 501 is used to acquire the first audio feature code of the current audio frame and the text feature information of the historical audio frame, where the historical audio frame is before the current audio frame;
  • Prediction unit 502 used to predict the text feature coding of the current audio frame based on the text feature information of historical audio frames
  • Fusion unit 503 is used to fuse the first audio feature coding and text feature coding of the current audio frame to obtain the fusion feature eigenvector;
  • the recognition unit 504 is used to perform speech emotion recognition based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
  • the identification unit 504 is used for:
  • Feature extraction is performed on the current audio frame based on the Teager energy operator to obtain the second audio feature
  • Speech emotion recognition is performed based on the second audio feature encoding and fusion feature vector, and the speech emotion recognition result of the current audio frame is obtained.
  • the identification unit 504 is used for:
  • the second target feature vector is input into the target emotion recognition model to obtain the speech emotion recognition result of the current audio frame.
  • the identification unit 504 is used for:
  • the acquisition unit 501 is used for:
  • the acquisition unit 501 is used for:
  • Text feature information of the historical audio frame is determined based on the first audio feature coding and the preset text feature coding of the historical audio frame.
  • the acquisition unit 501 is used for:
  • An embodiment of the present application also provides a computer device that integrates any voice emotion recognition device provided by the embodiment of the present application.
  • the computer device includes:
  • processors one or more processors
  • One or more application programs wherein one or more application programs are stored in the memory and configured to execute by the processor the steps of the voice emotion recognition method in any of the above voice emotion recognition method embodiments.
  • FIG. 6 shows a schematic structural diagram of the computer equipment involved in the embodiment of the present application. Specifically:
  • the computer device may include components such as a processor 601 of one or more processing cores, a memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604.
  • a processor 601 of one or more processing cores a processor 601 of one or more processing cores
  • a memory 602 of one or more computer-readable storage media e.g., a processor 601 of one or more processing cores
  • a power supply 603 e.g., a processor 601 of one or more processors, a memory 602 of one or more computer-readable storage media
  • a power supply 603 e.g., a power supply 603
  • the processor 601 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing software programs and/or modules stored in the memory 602, and calling software programs stored in the memory 602. Data, perform various functions of the computer equipment and process the data to conduct overall monitoring of the computer equipment.
  • the processor 601 may include one or more processing cores; the processor 601 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processor, digital signal processor (Digital Signal Processor, DSP). ), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA off-the-shelf programmable gate array
  • FPGA field-Programmable Gate Array
  • the general-purpose processor can be a microprocessor or the processor can be any conventional processor, etc.
  • the processor 601 can integrate an application processor and a modem processor, where the application processor mainly processes the operating system, User interfaces and applications, etc.
  • the modem processor mainly handles wireless communications. It can be understood that the above modem processor may not be integrated into the processor 601.
  • the memory 602 can be used to store software programs and modules.
  • the processor 601 executes various functional applications and data processing by running the software programs and modules stored in the memory 602 .
  • the memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program based on Data created by the use of computer equipment, etc.
  • memory 602 may include high-speed random access memory and may also include non-volatile memory, such as At least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602 .
  • the computer equipment also includes a power supply 603 that supplies power to various components.
  • the power supply 603 can be logically connected to the processor 601 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • the power supply 603 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
  • the computer device may also include an input unit 604 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and functional controls.
  • an input unit 604 operable to receive input numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and functional controls.
  • the computer device may also include a display unit and the like, which will not be described again here.
  • the processor 601 in the computer device will load the executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 will run the executable files stored in The application program in the memory 602 implements various functions, as follows:
  • the first audio feature code of the current audio frame and the text feature information of the historical audio frame where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio
  • the first audio feature encoding and text feature encoding of the frame are used to obtain a fusion feature vector; speech emotion recognition is performed based on the fusion feature vector to obtain the speech emotion recognition result of the current audio frame.
  • embodiments of the present application provide a computer-readable storage medium, which may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc. .
  • a computer program is stored thereon, and the computer program is loaded by the processor to execute the steps in any of the speech emotion recognition methods provided by the embodiments of the present application.
  • a computer program loaded by a processor may perform the following steps:
  • the first audio feature code of the current audio frame and the text feature information of the historical audio frame where the historical audio frame is before the current audio frame; predict the text feature code of the current audio frame based on the text feature information of the historical audio frame; fuse the current audio
  • the first audio feature encoding and text feature encoding of the frame are used to obtain the fusion feature vector; based on the fusion feature
  • the vector performs speech emotion recognition and obtains the speech emotion recognition result of the current audio frame.
  • each of the above units or structures can be implemented as an independent entity, or can be combined in any way and implemented as the same or several entities.
  • each of the above units or structures please refer to the previous method embodiments. Here No longer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente demande concerne un procédé et un appareil de reconnaissance d'émotion vocale. Le procédé consiste : à obtenir un premier codage de caractéristique audio d'une trame audio en cours et des informations de caractéristique de texte d'une trame audio historique, la trame audio historique précédant la trame audio en cours ; à prédire un codage de caractéristique de texte de la trame audio en cours en fonction des informations de caractéristique de texte de la trame audio historique ; à effectuer une fusion sur le premier codage de caractéristique audio et le codage de caractéristique de texte de la trame audio en cours afin d'obtenir un vecteur de caractéristique fusionnée ; et à effectuer une reconnaissance d'émotion vocale en fonction du vecteur de caractéristique fusionnée afin d'obtenir un résultat de reconnaissance d'émotion vocale de la trame audio en cours. La présente demande utilise les informations de caractéristique de texte de la trame audio historique pour prédire le codage de caractéristique de texte de la trame audio en cours, et après avoir effectué une fusion sur le premier codage de caractéristique audio et le codage de caractéristique de texte de la trame audio en cours, effectue une reconnaissance d'émotion vocale ; une fusion profonde est effectuée sur des informations audio et des informations de texte, et la précision de reconnaissance d'émotion vocale peut être améliorée.
PCT/CN2023/117475 2022-07-08 2023-09-07 Procédé et appareil de reconnaissance d'émotion vocale WO2024008215A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210806418.6 2022-07-08
CN202210806418.6A CN117409818A (zh) 2022-07-08 2022-07-08 语音情绪识别方法及装置

Publications (2)

Publication Number Publication Date
WO2024008215A2 true WO2024008215A2 (fr) 2024-01-11
WO2024008215A3 WO2024008215A3 (fr) 2024-02-29

Family

ID=89454303

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/117475 WO2024008215A2 (fr) 2022-07-08 2023-09-07 Procédé et appareil de reconnaissance d'émotion vocale

Country Status (2)

Country Link
CN (1) CN117409818A (fr)
WO (1) WO2024008215A2 (fr)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305642B (zh) * 2017-06-30 2019-07-19 腾讯科技(深圳)有限公司 情感信息的确定方法和装置
US11205444B2 (en) * 2019-08-16 2021-12-21 Adobe Inc. Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
CN110570879A (zh) * 2019-09-11 2019-12-13 深圳壹账通智能科技有限公司 基于情绪识别的智能会话方法、装置及计算机设备
CN111028827B (zh) * 2019-12-10 2023-01-24 深圳追一科技有限公司 基于情绪识别的交互处理方法、装置、设备和存储介质
CN111524534B (zh) * 2020-03-20 2021-04-09 北京捷通华声科技股份有限公司 一种语音分析方法、系统、设备及存储介质
CN113506586B (zh) * 2021-06-18 2023-06-20 杭州摸象大数据科技有限公司 用户情绪识别的方法和系统
CN114022192A (zh) * 2021-10-20 2022-02-08 百融云创科技股份有限公司 一种基于智能营销场景的数据建模方法及系统
CN114492579A (zh) * 2021-12-25 2022-05-13 浙江大华技术股份有限公司 情绪识别方法、摄像装置、情绪识别装置及存储装置
CN114639150A (zh) * 2022-03-16 2022-06-17 平安科技(深圳)有限公司 情绪识别方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
WO2024008215A3 (fr) 2024-02-29
CN117409818A (zh) 2024-01-16

Similar Documents

Publication Publication Date Title
US11450312B2 (en) Speech recognition method, apparatus, and device, and storage medium
WO2021093449A1 (fr) Procédé et appareil de détection de mot de réveil employant l'intelligence artificielle, dispositif, et support
US20200321008A1 (en) Voiceprint recognition method and device based on memory bottleneck feature
CN110534099B (zh) 语音唤醒处理方法、装置、存储介质及电子设备
CN111312245B (zh) 一种语音应答方法、装置和存储介质
EP3355303A1 (fr) Procédé et appareil de reconnaissance vocale
CN108735201A (zh) 连续语音识别方法、装置、设备和存储介质
WO2019037700A1 (fr) Procédé et appareil de détection d'émotion dans la parole, dispositif informatique, et support de stockage
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN111832308B (zh) 语音识别文本连贯性处理方法和装置
WO2022178969A1 (fr) Procédé et appareil de traitement de données vocales de conversation, dispositif informatique et support de stockage
WO2022105553A1 (fr) Procédé et appareil de synthèse de la parole, support lisible et dispositif électronique
CN111081230A (zh) 语音识别方法和设备
CN113314119B (zh) 语音识别智能家居控制方法及装置
WO2020220824A1 (fr) Procédé et dispositif de reconnaissance vocale
CN112259089A (zh) 语音识别方法及装置
KR20220130565A (ko) 키워드 검출 방법 및 장치
CN114127849A (zh) 语音情感识别方法和装置
CN113129867A (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
Gupta et al. Speech emotion recognition using SVM with thresholding fusion
CN112489623A (zh) 语种识别模型的训练方法、语种识别方法及相关设备
CN113468857B (zh) 风格转换模型的训练方法、装置、电子设备以及存储介质
CN116090474A (zh) 对话情绪分析方法、装置和计算机可读存储介质
CN114022192A (zh) 一种基于智能营销场景的数据建模方法及系统
WO2024008215A2 (fr) Procédé et appareil de reconnaissance d'émotion vocale

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834980

Country of ref document: EP

Kind code of ref document: A2