WO2021127982A1 - Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur - Google Patents

Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur Download PDF

Info

Publication number
WO2021127982A1
WO2021127982A1 PCT/CN2019/127923 CN2019127923W WO2021127982A1 WO 2021127982 A1 WO2021127982 A1 WO 2021127982A1 CN 2019127923 W CN2019127923 W CN 2019127923W WO 2021127982 A1 WO2021127982 A1 WO 2021127982A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
recognized
neural network
emotion
emotion recognition
Prior art date
Application number
PCT/CN2019/127923
Other languages
English (en)
Chinese (zh)
Inventor
李柏
丁万
黄东延
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2019/127923 priority Critical patent/WO2021127982A1/fr
Priority to CN201980003195.6A priority patent/CN111357051B/zh
Publication of WO2021127982A1 publication Critical patent/WO2021127982A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present invention relates to the field of artificial intelligence technology, in particular to a voice emotion recognition method, an intelligent device and a computer-readable storage medium.
  • the main task of speech emotion recognition is to extract the emotion information contained in the speech and identify the emotion category.
  • convolutional neural networks and recurrent neural networks are often used for speech emotion recognition.
  • convolutional neural networks have two fatal flaws, namely translation invariance and pooling layer, which will lead to loss of valuable information and low recognition rate.
  • the cyclic neural network has the problem that the long-distance information memory ability is not high.
  • a voice emotion recognition method comprising: acquiring voice data to be recognized, extracting low-level feature data of the voice data to be recognized; inputting the low-level feature data into a pre-trained feature extraction network to obtain the voice to be recognized
  • the high-level feature data of the data, the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network; the high-level feature data is input into a pre-trained emotion recognition neural network to recognize the voice data to be recognized Emotional data.
  • An intelligent device includes: an acquisition module for acquiring voice data to be recognized, and extracting low-level feature data of the voice data to be recognized; a feature extraction module for inputting the low-level feature data into a pre-trained feature extraction network, To obtain high-level feature data of the voice data to be recognized, the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network; a recognition module is used to input the high-level feature data into pre-trained emotions The recognition neural network obtains the emotion data of the voice data to be recognized according to the output result of the emotion recognition neural network.
  • An intelligent device includes: an acquisition circuit, a processor, and a memory, the processor is coupled to the memory and the acquisition circuit, a computer program is stored in the memory, and the processor executes the computer program to implement The method described above.
  • a computer-readable storage medium stores a computer program, and the computer program can be executed by a processor to implement the above-mentioned method.
  • the feature extraction network includes at least two layers of neural networks, one of which is the capsule neural Network, capsule network can carry more feature information, and has excellent generalization ability.
  • the extracted high-level information includes more feature information.
  • the high-level information including more feature information is input into the pre-trained emotion recognition neural network to make emotion
  • the output result of the recognition neural network is more accurate, so that more accurate emotion data of the voice data to be recognized can be obtained according to the output result of the emotion recognition neural network, which can effectively improve the accuracy of emotion recognition.
  • Figure 1 is an application environment diagram of a voice emotion recognition method in an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of the first embodiment of the voice emotion recognition method provided by the present invention.
  • Figure 3 is a schematic diagram of the principle of a capsule neural network
  • FIG. 4 is a schematic flowchart of a second embodiment of the voice emotion recognition method provided by the present invention.
  • FIG. 5 is a schematic flowchart of a third embodiment of a voice emotion recognition method provided by the present invention.
  • Figure 6 is a schematic diagram of the principle of the attention mechanism
  • Fig. 7 is a schematic structural diagram of a first embodiment of a smart device provided by the present invention.
  • FIG. 8 is a schematic structural diagram of a second embodiment of a smart device provided by the present invention.
  • FIG. 9 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present invention.
  • convolutional neural networks and recurrent neural networks are often used for speech emotion recognition.
  • convolutional neural networks have two fatal flaws, namely translation invariance and pooling layer, which will lead to loss of valuable information and low recognition rate.
  • the cyclic neural network has the problem of low long-distance information memory ability.
  • a voice emotion recognition method which can effectively improve the accuracy of emotion recognition.
  • FIG. 1 is an application environment diagram of a voice emotion recognition method in an embodiment of the present invention.
  • the voice emotion recognition method is applied to an interactive behavior prediction system.
  • the voice emotion recognition system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, and a notebook computer.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is used to obtain the voice data to be recognized
  • the server 120 is used to extract the low-level feature data of the voice data to be recognized
  • the low-level feature data is input into the pre-trained feature extraction network to obtain the high-level feature data of the voice data to be recognized.
  • the feature extraction network includes At least two layers of neural networks, one of which is a capsule neural network; the high-level feature data is input into the pre-trained emotion recognition neural network, and the emotion data of the speech data to be recognized is obtained according to the output result of the emotion recognition neural network.
  • FIG. 2 is a schematic flowchart of a first embodiment of a voice emotion recognition method provided by the present invention.
  • the voice emotion recognition method provided by the present invention includes the following steps:
  • S101 Acquire voice data to be recognized, and extract low-level feature data of the voice data to be recognized.
  • the voice data to be recognized is obtained.
  • the voice data to be recognized may be recorded on site by the user, or extracted from a database, or intercepted from a certain piece of audio.
  • the voice data to be recognized may be sent by the user terminal or actively acquired by the smart terminal.
  • the low-level feature data of the voice data to be recognized are extracted, such as the frequency, amplitude, duration, pitch, etc. of the voice data to be recognized.
  • the low-level feature data of the voice data to be recognized can be obtained through tool software, for example, through opensmile software. opensmile is a tool that runs in the form of a command line. It is mainly used to extract audio features by configuring the config file.
  • the voice data to be recognized may also be input to a pre-trained low-level feature extraction neural network, and the output result of the low-level feature extraction neural network may be used as the low-level feature data of the voice data to be recognized.
  • S102 Input low-level feature data into a pre-trained feature extraction network to obtain high-level feature data of the voice data to be recognized.
  • the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network.
  • the low-level feature data of the voice data to be recognized is input into the pre-trained feature extraction network to obtain the high-level feature data of the voice data to be recognized.
  • the high-level feature data is Mel frequency cepstrum.
  • Mel-Frequency Cepstrum (Mel-Frequency Cepstrum) is a linear transformation of the logarithmic energy spectrum based on the non-linear Mel scale of sound frequency. The frequency band division of the mel cepstrum is equidistantly divided on the mel scale, which is more similar to the human auditory system than the linearly spaced frequency bands used in the normal cepstrum.
  • Mel Frequency Cepstral Coefficient MFCC is widely used in speech recognition functions.
  • the feature extraction network includes at least two layers of neural networks, one of which is the capsule neural network.
  • Each neuron in the capsule network is a vector.
  • This vector can not only represent the basic characteristics of the data, but also include Many detailed features of the data can carry more feature information, and the generalization ability is excellent, which is very suitable for the voice field. Because the voice features are very subtle, the use of capsule neural networks can retain more feature information in the low-level feature data.
  • Figure 3 is a schematic diagram of the principle of the capsule neural network.
  • the final output vector V j of the capsule has a length between 0 and 1.
  • j is the j-th capsule
  • V j is the output vector of the j-th capsule
  • S j is the input vector of the j-th capsule
  • is the modulus length of the S vector.
  • both the input U i and the output V j are vectors.
  • C ij needs to be calculated using b ij
  • the update of b ij is the core of the capsule network and the dynamic routing algorithm.
  • the update formula of b ij is: And by calculating the inner product to change b ij , then change C ij .
  • the low-level feature data of the voice data to be recognized is input into the pre-trained feature extraction network to obtain the Mel frequency cepstrum of the voice data to be recognized.
  • high-pass filtering and Fourier Transform, Mel filter, scattered Fourier inverse transform and other calculations to obtain the Mel frequency cepstrum of the speech data to be recognized.
  • S103 Input the high-level feature data into the pre-trained emotion recognition neural network, and obtain emotion data of the voice data to be recognized according to the output result of the emotion recognition neural network.
  • the high-level feature data of the voice data to be recognized such as Mel frequency cepstrum
  • the pre-trained emotion recognition neural network is input into the pre-trained emotion recognition neural network, and the emotion data of the voice data to be recognized is obtained according to the output result of the emotion recognition network.
  • the emotion recognition network needs to be trained.
  • Prepare multiple training high-level feature data for example, prepare multiple Mel frequency cepstrums, and label the emotional data of each training high-level feature data.
  • Define the structure of the trained emotion recognition neural network You can define the number of layers of the emotion recognition neural network, such as 2 layers. You can define the type of emotion recognition neural network, such as fully connected neural network, bidirectional long and short memory neural network, etc.
  • Define the loss function of training and define the termination conditions, for example, training is terminated after 2000 times. Input multiple high-level feature data and their corresponding emotion data into the emotion recognition neural network for training.
  • the low-level feature data is input into the pre-trained feature extraction network to obtain the high-level feature data of the voice data to be recognized.
  • the feature extraction network includes at least two layers of nerves. Network, one of which is the capsule neural network.
  • the capsule network can carry more feature information and has excellent generalization ability.
  • the extracted high-level information includes more feature information, and the high-level information including more feature information is input into pre-training
  • the emotion recognition neural network makes the output result of the emotion recognition neural network more accurate, so that more accurate emotion data of the voice data to be recognized can be obtained according to the output result of the emotion recognition neural network, which can effectively improve the accuracy of emotion recognition.
  • FIG. 4 is a schematic flowchart of a second embodiment of a voice emotion recognition method provided by the present invention.
  • the voice emotion recognition method provided by the present invention includes the following steps:
  • S201 Acquire voice data to be recognized, and extract low-level feature data of the voice data to be recognized.
  • this step is basically the same as step S101 in the first embodiment of the voice emotion recognition method provided by the present invention, and will not be repeated here.
  • S202 Input low-level feature data into a pre-trained feature extraction network to obtain high-level feature data of the voice data to be recognized.
  • the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network, and the other is a two-way length Memory neural network.
  • low-level feature data is input to a pre-trained feature extraction network, which includes a layer of convolutional neural network and a layer of capsule neural network.
  • Convolutional Neural Networks is a type of feedforward neural network (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms of deep learning. Convolutional neural network has the ability of representation learning, and can perform shift-invariant classification of input information according to its hierarchical structure, so it is also called "shift-invariant artificial neural network (Shift-Invariant Artificial Neural Network). Neural Networks, SIANN)".
  • the convolutional neural network mimics the biological visual perception (visual perception) mechanism, which can perform supervised learning and unsupervised learning.
  • the convolution kernel parameter sharing in the hidden layer and the sparsity of inter-layer connections make the convolutional neural network able to A small amount of calculation is required to learn grid-like topology features, such as pixels and audio, with stable effects and no additional feature engineering (feature engineering) requirements for the data.
  • the low-level feature data of the voice data to be recognized is input into the convolutional neural network to obtain the middle-level feature data of the voice data to be recognized
  • the middle-level feature data of the voice data to be recognized is input into the capsule neural network to obtain the voice data to be recognized High-level feature data.
  • Each neuron in the capsule network is a vector. This vector can not only represent the basic characteristics of the data, but can also include many detailed features of the data, can carry more feature information, and has excellent generalization ability, which is very suitable for the speech field. Because the voice features are very subtle, the use of capsule neural networks can retain more feature information in the low-level feature data.
  • S203 Input the high-level feature data into the pre-trained emotion recognition neural network, and obtain the emotion data of the voice data to be recognized according to the output result of the emotion recognition neural network.
  • this step is basically the same as step S103 in the first embodiment of the voice emotion recognition method provided by the present invention, and will not be repeated here.
  • the low-level feature data is input into the convolutional neural network to obtain the middle-level feature data of the voice data to be recognized.
  • the high-level feature data of the data is input into the capsule neural network to obtain the high-level feature data of the voice data to be recognized.
  • the capsule network can carry more feature information and has excellent generalization ability.
  • the extracted high-level information includes more feature information, which will include High-level information with more feature information is input to the pre-trained emotion recognition neural network, which makes the output result of the emotion recognition neural network more accurate, so that more accurate emotion data of the speech data to be recognized can be obtained according to the output result of the emotion recognition neural network. Effectively improve the accuracy of emotion recognition.
  • FIG. 5 is a schematic flowchart of a third embodiment of a voice emotion recognition method provided by the present invention.
  • the voice emotion recognition method provided by the present invention includes the following steps:
  • S301 Acquire voice data to be recognized, and extract low-level feature data of the voice data to be recognized.
  • S302 Input low-level feature data into a pre-trained feature extraction network to obtain high-level feature data of the voice data to be recognized.
  • the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network.
  • steps S301-S302 are basically the same as steps S101-S102 in the first embodiment of the voice emotion recognition method provided by the present invention, and will not be repeated here.
  • S303 Input the high-level feature data into the pre-trained emotion recognition neural network, and obtain the emotion classification matrix of the speech data to be recognized.
  • the high-level feature data is input into the pre-trained emotion recognition neural network, and the emotion recognition neural network is a two-way long and short memory neural network.
  • BSSTM Bidirectional Long Short-term Memory
  • the high-level feature data is input to the pre-trained emotion recognition neural network, and the emotion recognition neural network outputs the emotion classification matrix of the speech data to be recognized.
  • Each unit of the emotion classification matrix is a vector, and each vector represents a part of the characteristics of the speech data to be recognized.
  • S304 Obtain a weight matrix of the emotion classification matrix, and multiply the weight matrix by the emotion classification matrix to obtain a feature matrix of the voice data to be recognized.
  • the model in order for the neural network to remember more information, the model will be very complicated. However, due to the limitation of computing power, it is impossible to expand the network indefinitely. It is necessary to use the attention mechanism and pay more attention to the effective Information, thereby simplifying the model and improving the recognition rate. Therefore, the weight matrix of the sentiment classification matrix is obtained, for example, by performing a self-attention operation on the sentiment classification matrix, the weight matrix of the sentiment classification matrix is obtained. Multiply the emotion classification matrix and its weight matrix to obtain the feature matrix of the speech data to be recognized.
  • Figure 6 is a schematic diagram of the principle of the attention mechanism.
  • the essence of the attention mechanism can be described as a mapping from a query to a series of (key-value-value) pairs. Based on this essence, many variants have been developed.
  • the present invention adopts a self-attention mechanism. Its solution is:
  • the weight coefficient of each K corresponding to V is obtained by calculating the correlation between each Q and each K.
  • the commonly used computer system and similarity function point product method, Cosine similarity method and neural network MLP evaluation are:
  • the weights are normalized by the Softmax function, which can highlight the weights of important elements, and ai is the weight coefficient.
  • the weight ai and the corresponding key value V are weighted and summed to obtain the final attention value.
  • S305 Acquire emotion data of the voice data to be recognized according to the feature matrix.
  • the feature matrix is input into a preset operation function, the probability values of various emotions of the voice data to be recognized are obtained, and the emotions of the voice data to be recognized are determined according to the probability values of various emotions.
  • high-level data is input into the pre-trained emotion recognition neural network, after obtaining the emotion classification matrix, the weight matrix of the emotion classification is obtained through the attention algorithm, and the emotion classification matrix is multiplied by the weight matrix.
  • the attention mechanism is used to pay more attention to the effective information, thereby simplifying the model and improving the recognition rate.
  • FIG. 7 is a schematic structural diagram of the first embodiment of the smart device provided by the present invention.
  • the smart device 10 includes an acquisition module 11, a feature extraction module 12, and an identification module 13.
  • the acquiring module 11 is used to acquire voice data to be recognized, and extract low-level feature data of the voice data to be recognized.
  • the feature extraction module 12 is configured to input low-level feature data into a pre-trained feature extraction network to obtain high-level feature data of the voice data to be recognized.
  • the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network.
  • the recognition module 13 is used to input high-level feature data into the pre-trained emotion recognition neural network to recognize the emotion data of the voice data to be recognized.
  • the acquisition module acquires the low-level feature data of the voice data to be recognized, and the feature extraction module inputs the low-level feature data into the pre-trained feature extraction network.
  • the feature extraction network includes at least two layers of neural networks, one of which The layer is a capsule neural network.
  • the capsule network can carry more feature information and has excellent generalization ability.
  • the extracted high-level information includes more feature information.
  • the high-level information including more feature information is input into the pre-trained emotion recognition nerve
  • the network makes the output result of the emotion recognition neural network more accurate, which can effectively improve the accuracy of emotion recognition.
  • the other layer of neural networks is a bidirectional long and short memory neural network.
  • the identification module 13 includes a matrix sub-module 131, a weight sub-module 132, and an identification sub-module 133.
  • the matrix sub-module 131 is used to input high-level feature data into the pre-trained emotion recognition neural network to obtain the emotion classification matrix of the speech data to be recognized.
  • the weight sub-module 132 is used to obtain the weight matrix of the emotion classification matrix, and multiply the weight matrix by the emotion classification matrix to obtain the feature matrix of the speech data to be recognized.
  • the recognition sub-module 133 is used to obtain the emotion of the voice data to be recognized according to the feature matrix.
  • the weight sub-module 132 performs a self-attention operation on the emotion classification matrix to obtain the weight matrix of the emotion classification matrix.
  • the emotion recognition neural network is a two-way long and short memory neural network.
  • the recognition module 13 also includes a function sub-module 134.
  • the function sub-module 134 is used to input the feature matrix into a preset operation function to obtain the probability values of various emotions of the voice data to be recognized, and determine the voice data to be recognized according to the probability values of various emotions Emotions.
  • the low-level feature data includes the frequency and amplitude of the voice data to be recognized.
  • the high-level feature data includes the Mel frequency cepstrum of the voice data to be recognized.
  • the obtaining module 11 is used to obtain low-level feature data of the voice data to be recognized by using the opensmile tool.
  • the smart device also includes a training module 14 for training the emotion recognition neural network.
  • the training module 14 includes a preparation sub-module 141, a definition sub-module 142, and an input sub-module 143.
  • the preparation sub-module 141 is used to prepare a plurality of training high-level feature data, and label the emotional data of each training high-level feature data.
  • the definition sub-module 142 is used to define the structure, loss function and termination conditions of the trained emotion recognition neural network.
  • the input sub-module 143 is used to input multiple high-level feature data and corresponding emotion data into the emotion recognition neural network for training.
  • the feature extraction module of the smart device in this embodiment inputs low-level feature data into the pre-trained feature extraction network.
  • the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network, which can be carried More feature information and excellent generalization ability.
  • the extracted high-level information includes more feature information, which can effectively improve the accuracy of recognition.
  • the recognition module pays more attention to effective information through the attention mechanism, thereby simplifying the model. Improve the recognition rate.
  • FIG. 8 is a schematic structural diagram of a second embodiment of a smart device provided by the present invention.
  • the smart device 20 provided by the present invention includes an acquisition circuit 21, a processor 22, and a memory 23.
  • the processor 22 is coupled to the acquisition circuit 21 and the memory 23.
  • a computer program is stored in the memory 23, and the processor 22 executes the computer program when working to implement the methods shown in FIG. 2, FIG. 4, and FIG. 5.
  • the detailed method can be referred to above, so I won't repeat it here.
  • the smart terminal extracts the low-level feature data of the voice data to be recognized, it inputs the low-level feature data into the pre-trained feature extraction network to obtain the high-level feature data of the voice data to be recognized.
  • the feature extraction network includes At least two layers of neural network, one of which is a capsule neural network.
  • the capsule network can carry more feature information and has excellent generalization ability.
  • the extracted high-level information includes more feature information, and will include more high-level features.
  • the information is input into the pre-trained emotion recognition neural network, so that the output result of the emotion recognition neural network is more accurate, which can effectively improve the accuracy of emotion recognition.
  • FIG. 9 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided by the present invention.
  • At least one computer program 31 is stored in the computer-readable storage medium 30.
  • the computer program 31 is used to be executed by the processor to implement the methods shown in FIGS. 2, 4, and 5.
  • the computer-readable storage medium 30 may be a storage chip in a terminal, a hard disk, or a mobile hard disk, or other readable and writable storage tools such as a USB flash drive, or an optical disk, and may also be a server or the like.
  • the computer program stored in the computer-readable storage medium in this embodiment can be used to input the low-level feature data into the pre-trained feature extraction network after extracting the low-level feature data of the voice data to be recognized to obtain The high-level feature data of voice data.
  • the feature extraction network includes at least two layers of neural networks, one of which is a capsule neural network.
  • the capsule network can carry more feature information and has excellent generalization capabilities. Feature information included in the extracted high-level information More, high-level information including more characteristic information is input into the pre-trained emotion recognition neural network, so that the output result of the emotion recognition neural network is more accurate, which can effectively improve the accuracy of emotion recognition.
  • the present invention extracts the low-level feature data of the voice data to be recognized, inputs the low-level feature data into the pre-trained feature extraction neural network including the capsule neural network, and obtains the high-level feature data of the voice data to be recognized.
  • the capsule network can be carried More feature information and excellent generalization ability.
  • the extracted high-level information includes more feature information. Input the high-level information including more feature information into the pre-trained emotion recognition neural network to make the output result of the emotion recognition neural network More accurate, which can effectively improve the accuracy of emotion recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un procédé de reconnaissance d'émotion de parole, un dispositif intelligent, et un support de stockage lisible par ordinateur, le procédé de reconnaissance d'émotion de parole comprenant les étapes consistant à : obtenir des données de parole à reconnaître et extraire des données de caractéristiques de bas niveau des données de parole (S101); entrer les données de caractéristiques de bas niveau dans un réseau d'extraction de caractéristiques préentraîné, et obtenir des données de caractéristiques de haut niveau des données de parole, le réseau d'extraction de caractéristiques comprenant au moins deux couches de réseaux neuronaux, dont l'une est un réseau neuronal de capsules (S102); et entrer les données de caractéristiques de haut niveau dans un réseau neuronal de reconnaissance d'émotion préentraîné, et obtenir des données d'émotion des données de parole en fonction d'un résultat de sortie du réseau neuronal de reconnaissance d'émotion (S103). Le procédé décrit peut améliorer efficacement l'exactitude de la reconnaissance d'émotion.
PCT/CN2019/127923 2019-12-24 2019-12-24 Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur WO2021127982A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/127923 WO2021127982A1 (fr) 2019-12-24 2019-12-24 Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur
CN201980003195.6A CN111357051B (zh) 2019-12-24 2019-12-24 语音情感识别方法、智能装置和计算机可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127923 WO2021127982A1 (fr) 2019-12-24 2019-12-24 Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur

Publications (1)

Publication Number Publication Date
WO2021127982A1 true WO2021127982A1 (fr) 2021-07-01

Family

ID=71197848

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127923 WO2021127982A1 (fr) 2019-12-24 2019-12-24 Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur

Country Status (2)

Country Link
CN (1) CN111357051B (fr)
WO (1) WO2021127982A1 (fr)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111862956B (zh) * 2020-07-27 2022-07-12 腾讯科技(深圳)有限公司 一种数据处理方法、装置、设备及存储介质
CN113362857A (zh) * 2021-06-15 2021-09-07 厦门大学 一种基于CapCNN的实时语音情感识别方法及应用装置
CN113555038B (zh) * 2021-07-05 2023-12-29 东南大学 基于无监督领域对抗学习的说话人无关语音情感识别方法及系统
CN116304585B (zh) * 2023-05-18 2023-08-15 中国第一汽车股份有限公司 情感识别及模型训练方法、装置、电子设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106653020A (zh) * 2016-12-13 2017-05-10 中山大学 一种基于深度学习的智慧视听设备多业务控制方法及系统
CN106782602A (zh) * 2016-12-01 2017-05-31 南京邮电大学 基于长短时间记忆网络和卷积神经网络的语音情感识别方法
CN109389992A (zh) * 2018-10-18 2019-02-26 天津大学 一种基于振幅和相位信息的语音情感识别方法
CN109523994A (zh) * 2018-11-13 2019-03-26 四川大学 一种基于胶囊神经网络的多任务语音分类方法
CN109817246A (zh) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
CN110491416A (zh) * 2019-07-26 2019-11-22 广东工业大学 一种基于lstm和sae的电话语音情感分析与识别方法
CN110534132A (zh) * 2019-09-23 2019-12-03 河南工业大学 一种基于谱图特征的并行卷积循环神经网络的语音情感识别方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108597539B (zh) * 2018-02-09 2021-09-03 桂林电子科技大学 基于参数迁移和语谱图的语音情感识别方法
CN110400579B (zh) * 2019-06-25 2022-01-11 华东理工大学 基于方向自注意力机制和双向长短时网络的语音情感识别

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782602A (zh) * 2016-12-01 2017-05-31 南京邮电大学 基于长短时间记忆网络和卷积神经网络的语音情感识别方法
CN106653020A (zh) * 2016-12-13 2017-05-10 中山大学 一种基于深度学习的智慧视听设备多业务控制方法及系统
CN109389992A (zh) * 2018-10-18 2019-02-26 天津大学 一种基于振幅和相位信息的语音情感识别方法
CN109523994A (zh) * 2018-11-13 2019-03-26 四川大学 一种基于胶囊神经网络的多任务语音分类方法
CN109817246A (zh) * 2019-02-27 2019-05-28 平安科技(深圳)有限公司 情感识别模型的训练方法、情感识别方法、装置、设备及存储介质
CN110491416A (zh) * 2019-07-26 2019-11-22 广东工业大学 一种基于lstm和sae的电话语音情感分析与识别方法
CN110534132A (zh) * 2019-09-23 2019-12-03 河南工业大学 一种基于谱图特征的并行卷积循环神经网络的语音情感识别方法

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHEN XIAOMING: "RESEARCH ON SPEECH EMOTION RECOGNITION METHOD BASED ON TIME SERIES DEEP LEARNING MODEL", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, 1 January 2019 (2019-01-01), CN, XP055827147, ISSN: 1674-0246 *
MIAO YUQING, ZOU WEI;LIU TONGLAI;ZHOU MING;CAI GUOYONG: "Speech Emotion Recognition Model Based on Parameter Transfer and Convolutional Recurrent Neural Network", COMPUTER ENGINEERING AND APPLICATIONS, HUABEI JISUAN JISHU YANJIUSUO, CN, vol. 55, no. 10, 1 January 2019 (2019-01-01), CN, pages 135 - 198, XP055827152, ISSN: 1002-8331, DOI: 10.3778/j.issn.1002-8331.1802-0089 *
WU XIXIN; LIU SONGXIANG; CAO YUEWEN; LI XU; YU JIANWEI; DAI DONGYANG; MA XI; HU SHOUKANG; WU ZHIYONG; LIU XUNYING; MENG HELEN: "Speech Emotion Recognition Using Capsule Networks", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 6695 - 6699, XP033565716, DOI: 10.1109/ICASSP.2019.8683163 *

Also Published As

Publication number Publication date
CN111357051A (zh) 2020-06-30
CN111357051B (zh) 2024-02-02

Similar Documents

Publication Publication Date Title
CN108010514B (zh) 一种基于深度神经网络的语音分类方法
US11948066B2 (en) Processing sequences using convolutional neural networks
WO2021127982A1 (fr) Procédé de reconnaissance d'émotion de parole, dispositif intelligent, et support de stockage lisible par ordinateur
WO2022007823A1 (fr) Procédé et dispositif de traitement de données de texte
WO2020232877A1 (fr) Procédé et appareil de sélection d'une réponse à une question, dispositif informatique et support de stockage
WO2020232860A1 (fr) Procédé et appareil de synthèse vocale, et support de stockage lisible par ordinateur
Deng et al. Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition
CN109992773B (zh) 基于多任务学习的词向量训练方法、系统、设备及介质
WO2021184902A1 (fr) Procédé et appareil de classification d'image, procédé et appareil d'entraînement, dispositif et support
CN110555714A (zh) 用于输出信息的方法和装置
WO2019227574A1 (fr) Procédé d'apprentissage de modèle vocal, procédé, dispositif et équipement de reconnaissance vocale, et support
WO2022048239A1 (fr) Procédé et dispositif de traitement audio
CN112818861A (zh) 一种基于多模态上下文语义特征的情感分类方法及系统
US20220019807A1 (en) Action classification in video clips using attention-based neural networks
CN111694940A (zh) 一种用户报告的生成方法及终端设备
CN112418059A (zh) 一种情绪识别的方法、装置、计算机设备及存储介质
CN112767927A (zh) 一种提取语音特征的方法、装置、终端及存储介质
CN112995414A (zh) 基于语音通话的行为质检方法、装置、设备及存储介质
WO2024114659A1 (fr) Procédé de génération de résumé et dispositif associé
CN112989843B (zh) 意图识别方法、装置、计算设备及存储介质
CN116612745A (zh) 一种语音情感识别方法、装置、设备及其存储介质
CN116957006A (zh) 预测模型的训练方法、装置、设备、介质及程序产品
CN112633381B (zh) 音频识别的方法及音频识别模型的训练方法
CN114913860A (zh) 声纹识别方法、装置、计算机设备、存储介质及程序产品
CN110826726B (zh) 目标处理方法、目标处理装置、目标处理设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957780

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957780

Country of ref document: EP

Kind code of ref document: A1