CN115240713A - Voice emotion recognition method and device based on multi-modal features and contrast learning - Google Patents

Voice emotion recognition method and device based on multi-modal features and contrast learning Download PDF

Info

Publication number
CN115240713A
CN115240713A CN202210825038.7A CN202210825038A CN115240713A CN 115240713 A CN115240713 A CN 115240713A CN 202210825038 A CN202210825038 A CN 202210825038A CN 115240713 A CN115240713 A CN 115240713A
Authority
CN
China
Prior art keywords
emotion
voice
text
speech
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210825038.7A
Other languages
Chinese (zh)
Other versions
CN115240713B (en
Inventor
谭真
张俊丰
赵翔
唐九阳
王俞涵
吴菲
葛斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210825038.7A priority Critical patent/CN115240713B/en
Publication of CN115240713A publication Critical patent/CN115240713A/en
Application granted granted Critical
Publication of CN115240713B publication Critical patent/CN115240713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a voice emotion recognition method and device based on multi-modal features and contrast learning. The method comprises the following steps: the constructed speech emotion recognition model utilizes a Fast RCNN preprocessing model and a bidirectional GRU model to extract 3D convolution network features to obtain speech emotion features, text emotion features and high-level emotion features; performing emotion feature enhancement representation according to a comparison learning method, splicing the enhanced emotion features, and then decoding and outputting probability distribution of emotion types; constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, training a pre-constructed speech emotion recognition model by using the cross entropy loss function and the loss function in comparison learning, and performing speech emotion recognition on speech video data to be recognized according to the trained speech emotion recognition model. By adopting the method, the speech emotion recognition accuracy can be improved.

Description

Voice emotion recognition method and device based on multi-modal features and contrast learning
Technical Field
The present application relates to the field of signal processing and artificial intelligence technologies, and in particular, to a speech emotion recognition method and apparatus, a computer device, and a storage medium based on multi-modal feature and contrast learning.
Background
In daily life, the way people transmit emotions is mainly through voice. In the process of man-machine interaction, voice is also one of the main approaches. In addition, the recognition of the emotion in the voice can enable the machine to better understand the intention and the idea of the user, and further enable the machine to be more intelligent and humanized.
However, early research on emotion speech recognition was limited to speech data only, resulting in a bottleneck in recognition accuracy. In fact, when human beings transmit emotions through voice, the emotion and hand movement changes are generally accompanied, and the voice contains text information besides acoustic information, which is called a modality. It is born that the voice emotion recognition task based on multi-modal features is aided by video and voice transcribed text that contains facial expressions and hand movements when speaking. Because the voice and the synchronously acquired video and the text transcribed by the voice contain the same emotional information, the multi-modal features have certain similarity in the aspect of emotional features. However, at present, the research on the speech emotion recognition based on the multi-modal features ignores the relation among the multi-modal features, so that the multi-modal emotion feature representation is not accurate enough, and the speech emotion recognition accuracy is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a speech emotion recognition method, apparatus, computer device and storage medium based on multi-modal feature and contrast learning, which can improve speech emotion recognition accuracy.
A speech emotion recognition method based on multi-modal feature and contrast learning, the method comprising:
acquiring voice video data to be recognized; the voice video data comprises voice text and video data;
constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
performing data preprocessing on the voice text to obtain a voice vector and a word vector;
extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;
respectively extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;
performing enhancement representation on the voice emotion feature, the text emotion feature and the high-level emotion feature according to a comparison learning method to obtain the enhanced voice emotion feature, the text emotion feature and the high-level emotion feature;
splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model;
and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
In one embodiment, the voice text comprises voice data and text data; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, wherein the method comprises the following steps:
taking the voice data as a frame according to a window with a fixed time period length, sliding the window backwards, wherein the position of the window after each movement and the position of the previous window have an overlapping part, and obtaining a plurality of single-frame voice data;
converting the single-frame voice data according to an OpenSMILE tool to obtain a voice vector;
taking the sentence length containing most words in the text data as the maximum length, and carrying out zero filling processing on the sentences with the length less than the maximum length in the text data to obtain a plurality of sentences with equal length;
converting words in the sentences with equal length according to the Bert preprocessing model to obtain word vectors; the word vector includes semantic information of the context of the word corresponding thereto.
In one embodiment, the extracting of emotion features of the speech vector and the word vector according to the bidirectional GRU model to obtain speech emotion features and text emotion features includes:
according to the bidirectional GRU model, the voice vector and the word vector are respectively and simultaneously input into a forward GRU model and a reverse GRU model, and the state information vectors output by the two GRU models at the same time are spliced to obtain the voice emotion feature and the text emotion feature which respectively correspond to the voice emotion feature and the text emotion feature.
In one embodiment, the method for enhancing and representing the speech emotion feature, the text emotion feature and the high-level emotion feature according to a contrast learning method to obtain the enhanced speech emotion feature, the text emotion feature and the high-level emotion feature includes:
and according to the contrast learning method, the distances among the speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are drawn up by training a reduction loss function, so that the enhanced speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are obtained.
In one embodiment, the loss function is loss cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that regulate the level of feature representation.
In one embodiment, the method for processing emotion information includes splicing enhanced speech emotion features, text emotion features and high-level emotion features, decoding the spliced speech emotion features, text emotion features and high-level emotion features through a decoder composed of full connection layers, and outputting probability distribution of emotion categories, and includes:
splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding the spliced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics through a decoder consisting of all connection layers, and outputting probability distribution of emotion types
Figure BDA0003746132280000031
Wherein, F r Representing multi-modal features after stitching, p j Representing the probability that the currently identified emotion is in the jth category,
Figure BDA0003746132280000035
for the jth feature parameter of the multi-modal feature,
Figure BDA0003746132280000032
is the ith feature parameter of the multi-modal feature.
In one embodiment, the cross entropy loss function is constructed according to the probability distribution of the emotion classes and the labeled real emotion class labels, and comprises the following steps:
constructing a cross entropy loss function according to the probability distribution of the emotion classes and the labeled real emotion class labels as
Figure BDA0003746132280000033
Wherein y represents the real distribution of the emotion types, x represents the emotion type labels, n represents the number of the emotion type labels, and i represents the emotion type label serial number.
A speech emotion recognition apparatus based on multi-modal features and contrast learning, the apparatus comprising:
the model building module is used for acquiring voice video data to be recognized; the voice video data comprises voice text and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
the data preprocessing module is used for preprocessing the data of the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;
the feature extraction module is used for respectively extracting the emotional features of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional features and text emotional features; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;
the comparison learning module is used for performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a comparison learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics; splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
the speech emotion recognition module is used for constructing a cross entropy loss function according to the probability distribution of emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring voice video data to be recognized; the voice video data comprises voice text and video data;
constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
performing data preprocessing on the voice text to obtain a voice vector and a word vector;
extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;
respectively extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;
performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a comparison learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics;
splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model;
and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
acquiring voice video data to be recognized; the voice video data comprises voice text and video data;
constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
performing data preprocessing on the voice text to obtain a voice vector and a word vector;
extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;
respectively extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting the emotional characteristics of the fused video data by using a 3D convolution network to obtain high-grade emotional characteristics;
performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a comparison learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics;
splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model;
and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
According to the voice emotion recognition method, device, computer equipment and storage medium based on multi-modal features and contrast learning, feature extraction is respectively carried out on audio and video data through a Fast RCNN preprocessing model and a bidirectional GRU model, multi-modal features capable of expressing voice emotion are obtained, emotion information from the multi-modes is effectively utilized, the similarity between the multi-modal emotion features is effectively reduced through the contrast learning method, more accurate emotion feature expression is obtained, probability distribution of emotion classes is carried out according to the more accurate emotion feature expression, a cross entropy loss function is constructed with a real emotion class label, model training is carried out on a pre-constructed voice emotion recognition model through the loss function in the contrast learning, voice emotion recognition is carried out through the trained model, and the accuracy of the voice emotion recognition is improved.
Drawings
FIG. 1 is a schematic flow chart of a speech emotion recognition method based on multi-modal feature and contrast learning in one embodiment;
FIG. 2 is a block diagram of a method for speech emotion recognition based on multi-modal features and contrast learning, under an embodiment;
FIG. 3 is a schematic diagram of a speech emotion recognition apparatus based on multi-modal features and contrast learning, under an embodiment;
FIG. 4 is a diagram of the internal structure of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a speech emotion recognition method based on multi-modal features and contrast learning is provided, which comprises the following steps:
step 102, acquiring voice video data to be recognized; the voice video data comprises voice text and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
step 104, performing data preprocessing on the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements.
The voice text contains words and emotions with emotional characteristics of a speaker, the voice vector and the word vector which are obtained by preprocessing the voice text can be better used for characteristic extraction, when the speaker sends voice, the voice vector and the word vector can be generally accompanied with expression change and hand action change.
Step 106, extracting emotional characteristics of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; and performing emotional feature extraction on the fused video data by using a 3D convolutional network to obtain high-grade emotional features.
The bidirectional GRU model comprises a forward GRU model and a reverse GRU model, the GRU model is a variant with a good effect of a long-time and short-time memory network, the structure is simpler than that of an LSTM network, the effect is good, the bidirectional GRU model is used for feature extraction, and a state information vector combining state information before and after the current time can be obtained, so that the obtained state information vector is more combined with a context, and the feature extraction is more accurate. Meanwhile, the essence of the video data is a multi-channel image formed by each frame of picture, the method adopts a 3D convolution mode to extract emotional characteristics of the video, adopts a 3D convolution kernel to perform convolution operation on the whole multi-channel three-dimensional image, and compared with the convolution operation on each 2D channel by 2D convolution, the 3D convolution can better model time information. The invention adopts a multilayer 3DCNN and a 3D pooling layer to extract high-level emotional characteristics in the video.
108, performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a contrast learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics; and splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, decoding by a decoder consisting of all connection layers, and outputting the probability distribution of emotion types.
The videos synchronously acquired with the voice and the texts transcribed by the voice contain the same emotion information, so that certain similarity exists between the extracted multi-modal features, the similarity between the multi-modal features can be drawn by a comparison learning method, the emotion information can be more accurately learned, higher-order feature representation can be obtained, and the accuracy of emotion feature recognition is improved. And then splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, and constructing a cross entropy loss function by using the probability distribution of emotion types decoded and output by a decoder consisting of a full connection layer and the labeled real emotion type labels to train a pre-constructed voice emotion recognition model so as to improve the accuracy of the voice emotion recognition model being the type of the voice emotion.
Step 110, constructing a cross entropy loss function according to the probability distribution of emotion types and labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
And using the probability distribution of the emotion classes and the one-hot codes of the real emotion class labels to construct a cross entropy loss function, wherein the expression is as follows:
Figure BDA0003746132280000081
the real emotion category label is a label which is accurately labeled in advance, and the probability distribution of the emotion categories and the real emotion category label are used for constructing a cross entropy loss function to train the model, so that the model is trainedMore accurate emotion classification can be output during emotion recognition, and loss function loss in comparison learning is realized cons In combination with the cross-entropy loss function, the loss function is then minimized by continuously updating the parameters with a random gradient descent, as follows:
loss θ =loss cons +loss crE
Figure BDA0003746132280000082
where θ represents all trainable parameters in the model and η is the learning rate of the model during training.
The loss function in the comparison learning is utilized to train the pre-constructed speech emotion recognition model, so that errors generated in the comparison learning in the prior art can be reduced, the model is more accurate, more accurate emotion types can be obtained when speech emotion recognition is carried out on speech video data to be recognized according to the trained speech emotion recognition model, and the process of the speech emotion recognition is shown in fig. 2.
In the speech emotion recognition method based on multi-modal characteristics and contrast learning, the multi-modal characteristics capable of expressing speech emotion are obtained by respectively extracting characteristics of speech audio and video data through a Fast RCNN preprocessing model and a bidirectional GRU model, emotion information from multiple modes is effectively utilized, the similarity between the multi-modal emotion characteristics is effectively reduced through the contrast learning method, more accurate emotion characteristic representation is obtained, then probability distribution of emotion types is carried out according to the more accurate emotion characteristic representation, a cross entropy loss function is constructed with a real emotion type label, model training is carried out on a pre-constructed speech emotion recognition model through the loss function in the contrast learning, speech emotion recognition is carried out through the trained model, and the accuracy of the emotion speech emotion recognition is improved.
In one embodiment, the voice text comprises voice data and text data; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, and comprising the following steps:
taking the voice data as a frame according to a window with a fixed time period length, sliding the window backwards, wherein the position of the window after each movement and the position of the previous window have an overlapping part, and obtaining a plurality of single-frame voice data;
converting the single-frame voice data according to an OpenSMILE tool to obtain a voice vector;
taking the sentence length containing most words in the text data as the maximum length, and carrying out zero filling processing on the sentences with the length less than the maximum length in the text data to obtain a plurality of sentences with equal length;
converting words in sentences with equal length according to a Bert preprocessing model to obtain word vectors; the word vector contains semantic information of the context of the word corresponding to the word vector.
In a specific embodiment, a window of a voice according to a certain time period length is used as a frame, the window is moved backwards, meanwhile, in order to enable transition between each frame of the voice to be natural, an overlapping part exists between the window after each movement and the previous window, and finally, each frame of the voice is converted into a mel cepstrum coefficient characteristic parameter through an OpenSMILE tool, namely, the voice is vectorized. Meanwhile, because the model can only process text vectors of isometric sequences, the length of sentences containing most words in the text is taken as the maximum length, and the sentences which are not longer than the maximum length are subjected to zero filling, so that all sentences are isometric; and then converting the words in the sentences with equal length into word vectors through a Bert preprocessing model, wherein each word vector contains the semantic information of the context.
In one embodiment, the extracting of emotion features of the speech vector and the word vector according to the bidirectional GRU model to obtain speech emotion features and text emotion features includes:
according to the bidirectional GRU model, the voice vector and the word vector are respectively and simultaneously input into a forward GRU model and a reverse GRU model, and the state information vectors output by the two GRU models at the same time are spliced to obtain the voice emotion feature and the text emotion feature which respectively correspond to the voice emotion feature and the text emotion feature.
The GRU model is a variant with good effect of a long-time memory network, is simpler than the structure of an LSTM network, and has good effect.
In one embodiment, the enhancing representation of the speech emotion feature, the text emotion feature and the high-level emotion feature according to a contrast learning method to obtain enhanced speech emotion feature, text emotion feature and high-level emotion feature includes:
and according to the contrast learning method, the distances among the speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are drawn up by training a reduction loss function, so that the enhanced speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are obtained.
In one embodiment, the loss function is loss cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that adjust the feature representation level.
In one embodiment, the method for processing the speech emotion feature, the text emotion feature and the high-level emotion feature comprises the steps of splicing the enhanced speech emotion feature, the enhanced text emotion feature and the high-level emotion feature, decoding the speech emotion feature, the text emotion feature and the high-level emotion feature by a decoder consisting of full connection layers, and outputting probability distribution of emotion categories, wherein the method comprises the following steps:
splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding the spliced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics through a decoder consisting of all connection layers, and outputting probability distribution of emotion types
Figure BDA0003746132280000111
Wherein, F r Representing multi-modal features after stitching, p j Representing the probability that the currently identified emotion is in the jth category,
Figure BDA0003746132280000112
for the jth feature parameter of the multi-modal feature,
Figure BDA0003746132280000113
is the ith feature parameter of the multi-modal feature.
In the specific embodiment, the video synchronously acquired with the voice and the text transcribed by the voice contain the same emotion information, so that certain similarity exists between the extracted multi-modal characteristics. Then, by adopting the idea of comparative learning, the distance between the speech and the emotional characteristics of the text and the video is shortened by training a reduction loss function, wherein the loss function is as follows:
loss cons =log(exp(D(S,T)/τ)+exp(D(S,V)/σ))
wherein D is an L2 distance function to measure the distance represented by two emotion characteristics, S is a speech emotion characteristic, T is a text emotion characteristic, V is a high-level emotion characteristic, and tau and sigma are parameters for adjusting the representation level of the characteristics. And splicing the expression of the emotional characteristics with higher orders of the three modes obtained after the comparison and learning, wherein the expression is as follows:
F=concat(S C ,T C ,M C )
wherein S C ,T C ,M C Respectively representing higher-order emotional characteristics in the voice, the text and the video which are obtained after comparative learning. Inputting the spliced multi-modal emotional characteristics F into the full connection layer and the Relu activation function layer to expressThe formula is as follows:
F r =Relu(W T F)=max(0,W T F)
where W represents a trainable parameter matrix.
And finally, outputting probability distribution corresponding to the emotion types through a softmax function, and selecting the emotion type corresponding to the maximum probability to obtain the emotion type. The expression is as follows:
Figure BDA0003746132280000114
by comparing and learning to draw up the similarity among the multi-modal features, the emotion information can be more accurately learned, so that higher-order feature representation is obtained, and the accuracy of emotion recognition is enhanced.
In one embodiment, the cross entropy loss function is constructed according to the probability distribution of the emotion classes and the labeled real emotion class labels, and comprises the following steps:
constructing a cross entropy loss function according to the probability distribution of the emotion classes and the labeled real emotion class labels as
Figure BDA0003746132280000121
Wherein y represents the real distribution of the emotion types, x represents the emotion type labels, n represents the number of the emotion type labels, and i represents the serial numbers of the emotion type labels.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 3, there is provided a speech emotion recognition apparatus based on multi-modal feature and contrast learning, including: a model construction module 302, a data preprocessing module 304, a feature extraction module 306, a comparison learning module 308 and a speech emotion recognition module 310, wherein:
a model building module 302, configured to obtain voice and video data to be recognized; the voice video data comprises voice text and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
the data preprocessing module 304 is configured to perform data preprocessing on the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; local features include facial expressions and hand movements;
the feature extraction module 306 is configured to perform emotion feature extraction on the speech vector and the word vector according to the bidirectional GRU model, so as to obtain speech emotion features and text emotion features; extracting the emotional characteristics of the fused video data by using a 3D convolution network to obtain high-grade emotional characteristics;
the contrast learning module 308 is configured to perform enhancement representation on the speech emotion feature, the text emotion feature and the high-level emotion feature according to a contrast learning method to obtain an enhanced speech emotion feature, a text emotion feature and a high-level emotion feature; splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
the speech emotion recognition module 310 is configured to construct a cross entropy loss function according to the probability distribution of emotion categories and the labeled real emotion category labels, and train a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
In one embodiment, the data preprocessing module 304 is further configured to include voice data and text data in the voice text; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, wherein the method comprises the following steps:
taking the voice data as a frame according to a window with a fixed time period length, sliding the window backwards, wherein the position of the window after each movement and the position of the previous window have an overlapping part, and obtaining a plurality of single-frame voice data;
converting the single-frame voice data according to an OpenSMILE tool to obtain a voice vector;
taking the sentence length containing most words in the text data as the maximum length, and carrying out zero filling processing on the sentences with the length less than the maximum length in the text data to obtain a plurality of sentences with equal length;
converting words in sentences with equal length according to a Bert preprocessing model to obtain word vectors; the word vector contains semantic information of the context of the word corresponding to the word vector.
In one embodiment, the feature extraction module 306 is further configured to perform emotion feature extraction on the speech vector and the word vector according to the bidirectional GRU model, respectively, to obtain speech emotion features and text emotion features, including:
according to the bidirectional GRU model, the voice vector and the word vector are simultaneously input into a forward GRU model and a backward GRU model respectively, and state information vectors output by the two GRU models at the same time are spliced to obtain corresponding voice emotion characteristics and text emotion characteristics respectively.
In one embodiment, the contrast learning module 308 is further configured to perform enhanced representation on the speech emotional feature, the text emotional feature and the high-level emotional feature according to a contrast learning method, so as to obtain an enhanced speech emotional feature, a text emotional feature and a high-level emotional feature, including:
and according to the contrast learning method, the distances among the speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are drawn up by training a reduction loss function, so that the enhanced speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are obtained.
In one embodiment, the loss function is loss cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that regulate the level of feature representation.
In one embodiment, the contrast learning module 308 is further configured to concatenate the enhanced speech emotion features, text emotion features, and high-level emotion features, decode the concatenated features by a decoder composed of full concatenation layers, and output a probability distribution of emotion classes, where the probability distribution includes:
splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding the spliced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics through a decoder consisting of all connection layers, and outputting probability distribution of emotion types
Figure BDA0003746132280000141
Wherein, F r Representing multi-modal features after stitching, p j Representing the probability that the currently identified emotion is in the jth category,
Figure BDA0003746132280000142
for the jth feature parameter of the multi-modal feature,
Figure BDA0003746132280000143
is the ith feature parameter of the multi-modal feature. .
In one embodiment, the speech emotion recognition module 310 is further configured to construct a cross entropy loss function according to the probability distribution of emotion classes and the labeled real emotion class labels, including:
constructing a cross entropy loss function according to the probability distribution of the emotion classes and the labeled real emotion class labels as
Figure BDA0003746132280000144
Wherein y represents the real distribution of the emotion types, x represents the emotion type labels, n represents the number of the emotion type labels, and i represents the emotion type label serial number.
For specific limitations of the speech emotion recognition device based on multi-modal features and contrast learning, reference may be made to the above limitations of the speech emotion recognition method based on multi-modal features and contrast learning, and details are not repeated here. The modules in the speech emotion recognition device based on multi-modal features and contrast learning can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech emotion recognition method based on multi-modal features and contrast learning. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.
In an embodiment, a computer storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A speech emotion recognition method based on multi-modal features and contrast learning, the method comprising:
acquiring voice video data to be recognized; the voice video data comprises voice text and video data;
constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
carrying out data preprocessing on the voice text to obtain a voice vector and a word vector;
extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after the local characteristics are expanded to obtain fused video data; the local features include facial expressions and hand movements;
extracting the emotional characteristics of the voice vector and the word vector according to a bidirectional GRU model to obtain voice emotional characteristics and text emotional characteristics; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;
performing enhancement representation on the voice emotion characteristics, the text emotion characteristics and the high-level emotion characteristics according to a comparison learning method to obtain enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics;
splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, decoding the voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics through a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model;
and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
2. The method according to claim 1, wherein the speech text comprises speech data and text data; carrying out data preprocessing on the voice text to obtain a voice vector and a word vector, and comprising the following steps:
taking the voice data as a frame according to a window with a fixed time period length, sliding the window backwards, wherein the position of the window after each movement and the position of the previous window have an overlapping part, and obtaining a plurality of single-frame voice data;
converting the single-frame voice data according to an OpenSMILE tool to obtain a voice vector;
taking the sentence length containing the most words in the text data as the maximum length, and carrying out zero filling processing on the sentences with the maximum length in the text data to obtain a plurality of sentences with equal length;
converting words in the sentences with the same length according to a Bert preprocessing model to obtain word vectors; the word vector contains semantic information of the context of the word corresponding to the word vector.
3. The method of claim 1, wherein performing emotion feature extraction on the speech vector and the word vector according to a bidirectional GRU model to obtain speech emotion features and text emotion features respectively comprises:
and according to the bidirectional GRU model, the voice vector and the word vector are respectively and simultaneously input into a forward GRU model and a reverse GRU model, and the state information vectors output by the two GRU models at the same time are spliced to obtain the voice emotion characteristic and the text emotion characteristic which respectively correspond to each other.
4. The method according to any one of claims 1 to 3, wherein the enhancing expression of the speech emotional features, the text emotional features and the high-level emotional features is performed according to a contrast learning method, so as to obtain enhanced speech emotional features, text emotional features and high-level emotional features, and the method comprises:
and according to the contrast learning method, the distances among the speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are narrowed through training a narrowing loss function, so that the enhanced speech emotion characteristics, the text emotion characteristics and the high-level emotion characteristics are obtained.
5. The method of claim 4, wherein the loss function is loss cons = log (exp (D (S, T)/τ) + exp (D (S, V)/σ)), where D is an L2 distance function to measure the distance of two emotion feature representations, S is a speech emotion feature, T is a text emotion feature, V is a high-level emotion feature, and τ and σ are parameters that adjust the feature representation level.
6. The method of claim 1, wherein the concatenating the enhanced speech emotion feature, the enhanced text emotion feature and the enhanced high-level emotion feature, decoding the concatenated features by a decoder composed of full concatenation layers, and outputting a probability distribution of emotion classes, comprises:
splicing the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics, decoding the enhanced voice emotion characteristics, the enhanced text emotion characteristics and the enhanced high-level emotion characteristics through a decoder consisting of all connection layers, and outputting probability distribution of emotion types as
Figure FDA0003746132270000021
Wherein, F r Representing multi-modal features after stitching, p j Representing the probability that the currently identified emotion is in the jth category,
Figure FDA0003746132270000031
for the jth feature parameter of the multi-modal feature,
Figure FDA0003746132270000032
is the ith feature parameter of the multi-modal feature.
7. The method of claim 6, wherein constructing a cross-entropy loss function according to the probability distribution of the emotion classes and the labeled real emotion class labels comprises:
constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels as
Figure FDA0003746132270000033
Wherein y represents the real distribution of the emotion types, x represents the emotion type labels, n represents the number of the emotion type labels, and i represents the serial numbers of the emotion type labels.
8. A speech emotion recognition apparatus based on multi-modal features and contrast learning, the apparatus comprising:
the model building module is used for acquiring voice video data to be recognized; the voice video data comprises voice texts and video data; constructing a speech emotion recognition model; the speech emotion recognition model comprises a Fast RCNN preprocessing model, a bidirectional GRU model, a 3D convolution network and a full-connection layer;
the data preprocessing module is used for preprocessing the data of the voice text to obtain a voice vector and a word vector; extracting local characteristics of speakers in the video data by using a Fast RCNN preprocessing model, and fusing the local characteristics with a global characteristic diagram of the video data after expanding the size to obtain fused video data; the local features include facial expressions and hand movements;
the feature extraction module is used for respectively extracting the emotional features of the voice vector and the word vector according to the bidirectional GRU model to obtain voice emotional features and text emotional features; extracting emotional characteristics of the fused video data by using a 3D convolutional network to obtain high-grade emotional characteristics;
the comparison learning module is used for performing enhancement representation on the voice emotional features, the text emotional features and the high-level emotional features according to a comparison learning method to obtain enhanced voice emotional features, text emotional features and high-level emotional features; splicing the enhanced voice emotion characteristics, text emotion characteristics and high-level emotion characteristics, decoding by a decoder consisting of a full connection layer, and outputting probability distribution of emotion types;
the speech emotion recognition module is used for constructing a cross entropy loss function according to the probability distribution of the emotion types and the labeled real emotion type labels, and training a pre-constructed speech emotion recognition model by using the cross entropy loss function and a loss function in comparison learning to obtain a trained speech emotion recognition model; and performing voice emotion recognition on the voice video data to be recognized according to the trained voice emotion recognition model.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program performs the steps of the method according to any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202210825038.7A 2022-07-14 2022-07-14 Voice emotion recognition method and device based on multi-modal characteristics and contrast learning Active CN115240713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210825038.7A CN115240713B (en) 2022-07-14 2022-07-14 Voice emotion recognition method and device based on multi-modal characteristics and contrast learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210825038.7A CN115240713B (en) 2022-07-14 2022-07-14 Voice emotion recognition method and device based on multi-modal characteristics and contrast learning

Publications (2)

Publication Number Publication Date
CN115240713A true CN115240713A (en) 2022-10-25
CN115240713B CN115240713B (en) 2024-04-16

Family

ID=83674345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210825038.7A Active CN115240713B (en) 2022-07-14 2022-07-14 Voice emotion recognition method and device based on multi-modal characteristics and contrast learning

Country Status (1)

Country Link
CN (1) CN115240713B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662440A (en) * 2022-12-27 2023-01-31 广州佰锐网络科技有限公司 Voiceprint feature identification method and system based on machine learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766462A (en) * 2018-06-21 2018-11-06 浙江中点人工智能科技有限公司 A kind of phonic signal character learning method based on Meier frequency spectrum first derivative
US11194972B1 (en) * 2021-02-19 2021-12-07 Institute Of Automation, Chinese Academy Of Sciences Semantic sentiment analysis method fusing in-depth features and time sequence models
CN114511906A (en) * 2022-01-20 2022-05-17 重庆邮电大学 Cross-modal dynamic convolution-based video multi-modal emotion recognition method and device and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766462A (en) * 2018-06-21 2018-11-06 浙江中点人工智能科技有限公司 A kind of phonic signal character learning method based on Meier frequency spectrum first derivative
US11194972B1 (en) * 2021-02-19 2021-12-07 Institute Of Automation, Chinese Academy Of Sciences Semantic sentiment analysis method fusing in-depth features and time sequence models
CN114511906A (en) * 2022-01-20 2022-05-17 重庆邮电大学 Cross-modal dynamic convolution-based video multi-modal emotion recognition method and device and computer equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JUNFENG ZHANG 等: "Multi-head attention fusion networks for multi-modal speech emotion recognition", COMPUTERS & INDUSTRIAL ENGINEERING, vol. 168, 10 March 2022 (2022-03-10), XP087028022, DOI: 10.1016/j.cie.2022.108078 *
张俊丰: "基于多特征和多模态融合的语音情感识别方法研究", 中国优秀硕士学位论文全文数据库 信息科技辑 (月刊), 15 June 2023 (2023-06-15) *
曹琦: "基于多特征和多模态融合的语音情感识别方法研究", 国优秀硕士学位论文全文数据库 信息科技辑 (月刊), 15 March 2022 (2022-03-15), pages 1 - 59 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115662440A (en) * 2022-12-27 2023-01-31 广州佰锐网络科技有限公司 Voiceprint feature identification method and system based on machine learning

Also Published As

Publication number Publication date
CN115240713B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN110444193B (en) Method and device for recognizing voice keywords
CN111368565B (en) Text translation method, text translation device, storage medium and computer equipment
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN108305612B (en) Text processing method, text processing device, model training method, model training device, storage medium and computer equipment
CN111898670B (en) Multi-mode emotion recognition method, device, equipment and storage medium
CN111833845B (en) Multilingual speech recognition model training method, device, equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
KR20230040951A (en) Speech recognition method, apparatus and device, and storage medium
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN111985243B (en) Emotion model training method, emotion analysis device and storage medium
CN111950275B (en) Emotion recognition method and device based on recurrent neural network and storage medium
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
CN112214585A (en) Reply message generation method, system, computer equipment and storage medium
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
JP2022503812A (en) Sentence processing method, sentence decoding method, device, program and equipment
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114882862A (en) Voice processing method and related equipment
CN114694255B (en) Sentence-level lip language recognition method based on channel attention and time convolution network
CN114550239A (en) Video generation method and device, storage medium and terminal
CN112183106A (en) Semantic understanding method and device based on phoneme association and deep learning
CN116310983A (en) Multi-mode emotion recognition method and device
CN115240713B (en) Voice emotion recognition method and device based on multi-modal characteristics and contrast learning
CN114446324A (en) Multi-mode emotion recognition method based on acoustic and text features
CN115374784A (en) Chinese named entity recognition method based on multi-mode information selective fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant