CN112418034A - Multi-modal emotion recognition method and device, electronic equipment and storage medium - Google Patents

Multi-modal emotion recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112418034A
CN112418034A CN202011262785.1A CN202011262785A CN112418034A CN 112418034 A CN112418034 A CN 112418034A CN 202011262785 A CN202011262785 A CN 202011262785A CN 112418034 A CN112418034 A CN 112418034A
Authority
CN
China
Prior art keywords
text
matrix
image
data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011262785.1A
Other languages
Chinese (zh)
Inventor
曾祥云
顾文元
张雪源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yuanmeng Intelligent Technology Co ltd
Yuanmeng Humanistic Intelligence International Co Ltd
Original Assignee
Yuanmeng Human Intelligence International Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanmeng Human Intelligence International Co ltd filed Critical Yuanmeng Human Intelligence International Co ltd
Priority to CN202011262785.1A priority Critical patent/CN112418034A/en
Publication of CN112418034A publication Critical patent/CN112418034A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of voice recognition and image processing, and provides a multi-modal emotion recognition method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: removing duplication of video data of an object to be recognized, and acquiring face time sequence image data of the object to be recognized; when the video data of the object to be recognized is obtained, the text data of the object to be recognized is obtained in real time; and inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model so as to perform multi-modal emotion recognition on the object to be recognized. According to the invention, the expression of the virtual human user and the text content of the conversation in the conversation process of the virtual human user are obtained in real time, and the multi-dimensional rich features are obtained in a mode of jointly inputting the image and the text signal, so that the accuracy and the robustness of emotion classification and detection are improved. Particularly in the case of the ironic scenario, the accuracy is high.

Description

Multi-modal emotion recognition method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voice recognition and image processing, in particular to a multi-mode emotion recognition method, a multi-mode emotion recognition device, electronic equipment and a storage medium.
Background
In the point-to-point chat between a virtual person and a human, the human emotion needs to be recognized in real time, corresponding answers are generated according to emotion recognition results, and sound, text and action are guided to be output in a multi-dimensional mode. Therefore, emotion recognition is very important for promoting the emotion accompanying experience of the virtual human.
Most of existing emotion recognition methods are text-based, and in virtual human interaction, texts are all based on voice recognition results, and because the accuracy rate of voice recognition is not necessarily high, certain noise exists in the obtained texts.
Meanwhile, only when a human says ironic words, the ironic words are converted into words through voice recognition, ironic words are all lost, and emotional recognition is inaccurate.
Disclosure of Invention
The invention aims to provide a multi-mode emotion recognition method, a multi-mode emotion recognition device, electronic equipment and a storage medium. Particularly in the case of the ironic scenario, the accuracy is high.
The technical scheme provided by the invention is as follows:
a multi-modal emotion recognition method comprises the following steps:
removing duplication of video data of an object to be recognized, and acquiring face time sequence image data of the object to be recognized;
when the video data of the object to be recognized is obtained, the text data of the object to be recognized is obtained in real time;
and inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model so as to perform multi-modal emotion recognition on the object to be recognized.
Further preferably, when the video data of the object to be recognized is obtained, the text data of the object to be recognized is obtained in real time, and the method specifically includes the steps of:
acquiring voice data input by the object to be recognized in each round of conversation;
and translating the voice data into text data in real time through a voice recognition interface.
Further preferably, the step of inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model to perform multi-modal emotion recognition on the object to be recognized specifically includes the steps of:
extracting a first dual-mode feature taking an image as a core and a second dual-mode feature taking a text as a core by utilizing the multi-mode emotion recognition model;
performing feature splicing on the first dual-mode feature and the second dual-mode feature to obtain a target feature;
inputting the target features into a softmax classifier of the multi-modal emotion recognition for classification and loss calculation so as to obtain the multi-modal emotion of the object to be recognized.
Further preferably, the extracting of the first dual-modal feature with an image as a core by using the multi-modal emotion recognition model specifically includes the steps of:
convolving the image semantic time sequence vector in the face time sequence image data by adopting a defined image convolution layer to obtain image time sequence characteristics;
compressing the image time sequence characteristics on a channel to obtain image characteristic vectors;
respectively normalizing the text feature vector and the image feature vector obtained based on the text data;
and interacting the image characteristic vector and the text characteristic vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the first bi-modal characteristic.
Further preferably, before the normalizing the text feature vector and the image feature vector obtained based on the text data, respectively, the method further includes the steps of:
normalizing the image feature vector and multiplying the normalized image feature vector by a preset coefficient;
carrying out position coding on the image characteristic vector, carrying out point-to-point addition on the obtained position coding vector and the image characteristic vector, and randomly setting the position coding vector and the image characteristic vector to be zero according to a preset probability so as to obtain an initial image characteristic matrix;
wherein the initial image feature matrix is used to duplicate the text feature vector.
Further preferably, the interacting the image feature vector and the text feature vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the first bi-modal feature specifically includes:
after copying the text characteristic vector, respectively carrying out linear transformation on the text characteristic vector and the image characteristic vector to obtain a first text matrix and a first image matrix, and carrying out linear transformation on the initial image characteristic matrix to obtain a current image characteristic matrix;
respectively changing the shapes of the first text matrix, the first image matrix and the current image feature matrix to obtain the current image feature matrix with the changed shape;
carrying out matrix multiplication on the current image feature matrix with the changed shape and the first text matrix to obtain a first weight matrix;
converting the first weight matrix into a probability matrix, and setting elements on the probability matrix to be zero to obtain a second weight matrix;
multiplying the second weight matrix with the first image matrix to obtain a first dual-mode matrix;
and transforming the first dual-mode matrix by using a linear transformer, and normalizing to obtain the first dual-mode characteristic.
Further preferably, the extracting of the second dual-modal feature with text as a core by using the multi-modal emotion recognition model comprises the following steps:
convolving the text semantic vectors in the text data by adopting a defined text convolution layer to obtain text features;
compressing the text features on a channel to obtain text feature vectors;
respectively normalizing the text feature vector and the image feature vector;
and interacting the image feature vector and the text feature vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the second dual-modal feature.
Further preferably, before the normalizing the text feature vector and the image feature vector respectively, the method further includes the steps of:
normalizing the text feature vector and multiplying the normalized text feature vector by a preset coefficient;
carrying out position coding on the text characteristic vector, carrying out point-to-point addition on the obtained position coding vector and the text characteristic vector, and randomly setting the position coding vector and the text characteristic vector to be zero according to a preset probability so as to obtain an initial text characteristic matrix;
wherein the initial text feature matrix is used to duplicate the image feature vector.
Further preferably, the interacting the image feature vector and the text feature vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the second bi-modal feature specifically includes:
after copying the image characteristic vector, respectively carrying out linear transformation on the text characteristic vector and the image characteristic vector to obtain a second text matrix and a second image matrix, and carrying out linear transformation on the initial text characteristic matrix to obtain a current text characteristic matrix;
respectively changing the shapes of the second text matrix, the second image matrix and the current text feature matrix to obtain the current text feature matrix with the changed shapes;
carrying out matrix multiplication on the current text characteristic matrix with the changed shape and the second image matrix to obtain a third weight matrix;
converting the third weight matrix into a probability matrix, and setting elements on the probability matrix to be zero to obtain a fourth weight matrix;
multiplying the fourth weight matrix with the second text matrix to obtain a second bimodal matrix;
and transforming the second dual-mode matrix by using a linear transformer, and normalizing to obtain the second dual-mode characteristic.
Further preferably, the removing duplicate of the video data of the object to be recognized to obtain the face time sequence image data of the object to be recognized specifically includes the steps of:
performing background modeling by adopting a Vibe algorithm, and extracting a binary gray level profile map relative to a static background;
performing responsive morphological operation and position and operation on the binary gray level contour map and the corresponding original map, and removing a background to reserve a foreground picture;
after foreground pictures in the video data of the object to be identified are extracted, calculating the similarity of the front frame picture and the rear frame picture by adopting a perceptual hash algorithm, and deleting the pictures when the similarity exceeds a preset threshold;
carrying out face detection on the obtained picture to obtain a face picture;
and storing the face picture into face time sequence image data according to the time sequence of the video data of the object to be recognized.
Further preferably, after the acquiring the text data of the object to be recognized in real time when the video data of the object to be recognized is acquired, the method further includes the steps of:
and converting the face time sequence image data and the text data into feature vectors with preset dimensions for alignment, so as to provide a data format and high-dimensional data for aligning the face time sequence image data and the text data for the multi-modal emotion recognition model.
A multi-modal emotion recognition device, comprising:
the duplication removing module is used for carrying out duplication removal on the video data of the object to be recognized and acquiring face time sequence image data of the object to be recognized;
the acquisition module is used for acquiring the text data of the object to be identified in real time when the video data of the object to be identified is acquired;
and the identification module is used for inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion identification model so as to perform multi-modal emotion identification on the object to be identified.
An electronic device, the electronic device comprising:
a processor; and a memory storing computer-executable instructions that, when executed, cause the processor to perform the multimodal emotion recognition method.
A storage medium having stored therein at least one instruction that is loaded and executed by a processor to perform operations performed by the multimodal emotion recognition method.
The multi-modal emotion recognition method, the device, the electronic equipment and the storage medium provided by the invention at least have the following beneficial effects:
1) according to the invention, the expression of the virtual human user and the text content of the conversation in the conversation process of the virtual human user are obtained in real time, and the multi-dimensional rich features are obtained in a mode of jointly inputting the image and the text signal, so that the accuracy and the robustness of emotion classification and detection are improved. Particularly in the case of the ironic scenario, the accuracy is high.
2) In order to enable the emotion accompanying virtual person to better dynamically sense the emotion of the user and generate more humanized and temperature chat content, the invention integrates video expression information and conversation text content, so that the obtained original information has more dimensionality and is more real, and noise interference caused by incorrect voice recognition is reduced.
3) In the invention, in the video data acquisition process, in order to accelerate the whole chatting process, the input video is subjected to de-duplication, the video with higher duplication degree, the abnormal video and the video without the human face are deleted, and the calculation amount of the subsequent steps is reduced.
4) The method comprises the steps of adopting a pre-trained bert model to extract features of an input text to obtain the representation of each character, and taking a first character [ cls ] as the representation of the whole sentence to accurately obtain text information of a user.
Drawings
The foregoing features, technical features, and advantages of the multimodal emotion recognition method, apparatus, electronic device, and storage medium, and implementations thereof will be further described in the following detailed description of preferred embodiments in conjunction with the accompanying drawings.
FIG. 1 is a flow diagram of one embodiment of a multi-modal emotion recognition method in the present invention;
FIG. 2 is a flow diagram of another embodiment of a method for multi-modal emotion recognition in the present invention;
FIG. 3 is a schematic diagram of an embodiment of a multi-modal emotion recognition apparatus in the present invention;
FIG. 4 is a block diagram of a multi-modal emotion recognition model in the present invention;
fig. 5 is a schematic structural diagram of the electronic device of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example one
One embodiment of the present invention, as shown in fig. 1, is a method for multi-modal emotion recognition, including:
s100, duplicate removal is carried out on video data of an object to be recognized, and face time sequence image data of the object to be recognized are obtained.
Specifically, the data input layer in the training phase: the data input includes video data and text conversation data collected in real time during the chat process. The video data obtains a chat object in real time through the virtual human camera, and then a sequence diagram of a human face is identified through an algorithm.
The method comprises the following steps of carrying out duplication removal on video data of an object to be recognized to obtain face time sequence image data of the object to be recognized, and specifically comprises the following steps:
the method comprises the steps of carrying out background modeling by adopting a Vibe algorithm, extracting a binarization gray level contour map relative to a static background, carrying out responding morphological operation and bit and operation on the binarization gray level contour map and a corresponding original image, removing the background to reserve a foreground picture, calculating the similarity of front and rear frame pictures by adopting a perceptual hash algorithm after extracting the foreground picture in video data of an object to be recognized, deleting the picture when the similarity exceeds a preset threshold value, carrying out face detection on the obtained picture to obtain a face picture, and storing the face picture into face time sequence image data according to the time sequence of the video data of the object to be recognized.
An exemplary video deduplication algorithm of a video data input layer of the multi-modal emotion recognition model mainly comprises the following steps:
and (3) performing background modeling by adopting a Vibe algorithm, and extracting a binary gray level profile map relative to a static background.
And performing responsive morphological operation and position and operation on the obtained binary gray level contour map and the original image, removing the background and keeping the foreground image.
After foreground pictures in the video stream are extracted, the similarity between the following pictures and the previous frames of pictures is calculated by adopting a perceptual hash algorithm, and if the similarity exceeds a threshold value, the pictures are deleted.
And carrying out face detection on the obtained picture to obtain a face picture, wherein other parts of the picture are not lost.
And storing the obtained picture set according to the time sequence of the video.
In the embodiment, in the video data acquisition process, in order to accelerate the whole chat process, the input video is subjected to de-duplication, and the video with high duplication degree, the abnormal video and the video without the human face are deleted, so that the calculation amount of the subsequent steps is reduced.
S200, when the video data of the object to be recognized is obtained, the text data of the object to be recognized is obtained in real time.
Specifically, the text data is obtained by collecting chat voice data in real time, the format of the chat data is a single sentence, and the chat data is converted into characters through voice recognition.
S300, inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model to perform multi-modal emotion recognition on the object to be recognized.
Specifically, the text information, i.e., the text data, is aligned with the picture information obtained by the video data input layer, and is stored as a specific data structure.
In the embodiment, in order to enable the emotion accompanying virtual people to better dynamically sense the emotion of the user and generate more humanized and temperature chat content, the invention integrates video expression information and conversation text content, so that the obtained original information has more dimensionality and is more real, and noise interference caused by incorrect voice recognition is reduced. And simultaneously, extracting the characteristics of the picture after the duplication in the sentence is removed by adopting a layer of bi-lstm to obtain an image representation vector with fixed shape and dimension.
Example two
Based on the foregoing embodiment, the same parts as those in the foregoing embodiment are not repeated in detail in this embodiment, and as shown in fig. 2, this embodiment provides a multi-modal emotion recognition method, which specifically includes:
s100, duplicate removal is carried out on video data of an object to be recognized, and face time sequence image data of the object to be recognized are obtained.
S201, voice data input by the object to be recognized in each round of conversation is obtained.
S202, the voice data is translated into text data in real time through a voice recognition interface.
Illustratively, the text data input layer of the multi-modal emotion recognition model:
1. in the chat process, each sentence input by the user is acquired, and the user is waited to finish a round of conversation so as to acquire a complete single round of conversation, because the single round of conversation may contain multiple sentences.
2. And translating the acquired voice data into a text in real time through a voice recognition interface.
3. And (3) aligning the text information obtained in the step (2) with the picture information obtained by processing the video data input layer, and storing the aligned text information and the picture information into a specific data structure.
S300, inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model to perform multi-modal emotion recognition on the object to be recognized.
In this embodiment, a pre-trained bert model is used to extract features of an input text, so as to obtain a representation of each word, and a first word [ cls ] is taken as a representation of a whole sentence, so as to accurately obtain text information of a user.
EXAMPLE III
Based on the foregoing embodiment, in this embodiment, the inputting the aligned face time series image data and the aligned text data into a multi-modal emotion recognition model to perform multi-modal emotion recognition on the object to be recognized specifically includes:
extracting a first dual-mode feature taking an image as a core and a second dual-mode feature taking a text as a core by using the multi-mode emotion recognition model, performing feature splicing on the first dual-mode feature and the second dual-mode feature to obtain a target feature, and inputting the target feature into a softmax classifier of the multi-mode emotion recognition to perform classification and loss calculation so as to obtain the multi-mode emotion of the object to be recognized.
Preferably, the extracting of the first dual-modal feature with an image as a core by using the multi-modal emotion recognition model specifically includes the steps of:
convolving the image semantic time sequence vector in the face time sequence image data by adopting a defined image convolution layer to obtain image time sequence characteristics; and compressing the image time sequence characteristics on a channel to obtain an image characteristic vector.
Normalizing the image feature vector and multiplying the normalized image feature vector by a preset coefficient; carrying out position coding on the image characteristic vector, carrying out point-to-point addition on the obtained position coding vector and the image characteristic vector, and randomly setting the position coding vector and the image characteristic vector to be zero according to a preset probability so as to obtain an initial image characteristic matrix; wherein the initial image feature matrix is used to duplicate the text feature vector.
Respectively normalizing the text feature vector and the image feature vector obtained based on the text data; and interacting the image characteristic vector and the text characteristic vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the first bi-modal characteristic.
The method comprises the following steps of interacting the image feature vectors and the text feature vectors through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the first bi-modal feature, and specifically comprises the following steps:
after copying the text characteristic vector, respectively carrying out linear transformation on the text characteristic vector and the image characteristic vector to obtain a first text matrix and a first image matrix, and carrying out linear transformation on the initial image characteristic matrix to obtain a current image characteristic matrix.
And respectively changing the shapes of the first text matrix, the first image matrix and the current image feature matrix to obtain the shape of the current image feature matrix [ W, B H, D ].
Wherein, W is the width and height of the image and 3, B is the data volume put in, H is the number of multiple heads, D is the characteristic dimension, and the value of H and D is 400; performing matrix multiplication on the current image characteristic matrix and the first text matrix to obtain a first weight matrix; converting the first weight matrix into a probability matrix, and setting elements on the probability matrix to be zero to obtain a second weight matrix; multiplying the second weight matrix with the first image matrix to obtain a first dual-mode matrix; and transforming the first dual-mode matrix by using a linear transformer, and normalizing to obtain the first dual-mode characteristic.
Preferably, the extracting of the second dual-modal feature taking the text as the core by using the multi-modal emotion recognition model comprises the following steps:
and convolving the text semantic vectors in the text data by adopting a defined text convolution layer to obtain text features, compressing the text features on a channel to obtain text feature vectors, respectively normalizing the text feature vectors and the image feature vectors, and interacting the image feature vectors and the text feature vectors through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the second dual-modal features.
Preferably, before the normalizing the text feature vector and the image feature vector respectively, the method further includes the steps of:
and normalizing the text feature vector, multiplying the normalized text feature vector by a preset coefficient, performing position coding on the text feature vector, performing point-to-point addition on the obtained position coding vector and the text feature vector, and randomly setting the position coding vector and the text feature vector to be zero according to a preset probability so as to obtain an initial text feature matrix. Wherein the initial text feature matrix is used to duplicate the image feature vector.
Preferably, the interacting the image feature vector and the text feature vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the second dual-modal feature specifically includes:
after copying the image characteristic vector, respectively carrying out linear transformation on the text characteristic vector and the image characteristic vector to obtain a second text matrix and a second image matrix, and carrying out linear transformation on the initial text characteristic matrix to obtain a current text characteristic matrix.
And respectively changing the shapes of the second text matrix, the second image matrix and the current text feature matrix to obtain the shape of the current text feature matrix [ W, B H, D ].
Wherein, W is the width and height of the image and 3, B is the data volume put in, H is the number of multiple heads, D is the characteristic dimension, and H is 400.
And performing matrix multiplication on the current text feature matrix and the second image matrix to obtain a third weight matrix, converting the third weight matrix into a probability matrix, setting elements on the probability matrix to be zero to obtain a fourth weight matrix, multiplying the fourth weight matrix and the second text matrix to obtain a second double-modal matrix, converting the second double-modal matrix by using a linear converter, and normalizing to obtain the second double-modal feature.
Preferably, after the acquiring the text data of the object to be recognized in real time while acquiring the video data of the object to be recognized, the method further includes the steps of:
and converting the face time sequence image data and the text data into feature vectors with preset dimensions for alignment, so as to provide a data format and high-dimensional data for aligning the face time sequence image data and the text data for the multi-modal emotion recognition model.
Example four
Based on the foregoing embodiment, parts in this embodiment that are the same as those in the foregoing embodiment are not repeated, and this embodiment provides a multimodal emotion recognition method, which specifically includes:
a training stage:
one, data input layer
The data input includes video data and text conversation data collected in real time during the chat process. The video data obtains a chat object in real time through the virtual human camera, and then a sequence diagram of a human face is identified through an algorithm. The text data is obtained by collecting chatting voice data in real time, the format of the chatting data is simple sentence, and the chatting data is converted into characters through voice recognition.
Video data input layer:
in the video data acquisition process, in order to accelerate the whole chatting process, the input video is subjected to de-duplication, the video with high duplication degree, the abnormal video and the video without the face are deleted, and the calculation amount of the subsequent steps is reduced. The video deduplication algorithm mainly comprises the following steps:
and (3) performing background modeling by adopting a Vibe algorithm, and extracting a binary gray level profile map relative to a static background. And performing responsive morphological operation and position and operation on the obtained binary gray level contour map and the original image, removing the background and keeping the foreground image. After foreground pictures in the video stream are extracted, the similarity between the following pictures and the previous frames of pictures is calculated by adopting a perceptual hash algorithm, and if the similarity exceeds a threshold value, the pictures are deleted. And carrying out face detection on the obtained picture to obtain a face picture, wherein other parts of the picture are not lost. And storing the obtained picture set according to the time sequence of the video.
Text data input layer:
1. in the chat process, each sentence input by the user is acquired, and the user is waited to finish a round of conversation so as to acquire a complete single round of conversation, because the single round of conversation may contain multiple sentences.
2. And translating the acquired voice data into a text in real time through a voice recognition interface.
3. And (3) aligning the text information obtained in the step (2) with the picture information obtained by processing the video data input layer, and storing the aligned text information and the picture information into a specific data structure.
Second, data characterization layer
And the data representation layer is used for converting the processed time sequence images and texts in the data input layer into feature vectors with fixed dimensions. Video and text aligned data formats and high dimensional data are provided for model computation.
Video data characterization layer:
1. time-series image data obtained by a video data input layer is input into a 12-layer transform feature extractor to extract features with time-series features.
2. The transform model in step 1 may be divided into an image padding layer, a feature deformation layer, a parameter initialization layer, a random discarding layer, a pixel point characterization layer, a position characterization layer, and a linear transformation layer according to a flow from image data input to feature output from bottom to top.
3. The combination mode of the modules in the step 2 is as follows:
A. firstly, two matrixes M and N are initialized randomly by adopting normal distribution, wherein the shape of the matrix M is (the image width is the image height is the channel number) 512 and is used for representing each pixel point, and the shape of the matrix N is (the image width is the channel number and the image height) 512.
B. And replacing the initialized M matrix with a pre-trained matrix M 'with a mutual semantic relation to obtain N' in the same way.
C. And reading processed picture data of a single-turn dialog, wherein image data values are distributed between 0 and 255, and projecting image values to (0-image width x image height x 2) according to channels to obtain an input data format with the shape of N x image width x image height x 2.
D. And D, deforming the width and the number of channels of the batch of image data obtained in the step A through a characteristic deformation layer to obtain a data format of the shape of N image height (image width channel number).
E. And looking up a semantic matrix M', converting each pixel into a 512-dimensional vector to obtain a matrix K.
F. And (3) enabling the same batch of pictures with different sizes to pass through a padding layer, and carrying out zero filling on the pictures with the width and the height which are smaller than the maximum width and the height of one batch of pictures in four dimensions, namely the upper dimension, the lower dimension, the left dimension and the right dimension, so that the pictures with the uniform sizes are obtained after zero filling is carried out on all the pictures.
G. And looking up a semantic matrix N', and representing the position of each pixel as a 512-dimensional vector to obtain a matrix P.
H. And point-to-point addition is carried out on the K matrix and the P matrix to obtain a comprehensive semantic vector Q.
I. And (3) adopting a random layer discarding method for the obtained semantic vector Q, and setting the original of some layers to be zero by using the Q vector in a layer mode according to a certain probability. This results in a vector Q' with temporal, pixel-positional, and pixel-semantic correlations.
J. And putting the semantic matrix Q 'into a pre-trained 4-layer transform extraction feature to obtain a higher-order semantic feature vector O, wherein the shape of the O is a Batch _ size (image width: image height: channel number) 512 image matrix O'.
Text data characterization layer:
1. the text data adopts a pre-trained BERT model, and the semantic relevance of the swap is enhanced by adopting a swap data fine adjustment mode.
2. And loading the fine-tuned BERT language model, wherein the pre-trained BERT model consists of 12 layers of transformers.
3. And dividing the text into words to obtain a sentence S after word division.
4. The word vector of each word of the sentence S, the position information of each word and the information of which sentence each word comes from are added to be represented as a vector W with 768 dimensions of sentence number and the number of each sentence word.
5. Putting the W vector into a BERT model to extract features, and obtaining the characterization W 'of the whole sentence, wherein the shape of the W' is batch _ size. sentence length. 768.
Third, model definition
Defining a model M, wherein M is composed of 6 layers of transform composing feature extraction layers and 4 layers of feature transformation layers, combining output features, connecting with a linear transformer, carrying out enhancement and randomization on the features through residual connection and dropout, and finally connecting with a softmax classifier to predict 21 types of emotions, wherein the structural diagram is shown in figure 4.
Fourthly, a model calculation layer
The model calculation layer comprises a convolution layer, a linear transformation layer, a random discarding layer, a transform feature coding layer, a cross-mode attention mechanism layer, a residual error layer, a loss function, an optimizer, layer normalization and gradient truncation.
1. And the text semantic vector obtained by the data representation layer is convolved by adopting a defined text convolution layer, the features are compressed on a channel, and the obtained text feature vector S is the sentence number and the sentence length 100.
2. And (3) performing convolution on the image semantic time sequence vector obtained by the data characterization layer by using a defined image convolution layer, and compressing the features on a channel to obtain a sum image feature vector I with the shape of sentence number (image width: image height: channel number) 400.
3. The image semantic features I are further extracted by an image feature extractor T1 composed of 6 layers of transformers, and each layer of transformers can be expressed as follows:
A. the image feature vector I is normalized and the features are multiplied by a smaller coefficient.
B. And carrying out position coding on the characteristic vector I, determining the position of the image by the width, the height and the channel number of the image, and carrying out point-to-point addition on the obtained position coding vector and the characteristic vector I.
C. And C, randomly setting the vector obtained in the step B to be zero with a certain probability so as to enhance the robustness I' of the feature.
D. To obtain bimodal semantic information, the textual information needs to be merged into a T1 feature extractor. Therefore, the text feature vector S is processed by the steps A-C to obtain S'.
E. And respectively adopting corresponding layer normalization to operate the text characteristic vector and the image characteristic vector.
F. And (4) carrying out interaction on the image features and the text features by using a cross-modal attention mechanism layer to obtain a cross-modal semantic feature vector. The cross-modal attention mechanism layer is expressed as follows:
a. after the text characteristics S 'are copied by 2, linear transformation is respectively carried out to obtain K and V matrixes, and meanwhile, the I' matrix is subjected to linear transformation to obtain a Q matrix.
b. And respectively changing the shape of the Q, K and V matrixes to obtain the shape [ W, B H and D ] of Q', wherein W is the width and the height of the image, B is the amount of put data (batch size), H is the number of multiple heads, D is the characteristic dimension, and the value of H D is 400.
c. And carrying out matrix multiplication on the Q matrix and the K matrix to obtain a weight matrix W.
d. And converting the W matrix into probability by adopting softmax, setting elements on the W matrix to be zero by adopting a certain probability for the probability matrix to obtain a W' matrix, and increasing the randomness of the characteristics.
e. And multiplying the W 'weight matrix by V to obtain a V' matrix.
f. And transforming the V 'matrix by using a linear transformer, and normalizing the V' matrix to obtain the final bimodal feature of the image after the mean force filling mechanism is added.
4. The text features S are further extracted by another 6-layer text feature extractor T2 composed of transformers, which obtains semantic features centered on the text and including pictures, and the transformers of each layer can be expressed as follows:
A. and normalizing the text semantic feature vector S, and multiplying the feature by a smaller coefficient.
B. And carrying out position coding on the characteristic vector S, and carrying out point-to-point addition on the obtained position coding vector and the characteristic vector S.
C. And C, randomly setting the vector obtained in the step B to be zero with a certain probability so as to enhance the robustness S' of the feature.
G. To obtain bimodal semantic information, the image information needs to be merged into a T2 feature extractor. Therefore, the image feature vector I is processed by the steps A-C to obtain I'.
H. And respectively adopting corresponding layer normalization to operate the text characteristic vector and the image characteristic vector.
I. And with the text features as the center, carrying out interaction on the image features by using a cross-modal attention mechanism layer to obtain cross-modal semantic feature vectors. The cross-modal attention mechanism layer is expressed as follows:
a. and copying 2 the image characteristics I ', respectively carrying out linear transformation to obtain K and V matrixes, and simultaneously carrying out linear transformation on the S' matrix to obtain a Q matrix.
b. And respectively changing the shape of the Q, K and V matrixes to obtain the shape [ W, B H and D ] of Q', wherein W is the width and the height of the image 3, B is the amount of put data (batch size), H is the number of multiple heads, and D is the characteristic dimension.
c. And carrying out matrix multiplication on the Q matrix and the K matrix to obtain a weight matrix W.
d. And converting the W matrix into probability by adopting softmax, setting elements on the W matrix to be zero by adopting a certain probability for the probability matrix to obtain a W' matrix, and increasing the randomness of the characteristics.
e. And multiplying the W 'weight matrix by V to obtain a V' matrix.
f. And transforming the V 'matrix by using a linear transformer, and normalizing the V' matrix to obtain the final text bimodal feature after the semantic machine adding mechanism is performed.
5. And randomly disordering the obtained bimodal features respectively taking the image and the text as cores and performing feature splicing.
6. And (5) putting the features obtained in the step (5) into a softmax classifier for classification and loss calculation.
Model five
1. The scene video of conversation is obtained in real time through virtual people's camera, detects whether there is the face and the condition of people's face of present conversation, if have a plurality of people's faces, judges the face of present conversation through frame contrast around the video, detects and cuts apart the face of speaking, obtains the face that is working.
2. When the face is segmented, the word information of the current speaking client is obtained through the voice recognition interface, the face information with higher repetition degree is removed in a segment of word information, the picture with the characteristics is reserved, the text is aligned with the reserved picture, and the text and the reserved picture are packaged into a data structure.
3. And (4) extracting the features of the image after the duplication in the sentence is removed by adopting a layer of bi-lstm to obtain an image representation vector with fixed shape and dimension.
4. And extracting features of the input text by adopting the pre-trained bert to obtain the representation of each character, and taking the first character [ cls ] as the representation of the whole sentence.
5. Because two sentences need to be put into the original bert at the same time, and the two emotion classifications are generally single sentences, certain improvements need to be made on the input information in the step 4 and the original bert, and the second sentences and corresponding word representations are removed.
EXAMPLE five
As shown in FIG. 3, the present invention provides a multi-modal emotion recognition apparatus, including:
the duplication elimination module 301 is configured to eliminate duplication of video data of an object to be recognized, and acquire face time sequence image data of the object to be recognized.
The obtaining module 302 is configured to obtain text data of the object to be recognized in real time when the video data of the object to be recognized is obtained.
The recognition module 303 is configured to input the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model, so as to perform multi-modal emotion recognition on the object to be recognized.
In this embodiment, the scene video of conversation is obtained in real time through virtual human camera, and whether there is face and the condition of people's face in the detection current conversation, if there are a plurality of people's faces, judge the face of current conversation through frame contrast around the video, detect and cut apart the face of speaking, obtain the face that is working. When the face is segmented, the word information of the current speaking client is obtained through the voice recognition interface, the face information with higher repetition degree is removed in a segment of word information, the picture with the characteristics is reserved, the text is aligned with the reserved picture, and the text and the reserved picture are packaged into a data structure.
And simultaneously, defining a model M, wherein the model M consists of a feature extraction layer consisting of 6 layers of transformers and a feature transformation layer consisting of 4 layers of feature transformation layers, combining output features, connecting a linear transformer, performing residual error connection and dropout to strengthen and randomize the features, and finally connecting a softmax classifier to predict 21 types of emotions.
And (4) extracting the features of the image after the duplication in the sentence is removed by adopting a layer of bi-lstm to obtain an image representation vector with fixed shape and dimension. And extracting features of the input text by adopting the pre-trained bert to obtain the representation of each character, and taking the first character [ cls ] as the representation of the whole sentence. Because two sentences need to be put into the original bert at the same time, the emotion classification is generally a single sentence, and therefore certain improvement needs to be made on the input information in the step 4 and the original bert, and the second sentence and the corresponding character representation are removed.
The multi-modal emotion recognition device can fuse video expression information and conversation text content, so that the obtained original information has more dimensionality and is more real, and noise interference caused by incorrect voice recognition is reduced.
On the other hand, as shown in fig. 5, the present invention provides an electronic device 100, which includes a processor 110, a memory 120, wherein the memory 120 is used for storing a computer program 121; the processor 110 is configured to execute the computer program 121 stored in the memory 120 to implement the method in the corresponding method embodiment.
The electronic device 100 may be a desktop computer, a notebook computer, a palm computer, a tablet computer, a mobile phone, a human-computer interaction screen, or the like. The electronic device 100 may include, but is not limited to, a processor 110, a memory 120. Those skilled in the art will appreciate that fig. 5 is merely an example of the electronic device 100, does not constitute a limitation of the electronic device 100, and may include more or fewer components than illustrated, or some components in combination, or different components, for example: electronic device 100 may also include input/output interfaces, display devices, network access devices, communication buses, communication interfaces, and the like. A communication interface and a communication bus, and may further include an input/output interface, wherein the processor 110, the memory 120, the input/output interface and the communication interface complete communication with each other through the communication bus. The memory 120 stores a computer program 121, and the processor 110 is configured to execute the computer program 121 stored in the memory 120 to implement the method in the corresponding method embodiment.
The Processor 110 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 120 may be an internal storage unit of the electronic device 100, for example: a hard disk or a memory of the electronic device. The memory may also be an external storage device of the electronic device, for example: the electronic device is provided with a plug-in hard disk, an intelligent memory Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like. Further, the memory 120 may also include both an internal storage unit and an external storage device of the electronic device 100. The memory 120 is used for storing the computer program 121 and other programs and data required by the electronic device 100. The memory may also be used to temporarily store data that has been output or is to be output.
A communication bus is a circuit that connects the described elements and enables transmission between the elements. Illustratively, the processor 110 receives commands from other elements through the communication bus, decrypts the received commands, and performs calculations or data processing according to the decrypted commands. Memory 120 may include program modules, illustratively, a kernel (kernel), middleware (middleware), an Application Programming Interface (API), and applications. The program modules may be comprised of software, firmware or hardware, or at least two of the same. The input/output interface forwards commands or data input by a user via the input/output interface (e.g., sensor, keypad, touch screen). The communication interface connects the electronic device 100 with other network devices, user equipment, networks. For example, the communication interface may be connected to the network by wire or wirelessly to connect to other external network devices or user devices. The wireless communication may include at least one of: wireless fidelity (WiFi), Bluetooth (BT), Near Field Communication (NFC), Global Positioning Satellite (GPS) and cellular communications, among others. The wired communication may include at least one of: universal Serial Bus (USB), high-definition multimedia interface (HDMI), asynchronous transfer standard interface (RS-232), and the like. The network may be a telecommunications network and a communications network. The communication network may be a computer network, the internet of things, a telephone network. The electronic device 100 may be connected to the network through a communication interface, and a protocol by which the electronic device 100 communicates with other network devices may be supported by at least one of an application, an Application Programming Interface (API), middleware, a kernel, and a communication interface.
In another aspect, the present invention provides a storage medium, where at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operations performed by the embodiments corresponding to the foregoing method. The storage medium may be, for example, a read-only memory (ROM), a Random Access Memory (RAM), a compact disc read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
They may be implemented in program code that is executable by a computing device such that it is executed by the computing device, or separately, or as individual integrated circuit modules, or as a plurality or steps of individual integrated circuit modules. Thus, the present invention is not limited to any specific combination of hardware and software.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/device and method may be implemented in other ways. The above-described apparatus/device embodiments are merely exemplary, and the division of the modules or units is merely an example of a logical division, and there may be other divisions in actual implementation, and it is exemplary that a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units may be stored in a medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow in the method according to the above embodiments may be implemented by sending instructions to relevant hardware through a computer program, where the computer program may be stored in a medium, and when the computer program is executed by a processor, the steps of the above method embodiments may be implemented. Wherein the computer program may be in source code form, object code form, an executable file or some intermediate form, etc. The medium may include: any entity or device capable of carrying the computer program, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signal, telecommunication signal, software distribution medium, etc. It should be noted that the content contained in the medium can be increased or decreased as appropriate according to the requirements of legislation and patent practice in the jurisdiction, and the following are exemplary: in some jurisdictions, in accordance with legislation and patent practice, the computer-readable medium does not include electrical carrier signals and telecommunications signals. It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of program modules is illustrated, and in practical applications, the above-described distribution of functions may be performed by different program modules, that is, the internal structure of the apparatus may be divided into different program units or modules to perform all or part of the above-described functions. Each program module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one processing unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software program unit. In addition, the specific names of the program modules are only used for distinguishing the program modules from one another, and are not used for limiting the protection scope of the application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (14)

1. A multi-modal emotion recognition method is characterized by comprising the following steps:
removing duplication of video data of an object to be recognized, and acquiring face time sequence image data of the object to be recognized;
when the video data of the object to be recognized is obtained, the text data of the object to be recognized is obtained in real time;
and inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion recognition model so as to perform multi-modal emotion recognition on the object to be recognized.
2. The multi-modal emotion recognition method of claim 1, wherein the step of obtaining the text data of the object to be recognized in real time while obtaining the video data of the object to be recognized specifically comprises the steps of:
acquiring voice data input by the object to be recognized in each round of conversation;
and translating the voice data into text data in real time through a voice recognition interface.
3. The multi-modal emotion recognition method of claim 1, wherein the step of inputting the aligned face time series image data and the aligned text data into a multi-modal emotion recognition model for multi-modal emotion recognition of the object to be recognized specifically comprises the steps of:
extracting a first dual-mode feature taking an image as a core and a second dual-mode feature taking a text as a core by utilizing the multi-mode emotion recognition model;
performing feature splicing on the first dual-mode feature and the second dual-mode feature to obtain a target feature;
inputting the target features into a softmax classifier of the multi-modal emotion recognition for classification and loss calculation so as to obtain the multi-modal emotion of the object to be recognized.
4. The method according to claim 3, wherein the extracting of the first bi-modal feature with the image as the core by using the multi-modal emotion recognition model comprises:
convolving the image semantic time sequence vector in the face time sequence image data by adopting a defined image convolution layer to obtain image time sequence characteristics;
compressing the image time sequence characteristics on a channel to obtain image characteristic vectors;
respectively normalizing the text feature vector and the image feature vector obtained based on the text data;
and interacting the image characteristic vector and the text characteristic vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the first bi-modal characteristic.
5. The multi-modal emotion recognition method of claim 4, further comprising, before said normalizing the text feature vector and the image feature vector obtained based on the text data, respectively:
normalizing the image feature vector and multiplying the normalized image feature vector by a preset coefficient;
carrying out position coding on the image characteristic vector, carrying out point-to-point addition on the obtained position coding vector and the image characteristic vector, and randomly setting the position coding vector and the image characteristic vector to be zero according to a preset probability so as to obtain an initial image characteristic matrix;
wherein the initial image feature matrix is used to duplicate the text feature vector.
6. The method according to claim 5, wherein the interacting the image feature vector and the text feature vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the first bi-modal feature comprises:
after copying the text characteristic vector, respectively carrying out linear transformation on the text characteristic vector and the image characteristic vector to obtain a first text matrix and a first image matrix, and carrying out linear transformation on the initial image characteristic matrix to obtain a current image characteristic matrix;
respectively changing the shapes of the first text matrix, the first image matrix and the current image feature matrix to obtain the current image feature matrix with the changed shapes;
carrying out matrix multiplication on the current image feature matrix with the changed shape and the first text matrix to obtain a first weight matrix;
converting the first weight matrix into a probability matrix, and setting elements on the probability matrix to be zero to obtain a second weight matrix;
multiplying the second weight matrix with the first image matrix to obtain a first dual-mode matrix; and transforming the first dual-mode matrix by using a linear transformer, and normalizing to obtain the first dual-mode characteristic.
7. The multi-modal emotion recognition method of claim 4, wherein the step of extracting a second text-centered bimodal feature using the multi-modal emotion recognition model comprises the steps of:
convolving the text semantic vectors in the text data by adopting a defined text convolution layer to obtain text features;
compressing the text features on a channel to obtain text feature vectors;
respectively normalizing the text feature vector and the image feature vector;
and interacting the image feature vector and the text feature vector through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the second dual-modal feature.
8. The multi-modal emotion recognition method of claim 4, further comprising, prior to said normalizing said text feature vector and said image feature vector, respectively, the steps of:
normalizing the text feature vector and multiplying the normalized text feature vector by a preset coefficient;
carrying out position coding on the text characteristic vector, carrying out point-to-point addition on the obtained position coding vector and the text characteristic vector, and randomly setting the position coding vector and the text characteristic vector to be zero according to a preset probability so as to obtain an initial text characteristic matrix;
wherein the initial text feature matrix is used to duplicate the image feature vector.
9. The method according to claim 8, wherein the interacting the image feature vectors and the text feature vectors through a cross-modal attention mechanism layer of the multi-modal emotion recognition model to obtain the second bi-modal features comprises:
after copying the image characteristic vector, respectively carrying out linear transformation on the text characteristic vector and the image characteristic vector to obtain a second text matrix and a second image matrix, and carrying out linear transformation on the initial text characteristic matrix to obtain a current text characteristic matrix;
respectively changing the shapes of the second text matrix, the second image matrix and the current text feature matrix to obtain the current text feature matrix with the changed shapes;
carrying out matrix multiplication on the current text characteristic matrix with the changed shape and the second image matrix to obtain a third weight matrix;
converting the third weight matrix into a probability matrix, and setting elements on the probability matrix to be zero to obtain a fourth weight matrix;
multiplying the fourth weight matrix with the second text matrix to obtain a second bimodal matrix;
and transforming the second dual-mode matrix by using a linear transformer, and normalizing to obtain the second dual-mode characteristic.
10. The multi-modal emotion recognition method according to any one of claims 1 to 9, wherein the step of performing deduplication on the video data of the object to be recognized to obtain the face time series image data of the object to be recognized specifically comprises the steps of:
performing background modeling by adopting a Vibe algorithm, and extracting a binary gray level profile map relative to a static background;
performing responsive morphological operation and position and operation on the binary gray level contour map and the corresponding original map, and removing a background to reserve a foreground picture;
after foreground pictures in the video data of the object to be identified are extracted, calculating the similarity of the front frame picture and the rear frame picture by adopting a perceptual hash algorithm, and deleting the pictures when the similarity exceeds a preset threshold;
carrying out face detection on the obtained picture to obtain a face picture;
and storing the face picture into face time sequence image data according to the time sequence of the video data of the object to be recognized.
11. The multi-modal emotion recognition method of claim 10, further comprising, after said acquiring the text data of the object to be recognized in real time while acquiring the video data of the object to be recognized, the steps of:
and converting the face time sequence image data and the text data into feature vectors with preset dimensions for alignment, so as to provide a data format and high-dimensional data for aligning the face time sequence image data and the text data for the multi-modal emotion recognition model.
12. A multi-modal emotion recognition apparatus, comprising:
the duplication removing module is used for carrying out duplication removal on the video data of the object to be recognized and acquiring face time sequence image data of the object to be recognized;
the acquisition module is used for acquiring the text data of the object to be identified in real time when the video data of the object to be identified is acquired;
and the identification module is used for inputting the aligned face time sequence image data and the aligned text data into a multi-modal emotion identification model so as to perform multi-modal emotion identification on the object to be identified.
13. An electronic device, characterized in that the electronic device comprises:
a processor; and a memory storing computer executable instructions that, when executed, cause the processor to perform the method of multimodal emotion recognition as recited in any of claims 1-11.
14. A storage medium having stored therein at least one instruction that is loaded and executed by a processor to perform operations performed by the multi-modal emotion recognition method as recited in any of claims 1-11.
CN202011262785.1A 2020-11-12 2020-11-12 Multi-modal emotion recognition method and device, electronic equipment and storage medium Pending CN112418034A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011262785.1A CN112418034A (en) 2020-11-12 2020-11-12 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011262785.1A CN112418034A (en) 2020-11-12 2020-11-12 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112418034A true CN112418034A (en) 2021-02-26

Family

ID=74832162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011262785.1A Pending CN112418034A (en) 2020-11-12 2020-11-12 Multi-modal emotion recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112418034A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700794A (en) * 2021-03-23 2021-04-23 北京达佳互联信息技术有限公司 Audio scene classification method and device, electronic equipment and storage medium
CN112862727A (en) * 2021-03-16 2021-05-28 上海壁仞智能科技有限公司 Cross-mode image conversion method and device
CN113490053A (en) * 2021-06-30 2021-10-08 北京奇艺世纪科技有限公司 Play amount prediction method, play amount prediction device, play amount prediction model, electronic equipment and storage medium
CN113536999A (en) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN113535894A (en) * 2021-06-15 2021-10-22 杭州电子科技大学 Multi-modal ironic detection method based on condition fusion
CN113569092A (en) * 2021-07-29 2021-10-29 北京达佳互联信息技术有限公司 Video classification method and device, electronic equipment and storage medium
CN114898449A (en) * 2022-07-13 2022-08-12 电子科技大学成都学院 Foreign language teaching auxiliary method and device based on big data
CN115114408A (en) * 2022-07-14 2022-09-27 平安科技(深圳)有限公司 Multi-modal emotion classification method, device, equipment and storage medium
WO2022199504A1 (en) * 2021-03-26 2022-09-29 腾讯科技(深圳)有限公司 Content identification method and apparatus, computer device and storage medium
CN115186683A (en) * 2022-07-15 2022-10-14 哈尔滨工业大学 Cross-modal translation-based attribute-level multi-modal emotion classification method
CN115239937A (en) * 2022-09-23 2022-10-25 西南交通大学 Cross-modal emotion prediction method
CN116150383A (en) * 2023-04-21 2023-05-23 湖南工商大学 Rumor detection method and model based on cross-modal attention mechanism
CN117235244A (en) * 2023-11-16 2023-12-15 江西师范大学 Online course learning emotion experience evaluation system based on barrage emotion word classification

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808146A (en) * 2017-11-17 2018-03-16 北京师范大学 A kind of multi-modal emotion recognition sorting technique
CN108985358A (en) * 2018-06-29 2018-12-11 北京百度网讯科技有限公司 Emotion identification method, apparatus, equipment and storage medium
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression
CN111164601A (en) * 2019-12-30 2020-05-15 深圳市优必选科技股份有限公司 Emotion recognition method, intelligent device and computer readable storage medium
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111898670A (en) * 2020-07-24 2020-11-06 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808146A (en) * 2017-11-17 2018-03-16 北京师范大学 A kind of multi-modal emotion recognition sorting technique
CN108985358A (en) * 2018-06-29 2018-12-11 北京百度网讯科技有限公司 Emotion identification method, apparatus, equipment and storage medium
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression
CN111164601A (en) * 2019-12-30 2020-05-15 深圳市优必选科技股份有限公司 Emotion recognition method, intelligent device and computer readable storage medium
CN111275085A (en) * 2020-01-15 2020-06-12 重庆邮电大学 Online short video multi-modal emotion recognition method based on attention fusion
CN111898670A (en) * 2020-07-24 2020-11-06 深圳市声希科技有限公司 Multi-mode emotion recognition method, device, equipment and storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862727A (en) * 2021-03-16 2021-05-28 上海壁仞智能科技有限公司 Cross-mode image conversion method and device
CN112700794A (en) * 2021-03-23 2021-04-23 北京达佳互联信息技术有限公司 Audio scene classification method and device, electronic equipment and storage medium
CN112700794B (en) * 2021-03-23 2021-06-22 北京达佳互联信息技术有限公司 Audio scene classification method and device, electronic equipment and storage medium
WO2022199504A1 (en) * 2021-03-26 2022-09-29 腾讯科技(深圳)有限公司 Content identification method and apparatus, computer device and storage medium
CN113535894A (en) * 2021-06-15 2021-10-22 杭州电子科技大学 Multi-modal ironic detection method based on condition fusion
CN113535894B (en) * 2021-06-15 2022-09-13 杭州电子科技大学 Multi-modal ironic detection method based on condition fusion
CN113490053A (en) * 2021-06-30 2021-10-08 北京奇艺世纪科技有限公司 Play amount prediction method, play amount prediction device, play amount prediction model, electronic equipment and storage medium
CN113536999A (en) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN113569092A (en) * 2021-07-29 2021-10-29 北京达佳互联信息技术有限公司 Video classification method and device, electronic equipment and storage medium
CN113569092B (en) * 2021-07-29 2023-09-05 北京达佳互联信息技术有限公司 Video classification method and device, electronic equipment and storage medium
CN114898449B (en) * 2022-07-13 2022-10-04 电子科技大学成都学院 Foreign language teaching auxiliary method and device based on big data
CN114898449A (en) * 2022-07-13 2022-08-12 电子科技大学成都学院 Foreign language teaching auxiliary method and device based on big data
CN115114408A (en) * 2022-07-14 2022-09-27 平安科技(深圳)有限公司 Multi-modal emotion classification method, device, equipment and storage medium
CN115114408B (en) * 2022-07-14 2024-05-31 平安科技(深圳)有限公司 Multi-mode emotion classification method, device, equipment and storage medium
CN115186683A (en) * 2022-07-15 2022-10-14 哈尔滨工业大学 Cross-modal translation-based attribute-level multi-modal emotion classification method
CN115239937A (en) * 2022-09-23 2022-10-25 西南交通大学 Cross-modal emotion prediction method
CN115239937B (en) * 2022-09-23 2022-12-20 西南交通大学 Cross-modal emotion prediction method
CN116150383A (en) * 2023-04-21 2023-05-23 湖南工商大学 Rumor detection method and model based on cross-modal attention mechanism
CN117235244A (en) * 2023-11-16 2023-12-15 江西师范大学 Online course learning emotion experience evaluation system based on barrage emotion word classification
CN117235244B (en) * 2023-11-16 2024-02-20 江西师范大学 Online course learning emotion experience evaluation system based on barrage emotion word classification

Similar Documents

Publication Publication Date Title
CN112418034A (en) Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN111695352A (en) Grading method and device based on semantic analysis, terminal equipment and storage medium
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
EP3851997A1 (en) Method and device for processing information, and storage medium
KR102576344B1 (en) Method and apparatus for processing video, electronic device, medium and computer program
CN111696176A (en) Image processing method, image processing device, electronic equipment and computer readable medium
CN108804427B (en) Voice machine translation method and device
CN113094478B (en) Expression reply method, device, equipment and storage medium
WO2021175040A1 (en) Video processing method and related device
CN112614110B (en) Method and device for evaluating image quality and terminal equipment
CN111445898A (en) Language identification method and device, electronic equipment and storage medium
CN110619334A (en) Portrait segmentation method based on deep learning, architecture and related device
CN109885831B (en) Keyword extraction method, device, equipment and computer readable storage medium
CN112785669B (en) Virtual image synthesis method, device, equipment and storage medium
CN113902838A (en) Animation generation method, animation generation device, storage medium and electronic equipment
KR20180065762A (en) Method and apparatus for deep neural network compression based on manifold constraint condition
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
CN110570877B (en) Sign language video generation method, electronic device and computer readable storage medium
CN115169368B (en) Machine reading understanding method and device based on multiple documents
CN115909176A (en) Video semantic segmentation method and device, electronic equipment and storage medium
CN115129877A (en) Method and device for generating punctuation mark prediction model and electronic equipment
CN113554719B (en) Image encoding method, decoding method, storage medium and terminal equipment
CN113010728A (en) Song recommendation method, system, intelligent device and storage medium
CN113469197A (en) Image-text matching method, device, equipment and storage medium
CN113299300A (en) Voice enhancement method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211202

Address after: 200000 room 206, No. 2, Lane 389, Jinkang Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Applicant after: Shanghai Yuanmeng Intelligent Technology Co.,Ltd.

Applicant after: Yuanmeng humanistic Intelligence International Co., Ltd

Address before: Room 2807, 28th floor, Bank of America centre, 12 Harcourt Road, central, Hong Kong, China

Applicant before: Yuanmeng Human Intelligence International Co.,Ltd.