CN112699774A - Method and device for recognizing emotion of person in video, computer equipment and medium - Google Patents

Method and device for recognizing emotion of person in video, computer equipment and medium Download PDF

Info

Publication number
CN112699774A
CN112699774A CN202011577706.6A CN202011577706A CN112699774A CN 112699774 A CN112699774 A CN 112699774A CN 202011577706 A CN202011577706 A CN 202011577706A CN 112699774 A CN112699774 A CN 112699774A
Authority
CN
China
Prior art keywords
emotion
video
images
face
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011577706.6A
Other languages
Chinese (zh)
Other versions
CN112699774B (en
Inventor
陈海波
罗志鹏
张治广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyan Technology Beijing Co ltd
Original Assignee
Shenyan Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyan Technology Beijing Co ltd filed Critical Shenyan Technology Beijing Co ltd
Priority to CN202011577706.6A priority Critical patent/CN112699774B/en
Publication of CN112699774A publication Critical patent/CN112699774A/en
Application granted granted Critical
Publication of CN112699774B publication Critical patent/CN112699774B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a method and a device for recognizing emotion of a person in a video, computer equipment and a medium. In one embodiment, the method comprises: acquiring an image containing a human face in a video to be identified, extracting N frames of images from the video to be identified at preset time intervals, dividing the video to be identified into M sections, and then randomly sampling at least one frame of image from each section to obtain at least two frames of images and sound frequency spectrogram images respectively corresponding to the images; extracting face emotion feature vectors from images containing faces, extracting frame emotion feature vectors from N frames of images, extracting video emotion feature vectors from at least two frames of images and extracting voice emotion feature vectors from voice spectrograms; performing feature fusion on the four feature vectors to obtain a multi-mode information feature vector; and calling a character emotion recognition model obtained through pre-training to recognize the multi-mode information characteristic vector to obtain a recognition result. This embodiment can improve the recognition accuracy.

Description

Method and device for recognizing emotion of person in video, computer equipment and medium
Technical Field
The present application relates to the field of computer technology. And more particularly, to a method and apparatus for emotion recognition of a person in a video, a computer device, and a medium.
Background
Emotion recognition is becoming an important subject of human research, and has great potential application value in the fields of psychology, intelligent robots, intelligent monitoring, virtual reality, synthetic animation and the like. At present, emotion recognition is basically realized by recognizing facial expressions of people in images, so that the psychological states of the people in the images are obtained. The existing recognition mode is to perform facial recognition on a static image containing a face, specifically to perform facial expression recognition on the whole image containing the face or a face image segmented from the whole image, and to obtain the belonging emotion category by one-time recognition. The identification mode can only process a single static image, and the identification is based on single characteristics, so that the problems of low identification precision and the like exist, and the user experience is influenced.
Disclosure of Invention
The application aims to provide a method and a device for recognizing emotion of a person in a video, a computer device and a medium, so as to solve at least one of the problems in the prior art.
In order to achieve the purpose, the following technical scheme is adopted in the application:
the application provides a method for recognizing emotion of a person in a video, which comprises the following steps:
the method comprises the steps of obtaining an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1;
extracting face emotion feature vectors from the image containing the face, extracting frame emotion feature vectors from the N frame images, extracting video emotion feature vectors from the at least two frame images, and extracting voice emotion feature vectors from the voice spectrogram;
performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector to obtain a multi-mode information feature vector;
and calling a character emotion recognition model obtained by pre-training, and recognizing the multi-mode information characteristic vector to obtain a character emotion recognition result.
Optionally, the acquiring an image including a face and a sound spectrogram corresponding to the image including the face in the video to be recognized includes:
respectively detecting each frame of image of the video to be recognized by using a preset face detection model to obtain an image containing a face, and recording time information of the image containing the face;
intercepting audio data of corresponding time in the video to be identified according to the time information of the image containing the face;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
Optionally, the acquiring N frames of images extracted from a video to be identified at predetermined time intervals and a sound spectrogram corresponding to the N frames of images includes:
extracting N frames of images from a video to be identified at preset time intervals, and recording time information of each frame of image in the N frames of images;
intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in the N frames of images;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
Optionally, the dividing the video to be identified into M segments, and then randomly sampling at least one frame of image from each segment to obtain at least two frames of images and sound spectrogram corresponding to the at least two frames of images includes:
dividing a video to be identified into M sections, randomly sampling at least one frame of image from each section to obtain at least two frames of images, and recording time information of each frame of image in the at least two frames of images;
intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in at least two frames of images;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
Optionally, the randomly sampling at least one frame of image from each segment includes:
and randomly sampling each segment for L times to obtain L at least two frame images, wherein L is more than 1.
Optionally, the extracting, from the image including the face, a face emotion feature vector includes:
inputting the image containing the human face into a human face feature extraction model obtained by pre-training for processing, wherein the human face feature extraction model comprises a first convolutional neural network, a first full-link layer, a first classifier and an image emotion feature fusion sub-model which are sequentially connected;
and the image emotion feature fusion sub-model outputs face emotion feature vectors according to the proportion of emotion classification of each face in the image containing the face.
Optionally, the extracting frame emotion feature vectors from the N frame images includes:
inputting the N frames of images into a frame feature extraction model obtained by pre-training for processing, wherein the frame feature extraction model comprises a second convolutional neural network, a second full connection layer and a second classifier which are sequentially connected;
and determining the feature vector output by the second convolutional neural network as a frame emotion feature vector.
Optionally, the extracting of the sound emotion feature vector from the sound spectrogram includes:
inputting the sound frequency spectrogram into a sound feature extraction model obtained by pre-training for processing, wherein the sound feature extraction model comprises a third convolutional neural network, a third full connection layer and a third classifier which are sequentially connected;
and determining the feature vector output by the third convolutional neural network as a voice emotion feature vector.
Optionally, the extracting a video emotion feature vector from the at least two frames of images includes:
inputting the at least two frames of images into a video feature extraction model obtained by pre-training for processing, wherein the video feature extraction model comprises a fourth convolutional neural network, a fourth full connection layer and a fourth classifier which are sequentially connected;
and determining the feature vector output by the fourth convolutional neural network as a video emotion feature vector.
Optionally, performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector, and the voice emotion feature vector to obtain a multi-mode information feature vector, including:
performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector;
performing dimensionality reduction on the feature vector after feature fusion;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain the multi-mode information feature vector of the four channels.
Optionally, under the condition that each segment is randomly sampled L times, 1 video feature vector is randomly selected from the L video emotion feature vectors as a video feature vector for feature fusion.
Optionally, the character emotion recognition model is a support vector machine classifier.
Optionally, the method further comprises:
and inputting a training set comprising a training image of the facial expression and an emotion class label corresponding to the training image into the first convolutional neural network so as to train the facial feature extraction model.
Optionally, the method further comprises:
inputting a training set comprising N training images and emotion category labels corresponding to the N training images into the second convolutional neural network to train the frame feature extraction model.
Optionally, the method further comprises:
inputting a training set comprising a sound spectrum training image and emotion class labels corresponding to the sound spectrum training image into the third convolutional neural network to train the sound feature extraction model.
Optionally, the method further comprises:
inputting a training set comprising at least two training images and emotion class labels corresponding to the at least two training images into the fourth convolutional neural network to train the video feature extraction model.
A second aspect of the present application provides emotion recognition of a person in a video, comprising:
the multimode data acquisition module is used for acquiring an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1;
a multi-mode feature extraction module for extracting face emotion feature vectors from the image containing the face, extracting frame emotion feature vectors from the N frames of images, extracting video emotion feature vectors from the at least two frames of images, and extracting voice emotion feature vectors from the voice spectrogram;
the multimode characteristic fusion module is used for carrying out characteristic fusion on the face emotion characteristic vector, the frame emotion characteristic vector, the video emotion characteristic vector and the voice emotion characteristic vector to obtain multimode information characteristic vectors;
and the emotion recognition module is used for calling a character emotion recognition model obtained through pre-training, recognizing the multi-mode information characteristic vector and obtaining a character emotion recognition result.
Optionally, the multimode data obtaining module includes:
the first acquisition submodule is used for respectively detecting each frame of image of the video to be recognized by using a preset face detection model to obtain an image containing a face, and recording time information of the image containing the face; intercepting audio data of corresponding time in the video to be identified according to the time information of the image containing the face; carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face; and/or
The second acquisition submodule is used for extracting N frames of images from the video to be identified at preset time intervals and recording the time information of each frame of image in the N frames of images; intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in the N frames of images; carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face; and/or
The third acquisition submodule is used for dividing the video to be identified into M sections, randomly sampling at least one frame of image from each section to obtain at least two frames of images, and recording the time information of each frame of image in the at least two frames of images; intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in at least two frames of images; and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
Optionally, the multi-modal feature extraction module includes:
the face feature extraction submodule is used for inputting the image containing the face into a face feature extraction model for processing, wherein the face feature extraction model comprises a first convolutional neural network, a first full connection layer, a first classifier and an image emotion feature fusion submodel which are sequentially connected, and the image emotion feature fusion submodel is used for outputting face emotion feature vectors according to the proportion of emotion classification of each face in the image containing the face;
the frame feature extraction submodule is used for inputting the N frames of images into a frame feature extraction model obtained by pre-training for processing, wherein the frame feature extraction model comprises a second convolutional neural network, a second full-link layer and a second classifier which are sequentially connected, and the second convolutional neural network is used for receiving the N frames of images and outputting frame emotion feature vectors;
the sound feature extraction submodule is used for inputting the sound spectrogram into a sound feature extraction model obtained by pre-training for processing, wherein the sound feature extraction model comprises a third convolutional neural network, a third full connection layer and a third classifier which are sequentially connected, and the third convolutional neural network is used for receiving the sound spectrogram and outputting the sound emotion feature vector;
and the video feature extraction submodule is used for inputting the at least two frames of images into a video feature extraction model obtained by pre-training for processing, wherein the video feature extraction model comprises a fourth convolutional neural network, a fourth full connection layer and a fourth classifier which are sequentially connected, and the fourth convolutional neural network is used for receiving the at least two frames of images and outputting the video emotion feature vector.
A third aspect of the present application provides a computer device comprising a processor and a memory storing a program that, when executed by the processor, implements the method for emotion recognition of a person in a video provided by the first aspect of the present application.
A fourth aspect of the present application provides a computer-readable medium storing a program that, when executed, implements the method for recognizing emotion of a person in a video provided by the first aspect of the present application.
The beneficial effect of this application is as follows:
the scheme provided by the application integrates the image information extracted from the video, the image frame sequence information extracted in two different modes and having the time dimension and the imaged sound information, so that the emotion recognition of people in the video is performed based on the multi-dimensional characteristics, the higher recognition precision is realized, and the accuracy and the robustness of the emotion recognition of people in the video under the complex environment can be improved. Furthermore, by independently training the extraction model and the classification model for extracting the four feature vectors, the rule of the data can be learned quickly when the data set is not trained, that is, the whole model has strong generalization capability.
Drawings
The following describes embodiments of the present application in further detail with reference to the accompanying drawings.
FIG. 1 illustrates an exemplary system architecture diagram to which embodiments of the present application may be applied.
Fig. 2 shows a flowchart of a method for recognizing emotion of a person in a video according to an embodiment of the present application.
Fig. 3 shows a network structure diagram of the face feature extraction model.
Fig. 4 shows an example diagram of an image containing a plurality of faces.
Fig. 5 shows a network structure diagram of the frame feature extraction model.
Fig. 6 shows a network structure diagram of the acoustic feature extraction model.
Fig. 7 shows a network structure diagram of a video feature extraction model.
Fig. 8 is a schematic network structure diagram of an overall network model composed of the feature extraction model and the character emotion recognition model.
Fig. 9 shows a schematic diagram of an emotion recognition device for a person in a video according to an embodiment of the present application.
Fig. 10 is a schematic structural diagram of a computer system implementing the emotion recognition apparatus for a person in a video according to an embodiment of the present application.
Detailed Description
In order to more clearly explain the present application, the present application is further described below with reference to the embodiments and the accompanying drawings. Similar parts in the figures are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and is not intended to limit the scope of the present application.
The existing emotion recognition is basically realized by recognizing the facial expressions of people in images, specifically, the facial expressions of the whole image containing human faces or a facial image segmented from the whole image are recognized, and the emotion categories of the whole image or the facial image can be obtained by one-time recognition. The identification mode can only process a single static image, and the identification is based on single characteristics, so that the problems of low identification precision and the like exist, and the user experience is influenced.
In view of this, the embodiment of the present application provides a method for recognizing emotion of a person in a video, where the method includes two stages of model training and performing emotion recognition of the person on an input video by using a trained model.
Wherein,
the model training comprises the following steps:
and training by using the training sample to obtain a feature extraction model and a character emotion recognition model.
The performing of the emotion recognition of the person includes:
the method comprises the steps of obtaining an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1; for ease of understanding, the random sampling of at least one frame of image from each segment is illustrated in several examples: for example, 1 frame image is randomly sampled from each segment, so that at least two frame images obtained are M frame images; as another example, 2 frames of images are randomly sampled from each segment; as another example, 2 frame images from the 1 st segment, 3 frame images from the 2 nd segment, and 1 frame image from the 3 rd segment … … are randomly sampled; that is, the number of image frames randomly sampled from each segment may be the same or different, which is not limited in this embodiment;
calling a feature extraction model obtained by pre-training, extracting face emotion feature vectors from the images containing the faces, extracting frame emotion feature vectors from the N frames of images, extracting video emotion feature vectors from the at least two frames of images and extracting voice emotion feature vectors from the voice spectrogram, wherein it needs to be explained that in the embodiment, the extraction of the four feature vectors is realized by calling the feature extraction model obtained by pre-training, but the extraction of the feature vectors can also be realized by adopting some image processing algorithms based on the prior art, and the embodiment does not explain the method of only extracting the features of the feature extraction model obtained by pre-training;
performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector to obtain a multi-mode information feature vector;
and calling a character emotion recognition model obtained by pre-training, and recognizing the multi-mode information characteristic vector to obtain a character emotion recognition result.
Therefore, the image information extracted from the video, the image frame sequence information extracted in two different modes and having the time dimension and the imaging sound information are combined, so that the emotion recognition of people in the video is performed based on the multidimensional characteristics, the recognition precision is high, and the accuracy and the robustness of the emotion recognition of people in the video under the complex environment are improved.
The emotion recognition method for the people in the video provided by the embodiment can be applied to many fields, such as visual interaction, intelligent control, driving assistance, distance education, accurate advertisement delivery, social networks, instant messaging, psychology fields and the like.
Illustratively, an application scenario of the present embodiment is as follows: in the auxiliary driving field, the emotion recognition method can accurately recognize the emotion of the driver in the video by collecting the video containing the face of the driver, and if the emotion of the driver belongs to the preset emotion related to dangerous driving, corresponding processing can be performed, for example, the driver can be warned to control the emotion of the driver, so that safe driving is guaranteed.
Another application scenario of the present application is as follows: in the field of remote education, videos containing faces of students are collected, emotions of the students in the videos can be accurately recognized through the emotion recognition method provided by the embodiment of the application, if the emotions of the students belong to preset emotions with poor learning states, corresponding processing can be carried out, for example, a teacher can be reminded to inquire or pay attention to learning conditions of the students, or a teaching scheme is improved, so that the teaching effect is improved.
Another application scenario of the present application is as follows: in the remote education field, through collecting the video containing the teacher's face, the method for emotion recognition provided by the embodiment of the application can accurately recognize the emotion of the teacher in the video, and if the face expression of the teacher belongs to the preset emotion with poor teaching state, corresponding processing can be performed, for example, the teacher is reminded to adjust the state of the teacher, so that the teaching effect is improved.
Another application scenario of the present application is as follows: in the field of social networks, by taking microblogs as an example, when a user uses a smart phone to perform self-shooting, shoots a video containing a face of the user and uploads the video to the microblogs, the emotion of the user in the video can be accurately identified by the emotion identification method provided by the embodiment of the application, so that corresponding microblog content can be pushed for the user. For example, poems or other content that conform to a sad emotion may be pushed to the user when the user emotion is identified as sad, and songs or other content that conform to a happy emotion may be pushed to the user when the user emotion is identified as happy.
Another application scenario of the present application is as follows: in the field of criminal psychology, the emotion of a specific person in a video can be accurately recognized by collecting the video containing the face of the specific person under inquiry through the emotion recognition method provided by the embodiment of the application, so that the emotion recognition method can be used as one of judgment bases for judging whether the specific person lies, and further comprehensive judgment can be performed by combining the monitoring results of a lie detector for monitoring physiological characteristics such as pulse, respiration and skin resistance.
Another application scenario of the present application is as follows: in the field of artificial intelligence, by taking an artificial intelligence chat robot as an example, videos containing human faces of users are collected through the artificial intelligence chat robot, and the emotion of the users in the videos can be accurately identified through the emotion identification method provided by the embodiment of the application, so that proper topics are selected for chatting with the users.
The present embodiment may also be applied to other multiple application scenarios, and is not limited herein.
The method for recognizing emotion of a person in a video provided in this embodiment may be implemented by a processing device with data processing capability, specifically, the processing device may be a Computer with data processing capability, including a Personal Computer (PC), a mini-Computer or a mainframe, or may be a server or a server cluster with data processing capability, and the present embodiment is not limited thereto.
In order to facilitate understanding of the technical solution of the present application, an application scenario of the method of the present application in practice is described below with reference to fig. 1. Fig. 1 shows an exemplary application scenario of the present application in practice, and referring to fig. 1, the application scenario includes a training server 10 and a recognition server 20. In the present embodiment, the training server 10 trains the feature extraction model and the character emotion recognition model by using the training samples to obtain the feature extraction model and the character emotion recognition model. The recognition server 20 may perform the recognition of the emotion of the person based on the image including the face obtained from the video to be recognized, the N frames of images extracted from the video to be recognized at predetermined time intervals, the at least two frames of images obtained by randomly sampling at least one frame of image from each frame of image after dividing the video to be recognized into M segments, and the sound spectrogram expression images corresponding to the image including the face, the N frames of images, and the multi frames of images, respectively, using the feature extraction model and the emotion recognition model of the person obtained by the pre-training of the training server 10. The video to be recognized is input into the recognition server 20, and the result of the recognition of the emotion of the person can be obtained.
It should be noted that the training server 10 and the recognition server 20 in fig. 1 may be two independent servers in practical application, or may be a server integrating the model training function and the emotion recognition function. When two servers are separate at the time, the two servers may communicate with each other over a network that may include various types of connections, such as wired, wireless communication links, or fiber optic cables, among others.
Next, the emotion recognition method of a person in a video provided by the present embodiment is explained from the viewpoint of a processing device having a data processing capability. Fig. 2 is a flowchart of a method for recognizing emotion of a person in a video according to this embodiment. As shown in fig. 2, the method for recognizing emotion of a person in a video provided by this embodiment includes steps S100 to S500, where step S100 belongs to a training phase, and steps S200 to S500 belong to a stage of performing emotion recognition of a person, and the flow of the method is specifically as follows:
and S100, training to obtain a feature extraction model and a character emotion recognition model.
In one possible implementation, the feature extraction model includes a face feature extraction model, a frame feature extraction model, a voice feature extraction model, and a video feature extraction model. Further, training to obtain the feature extraction model comprises independently training to obtain a face feature extraction model, a frame feature extraction model, a sound feature extraction model and a video feature extraction model, and training to obtain the four feature extraction models and then training to obtain the task emotion recognition model.
Next, the model network structure and the training mode of the face feature extraction model, the frame feature extraction model, the voice feature extraction model, and the video feature extraction model will be described.
In one possible implementation, as shown in fig. 3, the face feature extraction model includes a first convolutional neural network, a first fully-connected layer, and a first classifier, which are connected in sequence.
As shown in fig. 3, the first convolutional neural network may be a MobileFaceNet. The MobileFaceNet is an open source face recognition network based on python language, is a lightweight face recognition network with industrial-level precision and speed, and is applicable to mobile terminal application and the like.
In one particular example, as shown in FIG. 3, the first classifier may employ a SoftMax classifier.
In one possible implementation, the training of the face feature extraction model includes: and inputting a training set comprising a training image of the facial expression and an emotion class label corresponding to the training image into the first convolutional neural network so as to train the facial feature extraction model.
In one specific example, the training images and their corresponding emotion class labels are used as training samples for training the face feature extraction model.
The training sample refers to a data sample for training a model, which may include a large number of training images including facial expressions, and has a pre-labeled emotion category label for each training image. It can be understood that the larger the training sample data size is, the better the model training effect is, but the training sample data size also affects the efficiency of model training, so the specific data size of the training sample is not limited in this embodiment, and may be determined according to the actual business requirement in the specific implementation.
In this implementation, the model may be trained in a supervised learning manner in machine learning, and therefore, the training samples may include: training set samples and verification set samples; it can be understood that, for all sample data in the training sample, a part of the sample data is taken as a training set sample, and another part of the sample data is taken as a verification set sample; the training set samples are used for training the model, and the verification set samples are used for verifying the model in the training process. For example, 80% of the data in the training samples are used as training set samples, and 20% of the data are used as validation set samples.
Wherein the emotion category label is a label labeling the emotion category. Different emotion category labels can be obtained according to different emotion category division modes. For example, the present embodiment can classify emotions into seven categories according to one: anger (Angry), aversion (distust), Fear (Fear), happy (happenses), Sadness (Sadness), Surprise (surrise) and Neutral (Neutral); thus, seven emotion category labels of anger, disgust, fear, happiness, sadness, surprise, and neutrality may be set in advance. This embodiment can also classify emotions into three categories according to another psychological definition: positive, negative, normal; in this way, three emotion category labels of positive, negative, and normal can be set.
In specific implementation, the training samples can be acquired and generated through web crawlers or manual acquisition, manual labeling and the like, the training samples acquired in advance are stored in a sample database established in advance, and based on the method, the training samples can be directly read from the sample database established in advance in specific implementation.
For ease of understanding, the principles of the neural network model will first be briefly described. A neural network model is generally understood to be a nonlinear learning system that models the human brain and is formed by a large number of processing units, i.e., "neurons," that are widely interconnected. Because the network structure of the Convolutional Neural network has the characteristics of sparse connection and weight sharing, a Convolutional Neural network model (CNN) is often adopted in the field of image processing to realize image recognition.
It can be understood that the spatial relationship of the image is local, each neuron only needs to feel a local image area without feeling a global image, and then at a higher level, the neurons which feel different local areas are integrated to obtain global information, so that the number of weight parameters which need to be trained by the convolutional neural network can be reduced. In order to further reduce the weight parameters of the training, a weight sharing mode may be adopted for training, specifically, a feature of an image is extracted by using the same convolution kernel for different regions of the image, for example, an edge along a certain direction, the entire image is convolved by using a plurality of convolution kernels, so that a plurality of features of the entire image may be obtained, and the features are mapped to obtain a classification result of the image.
In this implementation, the training process of the face feature extraction model is, for example:
the pre-established initial face feature extraction model comprises a first convolutional neural network, a first full-connection layer and a first classifier which are sequentially connected.
The training processes of the first convolution neural network, the first full-connection layer and the first classifier of the initial face feature extraction model are as follows: the Loss function Loss of the face feature extraction model is used for measuring the difference between the predicted value and the target value, and the smaller the output value of the Loss is, the closer the predicted value is to the target value, that is, the more accurate the model identification is. Therefore, the training process of the face feature extraction model is actually a process of continuously optimizing the parameters of the model through the training of sample data to continuously reduce the output value of the model Loss. When the output value of Loss tends to be stable, the model is considered to be in a convergence state, and the model trained at the moment can be used as a face feature extraction model to be applied to extracting face emotion feature vectors.
The reduction of the Loss output value is mainly realized by optimizing model parameters by a gradient descent method, and specifically, the Loss value is reduced by continuously moving the weight value to the opposite direction of the gradient corresponding to the current point.
In practical applications, the data set can be divided into five parts or batches for five-fold cross validation.
Inputting one batch of training images into an initial human face feature extraction model, extracting and mapping features of the training images through a first convolutional neural network, a first full-link layer and a first classifier, so that a prediction result of the emotion type of the training images is obtained, and an output value of Loss can be calculated according to the prediction result and the emotion class label of the training images. Based on the output value of Loss, the gradient of each parameter in the initial human face feature extraction model can be calculated through a back propagation algorithm, and the parameter weight of the model is updated according to the gradient.
When all training samples in the sample library are trained, the sequence of the samples can be disturbed, the samples are trained for a plurality of times, and when the Loss output value of the model is stabilized at a smaller value, the samples in the verification set of the preselected division can be adopted for verification. When the model identifies the sample of the verification set and has a smaller Loss output value, the model is considered to have higher identification accuracy, the training can be stopped, the trained model is used as a face feature extraction model, and the face emotion feature vector is extracted from the image containing the face in the subsequent character emotion identification stage.
In one possible implementation, as shown in fig. 5, the frame feature extraction model includes a second convolutional neural network, a second fully-connected layer, and a second classifier, which are connected in sequence.
Wherein, as shown in fig. 5, the second convolutional neural network can adopt DenseNet-121. DenseNet-121 is DenseNet-121 means that the network has a total of 121 layers: (6+12+24+16) × 2+3(transition layer) +1(7x7 Conv) +1(Classification layer) ═ 121, which establishes a dense connection (dense connection) of all layers in front to the back layer, and feature reuse (feature reuse) can be achieved by the connection of features on the channels. These features allow DenseNet to achieve superior performance with less parametric and computational cost.
In one particular example, as shown in fig. 5, the second classifier may employ a SoftMax classifier.
In one possible implementation, training the frame feature extraction model includes:
acquiring a training video, extracting N frames of training images from the training video at preset time intervals, and marking emotion category labels on the N frames of training images, wherein the marking of the emotion category labels on the N frames of training images is to mark the emotion category labels on the N frames of training images as a whole;
inputting a training set comprising N training images and emotion category labels corresponding to the N training images into the second convolutional neural network to train the frame feature extraction model.
In a specific example, N frames of training images extracted from a training video and their corresponding emotion class labels are used as training samples of a training frame feature extraction model.
The modes of obtaining and marking the training samples of the training frame feature extraction model, dividing all sample data in the training samples, and labeling emotion categories are similar to those introduced in the training face feature extraction model, and are not described herein again.
In this implementation, the training process of the frame feature extraction model is, for example:
the pre-established frame feature extraction model comprises a second convolutional neural network, a second full-connection layer and a second classifier which are connected in sequence. The training process of the second convolutional neural network, the second fully-connected layer and the second classifier of the initial frame feature extraction model is as follows: the Loss function Loss of the frame feature extraction model is used for measuring the difference between the predicted value and the target value, and the smaller the output value of the Loss, the closer the predicted value is to the target value, that is, the more accurate the model identification is. Therefore, the training process of the frame feature extraction model is actually a process of continuously optimizing the parameters of the model through the training of sample data to continuously reduce the output value of the model Loss. When the output value of Loss is reduced to a certain degree or tends to be stable, the model is considered to be in a convergence state, and the model trained at the moment can be used as a frame feature extraction model to be applied to extracting frame emotion feature vectors. The specific principle and the training process are similar to those of the aforementioned training face feature extraction model, and are not described herein again.
In one possible implementation, as shown in fig. 6, the acoustic feature extraction model includes a third convolutional neural network, a third fully-connected layer, and a third classifier connected in sequence.
Wherein, as shown in FIG. 6, the third convolutional neural network can adopt EfficientNet-B4. The EfficientNet network utilizes a model scaling method of uniformly scaling all dimensions of a model by using a composite coefficient, and each dimension is expanded by using a fixed group of scaling coefficients, so that the accuracy and efficiency of the model can be greatly improved.
In one particular example, as shown in fig. 6, the third classifier may employ a SoftMax classifier.
In one possible implementation, training the obtained acoustic feature extraction model includes:
acquiring audio training data, converting the audio training data into an image format to obtain a sound spectrum training image, and marking an emotion class label on the sound spectrum training image;
inputting a training set comprising a sound spectrum training image and emotion class labels corresponding to the sound spectrum training image into the third convolutional neural network to train the sound feature extraction model.
In a specific example, a sound spectrum training graph and corresponding emotion class labels are used as training samples for training a sound feature extraction model.
The modes of obtaining and labeling the training samples of the training voice feature extraction model, dividing all sample data in the training samples, and labeling emotion categories are similar to those introduced in the training human face feature extraction model, and are not described herein again.
In this implementation, the training process of the acoustic feature extraction model is, for example:
the pre-established sound feature extraction model comprises a third convolutional neural network, a third full-connection layer and a third classifier which are connected in sequence. The training process of the third convolutional neural network, the third fully-connected layer and the third classifier of the initial sound feature extraction model is as follows: the Loss function Loss of the acoustic feature extraction model is used for measuring the difference between the predicted value and the target value, and the smaller the output value of the Loss, the closer the predicted value is to the target value, that is, the more accurate the model identification is. Therefore, the training process of the frame feature extraction model is actually a process of continuously optimizing the parameters of the model through the training of sample data to continuously reduce the output value of the model Loss. When the output value of Loss is reduced to a certain degree or tends to be stable, the model is considered to be in a convergence state, and the model trained at the moment can be used as a frame feature extraction model to be applied to extracting frame emotion feature vectors. The specific principle and the training process are similar to those of the aforementioned training face feature extraction model, and are not described herein again.
In one possible implementation, as shown in fig. 7, the video feature extraction model includes a fourth convolutional neural network, a fourth fully-connected layer, and a fourth classifier, which are connected in sequence.
Wherein, as shown in fig. 7, a fourth convolutional neural network may employ ResNet 101. The depth residual error network ResNet101 of the 101 layer is added with a residual error unit through a short circuit mechanism, so that the problem of degradation of the depth network is solved through residual error learning, deeper network depth can be realized, and the extraction effect is ensured.
In one particular example, as shown in FIG. 7, the fourth classifier may employ a SoftMax classifier.
In one possible implementation, training the video feature extraction model includes:
obtaining a training video, dividing the training video into M segments to obtain M sub-training videos, i.e. S in fig. 71、S2、......、SMRandomly sampling at least one frame of training image from each section of sub-training video to obtain at least two frames of training images, and labeling emotion category labels on the at least two frames of training images;
inputting a training set comprising at least two training images and emotion class labels corresponding to the at least two training images into the fourth convolutional neural network to train the video feature extraction model.
In a specific example, at least two training images extracted from a training video and corresponding emotion class labels are used as training samples of a training video feature extraction model.
The modes of obtaining and marking the training samples of the training video feature extraction model, dividing all sample data in the training samples, and labeling emotion categories are similar to those introduced in the training face feature extraction model, and are not repeated here.
In this implementation, the training process of the video feature extraction model is, for example:
the pre-established video feature extraction model comprises a fourth convolutional neural network, a fourth full-connection layer and a fourth classifier which are sequentially connected. The training process of the fourth convolutional neural network, the fourth fully-connected layer and the fourth classifier of the initial video feature extraction model is as follows: the Loss function Loss of the video feature extraction model is used for measuring the difference between the predicted value and the target value, and the smaller the output value of the Loss, the closer the predicted value is to the target value, that is, the more accurate the model identification is. Therefore, the training process of the video feature extraction model is actually a process of continuously optimizing the parameters of the model through the training of sample data to continuously reduce the output value of the model Loss. When the output value of Loss is reduced to a certain degree or tends to be stable, the model is considered to be in a convergence state, and the model trained at the moment can be used as a video feature extraction model to be applied to extracting the video emotion feature vector. The specific principle and the training process are similar to those of the aforementioned training face feature extraction model, and are not described herein again.
Through the above, a trained face feature extraction model, a frame feature extraction model, a voice feature extraction model and a video feature extraction model can be obtained, and then a character emotion recognition model is trained through the outputs of the face feature extraction model, the frame feature extraction model, the voice feature extraction model and the video feature extraction model.
Fig. 3 shows a network structure of the face feature extraction model when the face feature extraction model is trained. In the stages of training the character emotion recognition model and performing character emotion recognition subsequently, the face feature extraction model further includes an image emotion feature fusion sub-model connected to the first classifier, the image emotion feature fusion sub-model does not need to be trained, and the rules thereof can be configured to perform re-classification based on the occupied proportion on a plurality of classification results obtained by the first classifier, for example, if an image is an image containing facial expressions of five characters as shown in fig. 4, and if the emotion of three characters obtained by recognizing the image shown in fig. 4 is happy (or positive), the emotion of two tasks is neutral (or normal, shown in the figure is "quiet"), the face emotion feature vector output by the image emotion feature fusion sub-model represents happy (or positive) to represent the overall emotion of the image.
In one possible implementation, as shown in fig. 8, training a character emotion recognition model includes:
for training video, determining a face emotion feature vector output by an image emotion feature fusion sub-model of a face feature extraction model, a frame emotion feature vector output by a second convolutional neural network of a frame feature extraction model and a feature vector output by a third convolutional neural network of a voice feature extraction model as voice emotion feature vectors for feature fusion to obtain multi-mode information feature vectors, for example, inputting the multi-mode information feature vectors into a multi-mode feature fusion module for feature fusion, wherein for four extraction models, corresponding input data can be obtained respectively by referring to the above manners, for example, an image containing a face can be recognized and extracted from the training video as the input of the face feature extraction model, N frames of images are extracted at preset time intervals as the input of the frame feature extraction model, at least one frame of image is randomly sampled from each segment after being divided into M segments to obtain at least two frames of images as the input of the video feature extraction model, Converting audio data analyzed by the training video into a sound spectrogram as the input of a sound characteristic extraction model;
and (4) taking the multi-mode information characteristic vector as a training sample to train a character emotion recognition model.
In one particular example, as shown in FIG. 8, the character emotion recognition model may employ a Support Vector Machine (SVM) classifier.
In the model training process, the training samples for training the five models, i.e., the face feature extraction model, the frame feature extraction model, the voice feature extraction model, the video feature extraction model and the character emotion recognition model, may be from the same batch of training videos, for example, a group of images including a face, the N frames of images, the at least two frames of images and the voice spectrogram are respectively obtained from each of a batch of training videos, and the five models are trained. In addition, the training samples of the above five models may also respectively adopt different training sets, which is not limited in this embodiment.
Therefore, the emotion recognition method for people in video provided by this embodiment adopts a mode of independently training the extraction model and the classification model for extracting the four feature vectors, and when facing a data set other than a training sample, the rule of data can be learned faster, that is, the overall model has strong generalization capability.
Step S200, obtaining an image containing a face in a video to be recognized, extracting N frames of images from the video to be recognized at preset time intervals, dividing the video to be recognized into M sections, and then randomly sampling at least one frame of image from each section to obtain at least two frames of images and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images.
In a possible implementation manner, the acquiring an image including a face and a sound spectrogram corresponding to the image including the face in a video to be recognized includes:
respectively detecting each frame of image of the video to be recognized by using a preset face detection model to obtain an image containing a face, and recording time information of the image containing the face;
intercepting audio data of corresponding time in the video to be identified according to the time information of the image containing the face;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
In a specific example, a face detection model based on edge features or a face detection model based on a statistical theory method (e.g., Adaboost detection algorithm) may be used to detect each frame of image of a video to be recognized, select an image including a face, and record time information of the image including the face, where the time information of the image may also be represented by parameters such as the number of frames in the video in which the image is located.
In a specific example, in this implementation, for example, for a video to be recognized that is 30 seconds long in total, if it is recognized that the 5 th to 10 th second images include a human face, the 5 th to 10 th second image frames are taken as images including the human face, the 5 th to 10 th second audio data are intercepted, and the 5 th to 10 th second audio data are subjected to spectrum analysis, so as to obtain a sound spectrogram corresponding to the 5 th to 10 th second images.
In one possible implementation manner, the acquiring N frames of images extracted from a video to be identified at predetermined time intervals and a sound spectrogram corresponding to the N frames of images includes:
extracting N frames of images from a video to be identified at preset time intervals, and recording time information of each frame of image in the N frames of images;
intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in the N frames of images;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
Following the above example, for example, if the predetermined time interval is 5 seconds, the images of 1 st, 5 th, 10 th, 15 th, 20 th, 25 th, and 30 th seconds are extracted from the video to be recognized that is 30 seconds long, and the audio data of 1 st, 5 th, 10 th, 15 th, 20 th, 25 th, and 30 th seconds are intercepted and subjected to spectrum analysis, so as to obtain the sound spectrogram corresponding to the "N-frame image".
In a possible implementation manner, the dividing of the video to be identified into M segments, obtaining at least two frames of images from each segment by randomly sampling at least one frame of image, and the sound spectrogram corresponding to the at least two frames of images includes:
dividing a video to be identified into M sections, randomly sampling at least one frame of image from each section to obtain at least two frames of images, and recording time information of each frame of image in the at least two frames of images;
intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in at least two frames of images;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
Continuing with the previous example, for example, M is set to 4, 1 frame of image is randomly sampled from each segment, then the video to be identified, which is 30 seconds long in total, is divided into 5 segments, each segment is 6 seconds, 1 frame of image is randomly sampled from 5 segments to obtain an "M frame of image", and the audio data at the time of the 1 frame of image randomly sampled from 5 segments is intercepted to perform spectrum analysis, so as to obtain a sound spectrogram corresponding to the "M frame of image".
In one possible implementation, the randomly sampling at least one frame of image from each segment includes:
and randomly sampling each segment for L times to obtain L at least two frame images, wherein L is more than 1.
Step S300, calling a feature extraction model obtained by pre-training (obtained by training in step S100), extracting face emotion feature vectors from the image containing the face, extracting frame emotion feature vectors from the N frames of images, extracting video emotion feature vectors from the at least two frames of images and extracting voice emotion feature vectors from the voice spectrogram;
in a possible implementation manner, the invoking a feature extraction model obtained by pre-training, and extracting a face emotion feature vector from the image including the face includes:
inputting the image containing the human face into a human face feature extraction model obtained by pre-training for processing, wherein the human face feature extraction model comprises a first convolutional neural network, a first full-link layer, a first classifier and an image emotion feature fusion sub-model which are sequentially connected;
and the image emotion feature fusion sub-model outputs face emotion feature vectors according to the proportion of emotion classification of each face in the image containing the face.
In one possible implementation, the first convolutional neural network is MobileFaceNet.
In a specific example, the pre-trained face feature extraction model includes a first convolutional neural network, a first fully-connected layer, a first classifier, and an image emotion feature fusion sub-model connected to the first classifier, which are sequentially connected as shown in fig. 3, and a determined face emotion feature vector can be output through the image emotion feature fusion sub-model no matter whether an image including a face includes only one face or a plurality of faces.
In a possible implementation manner, the invoking a feature extraction model obtained by pre-training, and extracting a frame emotion feature vector from the N frames of images includes:
inputting the N frames of images into a frame feature extraction model obtained by pre-training for processing, wherein the frame feature extraction model comprises a second convolutional neural network, a second full connection layer and a second classifier which are sequentially connected;
and determining the feature vector output by the second convolutional neural network as a frame emotion feature vector.
In one possible implementation, the second convolutional neural network is DenseNet-121.
In one specific example, the frame feature extraction model obtained by pre-training is shown in fig. 5.
In a possible implementation manner, the invoking a pre-trained feature extraction model to extract a voice emotion feature vector from the voice spectrogram includes:
inputting the sound frequency spectrogram into a sound feature extraction model obtained by pre-training for processing, wherein the sound feature extraction model comprises a third convolutional neural network, a third full connection layer and a third classifier which are sequentially connected;
and determining the feature vector output by the third convolutional neural network as a voice emotion feature vector.
In one possible implementation, the third convolutional neural network is EfficientNet-B4.
In one specific example, the pre-trained acoustic feature extraction model is shown in fig. 6.
In a possible implementation manner, the invoking a feature extraction model obtained by pre-training to extract a video emotion feature vector from the at least two frames of images includes:
inputting the at least two frames of images into a video feature extraction model obtained by pre-training for processing, wherein the video feature extraction model comprises a fourth convolutional neural network, a fourth full connection layer and a fourth classifier which are sequentially connected;
and determining the feature vector output by the fourth convolutional neural network as a video emotion feature vector.
In one possible implementation, the fourth convolutional neural network is ResNet 101.
In a specific example, the video feature extraction model obtained by pre-training is shown in fig. 7, and samples of at least two frames of images span the whole video, so that long-term time relation modeling is supported.
In one specific example, the number of output neurons of the first to fourth convolutional neural networks is 1024. The image including the human face input into the first convolution neural network is 640 x 3 x 1 (the length, the width, the channel number and the image frame number of the image are respectively), the first convolution neural network outputs 1 x 1024 feature vectors, and the first full-connection layer outputs 1 x7 feature vectors. The N frame images input to the second convolutional neural network are 640 × 3 × 10 (the length, width, number of channels of the image and the number of frames of the image input at a single time respectively), the second convolutional neural network outputs 10 × 1024 eigenvectors, and the second fully-connected layer outputs 10 × 3 eigenvectors. The acoustic spectrogram input to the third convolutional neural network is 640 × 3 × 10 (the length, width, number of channels, and number of windows of the spectrogram, respectively), the third convolutional neural network outputs 1 × 1024 eigenvectors, and the third fully-connected layer outputs 1 × 3 eigenvectors. At least two frames of images input into the fourth convolutional neural network are 640 x 3 x 5 (the length, the width, the channel number and the image frame number of the images are respectively, and the image frame number is the segment number of the video division), the fourth convolutional neural network outputs 1 x 1024 feature vectors, and the fourth fully connected layer outputs 1 x 3 feature vectors.
In summary, as shown in fig. 8, the face emotion feature vector, the frame emotion feature vector, the voice emotion feature vector, and the video emotion feature vector are respectively output by the image emotion feature fusion sub-model of the face feature extraction model, the second convolutional neural network of the frame feature extraction model, the third convolutional neural network of the voice feature extraction model, and the fourth convolutional neural network of the video feature extraction model.
And S400, performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector (obtained in the step S300) to obtain a multi-mode information feature vector.
In one possible implementation, step S400 further includes:
carrying out feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector to obtain a multi-mode information feature vector, wherein the method comprises the following steps:
performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector;
performing dimensionality reduction on the feature vector after feature fusion, wherein in a specific example, the dimensionality reduction on the feature vector after feature fusion can be realized by using a PCA (principal component analysis) method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain the multi-mode information feature vector of the four channels.
In a possible implementation manner, in the case that each segment is sampled L times at random, the video emotion feature vectors fused in step S400 are 1 video emotion feature vectors randomly selected from the L video emotion feature vectors. It can be understood that, in the case of randomly sampling each segment L times, the fourth convolutional neural network performs feature extraction on L "at least two frame images" obtained by randomly sampling each segment L times to output and store L video emotion feature vectors. Thereby, the diversity of features can be increased.
And step S500, calling a character emotion recognition model obtained through pre-training (obtained through training in step S100), and recognizing the multi-mode information characteristic vector (obtained through training in step S400) to obtain a character emotion recognition result.
In one possible implementation, the character emotion recognition model is a Support Vector Machine (SVM) classifier. In a specific example, the SVM classifier outputs a character emotion recognition result representing whether the emotion of a character in the video to be recognized is positive, negative or normal according to the multi-mode information feature vector.
In summary, the method for recognizing the emotion of the person in the video, provided by the embodiment, integrates the image information extracted from the video, the image frame sequence information with the time dimension extracted in two different ways, and the imaged sound information, so that the emotion recognition of the person in the video is performed based on the multidimensional features, and the method has higher recognition accuracy, and is beneficial to improving the accuracy and robustness of the emotion recognition of the person in the video in a complex environment.
As shown in fig. 9, another embodiment of the present application provides an emotion recognition apparatus for a person in a video, including:
the multimode data acquisition module is used for acquiring an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1;
the multimode feature extraction module is used for calling a feature extraction model obtained by pre-training, extracting face emotion feature vectors from the images containing the faces, extracting frame emotion feature vectors from the N frames of images, extracting video emotion feature vectors from the at least two frames of images and extracting sound emotion feature vectors from the sound spectrogram;
the multimode characteristic fusion module is used for carrying out characteristic fusion on the face emotion characteristic vector, the frame emotion characteristic vector, the video emotion characteristic vector and the voice emotion characteristic vector to obtain multimode information characteristic vectors;
and the emotion recognition module is used for calling a character emotion recognition model obtained through pre-training, recognizing the multi-mode information characteristic vector and obtaining a character emotion recognition result.
In one possible implementation, the multimode data acquisition module includes:
the first acquisition submodule is used for respectively detecting each frame of image of the video to be recognized by using a preset face detection model to obtain an image containing a face, and recording time information of the image containing the face; intercepting audio data of corresponding time in the video to be identified according to the time information of the image containing the face; and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
In one possible implementation, the multimode data acquisition module includes:
the second acquisition submodule is used for extracting N frames of images from the video to be identified at preset time intervals and recording the time information of each frame of image in the N frames of images; intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in the N frames of images; and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
In one possible implementation, the multimode data acquisition module includes:
the third acquisition submodule is used for dividing the video to be identified into M sections, randomly sampling at least one frame of image from each section to obtain at least two frames of images, and recording the time information of each frame of image in the at least two frames of images; intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in at least two frames of images; and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
In one possible implementation, the third obtaining module is configured to randomly sample at least one frame of image from each segment, and includes: and randomly sampling each segment for L times to obtain L at least two frame images, wherein L is more than 1.
In one possible implementation, the multimodal feature extraction module includes:
the face feature extraction submodule is used for inputting the image containing the face into a face feature extraction model for processing, wherein the face feature extraction model comprises a first convolution neural network, a first full connection layer, a first classifier and an image emotion feature fusion submodel which are sequentially connected, the first convolution neural network is used for receiving the face image, the image emotion feature fusion submodel is used for outputting face emotion feature vectors according to the proportion of emotion classification of each face in the image containing the face, and the first convolution neural network can adopt a MobileFaceNet;
the frame feature extraction submodule is used for inputting the N frames of images into a frame feature extraction model obtained by pre-training for processing, wherein the frame feature extraction model comprises a second convolutional neural network, a second full-connection layer and a second classifier which are sequentially connected, the second convolutional neural network is used for receiving the N frames of images and outputting frame emotion feature vectors, and the second convolutional neural network can adopt DenseNet-121;
the sound feature extraction submodule is used for inputting the sound spectrogram into a sound feature extraction model obtained by pre-training for processing, wherein the sound feature extraction model comprises a third convolutional neural network, a third full connection layer and a third classifier which are sequentially connected, the third convolutional neural network is used for receiving the sound spectrogram and outputting the sound emotion feature vector, and the third convolutional neural network can adopt EfficientNet-B4;
and the video feature extraction submodule is used for inputting the at least two frames of images into a video feature extraction model obtained by pre-training for processing, wherein the video feature extraction model comprises a fourth convolutional neural network, a fourth full connection layer and a fourth classifier which are sequentially connected, the fourth convolutional neural network is used for receiving the at least two frames of images and outputting the video emotion feature vector, and the fourth convolutional neural network is ResNet 101.
In a possible implementation manner, the multi-mode feature fusion module is configured to perform feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector, and the voice emotion feature vector; performing dimensionality reduction on the feature vector after feature fusion, wherein in a specific example, the dimensionality reduction on the feature vector after feature fusion can be realized by using a PCA (principal component analysis) method packaged in a sklern tool library; and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain the multi-mode information feature vector of the four channels.
In a possible implementation manner, in the case of randomly sampling each segment L times, the multimode feature fusion module fuses that the feature extraction model extracts the video emotion feature vector from 1 at least two images randomly selected from the L at least two images.
In one possible implementation, the character emotion recognition model is a support vector machine classifier.
It should be noted that the principle and the work flow of the emotion recognition apparatus for people in video provided in this embodiment are similar to the emotion recognition stage in the emotion recognition method for people in video, and reference may be made to the above description for relevant points, which is not described herein again.
As shown in fig. 10, a computer system suitable for implementing the emotion recognition apparatus for a person in a video provided by the above-described embodiment includes a central processing module (CPU) that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage section into a Random Access Memory (RAM). In the RAM, various programs and data necessary for the operation of the computer system are also stored. The CPU, ROM, and RAM are connected thereto via a bus. An input/output (I/O) interface is also connected to the bus.
An input section including a keyboard, a mouse, and the like; an output section including a speaker and the like such as a Liquid Crystal Display (LCD); a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The communication section performs communication processing via a network such as the internet. The drive is also connected to the I/O interface as needed. A removable medium such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive as necessary, so that a computer program read out therefrom is mounted into the storage section as necessary.
In particular, the processes described in the above flowcharts may be implemented as computer software programs according to the present embodiment. For example, the present embodiments include a computer program product comprising a computer program tangibly embodied on a computer-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication section, and/or installed from a removable medium.
The flowchart and schematic diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to the present embodiments. In this regard, each block in the flowchart or schematic diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the schematic and/or flowchart illustration, and combinations of blocks in the schematic and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the present embodiment may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a multimode data acquisition module, a multimode feature extraction module, a multimode feature fusion module and an emotion recognition module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself. For example, the multimodal feature fusion module may also be described as a "multimodal feature concatenation module".
On the other hand, the present embodiment also provides a nonvolatile computer storage medium, which may be the nonvolatile computer storage medium included in the apparatus in the foregoing embodiment, or may be a nonvolatile computer storage medium that exists separately and is not assembled into a terminal. The non-volatile computer storage medium stores one or more programs that, when executed by a device, cause the device to: the method comprises the steps of obtaining an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1; calling a feature extraction model obtained by pre-training, extracting face emotion feature vectors from the image containing the face, extracting frame emotion feature vectors from the N frames of images, extracting video emotion feature vectors from the at least two frames of images and extracting sound emotion feature vectors from the sound spectrogram; performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector to obtain a multi-mode information feature vector; and calling a character emotion recognition model obtained by pre-training, and recognizing the multi-mode information characteristic vector to obtain a character emotion recognition result.
In the description of the present application, it should be noted that the terms "upper", "lower", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, which are only for convenience in describing the present application and simplifying the description, and do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and operate, and thus, should not be construed as limiting the present application. Unless expressly stated or limited otherwise, the terms "mounted," "connected," and "connected" are intended to be inclusive and mean, for example, that they may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
It is further noted that, in the description of the present application, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It should be understood that the above-mentioned examples are given for the purpose of illustrating the present application clearly and not for the purpose of limiting the same, and that various other modifications and variations of the present invention may be made by those skilled in the art in light of the above teachings, and it is not intended to be exhaustive or to limit the invention to the precise form disclosed.

Claims (21)

1. A method for recognizing emotion of a person in a video, comprising:
the method comprises the steps of obtaining an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1;
extracting face emotion feature vectors from the image containing the face, extracting frame emotion feature vectors from the N frame images, extracting video emotion feature vectors from the at least two frame images, and extracting voice emotion feature vectors from the voice spectrogram;
performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector to obtain a multi-mode information feature vector;
and calling a character emotion recognition model obtained by pre-training, and recognizing the multi-mode information characteristic vector to obtain a character emotion recognition result.
2. The method according to claim 1, wherein the obtaining of the image containing the face and the sound spectrogram corresponding to the image containing the face in the video to be recognized comprises:
respectively detecting each frame of image of the video to be recognized by using a preset face detection model to obtain an image containing a face, and recording time information of the image containing the face;
intercepting audio data of corresponding time in the video to be identified according to the time information of the image containing the face;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
3. The method according to claim 1, wherein the obtaining of the N frames of images extracted from the video to be identified at the predetermined time intervals and the sound spectrogram corresponding to the N frames of images comprises:
extracting N frames of images from a video to be identified at preset time intervals, and recording time information of each frame of image in the N frames of images;
intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in the N frames of images;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
4. The method of claim 1, wherein the dividing the video to be identified into M segments, at least two frames of images obtained by randomly sampling at least one frame of image from each segment, and the sound spectrogram corresponding to the at least two frames of images comprises:
dividing a video to be identified into M sections, randomly sampling at least one frame of image from each section to obtain at least two frames of images, and recording time information of each frame of image in the at least two frames of images;
intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in at least two frames of images;
and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
5. The method of claim 4, wherein randomly sampling at least one frame of image from each segment comprises:
and randomly sampling each segment for L times to obtain L at least two frame images, wherein L is more than 1.
6. The method according to any one of claims 1-5, wherein said extracting a face emotion feature vector from the image containing the face comprises:
inputting the image containing the human face into a human face feature extraction model obtained by pre-training for processing, wherein the human face feature extraction model comprises a first convolutional neural network, a first full-link layer, a first classifier and an image emotion feature fusion sub-model which are sequentially connected;
and the image emotion feature fusion sub-model outputs face emotion feature vectors according to the proportion of emotion classification of each face in the image containing the face.
7. The method according to any one of claims 1-5, wherein said extracting frame emotion feature vectors from said N frame images comprises:
inputting the N frames of images into a frame feature extraction model obtained by pre-training for processing, wherein the frame feature extraction model comprises a second convolutional neural network, a second full connection layer and a second classifier which are sequentially connected;
and determining the feature vector output by the second convolutional neural network as a frame emotion feature vector.
8. The method according to any one of claims 1-5, wherein said extracting a vocal emotional feature vector from said vocal spectrogram comprises:
inputting the sound frequency spectrogram into a sound feature extraction model obtained by pre-training for processing, wherein the sound feature extraction model comprises a third convolutional neural network, a third full connection layer and a third classifier which are sequentially connected;
and determining the feature vector output by the third convolutional neural network as a voice emotion feature vector.
9. The method according to any one of claims 1-5, wherein said extracting video emotion feature vectors from said at least two frame images comprises:
inputting the at least two frames of images into a video feature extraction model obtained by pre-training for processing, wherein the video feature extraction model comprises a fourth convolutional neural network, a fourth full connection layer and a fourth classifier which are sequentially connected;
and determining the feature vector output by the fourth convolutional neural network as a video emotion feature vector.
10. The method according to any one of claims 1-5, wherein performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector to obtain a multi-mode information feature vector comprises:
performing feature fusion on the face emotion feature vector, the frame emotion feature vector, the video emotion feature vector and the voice emotion feature vector;
performing dimensionality reduction on the feature vector after feature fusion;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain the multi-mode information feature vector of the four channels.
11. The method according to claim 10, wherein in the case of randomly sampling each segment L times, randomly selecting 1 from the L video emotion feature vectors as the video feature vector for feature fusion.
12. The method of claim 1, wherein the human emotion recognition model is a support vector machine classifier.
13. The method of claim 6, further comprising:
and inputting a training set comprising a training image of the facial expression and an emotion class label corresponding to the training image into the first convolutional neural network so as to train the facial feature extraction model.
14. The method of claim 7, further comprising:
inputting a training set comprising N training images and emotion category labels corresponding to the N training images into the second convolutional neural network to train the frame feature extraction model.
15. The method of claim 8, further comprising:
inputting a training set comprising a sound spectrum training image and emotion class labels corresponding to the sound spectrum training image into the third convolutional neural network to train the sound feature extraction model.
16. The method of claim 9, further comprising:
inputting a training set comprising at least two training images and emotion class labels corresponding to the at least two training images into the fourth convolutional neural network to train the video feature extraction model.
17. An emotion recognition apparatus for a person in a video, comprising:
the multimode data acquisition module is used for acquiring an image containing a face in a video to be identified, N frames of images extracted from the video to be identified at preset time intervals, at least two frames of images obtained by randomly sampling at least one frame of image from each section after dividing the video to be identified into M sections, and sound frequency spectrogram images respectively corresponding to the image containing the face, the N frames of images and the at least two frames of images, wherein N is more than 1, and M is more than 1;
a multi-mode feature extraction module for extracting face emotion feature vectors from the image containing the face, extracting frame emotion feature vectors from the N frames of images, extracting video emotion feature vectors from the at least two frames of images, and extracting voice emotion feature vectors from the voice spectrogram;
the multimode characteristic fusion module is used for carrying out characteristic fusion on the face emotion characteristic vector, the frame emotion characteristic vector, the video emotion characteristic vector and the voice emotion characteristic vector to obtain multimode information characteristic vectors;
and the emotion recognition module is used for calling a character emotion recognition model obtained through pre-training, recognizing the multi-mode information characteristic vector and obtaining a character emotion recognition result.
18. The apparatus of claim 17, wherein the multi-mode data acquisition module comprises:
the first acquisition submodule is used for respectively detecting each frame of image of the video to be recognized by using a preset face detection model to obtain an image containing a face, and recording time information of the image containing the face; intercepting audio data of corresponding time in the video to be identified according to the time information of the image containing the face; carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face; and/or
The second acquisition submodule is used for extracting N frames of images from the video to be identified at preset time intervals and recording the time information of each frame of image in the N frames of images; intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in the N frames of images; carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face; and/or
The third acquisition submodule is used for dividing the video to be identified into M sections, randomly sampling at least one frame of image from each section to obtain at least two frames of images, and recording the time information of each frame of image in the at least two frames of images; intercepting audio data of corresponding time in the video to be identified according to the time information of each frame of image in at least two frames of images; and carrying out spectrum analysis on the audio data to obtain a sound spectrogram corresponding to the image containing the human face.
19. The apparatus of claim 17, wherein the multi-mode feature extraction module comprises:
the face feature extraction submodule is used for inputting the image containing the face into a face feature extraction model for processing, wherein the face feature extraction model comprises a first convolutional neural network, a first full connection layer, a first classifier and an image emotion feature fusion submodel which are sequentially connected, and the image emotion feature fusion submodel is used for outputting face emotion feature vectors according to the proportion of emotion classification of each face in the image containing the face;
the frame feature extraction submodule is used for inputting the N frames of images into a frame feature extraction model obtained by pre-training for processing, wherein the frame feature extraction model comprises a second convolutional neural network, a second full-link layer and a second classifier which are sequentially connected, and the second convolutional neural network is used for receiving the N frames of images and outputting frame emotion feature vectors;
the sound feature extraction submodule is used for inputting the sound spectrogram into a sound feature extraction model obtained by pre-training for processing, wherein the sound feature extraction model comprises a third convolutional neural network, a third full connection layer and a third classifier which are sequentially connected, and the third convolutional neural network is used for receiving the sound spectrogram and outputting the sound emotion feature vector;
and the video feature extraction submodule is used for inputting the at least two frames of images into a video feature extraction model obtained by pre-training for processing, wherein the video feature extraction model comprises a fourth convolutional neural network, a fourth full connection layer and a fourth classifier which are sequentially connected, and the fourth convolutional neural network is used for receiving the at least two frames of images and outputting the video emotion feature vector.
20. A computer device comprising a processor and a memory storing a program, wherein the program when executed by the processor implements the method of any one of claims 1-16.
21. A computer-readable medium storing a program, characterized in that the program, when executed, implements the method of any of claims 1-16.
CN202011577706.6A 2020-12-28 2020-12-28 Emotion recognition method and device for characters in video, computer equipment and medium Active CN112699774B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011577706.6A CN112699774B (en) 2020-12-28 2020-12-28 Emotion recognition method and device for characters in video, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011577706.6A CN112699774B (en) 2020-12-28 2020-12-28 Emotion recognition method and device for characters in video, computer equipment and medium

Publications (2)

Publication Number Publication Date
CN112699774A true CN112699774A (en) 2021-04-23
CN112699774B CN112699774B (en) 2024-05-24

Family

ID=75512286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011577706.6A Active CN112699774B (en) 2020-12-28 2020-12-28 Emotion recognition method and device for characters in video, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN112699774B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392722A (en) * 2021-05-24 2021-09-14 北京爱奇艺科技有限公司 Method and device for recognizing emotion of object in video, electronic equipment and storage medium
CN113435518A (en) * 2021-06-29 2021-09-24 青岛海尔科技有限公司 Feature fusion interaction method and device based on multiple modes
CN113469044A (en) * 2021-06-30 2021-10-01 上海歆广数据科技有限公司 Dining recording system and method
CN113536999A (en) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN113569740A (en) * 2021-07-27 2021-10-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Video recognition model training method and device and video recognition method and device
CN113673325A (en) * 2021-07-14 2021-11-19 南京邮电大学 Multi-feature character emotion recognition method
CN114298121A (en) * 2021-10-09 2022-04-08 腾讯科技(深圳)有限公司 Multi-mode-based text generation method, model training method and device
CN114581570A (en) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114699777A (en) * 2022-04-13 2022-07-05 南京晓庄学院 Control method and system of toy dancing robot
WO2024111775A1 (en) * 2022-11-21 2024-05-30 Samsung Electronics Co., Ltd. Method and electronic device for identifying emotion in video content

Citations (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003085572A (en) * 2001-09-11 2003-03-20 Nippon Hoso Kyokai <Nhk> Comic generation device and comic generation program
US20140063236A1 (en) * 2012-08-29 2014-03-06 Xerox Corporation Method and system for automatically recognizing facial expressions via algorithmic periocular localization
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
WO2014193161A1 (en) * 2013-05-28 2014-12-04 삼성전자 주식회사 User interface method and device for searching for multimedia content
WO2017185630A1 (en) * 2016-04-27 2017-11-02 乐视控股(北京)有限公司 Emotion recognition-based information recommendation method and apparatus, and electronic device
US20170330029A1 (en) * 2010-06-07 2017-11-16 Affectiva, Inc. Computer based convolutional processing for image analysis
US9858340B1 (en) * 2016-04-11 2018-01-02 Digital Reasoning Systems, Inc. Systems and methods for queryable graph representations of videos
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN108537160A (en) * 2018-03-30 2018-09-14 平安科技(深圳)有限公司 Risk Identification Method, device, equipment based on micro- expression and medium
US20190012599A1 (en) * 2010-06-07 2019-01-10 Affectiva, Inc. Multimodal machine learning for emotion metrics
CN109190487A (en) * 2018-08-07 2019-01-11 平安科技(深圳)有限公司 Face Emotion identification method, apparatus, computer equipment and storage medium
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
CN109508638A (en) * 2018-10-11 2019-03-22 平安科技(深圳)有限公司 Face Emotion identification method, apparatus, computer equipment and storage medium
CN109784277A (en) * 2019-01-17 2019-05-21 南京大学 A kind of Emotion identification method based on intelligent glasses
CN109919001A (en) * 2019-01-23 2019-06-21 深圳壹账通智能科技有限公司 Customer service monitoring method, device, equipment and storage medium based on Emotion identification
CN109993093A (en) * 2019-03-25 2019-07-09 山东大学 Road anger monitoring method, system, equipment and medium based on face and respiratory characteristic
WO2019157344A1 (en) * 2018-02-12 2019-08-15 Avodah Labs, Inc. Real-time gesture recognition method and apparatus
CN110175526A (en) * 2019-04-28 2019-08-27 平安科技(深圳)有限公司 Dog Emotion identification model training method, device, computer equipment and storage medium
CN110298212A (en) * 2018-03-21 2019-10-01 腾讯科技(深圳)有限公司 Model training method, Emotion identification method, expression display methods and relevant device
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium
WO2020024400A1 (en) * 2018-08-02 2020-02-06 平安科技(深圳)有限公司 Class monitoring method and apparatus, computer device, and storage medium
CN110781916A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Video data fraud detection method and device, computer equipment and storage medium
WO2020034902A1 (en) * 2018-08-11 2020-02-20 昆山美卓智能科技有限公司 Smart desk having status monitoring function, monitoring system server, and monitoring method
WO2020054945A1 (en) * 2018-09-14 2020-03-19 엘지전자 주식회사 Robot and method for operating same
KR20200054613A (en) * 2018-11-12 2020-05-20 주식회사 코난테크놀로지 Video metadata tagging system and method thereof
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
CN111339847A (en) * 2020-02-14 2020-06-26 福建帝视信息科技有限公司 Face emotion recognition method based on graph convolution neural network
CN111339913A (en) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 Method and device for recognizing emotion of character in video
WO2020143156A1 (en) * 2019-01-11 2020-07-16 平安科技(深圳)有限公司 Hotspot video annotation processing method and apparatus, computer device and storage medium
CN111626126A (en) * 2020-04-26 2020-09-04 腾讯科技(北京)有限公司 Face emotion recognition method, device, medium and electronic equipment
CN111914594A (en) * 2019-05-08 2020-11-10 四川大学 Group emotion recognition method based on motion characteristics

Patent Citations (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003085572A (en) * 2001-09-11 2003-03-20 Nippon Hoso Kyokai <Nhk> Comic generation device and comic generation program
US20170330029A1 (en) * 2010-06-07 2017-11-16 Affectiva, Inc. Computer based convolutional processing for image analysis
US20190012599A1 (en) * 2010-06-07 2019-01-10 Affectiva, Inc. Multimodal machine learning for emotion metrics
US20140063236A1 (en) * 2012-08-29 2014-03-06 Xerox Corporation Method and system for automatically recognizing facial expressions via algorithmic periocular localization
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
US20140079297A1 (en) * 2012-09-17 2014-03-20 Saied Tadayon Application of Z-Webs and Z-factors to Analytics, Search Engine, Learning, Recognition, Natural Language, and Other Utilities
US20180204111A1 (en) * 2013-02-28 2018-07-19 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
WO2014193161A1 (en) * 2013-05-28 2014-12-04 삼성전자 주식회사 User interface method and device for searching for multimedia content
US20200184278A1 (en) * 2014-03-18 2020-06-11 Z Advanced Computing, Inc. System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform
US9858340B1 (en) * 2016-04-11 2018-01-02 Digital Reasoning Systems, Inc. Systems and methods for queryable graph representations of videos
WO2017185630A1 (en) * 2016-04-27 2017-11-02 乐视控股(北京)有限公司 Emotion recognition-based information recommendation method and apparatus, and electronic device
WO2019157344A1 (en) * 2018-02-12 2019-08-15 Avodah Labs, Inc. Real-time gesture recognition method and apparatus
CN110298212A (en) * 2018-03-21 2019-10-01 腾讯科技(深圳)有限公司 Model training method, Emotion identification method, expression display methods and relevant device
CN108537160A (en) * 2018-03-30 2018-09-14 平安科技(深圳)有限公司 Risk Identification Method, device, equipment based on micro- expression and medium
WO2019184125A1 (en) * 2018-03-30 2019-10-03 平安科技(深圳)有限公司 Micro-expression-based risk identification method and device, equipment and medium
WO2020024400A1 (en) * 2018-08-02 2020-02-06 平安科技(深圳)有限公司 Class monitoring method and apparatus, computer device, and storage medium
WO2020029406A1 (en) * 2018-08-07 2020-02-13 平安科技(深圳)有限公司 Human face emotion identification method and device, computer device and storage medium
CN109190487A (en) * 2018-08-07 2019-01-11 平安科技(深圳)有限公司 Face Emotion identification method, apparatus, computer equipment and storage medium
WO2020034902A1 (en) * 2018-08-11 2020-02-20 昆山美卓智能科技有限公司 Smart desk having status monitoring function, monitoring system server, and monitoring method
WO2020054945A1 (en) * 2018-09-14 2020-03-19 엘지전자 주식회사 Robot and method for operating same
CN109508638A (en) * 2018-10-11 2019-03-22 平安科技(深圳)有限公司 Face Emotion identification method, apparatus, computer equipment and storage medium
CN109344781A (en) * 2018-10-11 2019-02-15 上海极链网络科技有限公司 Expression recognition method in a kind of video based on audio visual union feature
KR20200054613A (en) * 2018-11-12 2020-05-20 주식회사 코난테크놀로지 Video metadata tagging system and method thereof
WO2020143156A1 (en) * 2019-01-11 2020-07-16 平安科技(深圳)有限公司 Hotspot video annotation processing method and apparatus, computer device and storage medium
CN109784277A (en) * 2019-01-17 2019-05-21 南京大学 A kind of Emotion identification method based on intelligent glasses
CN109919001A (en) * 2019-01-23 2019-06-21 深圳壹账通智能科技有限公司 Customer service monitoring method, device, equipment and storage medium based on Emotion identification
CN109993093A (en) * 2019-03-25 2019-07-09 山东大学 Road anger monitoring method, system, equipment and medium based on face and respiratory characteristic
CN110175526A (en) * 2019-04-28 2019-08-27 平安科技(深圳)有限公司 Dog Emotion identification model training method, device, computer equipment and storage medium
CN111914594A (en) * 2019-05-08 2020-11-10 四川大学 Group emotion recognition method based on motion characteristics
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium
CN110781916A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Video data fraud detection method and device, computer equipment and storage medium
CN111339847A (en) * 2020-02-14 2020-06-26 福建帝视信息科技有限公司 Face emotion recognition method based on graph convolution neural network
CN111339913A (en) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 Method and device for recognizing emotion of character in video
CN111626126A (en) * 2020-04-26 2020-09-04 腾讯科技(北京)有限公司 Face emotion recognition method, device, medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAO, K. SREENIVASA等: "Recognition of emotions from video using acoustic and facial features", 《SIGNAL, IMAGE AND VIDEO PROCESSING》, 31 December 2015 (2015-12-31), pages 1029 - 1045 *
吴乾震: "基于融合算法的视频表情识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 5, 15 May 2020 (2020-05-15), pages 138 - 139 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113392722A (en) * 2021-05-24 2021-09-14 北京爱奇艺科技有限公司 Method and device for recognizing emotion of object in video, electronic equipment and storage medium
CN113435518A (en) * 2021-06-29 2021-09-24 青岛海尔科技有限公司 Feature fusion interaction method and device based on multiple modes
CN113435518B (en) * 2021-06-29 2024-03-22 青岛海尔科技有限公司 Multi-mode-based interaction method and device for feature fusion
CN113469044A (en) * 2021-06-30 2021-10-01 上海歆广数据科技有限公司 Dining recording system and method
CN113469044B (en) * 2021-06-30 2022-07-01 上海歆广数据科技有限公司 Dining recording system and method
CN113536999A (en) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN113673325B (en) * 2021-07-14 2023-08-15 南京邮电大学 Multi-feature character emotion recognition method
CN113673325A (en) * 2021-07-14 2021-11-19 南京邮电大学 Multi-feature character emotion recognition method
CN113569740A (en) * 2021-07-27 2021-10-29 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Video recognition model training method and device and video recognition method and device
CN113569740B (en) * 2021-07-27 2023-11-21 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Video recognition model training method and device, and video recognition method and device
CN114298121A (en) * 2021-10-09 2022-04-08 腾讯科技(深圳)有限公司 Multi-mode-based text generation method, model training method and device
CN114581570B (en) * 2022-03-01 2024-01-26 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114581570A (en) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114699777A (en) * 2022-04-13 2022-07-05 南京晓庄学院 Control method and system of toy dancing robot
WO2024111775A1 (en) * 2022-11-21 2024-05-30 Samsung Electronics Co., Ltd. Method and electronic device for identifying emotion in video content

Also Published As

Publication number Publication date
CN112699774B (en) 2024-05-24

Similar Documents

Publication Publication Date Title
CN112699774B (en) Emotion recognition method and device for characters in video, computer equipment and medium
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
KR102071582B1 (en) Method and apparatus for classifying a class to which a sentence belongs by using deep neural network
CN110569795B (en) Image identification method and device and related equipment
CN109508375A (en) A kind of social affective classification method based on multi-modal fusion
CN111292765B (en) Bimodal emotion recognition method integrating multiple deep learning models
Kaluri et al. An enhanced framework for sign gesture recognition using hidden Markov model and adaptive histogram technique.
CN116564338B (en) Voice animation generation method, device, electronic equipment and medium
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
Rwelli et al. Gesture based Arabic sign language recognition for impaired people based on convolution neural network
Avula et al. CNN based recognition of emotion and speech from gestures and facial expressions
Madan et al. Intelligent and personalized factoid question and answer system
CN116543798A (en) Emotion recognition method and device based on multiple classifiers, electronic equipment and medium
Prasath Design of an integrated learning approach to assist real-time deaf application using voice recognition system
Mandal et al. AI-Based mock interview evaluator: An emotion and confidence classifier model
Nunes Deep emotion recognition through upper body movements and facial expression
CN113642446A (en) Detection method and device based on face dynamic emotion recognition
CN113191135A (en) Multi-category emotion extraction method fusing facial characters
Machanje et al. A 2d-approach towards the detection of distress using fuzzy k-nearest neighbor
Zim OpenCV and Python for Emotion Analysis of Face Expressions
Smitha et al. Ensemble Convolution Neural Network for Robust Video Emotion Recognition Using Deep Semantics
Elbarougy et al. Continuous audiovisual emotion recognition using feature selection and lstm
Bhargava et al. Action Recognition on American Sign Language using Deep Learning
Saravanan et al. EduVigil: Shaping the Future of Education with AI-An Intriguing Case Study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant