CN113392781A

CN113392781A - Video emotion semantic analysis method based on graph neural network

Info

Publication number: CN113392781A
Application number: CN202110676126.0A
Authority: CN
Inventors: 孙善宝
Original assignee: Shandong Inspur Scientific Research Institute Co Ltd
Current assignee: Shandong Inspur Scientific Research Institute Co Ltd
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2021-09-14
Also published as: WO2022262098A1

Abstract

The invention provides a video emotion semantic analysis method based on a graph neural network, which fully considers the characteristics that the emotion of individual characters in a video is influenced by concerned related characters and objects, utilizes a deep learning technology to construct graph structure association relations between the characters and the objects in the video, simultaneously extracts emotion characteristics by combining a 3D convolutional neural network of classical deep learning, utilizes the graph convolution neural network to combine the current relationship between the characters and the graph structure of the association relation between the characters and the objects, and can more accurately judge the real emotion state of a target character. The method of quickly modeling the relationship between the character and the object and then performing deep emotion analysis on the video can achieve better effects on personalized scenes such as reconnaissance, interviewing, face signing and the like.

Description

Video emotion semantic analysis method based on graph neural network

Technical Field

The invention relates to a video emotion semantic analysis method based on a graph neural network, and belongs to the technical field of graph neural networks, emotion analysis and machine vision.

Background

With the rapid development of deep learning technology and the support of mass data and high-efficiency computing power in the times of internet and cloud computing, a large-scale neural network similar to a human brain structure is obtained by training and constructing the deep learning technology represented by a CNN convolutional neural network, and breakthrough progress is made in the fields of computer vision, voice recognition, natural language understanding and the like, so that subversive change can be brought to the whole society, and the deep learning technology becomes an important development strategy of countries in the future.

The conventional convolutional neural network brings promotion in the text and image fields, but it can process only euclidean space data. Graph Neural Network (GNN) is a method that can perform deep learning on Graph data, and includes models applied to graphs by various Neural networks. The graph is a graph formed by a plurality of nodes and edges (edges) connecting the two nodes, and is used for depicting the relationship between different nodes. The graph data is a kind of non-european space data, and is gradually receiving attention due to its ubiquitous nature. Graph Convolutional neural Network (GCN) is a type of neural Network using Graph convolution, and has shown advantages in the field of computer vision as an important branch in Graph neural networks.

Video emotion analysis is an important research direction in a video understanding task, and the emotion state of a person in a picture is discovered through video analysis, so that the video emotion analysis has important application value in the fields of man-machine interaction, interview and interview, medical diagnosis, robot manufacturing, investigation and interrogation and the like. Ekman and Friesen construct a discrete classification model, 6 basic emotions are defined, angler, disgust, fear, happy happensation, heartburn and surprise are defined, and then slight bamboo contitempt is added into the basic emotion. With the continuous change of the service scene, deeper emotion of individuals in the video needs to be analyzed in the actual service, the surface emotion of the characters is uncovered, and the real emotional state of the individuals is found to meet the requirement of individuation of a new scene. Under the circumstance, modeling the relationship between people and objects of the video, effectively utilizing the graph neural network, analyzing the emotional characteristics of the video characters by combining the CNN convolutional neural network, and more accurately judging the emotional state of the individual becomes a problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a video emotion semantic analysis method based on a graph neural network, which can judge the real emotion state of a target person more accurately, and has higher processing efficiency and strong timeliness.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a) constructing graph structure association relations among the characters and the objects in the video,

b) people and objects in the video are identified through target detection,

c) in the video emotion analysis, emotion data in a video is extracted through a 3D-CNN three-dimensional convolution neural network based on the identified person,

d) and judging the real emotional state of the target character by utilizing the graph convolution operation and combining the current relationship between the characters and the incidence relationship graph structure between the characters and the object.

Preferably, the specific steps of step a are as follows:

step 101, designing the emotion types of characters, the relationship between the characters and objects according to the requirements of scenes in the target research field;

102, collecting a large amount of video data of scenes in the field to perform data annotation, training a target detection module ObjDet aiming at the types of interested persons and objects in an annotation data set based on a universal target detection model, and obtaining a target detection model;

103, carrying out character emotion category marking on video data, designing network structures of the character emotion feature extractor PFextract and the emotion Classifier, combining the network structures and the network structures, and training by using marking data to obtain a PFextract model and a Classifier model;

the character emotional feature extractor PFextract adopts 3D-CNN as a core network and is used for extracting the emotional features of characters in a video to form a feature vector;

the core of the emotion Classifier is a linear Classifier, and the emotion state classification is judged by using the characteristics formed by the character emotion characteristic extractor PFextract.

Preferably, the specific steps of step b are as follows:

step 201, according to the set character relationship, based on the characters obtained by video target detection and recognition, forming a character relationship diagram structure data set, training the character relationship generator PRGen, and obtaining a character relationship diagram structure generation model;

the character relation graph structure is a graph relation structure (V, E) of related characters related in a video, wherein V represents a character, E represents a relation between the character and the character, and character features are represented by a d-dimensional vector format;

the character relation generator PRGen is responsible for forming the identified characters into a character basic relation graph structure for describing the relation between main target characters;

step 202, according to the set relationship between the person and the object, based on the person and the object obtained by video target detection and recognition, forming a data set of a structure of a relationship diagram between the person and the object, and training the relation generator PAORGen between the person and the object to obtain a structure generation model of the relationship diagram between the person and the object;

the person-object relationship generator PAORGen is responsible for forming the identified persons and objects into a person-object basic relationship graph structure for describing the relationship between the main target person and the object of interest.

Preferably, the specific steps of step c are as follows:

301, training by combining the graph convolution emotion generator GCNGen and the emotion discriminator MDTR based on a character emotion feature vector, a character relation diagram structure and a character and object relation diagram structure extracted by a character emotion feature extractor PFExtract to obtain a graph convolution emotion generator model and an emotion discriminator model;

step 302, based on the emotional feature vector generated by the graph convolution emotion generator GCNGen, combining the existing character relationship graph structure, and the character relationship adjuster PRTuning, so that the character relationship graph structure adjustment conforms to the emotional feature generated by the graph convolution emotion generator;

the character relation regulator PRTuning regulates and updates the existing character relation graph structure according to the character relation identified by the current video segment and by combining the graph convolution emotion characteristic vector output by the previous scene;

303, based on the emotion feature vector generated by the graph convolution emotion generator GCNGen, combining the existing character and object relation graph structure, and adjusting the character and object relation graph structure to conform to the emotion feature generated by the graph convolution emotion generator;

the character-object relationship adjuster PAORTuning adjusts and updates the existing character-object relationship graph structure according to the relationship between the character and the object identified by the current video segment and by combining the graph volume emotion characteristic vector output by the front scene;

and step 304, combining the models formed by training in the steps 101 to 108 for video emotion semantic analysis and judgment.

Preferably, the specific steps of step d are as follows:

step 401, segmenting the video, extracting the people and the objects in the video by using the target detection ObjDet module, forming a people set and an object set based on the recognition result, and forming a people basic relationship graph structure and a people and object basic relationship graph structure through the people relationship generator PRGen and the people and object relationship generator PAORGen;

step 402, cutting the two relation graph structures formed in the step 110, carrying out fine adjustment according to known knowledge, and selecting concerned characters and objects as initial relation graph structures of video emotion semantic analysis;

step 403, using the target detection obj det module to perform target detection on the video again, acquiring characters and objects of the video according to a set time interval, and acquiring emotional feature vectors feV of characters appearing in the video segment through the character emotional feature extractor PFExtract;

step 404, inputting a character emotion feature vector feV set, a character relation graph structure and a character and object relation graph structure extracted by a character emotion feature extractor PFextract into the graph convolution emotion generator GCNGen to obtain an emotion feature vector emV subjected to graph convolution;

step 405, inputting the emotion feature vector subjected to graph convolution into the emotion discriminator MDTR, and outputting the emotion state of the character in the current video segment;

step 406, acquiring a next video segment, identifying the people and objects in the video segment, inputting the people and objects in the video segment, the emotion characteristic vector emV and the people relationship graph structure into the people relationship adjuster PRTuning based on the graph convolution emotion characteristic vector emV output by the graph convolution emotion generator GCNGen in the previous video segment, updating the people relationship graph structure, inputting the people and objects in the video segment, the emotion characteristic vector emV and the people and object relationship graph structure into the people relationship adjuster PAORTuning, and updating the people and object relationship graph structure;

step 407, obtaining an emotional feature vector feV of a person appearing in the video clip through the person emotional feature extractor PFExtract, and turning to step 113;

step 408, repeating the steps 110 to 116, and continuously outputting the emotion state of the video character;

and 409, continuously collecting data in the process of judging the video emotion, and simultaneously feeding back the correctness of an output result for continuous optimization of the model.

The invention has the advantages that: the method fully considers the characteristic that the emotion of a character individual in the video is influenced by the concerned related characters and objects, utilizes the deep learning technology to construct graph structure association relations between the characters and the objects in the video, simultaneously extracts emotion characteristics by combining with a 3D convolutional neural network of classical deep learning, utilizes the graph convolutional neural network to combine with the current relationship between the characters and the associated relation graph structure between the characters and the objects, and can more accurately judge the real emotion state of a target character; the characteristics of video time sequence are fully considered, the 3D convolutional neural network is adopted, video emotion analysis is more accurately realized, the video is segmented into sequences, complexity is reduced, and processing efficiency is improved; compared with the traditional mode of directly judging the emotional state by adopting video or image frames, the method adopts the graph convolution neural network, increases external knowledge and internal associated factors, more comprehensively expresses the emotional state of the deep-level video, better adapts to and meets the real service scene, fully considers the timeliness of the emotional state change, continuously feeds back the emotional state of the output character along with the advance of the video frames, and continuously updates the relation between the characters and the relation graph structure between the characters and the object. The method has the advantages that people and objects are identified by adopting a target detection algorithm, target people and objects can be quickly positioned, useless frames are filtered, the calculated amount of 3D convolution video for extracting emotional features is reduced to a certain degree, and the video processing speed is accelerated. In addition, a mode of rapidly modeling the relationship between the character and the object and then performing deep emotion analysis on the video is adopted, so that better effects can be achieved on personalized scenes such as reconnaissance, interviewing, face signing and the like.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a schematic structural diagram of a video emotion semantic analysis model according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In one embodiment, as shown in fig. 1, graph structure association relations between people and objects in a video are constructed, people and objects in the video are identified through target detection, emotion data in the video are extracted through a 3D-CNN three-dimensional convolutional neural network based on the identified people in video emotion analysis, and the real emotional state of a target person is judged through graph convolution operation by combining the current graph structure of the relation between people and the current graph structure of the association relations between people and objects. Wherein the content of the first and second substances,

the character relation graph structure is a graph relation structure (V, E) of related characters related in a video, wherein V represents a character, E represents a relation between the character and the character, and character features are represented by a d-dimensional vector format; the figure-object relation graph structure describes the relation between the figure and the object, and the graph structure is also used for description; the core of the target detection module ObjDet is a neural network, and people and interested objects in the video can be identified by adopting target detection algorithms such as SSD or YOLO and the like aiming at video detection; the character relation generator PRGen is responsible for forming the identified characters into a character basic relation graph structure for describing the relation between main target characters; the person and object relation generator PAORGen is responsible for forming the recognized persons and objects into a person and object basic relation graph structure for describing the relation between the main target person and the interested object; the character emotional feature extractor PFextract adopts 3D-CNN as a core network and is used for extracting the emotional features of characters in a video to form a feature vector; the core of the emotion Classifier is a linear Classifier, and the emotion state classification is judged by using the characteristics formed by the character emotion characteristic extractor PFextract; the character relation regulator PRTuning regulates and updates the existing character relation graph structure according to the character relation identified by the current video segment and by combining the graph convolution emotion characteristic vector output by the previous scene; the character-object relationship adjuster PAORTuning adjusts and updates the existing character-object relationship graph structure according to the relationship between the character and the object identified by the current video segment and by combining the graph volume emotion characteristic vector output by the front scene; the graph convolution emotion generator GCNGen comprises a figure relation diagram structure diagram convolution operation module, a figure and object relation diagram structure diagram convolution operation module and a fusion operation module of a graph convolution operation result and a figure emotion feature vector generated by the figure emotion feature extractor PFextract, and generates graph convolution emotion feature vectors of all target figures; the core of the emotion discriminator MDTR is a neural network, and the real emotional state of the character is judged by utilizing the character emotion characteristic vector generated by the graph convolution emotion generator GCNGen.

The method provided by the invention will be described in detail with reference to specific examples.

Firstly, analyzing and judging video emotion semantics

The video emotion semantic analysis and judgment method comprises the following steps:

step 101, designing emotional types of characters, such as calmness, joy, surprise, hurt, anger, disgust, fear, slight and the like, according to the requirements of scenes in the target research field, and designing the relationship among the characters and objects;

104, according to the set character relationship, detecting and identifying the obtained characters based on the video target to form a character relationship diagram structure data set, and training the character relationship generator PRGen to obtain a character relationship diagram structure generation model;

105, according to the set relationship between the person and the object, detecting and identifying the obtained person and the object based on the video target, forming a data set of a structure of a relationship graph between the person and the object, and training the relation generator PAORGen between the person and the object to obtain a structure generation model of the relationship graph between the person and the object;

106, training by combining the graph convolution emotion generator GCNGen and the emotion discriminator MDTR based on a character emotion feature vector, a character relation graph structure and a character and object relation graph structure extracted by a character emotion feature extractor PFextract to obtain a graph convolution emotion generator model and an emotion discriminator model;

step 107, based on the emotional feature vector generated by the graph convolution emotion generator GCNGen, combining the existing character relationship graph structure, and the character relationship adjuster PRTuning, so that the character relationship graph structure adjustment conforms to the emotional feature generated by the graph convolution emotion generator;

step 108, based on the emotion feature vector generated by the graph convolution emotion generator GCNGen, combining the existing character and object relation graph structure, and adjusting the character and object relation graph structure to conform to the emotion feature generated by the graph convolution emotion generator;

step 109, combining the models formed by training in the steps 101 to 108 for video emotion semantic analysis and judgment;

step 110, segmenting the video, extracting the people and the objects in the video by using the target detection ObjDet module, forming a people set and an object set based on the identification result, and forming a people basic relationship graph structure and a people and object basic relationship graph structure through the people relationship generator PRGen and the people and object relationship generator PAORGen;

step 111, cutting the two relation graph structures formed in the step 110, finely adjusting according to known knowledge, and selecting concerned characters and objects as initial relation graph structures of video emotion semantic analysis;

step 112, using the target detection ObjDet module to perform target detection on the video again, acquiring characters and objects of the video according to a set time interval, and acquiring emotional feature vectors feV of characters appearing in the video segment through the character emotional feature extractor PFextract;

113, inputting a character emotion feature vector feV set, a character relation graph structure and a character and object relation graph structure extracted by a character emotion feature extractor PFextract into the graph convolution emotion generator GCNGen to obtain an emotion feature vector emV subjected to graph convolution;

step 114, inputting the emotion feature vector subjected to graph convolution into the emotion discriminator MDTR, and outputting the emotion state of the person in the current video segment;

step 115, acquiring a next video, identifying the character and the object of the current video, inputting the character and the object of the current video, the emotion feature vector emV and the character relation graph structure into the character relation regulator PRTuning based on the graph convolution emotion feature vector emV output by the graph convolution emotion generator GCNGen of the previous video, updating the character relation graph structure, inputting the character and the object of the current video, the emotion feature vector emV and the character and object relation graph structure into the character relation regulator PAORTuning, and updating the character and object relation graph structure;

step 116, obtaining an emotional feature vector feV of a character appearing in the video clip through the character emotional feature extractor PFextract, and turning to step 113;

step 117, repeating the steps 110 to 116, and continuously outputting the emotion state of the video character;

and step 118, continuously collecting data in the process of judging the video emotion, and simultaneously feeding back the correctness of the output result for continuous optimization of the model.

The above-described embodiment is only one specific embodiment of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A video emotion semantic analysis method based on a graph neural network is characterized by comprising the following steps:

b) people and objects in the video are identified through target detection,

2. The method for analyzing video emotion semantics based on graph neural network according to claim 1, wherein the specific steps of the step a are as follows:

3. The method for analyzing video emotion semantics based on graph neural network according to claim 2, wherein the specific steps of the step b are as follows:

4. The method for video emotion semantic analysis based on graph neural network according to claim 3, characterized in that, the specific steps of the step c are as follows:

5. The method for video emotion semantic analysis based on graph neural network according to claim 4, characterized in that, the specific steps of the step d are as follows: