CN113673325A

CN113673325A - Multi-feature character emotion recognition method

Info

Publication number: CN113673325A
Application number: CN202110793285.9A
Authority: CN
Inventors: 钟谭媛; 陈志�; 李玲娟; 岳文静
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-11-19
Anticipated expiration: 2041-07-14
Also published as: CN113673325B

Abstract

The invention discloses a multi-feature character emotion recognition method, which comprises the steps of firstly extracting local space-time features of a face and a body in a video by using a 3D convolutional neural network, then performing dictionary learning on extracted feature vectors by using an MOD algorithm under the framework of a sparse coding tree to obtain sparse codes, finally training an SVM classifier at nodes of the sparse coding tree by using the sparse codes as input, continuously classifying, and finally outputting emotion representations of a single category; the invention can be well suitable for different scenes, has stronger generalization capability and can also improve the accuracy of human mood identification in the video of a multi-shielding environment.

Description

Multi-feature character emotion recognition method

Technical Field

The invention relates to the technical field of feature recognition, and mainly relates to a multi-feature character emotion recognition method.

Background

Emotion recognition is an application direction in which computer vision field develops rapidly and researches more in recent years, and the research field covers a series of related subjects such as pattern recognition, machine learning, psychology and medicine. In recent years, emotion recognition has become an important research topic in the fields of computer vision and human-computer interaction, and has important theoretical significance and practical application value.

Emotion recognition for people in a video mainly involves the following techniques:

(1)3D convolutional neural network (C3D): the extracted features encapsulate information about objects, scenes, and actions in the video, making them useful for different tasks without having to fine-tune the model for each task. C3D is a good descriptor: the device is universal, compact, simple and efficient. The method utilizes the 3D convolution network to extract local space-time characteristics of the face and the body of the person in the video, thereby greatly improving the efficiency and effectiveness;

(2) sparse coding tree: it uses a node-specific dictionary and classifier to direct the input vector to child nodes, which in turn have their own specialized dictionaries and classifiers, enabling more accurate classification to be performed;

(3) MOD dictionary learning: the method is a dictionary learning method with an expected maximum value, and the dictionary atoms are continuously updated in the training process through iteration, so that the residual error of sparse representation is continuously reduced to meet the convergence condition, and a dictionary with good discrimination performance is finally obtained;

(4) a support vector machine: used to train the classifier.

Based on the research results, the invention provides a multi-feature character emotion recognition method based on facial expressions and body actions, and aims to improve the emotion recognition accuracy of characters in a video.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a multi-feature character emotion recognition method which comprises the steps of firstly extracting local features of a face and a body in a video by using a 3D convolutional neural network, then performing dictionary learning on extracted feature vectors by using an MOD algorithm under the framework of a sparse coding tree to obtain sparse codes, and finally training an SVM classifier at nodes of the sparse coding tree by taking the sparse codes as input to finish emotion classification recognition.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a multi-feature character emotion recognition method comprises the following steps:

step S1, the user inputs the video, and uses the sampling step length of 1 frame to traverse all the frames of the video, and creates a plurality of clips with 16 frame lengths; the plurality of 16-frame clips are used as the input of the 3D convolutional neural network;

step S2, extracting local features of facial expressions and body actions of people in the video by adopting a 3D convolutional neural network; for each input, constructing a 7 × 7 × 512 feature map in the conv5b layer, respectively extracting the spatial position of each feature, and connecting the values of each spatial position along 512 channels to obtain the final local features of the input; the total number of local features of the input video is 7 multiplied by 7, and each obtained local feature is a 512-dimensional vector;

step S3, for the input final local features, dictionary learning is carried out on the input final local features at the sparse coding tree root node by using an MOD algorithm; the MOD algorithm objective function is as follows:

wherein D ═ g₁,g₂,…,g_n]^TRepresenting a dictionary matrix, g_iIs a dictionary atom; x is the number of_iInputting a feature vector; omega_iDenotes x_iCorresponding dictionary atom g_iThe sparse coefficient of (d); t is₀Representing the number of nonzero elements in the sparse representation coefficient;

step S3.1, training sample set is

S3.2, initializing a dictionary; randomly constructing a dictionary initial value D⁽⁰⁾∈R^n×mAnd to carry out D⁽⁰⁾Column normalization;

s3.3, approximating a solution by using a tracking algorithm to obtain a sparse coefficient omega_iThe following were used:

s3.4, according to the sparse coefficient matrix W_(k)The dictionary is updated as follows:

step S3.5 when

Less than 10^-6Stopping iteration and outputting a final dictionary D;

step S4, learning a classifier by using a Support Vector Machine (SVM), and training a sparse coding tree; in particular, the amount of the solvent to be used,

s4.1, initializing a root node of the sparse coding tree into an active node a; at a, encoding the input local features into sparse codes using the dictionary D output at step S3; carrying out coarse classification on the coded input features at the active node a by adopting a Support Vector Machine (SVM) classifier;

s4.2, classifying according to a branch rule based on the rough classification label; and when branching to the next-level child node, taking the child node as the next active node a, repeating the sparse coding and coarse classification steps until all emotion classifications are finished, and outputting a final result.

Further, the branching rule in step S4.2 specifically includes:

when the coarse classification result is composed of 2 or more confusion classifications, the current node is transferred to a new sub-node specially trained, the coarse classification result is further finely classified, and finally, the recognition result only with a single classification is output.

Further, the 3D convolutional neural network in step S2 is configured to capture temporal and spatial feature information in the video, and includes 8 convolutional layers, 5 pooling layers, 2 fully-connected layers, and 1 softmax output layer, where the size of the 3D convolutional kernel of all layers is 3 × 3 × 3, and the step size is 1; the size of the first layer of the pooling layer is 1 multiplied by 2, the step length is 1, and the sizes of the rest pooling layers are 2 multiplied by 2, and the step length is 2; the input video is resized to 128 x 171, clipped into 16 frame segments that do not overlap with each other and taken as the network input.

Further, in step S2, conv5b is a feature visualization used in the last convolutional layer in the 3D convolutional neural network, and the feature mapping space size is 7 × 7, the number of channels is 512, and two feature mappings are included.

Has the advantages that: by adopting the technical scheme and the prior art, the invention has the following technical effects:

the method comprises the steps of extracting local features of facial expressions and body actions of characters in videos, then performing dictionary learning on extracted feature vectors by using an MOD algorithm under the framework of a sparse coding tree to obtain sparse codes, finally training an SVM classifier at nodes of the sparse coding tree by using the sparse codes as input, continuously classifying, and finally outputting emotion representations of single categories; the invention can be well suitable for different scenes, has stronger generalization capability and can also improve the accuracy of human mood identification in the video of a multi-shielding environment.

(1) The invention uses the 3D convolution neural network to extract local features, can encapsulate information related to targets, scenes and actions in the video, and greatly improves the efficiency and the effectiveness.

(2) The invention uses the sparse coding tree and the MOD algorithm, repeatedly classifies the nodes of the sparse coding tree, and continuously reduces the error through the iteration of the MOD algorithm, thereby more accurately completing the emotion recognition of the person.

(3) According to the invention, the facial expression characteristics and the body action characteristics are used as the elements of emotion recognition, so that the accuracy of emotion recognition under the condition that a person is shielded by a face or a body in a video is improved, and the generalization capability is enhanced.

Drawings

FIG. 1 is a flow chart of a multi-feature character emotion recognition method provided by the present invention;

FIG. 2 is a schematic diagram of a sparse coding tree according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the following description of an embodiment thereof, which is provided in connection with the accompanying drawings.

The multi-feature character emotion recognition method shown in fig. 1 comprises the following steps:

and step S1, inputting a video by a user, traversing all frames of the video by using the sampling step length of 1 frame, and creating a plurality of clips with the length of 16 frames as the input of the 3D convolutional neural network.

And step S2, adopting a 3D convolutional neural network to extract local characteristics of facial expressions and body actions of the people in the video. The 3D convolutional neural network is used for capturing the characteristic information of time and space in the video, and comprises 8 convolutional layers, 5 pooling layers, 2 full-connected layers and 1 softmax output layer, wherein the size of a 3D convolutional core of all the layers is 3 multiplied by 3, and the step length is 1; the size of one pooling layer is 1 multiplied by 2, the step length is 1, and the sizes of the other pooling layers are 2 multiplied by 2, and the step length is 2; the video is resized to 128 x 171 and the clips are 16 frame segments that do not overlap each other and are input as a network.

For each input, constructing a 7 × 7 × 512 feature map in the conv5b layer, respectively extracting the spatial position of each feature, and connecting the values of each spatial position along 512 channels to obtain the final local features of the input; the total number of local features of the input video is 7 × 7, and each obtained local feature is a 512-dimensional vector. conv5b is a feature visualization used in the last convolutional layer in a 3D convolutional neural network, with a feature mapping space size of 7 × 7, 512 channels, and containing two feature maps.

And step S3, performing dictionary learning on the input final local features by using MOD algorithm at the root nodes of the sparse coding tree. The MOD algorithm objective function is as follows:

wherein D ═ g₁,g₂,…,g_n]^TRepresenting a dictionary matrix, g_iIs a dictionary atom; x is the number of_iInputting a feature vector; omega_iDenotes x_iCorresponding dictionary atom g_iThe sparse coefficient of (d); t is₀The number of non-zero elements in the sparse representation coefficients is represented. In particular, the amount of the solvent to be used,

step S3.1, training sample set is

step S3.5 when

Less than 10^-6And stopping iteration and outputting the final dictionary D.

In this step, the MOD algorithm updates the sparse coefficient matrix W by iterative iterations_(k)And a dictionary matrix D. Firstly, a tracking algorithm is used for approximating a result, so that a sparse coefficient is updated, and then, according to the input local characteristics and a sparse coefficient matrix W_(k)Updating the dictionary as the sparse coefficient matrix W_(k)The change is small enough to get the final dictionary D.

And step S4, learning a classifier by using a Support Vector Machine (SVM), and training the sparse coding tree. As shown in fig. 2:

s4.1, initializing a root node of the sparse coding tree into an active node a; at a, encoding the input local features into sparse codes using the dictionary D output at step S3; and carrying out coarse classification on the coded input features at the active node a by adopting a Support Vector Machine (SVM) classifier. The SVM classifier is a rough classifier aiming at four emotions (anger, distraction, heart injury and neutrality) under study; at this time, the classifier only performs rough classification on the input data, and further classification is transmitted to the child node, so the classification is called rough classification.

And S4.2, classifying according to the branch rule based on the rough classification label. The branch rule specifically includes:

when the coarse classification result is composed of 2 or more confusion classifications, the current node is transferred to a new sub-node specially trained, the coarse classification result is further finely classified, and finally, the recognition result only with a single classification is output. For example, when the output result of the coarse classifier only contains a class of "angry", it is output as the final result, and then the samples with the remaining data labeled as "open heart", "wounded heart", and "neutral" are directed to the next new child node, and then the SVM classifier is trained again.

And when branching to the next-level child node, taking the child node as the next active node a, repeating the sparse coding and classifying steps until all emotions are classified, and outputting a final result.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A multi-feature character emotion recognition method is characterized by comprising the following steps:

wherein D ═ g₁，g₂，...，g_n]^TRepresenting a dictionary matrix, g_iIs a dictionary atom; x is the number of_iInputting a feature vector; omega_iDenotes x_iCorresponding dictionary atom g_iThe sparse coefficient of (d); t is₀Representing the number of nonzero elements in the sparse representation coefficient;

step S3.1, training sample set is

s3.4, according to the sample X and the sparse coefficient matrix W_(k)The dictionary is updated as follows:

step S3.5 when

Less than 10^-6Stopping iteration and outputting a final dictionary D;

2. The method for multi-feature character emotion recognition according to claim 1, wherein the branching rule in step S4.2 specifically includes:

3. The method for multi-feature human emotion recognition of claim 1, wherein the 3D convolutional neural network in step S2 is used for capturing temporal and spatial feature information in video, and comprises 8 convolutional layers, 5 pooling layers, 2 fully-connected layers and 1 softmax output layer, the size of 3D convolutional kernel for all layers is 3 × 3 × 3, and the step size is 1; the size of the first layer of the pooling layer is 1 multiplied by 2, the step length is 1, and the sizes of the rest pooling layers are 2 multiplied by 2, and the step length is 2; the input video is resized to 128 x 171, clipped into 16 frame segments that do not overlap with each other and taken as the network input.

4. The method as claimed in claim 1, wherein in step S2, conv5b is used to visualize the features of the last convolutional layer in the 3D convolutional neural network, the size of the feature mapping space is 7 × 7, the number of channels is 512, and the method includes two feature mappings.