CN111488856B

CN111488856B - Multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning

Info

Publication number: CN111488856B
Application number: CN202010347655.1A
Authority: CN
Inventors: 沈琳琳; 肖建安
Original assignee: Jiangxi Ji Wei Technology Co ltd
Current assignee: Jiangxi Ji Wei Technology Co ltd
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2023-04-18
Anticipated expiration: 2040-04-28
Also published as: CN111488856A

Abstract

A multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning relates to the technical field of computer vision. The method utilizes face point cloud data to generate three attribute maps which are respectively a depth map, a direction map and an elevation map, wherein the depth map, the direction map and the elevation map are combined into a three-channel RGB map, and the RGB map is used as the input of a certain branch in a network, so that the parameter quantity of a model is reduced. By the multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning, the complexity of a deep learning network is reduced, the redundancy among features extracted from different branches in the network is inhibited, and good economic and social benefits are generated.

Description

Multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning.

Background

With the rapid development of deep learning, multimodal 2D and 3D Facial Expression Recognition (FER) are receiving wide attention in the field of computer vision. The methods based on deep learning are that a plurality of 3D attribute graphs are extracted by utilizing 3D point cloud data, the attribute graphs and a 2D face graph are used as input and are respectively sent into each characteristic extraction branch of a CNN network, and finally, the extracted characteristics of each branch are fused to be used as the input of a classifier. However, since the 2D color map and the 3D attribute map are both from the same sample, there may be redundancy in the learned features of each branch, which is not conducive to direct feature fusion, and in addition, one branch is used for each Zhang Shuxing map to extract features, which greatly increases the complexity of the model.

Disclosure of Invention

The invention aims to provide a multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning, aiming at the defects and shortcomings of the prior art, and the method can be used for reducing the complexity of a deep learning network and inhibiting the redundancy among the features extracted from different branches in the network.

In order to achieve the purpose, the invention adopts the following technical scheme: three attribute maps, namely a depth map, a direction map and an elevation map, are generated by using face point cloud data, the depth map, the direction map and the elevation map are combined into a three-channel RGB map, and the RGB map is used as the input of a certain branch in a network, so that the parameter quantity of the model is reduced.

The multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning introduces an orthogonal module to ensure that the features are orthogonal during feature fusion.

According to the multi-mode 2D and 3D facial expression recognition method based on orthogonal guide learning, the feature extraction part uses two network branches with different structures to respectively extract features of a 2D face graph and a 3D attribute graph, the features are respectively defined as FE2DNet and FE3DNet, the FE2DNet is a VGG network deformation, and the FE3DNet is a Resnet derivation.

In the multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning, a Global Weighted Pooling (GWP) layer is adopted to replace a GAP layer, which is different from a general object, in the facial expression recognition task, images input into a CNN network are aligned through key points, so that in a deep feature map, each pixel represents a certain specific area of the input image and contains fixed semantic information. Important areas such as mouth, nose, eyes, etc. play a crucial role in the correct classification of expressions, and additional attention needs to be paid to semantic information of these areas. Using GAP directly, averaging all pixels directly, the semantic information of these key regions is likely to be ignored.

According to the multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning, each feature map is provided with a weight map with the same size, the weight in the weight map can be updated by gradient descent, the output feature vector is obtained by calculating the dot product of the feature map and the weight map, and the calculation formula is shown as follows:

(wherein x ^k ,y ^k ,w ^k Respectively representing the values of the feature map, the weight map and the corresponding feature vector elements), after training with a large amount of face data, the weight map will focus more on a specific spatial region, and a larger weight in the weight map indicates that the spatial region contributes more to the final classification result.

The multimode 2D and 3D facial expression recognition method based on orthogonal guide learning is characterized in that input images of two channels are 2D gray level images and 3D attribute images from the same face, and feature vectors V extracted by a feature extractor ₁ And V ₂ There may be redundancy, let V before feature fusion ₁ And V ₂ Passing through an orthogonal guide module to output a feature vector F ₁ And F ₂ Orthogonal, removing the redundant part between the two vectors.The orthogonal guide module is composed of a full connection layer and a Relu layer.

The multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning,

orthogonal steering modules are respectively provided with V ₁ And V ₂ As input, it is converted by the full connection layer and two orthogonal features F are output ₁ And F ₂ Designing a quadrature loss function L _orth To supervise updating of orthogonal bootstrap module weights to ensure F ₁ And F ₂ Orthogonality therebetween. L is _orth Is defined as follows:

wherein θ is F ₁ And F ₂ The included angle therebetween. When loss function L _orth The closer to 0, the closer to 90 degrees the angle θ is represented, and F is ₁ And F ₂ The more orthogonal, i.e. uncorrelated, there is.

The working principle of the invention is as follows: a multi-mode 2D and 3D face expression recognition method based on orthogonal guide learning utilizes face point cloud data to generate three attribute maps which are a depth map, a direction map and an elevation map respectively, the depth map, the direction map and the elevation map are combined into a three-channel RGB map, an orthogonal module is introduced to ensure that features are orthogonal during feature fusion, and before feature fusion, V is firstly led to ₁ And V ₂ Passing through an orthogonal guide module to output a feature vector F ₁ And F ₂ Orthogonal, removing the redundant part between the two vectors.

After the technical scheme is adopted, the invention has the beneficial effects that: by the multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning, the complexity of a deep learning network is reduced, the redundancy among features extracted from different branches in the network is inhibited, and good economic and social benefits are generated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of the network architecture of the present invention and its flow chart;

FIG. 2 is a schematic diagram of the network structure of the FE2DNet and FE3DNet of the present invention;

figure 3 is a schematic flow diagram of the GWP operation structure of the present invention;

fig. 4 is a flow chart of the structure of the orthogonal guide module of the present invention.

Detailed Description

Referring to fig. 1 to 4, the technical solution adopted by the present embodiment is: the method utilizes face point cloud data to generate three attribute maps which are respectively a depth map, a direction map and an elevation map, wherein the depth map, the direction map and the elevation map are combined into a three-channel RGB map, and the RGB map is used as the input of a certain branch in a network, so that the parameter quantity of a model is reduced.

Furthermore, the multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning introduces an orthogonal module to ensure that the features are orthogonal during feature fusion.

Further, in the method for recognizing the multi-modal 2D and 3D facial expressions based on the orthogonal guide learning, the feature extraction part uses two network branches with different structures to respectively extract features of the 2D face map and the 3D attribute map, which are respectively defined as FE2DNet and FE3DNet, where FE2DNet is a variant of the VGG network and FE3DNet is a derivative of Resnet.

Further, in the multimodal 2D and 3D facial expression recognition method based on orthogonal guide learning, a Global Weighted Pooling (GWP) layer is used to replace a GAP layer, which is different from a general object, in the facial expression recognition task, images input to the CNN network are aligned through key points, so that in a deep feature map, each pixel represents a specific area of the input image and contains fixed semantic information. Important areas such as mouth, nose, eyes, etc. play a crucial role in the correct classification of expressions, and additional attention needs to be paid to semantic information of these areas. Using GAP directly, averaging all pixels directly, the semantic information of these key regions is likely to be ignored.

Furthermore, each feature map is provided with a weight map with the same size, the weight in the weight map can be updated by gradient descent, the output feature vector is obtained by calculating the dot product of the feature map and the weight map, and the calculation formula is as follows:

Further, according to the multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning, input images of two channels are both a 2D gray level image and a 3D attribute image of the same face, and a feature vector V extracted by the feature extractor ₁ And V ₂ There may be redundancy, so V is left before feature fusion takes place ₁ And V ₂ Passing through an orthogonal guide module to output a feature vector F ₁ And F ₂ Orthogonal, removing the redundant part between the two vectors. The orthogonal guide module is composed of a full connection layer and a Relu layer.

Further, in the multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning, the orthogonal guide modules respectively adopt V ₁ And V ₂ As input, it is converted by the full connection layer and two orthogonal features F are output ₁ And F ₂ Design a quadrature loss function L _orth To supervise updating of orthogonal bootstrap module weights to ensure F ₁ And F ₂ Orthogonality therebetween. L is _orth The formula of (c) is defined as follows:

wherein θ is F ₁ And F ₂ The included angle therebetween. When loss function L _orth The closer to 0, the closer to 90 degrees the angle θ is represented, in which case F ₁ And F ₂ The more orthogonal, i.e. uncorrelated, there is.

After the technical scheme is adopted, the invention has the beneficial effects that: the multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning reduces the complexity of a deep learning network and inhibits the redundancy among the features extracted from different branches in the network, thereby generating good economic benefit and social benefit.

The above description is only for the purpose of illustrating the technical solutions of the present invention and not for the purpose of limiting the same, and other modifications or equivalent substitutions made by those skilled in the art to the technical solutions of the present invention should be covered within the scope of the claims of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning is characterized in that: the method comprises the steps of generating three attribute maps which are respectively a depth map, a direction map and an elevation map by using face point cloud data, synthesizing the depth map, the direction map and the elevation map into a three-channel RGB map, using the RGB map as the input of a branch in a network, and using two attribute maps as a feature extraction partIn the human face expression recognition task, images input into a CNN network are aligned through key points, each pixel in a deep feature map represents a specific area of an input image, input images of two channels are 2D gray level maps and 3D attribute maps from the same human face, and feature vectors V extracted by a feature extractor are respectively 2D face maps and 3D attribute maps ₁ And V ₂ Having redundancy, let V before feature fusion ₁ And V ₂ Passing through an orthogonal guide module to output a feature vector F ₁ And F ₂ Orthogonal to eliminate redundant part between two vectors, and the orthogonal guide module is composed of a layer of full connection layer and Relu layer, and the orthogonal guide modules are respectively divided into V ₁ And V ₂ As input, it is converted by the full connection layer and two orthogonal features F are output ₁ And F ₂ Designing a quadrature loss function L _orth To supervise updating of orthogonal bootstrap module weights to ensure F ₁ And F ₂ Orthogonality therebetween.

2. The multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning of claim 1, wherein: an orthogonality module is introduced to ensure that the features are orthogonal when the features are fused.

3. The multi-modal 2D and 3D facial expression recognition method based on orthogonal guide learning of claim 1, wherein: each feature map is provided with a weight map with the same size, the weight in the weight map is updated by gradient descent, and the output feature vector is calculated by the dot product of the feature map and the weight map.