CN115471885A

CN115471885A - Action unit correlation learning method and device, electronic device and storage medium

Info

Publication number: CN115471885A
Application number: CN202211017801.XA
Authority: CN
Inventors: 苗瑞; 周波; 邹小刚; 梁书玉; 莫少锋
Original assignee: Shenzhen HQVT Technology Co Ltd
Current assignee: Shenzhen HQVT Technology Co Ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2022-12-13

Abstract

The application provides a method and a device for learning correlation of action units, electronic equipment and a storage medium, and relates to the technical field of image processing. The method comprises the following steps: acquiring a face image, and determining local features corresponding to an AU central point based on the face image; determining an attention score matrix of AU center association relation through an attention mechanism based on the local features; the AU center association relationship is an association relationship between the AU center points; acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center association relation set for a spatial relation and a non-spatial relation respectively; learning, by a potential relationship network, a correlation between the AU center point and the AU based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix. The method improves the accuracy of identifying whether AU occurs or not.

Description

Action unit correlation learning method and device, electronic device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for learning action unit dependency, an electronic device, and a storage medium.

Background

In the field of facial expression analysis, there are two main ways of labeling human expressions: emotion category based labeling (e.g., joy, anger, etc.) and Facial motion Coding System (FACS) based labeling. The latter decomposes the facial expression into a plurality of different motion Units (AUs-Action Units) according to the facial muscle movement for more objective and quantitative marking, and is more suitable for the field of synthesis and control of the facial expression.

Since it is impractical to explicitly define all AU combinations, a priori knowledge needs to be incorporated in the AU detection process. Existing a priori knowledge contributes significantly to the progress of AU detection, but is constructed based on the co-occurrence of training labels, and therefore depends to a large extent on the type and number of AUs being modeled. Second, the prior art does not account for the change in appearance due to AUs co-occurrence for relational modeling of multiple AUs.

Therefore, the priori knowledge adopted by the existing relation learning method depends on training data, and more co-occurrence cannot be learned, and finally, an independent deep learning model is caused to infer whether AU has a very challenging technical problem.

Disclosure of Invention

The application provides an action unit correlation learning method, an action unit correlation learning device, an electronic device and a storage medium, which are used for solving the technical problem that prior knowledge adopted by the existing relation learning method depends on training data, more co-occurrence cannot be learned, and finally an independent deep learning model is caused to infer whether AU (AU) occurs in a challenging manner.

According to a first aspect of the present application, there is provided an action unit correlation learning method including:

acquiring a face image, and determining local features corresponding to the face point based on the face image; the human face image comprises a plurality of action units AU, each AU comprises at least one face point, and the face point is an AU central point;

determining an attention score matrix of AU center association relation through an attention mechanism based on the local features; the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points;

acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center incidence relation respectively set for a spatial relation and a non-spatial relation;

learning, by a potential relationship network, a correlation between the AU center point and the AU based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix.

According to a second aspect of the present application, there is provided an action unit correlation learning apparatus including:

the acquisition determining module is used for acquiring a face image and determining local features corresponding to the face points on the basis of the face image; the human face image comprises a plurality of Action Units (AUs), each AU comprises at least one face point, and the face point is an AU central point;

the determining module is used for determining an attention score matrix of the AU center incidence relation through an attention mechanism based on the local features; wherein the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points;

the acquisition module is used for acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center incidence relation which is set aiming at a spatial relation and a non-spatial relation respectively;

a learning module to learn a correlation between the AU center point and the AU through a potential relationship network based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix.

According to a third aspect of the present application, there is provided an electronic device comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the action unit correlation learning method as described above in the first aspect.

According to a fourth aspect of the present application, there is provided a computer-readable storage medium having stored therein computer-executable instructions for implementing the action unit dependency learning method of the first aspect as described above when executed by a processor.

According to the action unit correlation learning method, the action unit correlation learning device, the electronic equipment and the storage medium, the acquired first initial adjacent matrix and the second initial adjacent matrix are respectively set for the spatial relation and the non-spatial relation, so that more co-occurrence can be learned through different relation types, meanwhile, the attention score matrix determined based on local features and an attention mechanism is used as side information to update the structure information represented by different initial adjacent matrices, a potential relation network can train the real association relation between an AU center and an AU, and whether the AU occurs or not can be accurately identified.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic diagram of an application scenario related to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an action unit correlation learning model according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a method for learning correlation of action units according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an action unit correlation learning apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. The drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the concepts of the application by those skilled in the art with reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application.

For ease of understanding, an application scenario of the embodiment of the present application is first described.

Fig. 1 is a schematic diagram of an application scenario related to an embodiment of the present application. As shown in fig. 1, the application scenario of the present embodiment relates to AU relationship learning, which refers to learning the association between different AUs. In the prior art, there are mainly two methods for AU relationship learning, one is a local relationship modeling method that selects a set of discriminating key points for each AU by allowing grouping sparsity, and the other is an AU relationship modeling method related to symbiotic statistics, which can be subdivided into three types: a) Bayesian networks and deep bayesian networks to incorporate knowledge of AU symbiotic statistics; b) The knowledge graph modeling of AU semantic relations, due to the limitation of a data set, not all important AU relations can be captured into training data features, so some AU relations in edge forms need to be added manually; c) And based on the gated graph neural network, learning the correlation through a knowledge graph and predicting the corresponding AU relation.

Two major drawbacks of the existing AU relationship learning methods are as follows: first, these methods incorporate a priori knowledge at various stages of the AU detection pipeline, since it is impractical to explicitly define all AU combinations. While a priori knowledge contributes significantly to the progress of AU detection, they are constructed based on the co-occurrence of training labels. They depend to a large extent on the type and number of AUs modeled. Secondly, modeling the relationship of multiple occurrences of AUs does not account for changes in appearance due to co-occurrence (co-occurrence) of AUs. In particular, inferring AU relationships using independent deep learning models is extremely challenging due to the over 7000 unique combinations of AUs, the lack of labeled data corresponding to all combinations, and the unbalanced distribution of AUs in the spontaneous dataset.

In order to solve the problem that prior knowledge adopted by the existing AU relational modeling depends on training data or an AU set, the application provides an action unit correlation learning method based on a potential multi-relation graph, which is applied to the field of image processing, introduces the potential multi-relation graph (the potential multi-relation graph can be understood as a graph reflected by a first initial adjacent matrix corresponding to a spatial relation and a second initial adjacent matrix corresponding to a non-spatial relation together, and the representation form of the graph structure is a characteristic result between an AU central point and an AU), and learns the potential correlation between the AU central point and the AU central point from data instead of manually defining the graph structure. Local relations (relations between local features) which are irrelevant to the relevance between the central points of the modeled AU and the AU are constructed by combining facial morphology and extracting a priori knowledge in the form of a graph structure through a potential region graph (namely a graph represented by an attention score matrix) which is irrelevant to training statistics. The structure of the multi-relation graph in the method is two initial adjacent structures capable of reflecting strong priori knowledge of the human face morphological form, the potential edge strength is learned from the local neighborhood significance, and the edge strength is utilized to learn the correlation between the AU central point and the AU, so that the cross data set transfer learning can be better utilized.

As shown in fig. 2, the structure of the action unit correlation learning model employed in the present application may be as follows: three main modules for: the method comprises the following steps of main feature extraction (or global feature extraction), local feature extraction and correlation learning (namely relationship learning), and a classification module can be arranged after the relationship learning and used for AU activated classification. The specific processes of the above four modules are as follows, and are not described herein again.

The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 3 is a flowchart illustrating a method for learning dependency of an action unit according to an embodiment of the present disclosure. As shown in fig. 3, the method of the present embodiment includes:

s301: acquiring a face image, and determining local features corresponding to the face point based on the face image; the human face image comprises a plurality of Action Units (AUs), each AU comprises at least one face point, and the face point is an AU central point;

the motion unit is also called as a face motion unit, AU motion. The AU centre point is alternatively referred to as the AU centre. The FACS standard manual classifies human facial motion criteria into 26 AU motions, and each AU motion can be broken down into local motions of several facial points, e.g., motion smiling, involving local changes in the two corners of the mouth, the center of the upper lip, the center of the lower lip, the detection of the occurrence of these local changes enables to identify whether the motion smiling has occurred.

In order to facilitate subsequent processing, after the face image is acquired, unified processing may be performed on all input face images, where the processing includes but is not limited to: the face image is cut, and face alignment is performed in a uniform alignment manner to align all faces to a common position. After being processed, the face image may be adjusted to 170 × 170 pixels. Further, the above process may further include: such as rotation, horizontal flipping, etc.

It should be understood that, in the present application, the local features corresponding to the face points may be directly determined through the face images, and the local features corresponding to the face points may also be indirectly determined through the face images, where the indirect determination manner is similar to the following manners S401 to S402, and details are not described here.

S302: determining an attention score matrix of AU center association relation through an attention mechanism based on the local features; wherein the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points; it should be understood that the specific implementation manner of S302 is similar to that of S502 to S503, and is not described herein again.

S303: acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center association relation set for a spatial relation and a non-spatial relation respectively;

in the embodiment of the application, based on the public prior knowledge, the edges of the potential relationship graph corresponding to the association relationship of the AU center are divided into two types: spatial relationships and non-spatial relationships, the partitioning of relationship types is essential to provide the learning model with the ability to focus on the particular strength of learning the corresponding relationships. Further, the spatial relationship and the non-spatial relationship are analyzed as follows:

the spatial relationship is as follows: the embodiment of the application divides the human face into an upper area and a lower area. The spatial relationship corresponds to a spatial map. Two AUs under the spatial relationship are interesting regions ROI which are close to each other in a connection space and belong to an upper face region or a lower face region at the same time. That is, the present application divides the ROI corresponding to the center of the AU defined in table 2 into two sets S1 (regions 1-6,9, 10) and S2 (regions 7,8, 11-20), which correspond to the upper region and the lower region, respectively, which are Strongly Connected Components (SCC) in a spatial map, and each element in the first initial adjacency matrix may be defined as equation (1):

（1）

the non-spatial relationship: the non-spatial relationship corresponds to a non-spatial graph which is a bipartite graph, and the non-spatial graph is used for connecting the discrete return on investment rates among nodes far away from each other, so that the co-occurrence between two remote AU centers in emotional expression (namely expression) can be embodied and used as the non-spatial relationship based on emotion. Each element in the second initial adjacency matrix may be defined as formula (2):

（2）

from the above, the first initial adjacency matrix corresponds to a spatial map, the second initial adjacency matrix corresponds to a non-spatial map, and the non-spatial map and the spatial map can form a feature result between the AU central point and the AU corresponding to the potential multi-relation map in S304.

In the embodiment of the present application, the first initial adjacency matrix and the second initial adjacency matrix are both a priori knowledge map matrices related to facial motion morphology. That is, knowledge of facial morphology is incorporated into the present application, which can create spatial and non-spatial maps based on the morphology of facial muscle linkages and the co-occurrence of certain AUs in human expressions. Correspondingly, the two types of initial adjacency matrixes with different spatial relationship types are obtained according to the formulas (1) and (2), and a data basis is provided for obtaining spatial relationship features and non-spatial relationship features through the following steps S604-S605.

S304: learning, by a potential relationship network, a correlation between the AU center point and the AU based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix. It should be understood that the specific implementation manner of S304 is similar to that of S604 to S605 described below, and the detailed description thereof is omitted here.

As can be known from the descriptions of S301 to S304, according to the action unit correlation learning method provided in the embodiment of the present application, because the acquired first initial adjacency matrix and second initial adjacency matrix are respectively set for a spatial relationship and a non-spatial relationship, more co-occurrence properties are learned through different relationship types, and meanwhile, the attention score matrix determined based on the local features and the attention mechanism is used as side information to update the structure information represented by the different initial adjacency matrices, so that the potential relationship network can accurately identify the true association relationship between the center point of the AU and the AU, and further accurately identify whether the AU occurs. That is to say, in the embodiment of the present application, under the condition that 20 AU centers are specified, a change in the 20 AU centers can be detected, a relationship network between the 20 AU centers and the 26 AU centers is trained, and a final overall model can accurately identify whether the 26 AU occurs.

In one possible implementation, at S304, after learning the correlation between the AU central point and the AU through a potential relationship network based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix, the method further includes S305: and inputting the characteristic result between the AU central point and the AU into a classification layer so as to output a classification result for representing whether the AU occurs or not.

Since the number of action units is plural, the classification scenario belongs to a multi-label classification. Based on this, the function applied to the multi-label classification is not specifically limited in the embodiment of the present application. As can be seen from fig. 2, the classification module in the embodiment of the present application may activate the classification module for Sigmoid, and the module executes S305, and the obtained classification result may reflect whether each action unit occurs.

On the basis of the above embodiments, the technical solutions of the present application are described in more detail below with reference to several specific embodiments. The present embodiment focuses on refining S301 in fig. 3. The method of the embodiment comprises the following steps:

s401, extracting the global features of the face image.

In an embodiment of the present application, the global features include at least one of: color features, texture features, and shape features. The model type used for extracting the global features is not specifically limited in the present application. Since deep convolutional neural networks exhibit superior performance in vision-based tasks, a series of convolutional layers can extract hierarchical information from a high-dimensional input space to obtain an optimal representation of a given task. Therefore, in order to extract a preliminary global representation of the face image, a DenseNet network can be adopted in the application. The specific DenseNet121 architecture is exemplified as follows:

the backbone network densenert 121 belongs to a model architecture of a network class using residual connections, and comprises: the structural parameters of the blocks Block-1, block-2 and Block-3 can be shown in Table 1:

TABLE 1 Global feature network architecture parameters

More specifically, within any one Block (Block), each dense layer is connected to its subsequent layers, with the purpose of connecting the initial layer to the subsequent layers to facilitate information flow between the layers. If it is not

Indicates the application is innSet of non-linear functions in a layer, then layer

The output of (a) is formula (3):

（3）

the application can extract features from DenseNet121 in Block-3 behind the pooling layer. Exemplary, firstiThe size of the global feature Fi of the human face image is C multiplied by W multiplied by H, wherein C is the total number of channels, and W multiplied by H is the space size of the feature. When the size of the global feature Fi is 256 × 10 × 10, the spatial size of each feature is 10 × 10, 256 is the total number of channels, and 10 × 10 is the feature size of each pixel at Block-3. In the global feature extraction stage, the global feature extraction network (e.g., denseNet 121) can treat all regions of the face equally, including the background. The extracted global features may need further processing as intermediate features to understand the subtle movements of the facial muscles. The information of the global features is transmitted to the local feature extraction module in S402 for learning the kernel of the region of interest, and the specific implementation is as shown in S402 below:

s402, extracting local features corresponding to the face points based on the global features.

In the present embodiment, S402: based on the global features, extracting local features corresponding to the face points may include the following two steps: s4021: selecting a region in a preset range from a face image constructed by global features as an interested region by taking the face point as a center; s4022: and cutting the region of interest, and performing maximum pool operation on the cut region of interest to extract local features corresponding to the face points.

The above-mentioned face points are short for face key points. The output of the DenseNet network can be scaled up from a 10 × 10 size to a 20 × 20 size before local feature extraction, with the total number of channels unchanged, still being 256. The local feature extraction module employed in S402 may employ ROI-net in fig. 2. The present application may consider the position where N =20 main face key points (Landmark) are located as the position of the AU center. That is, the present application may acquire the center position of each AU based on the information of the face key point, as shown in table 2:

TABLE 2 position of centers of 20 AUs

AU center	Position of
		[1, 2]	Proportional position of upper part 1 of inner eyebrow
[3, 4]	1, 3 proportion positions above the outer eyebrow
		[5, 6]	1, 3 proportion positions above the outer eyebrow
[7, 8]	The middle position under the eye
		[9, 10]	Center of eye
[11, 12]	Upper lip angle
		[13, 14]	Lower lip angle
[15, 16]	In the lip
		[17, 18]	Proportional position of upper lip 1
[19, 20]	Proportional position of lower lip 1

In the embodiment of the present application, S4021 may be understood as: around the center of each AU, a bounding box Region of a preset range is defined as its Region of Interest (ROI). The preset range can be set by a user, and is 4 × 4, for example. S4022 can be understood as clipping the ROI in the global feature Fi, and the input of each local feature extraction module is r _i ，r _i Has a size of 4X 256, wherein R _i =r ₁ ，r ₂ …，r ₂₀ And 20 is the number of regions of interest. After cropping, a single-kernel convolution filter with a window of 3 may be applied to perform maximum pool operation on all regions of interest to extract local features corresponding to the regions of interestx _i The size of the glass is 1 x 256,i=1,2, \ 8230;, 20. The region of interest is used for guiding the potential relationship network to pay more attention to the region related to the detection of the AU, and therefore the selection process is an important stage of the association relationship learning and attention combination of the AU center.

S403: determining an attention score matrix of AU center association relation through an attention mechanism based on the local features; wherein the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points; it should be understood that the specific implementation manner of S403 is similar to that of S502 to S503, and is not described herein again.

S404: acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center association relation set for a spatial relation and a non-spatial relation respectively; it should be understood that the specific implementation manner of S404 is similar to that of S303 described above, and is not described herein again.

S405: learning, by a potential relationship network, a correlation between the AU center point and the AU based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix. It should be understood that the specific implementation manner of S405 is similar to that of S604 to S605 described below, and is not described herein again.

The global features of the face images can be extracted through S401, the global features retain the overall attributes of the face images, and on the basis, the local features can be further extracted through S402 to provide nodes and side information for obtaining the attention score matrix.

The present embodiment focuses on refining S302 in fig. 3. The method of the embodiment comprises the following steps:

s501: acquiring a face image, and determining local features corresponding to the face points on the basis of the face image; the human face image comprises a plurality of Action Units (AUs), each AU comprises at least one face point, and the face point is an AU central point; it should be understood that the specific implementation manner of S501 is similar to that of S401 to S402, and is not described herein again.

S502: determining queries, keys and values of AU center association relations through an attention mechanism based on the local features; the formula used for the attention mechanism is as follows:

，

，

(ii) a Wherein, the first and the second end of the pipe are connected with each other,

、

and

query, key and value respectively representing the association relation of each AU center,xis the output of the local feature extraction module (i.e., local features) with a size of 20 × 256, and

，

and

are respectively Gaussian initialization matrix with size of 256

20。

S503: determining an attention score matrix of the AU center association relation according to the query, the key and the value of the AU center association relation, as shown in formula (4):

（4）

wherein the content of the first and second substances,

an attention score matrix for AU-centric associations,n=20, because

Is a normalized exponential function, therefore

The output of the scaled dot product attention module may be made to be a weighted sum of values.

S504: acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center association relation set for a spatial relation and a non-spatial relation respectively; it should be understood that the specific implementation manner of S504 is similar to that of S303 described above, and is not described herein again.

S505: learning, by a potential relationship network, a correlation between the AU center point and the AU based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix. It should be understood that the specific implementation manner of S505 is similar to that of S604 to S605 described below, and is not described herein again.

The present embodiment can be known from the descriptions of S502 to S503, and provides a specific combination manner of the local feature and the attention mechanism, and a specific formula of the attention score matrix for determining the association relationship between AU centers, so as to ensure that attention to the region of interest is realized through the attention score matrix.

The present embodiment focuses on refining S304 in fig. 3. The method of the embodiment comprises the following steps:

s601: acquiring a face image, and determining local features corresponding to the face point based on the face image; the human face image comprises a plurality of Action Units (AUs), each AU comprises at least one face point, and the face point is an AU central point; it should be understood that the specific implementation manner of S601 is similar to that of S401 to S402, and is not described herein again.

S602: determining an attention score matrix of AU center association relation through an attention mechanism based on the local features; wherein the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points; it should be understood that the specific implementation manner of S602 is similar to that of S502 to S503, and is not described herein again.

After S602, as can be seen from FIG. 2, the attention score matrix

And performing dot product operation with the first initial adjacency matrix or the second initial adjacency matrix in S603, wherein the dot product result corresponding to each initial adjacency matrix is used as an input to perform learning of a space diagram and a non-space diagram, and the space diagram and the non-space diagram are used as two types of potential diagrams to finally form a potential multi-relation diagram. The side information of the potential multi-relation graph is a characteristic result of AU center association relation. The nodes of the potential multi-relationship graph are defined as N =20 and correspond to 20 local features (representations of regions of interest in the salient region set) output in the local feature extraction stage. In each potential graph, all nodes constitute (a) ((b))x ₁ ，x ₂ ，…，x ₂₀ ) And the nodes are 256-dimensional vectors.

S603: acquiring a first initial adjacency matrix and a second initial adjacency matrix of an AU center incidence relation respectively set for a spatial relation and a non-spatial relation; it should be understood that the specific implementation manner of S603 is similar to that of S303 described above, and is not described herein again.

S604: when the potential relationship network comprises a scaling dot product attention module and a graph convolution neural network module which are connected in sequence, calculating a first dot product result of the attention score matrix and the first initial adjacency matrix and a second dot product result of the attention score matrix and the second initial adjacency matrix by the scaling dot product attention module, as shown in equation (5):

（5）

wherein the content of the first and second substances,

an attention score matrix for AU-centric associations,

is the first initial adjacency matrix or the second initial adjacency matrix when

In the case of the first initial adjacency matrix,

for the first product result, when

In the case of the second initial adjacency matrix,

is the second dot product result, becausexIs 20X 256, and thus

Also 20 × 256.

The embodiment of the application uses an SDP (Scale Dot-Product) attention module. In each of the potential figures, there is shown,

each can be understood as an adjacency matrix among 20 nodes, and the side information of the adjacency matrix is used for indicating the strength of the association relationship. That is to say that the position of the first electrode,

each side information in (2) can be used as a weight pair

The strength of each association in (a) is adjusted to obtain a new association strength.

It is an object of the present application to infer potential multi-relationship graphs from visual data (i.e., a first initial adjacency matrix, a second initial adjacency matrix) and to learn parameterized models (i.e., a first graph convolutional neural network and a second graph convolutional neural network described below). In the work of the present application, the learning of a potential multiple relationship graph can be imagined as starting from a spatial graph represented by a first initial adjacency matrix and a non-spatial graph represented by a second initial adjacency matrix, and then the task of the present application is to learn, for each potential graph, potential edge weights in the potential graph to represent potential pairwise relationships between action units. The space graph and the non-space graph are graphs with a polygonal set and a fixed node set.

S605: determining a feature result between the AU center point and the AU through a graph convolution neural network module based on the first dot product result and the second dot product result; wherein the feature result is used to reflect a correlation between the AU center point and the AU.

In the present embodiment, S605: determining, by a convolutional neural network module, a feature result between the AU center point and the AU based on the first dot product result and the second dot product result, including:

s6051: when the graph convolution neural network module comprises a first graph convolution neural network and a second graph convolution neural network which are connected in parallel, inputting the first point product result into the first graph convolution neural network, learning the correlation between the AU central point with the spatial relationship type and the AU through the first graph convolution neural network, and outputting a spatial relationship characteristic; the first atlas convolutional neural network and the first atlas convolutional neural network are both potential atlas networks with hidden layers. For example, a potential graph network may include an input layer of 20 nodes, two layers of a hidden layer of 256 nodes, and an output layer of 50 nodes.

S6052: inputting the second dot product result into the second graph convolution neural network, learning the correlation between the AU central point with the non-spatial relationship type and the AU through the second graph convolution neural network, and outputting a non-spatial relationship characteristic; as used herein, the inputs of a first and second graph convolution neural networks are the first and second dot product results, respectively. The first dot product result and the second dot product result are potential edges, and local features are modeled by adopting attention, so that the performance of the corresponding graph convolution neural networks can be improved.

In the present application, a potential graph, or referred to as a relational GCN network, has relationship strength in addition to having nodes, distinguishing relationship types. The purpose of the first and second convolutional neural networks is to train the side information (or relationship strength) in the potential graph under the spatial relationship type and the non-spatial relationship type, respectively. In one example, taking the first atlas neural network as an example, the implementation of the first atlas neural network is analyzed as follows:

first dot product result

The structure information of the representation is an adjacency matrix, and the structure information of the representation is input

And facial morphology array

The dot product is obtained, which can be used as a first map to convolution the adjacency matrix in the neural network. Further, each layer in the first graph convolutional neural network

Can be described as the previous layer output

And an adjacency matrix

As a functionfThe output after the input of (b) is calculated is shown in formulas (6) to (7):

（6）

（7）

wherein the content of the first and second substances,

is the output of the next layer, and is,

is the output of the current layer and is,

is composed of

The number of the layer is given to the user,Xinput for layer 0, is original node feature annotation, Z is the last layer

The purpose of (2) is to learn the association relationship of each action unit in the spatial relationship from the data.

The embodiment of the present application may adopt the propagation rule of the following formula (8) to obtain the secondl+1 level of each node (1)v) Output of (2)

：

（8）

Wherein the content of the first and second substances,ρthe nonlinear activation function Relu (Rectified Linear Unit) is used for activating part of neurons and increasing sparsityx=

When less than 0, the output value is 0, whenxWhen the value is greater than 0, the output value is x;

is based on a adjacency matrix

Sum degree matrixDThe normalization constant is defined as the constant of the normalization,

is a weight kernel, consisting of layerslAll edges in (a) are shared and thus represent a single type of relationship, derived by training the model. In addition, the execution process of the second graph convolution neural network is similar to that of the first graph convolution neural network, and is not described herein again.

To embody two different relationships (a spatial relationship and a non-spatial relationship), the embodiment of the present application may let the function learn the weight of each edge type to learn the different relationships. So in the embodiments of this application

Can be represented by formula (9):

（9）

wherein the content of the first and second substances,

is hyper-parametric, and

and 1-

Are the weights of the respective terms and are,

the node self-center relationship is represented,

for the output of each node v at the l-th level,

representation and nodeiIs provided withrType relation (a)re.R) the set of neighbor indices of the edge-connected nodes V, can be understood asr=1 orr=2, whenrWhen =1, the formula corresponds to the layer output of the first graph convolution neural network whenr=2, the equation corresponds to the layer output of the second graph convolution neural network.

In addition to learning spatial and non-spatial relationships, the present application also learns weights

Representing node self-centering relationship such that neighborhood information and hyper-parameters of node self-centeringβAnd (4) balancing. Thus, a relational GCN network helps define neighborhood information for a node in a particular context.

S6053: and splicing the spatial relation features and the non-spatial relation features to obtain a feature result between the AU center and the AU. The characteristic result is a representation form of the potential multiple relation graph.

For the activation of classification tasks, the application respectively outputs and represents the set of all nodes in two potential spaces (Spatial and Non-Spatial). In the case of detecting each face action AU, multiple nodes (i.e., AU centers) may contribute differently to each AU, and the final node Z represents the connection of all nodes (20 in the above embodiment) rather than being merged, and therefore, the connection final node of the present application is represented as Z, and pooling is not required, and then detection of whether 26 AUs occur in the FACS can be achieved by sigmoid-activated classification.

The action unit correlation learning method is based on a potential multiple relation graph, and the adopted model comprises a global feature extraction stage, a local feature extraction stage, a relation learning stage and a classification layer. The present application establishes facial muscle relationships for each region of interest based on a mixed attention mechanism and learns facial morphology prior knowledge (first initial adjacency matrix, second initial adjacency matrix) using first order approximations in the atlas neural network. The backbone network densenet121 for extracting global features processes the input image to extract global spatial features (i.e. extract the intermediate representation of the input image) and inputs it into ROI-net for local feature construction. The relationship learning module uses a two-tier, relational GCN network to learn a potential multi-relationship graph with two sets of edges (spatial and non-spatial). The SDP attention module is used for calculating the strength of potential correlation on each ROI in the form of a strength matrix, the final face embedding is the concatenation (namely feature splicing) of single node features, and an AU prediction is carried out by inputting a classification layer. In summary, the invention of the embodiment of the present application learns the potential edge strength from the local neighborhood saliency by combining the node information and the structure prior knowledge of the local region (i.e., the region of interest, or referred to as the attention region), and simultaneously, by modeling the point product attention transferred by the local region information in the multi-relationship message transfer neural network (i.e., the potential relationship network), the separation of the relationship strength learning model and the graph structure learning model can be realized by using the strong prior knowledge in the face morphology form.

Fig. 4 is a schematic structural diagram of an action unit correlation learning apparatus according to an embodiment of the present disclosure. The apparatus of the present embodiment may be in the form of software and/or hardware. As shown in fig. 4, the action unit correlation learning apparatus provided in this embodiment includes: an acquisition determination module 41, a determination module 42, an acquisition module 43, and a learning module 44, wherein:

an acquisition determining module 41, configured to acquire a face image, and determine a local feature corresponding to the face point based on the face image; the human face image comprises a plurality of action units AU, each AU comprises at least one face point, and the face point is an AU central point;

a determining module 42, configured to determine an attention score matrix of an AU center association relationship through an attention mechanism based on the local features; wherein the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points;

an obtaining module 43, configured to obtain a first initial adjacency matrix and a second initial adjacency matrix of an AU center association relationship that are set for a spatial relationship and a non-spatial relationship, respectively;

a learning module 44 configured to learn, through a potential relationship network, a correlation between the AU center point and the AU based on the attention score matrix, the first initial adjacency matrix, and the second initial adjacency matrix.

In one possible implementation, the learning module 44 includes: a calculation unit and a first determination unit, wherein:

a calculating unit, configured to calculate, by the scaled dot product attention module, a first dot product result of the attention score matrix and the first initial adjacency matrix, and a second dot product result of the attention score matrix and the second initial adjacency matrix when the potential relationship network includes a scaled dot product attention module and a graph convolution neural network module that are connected in sequence;

a first determining unit, configured to determine, by a convolutional neural network module, a feature result between the AU center point and the AU based on the first dot product result and the second dot product result; wherein the feature result is used to reflect a correlation between the AU center point and the AU.

In a possible implementation manner, the first determining unit includes: a first learning subunit, a second learning subunit, and a stitching subunit, wherein:

a first learning subunit, configured to, when the atlas neural network module includes a first atlas neural network and a second atlas neural network connected in parallel, input the first point product result to the first atlas neural network, learn, by using the first atlas neural network, a correlation between the AU central point and the AU having a spatial relationship type, and output a spatial relationship feature;

a second learning subunit, configured to input the second dot product result to the second graph convolution neural network, learn, by using the second graph convolution neural network, a correlation between the AU central point having a non-spatial relationship type and the AU, and output a non-spatial relationship feature;

and the splicing subunit is used for splicing the spatial relation characteristic and the non-spatial relation characteristic to obtain a characteristic result between the central point of the AU and the AU.

In one possible implementation manner, the obtaining determination module 41 includes a first extraction unit and a second extraction unit, where:

the first extraction unit is used for extracting the global features of the face image;

a second extraction unit configured to extract a local feature corresponding to the face point based on the global feature.

In a possible implementation manner, the second extraction unit includes a selection subunit and a clipping operation subunit, where:

the selection subunit is used for selecting an area in a preset range from a face image constructed by global features by taking the face point as a center as an interested area;

and the cutting operation subunit is used for cutting the interesting region and performing maximum pool operation on the cut interesting region so as to extract local features corresponding to the face points.

In one possible implementation, the determining module 42 includes: a second determination unit and a third determination unit, wherein:

a second determining unit, configured to determine, based on the local feature, a query, a key, and a value of an AU center association relationship through an attention mechanism;

and the third determining unit is used for determining an attention score matrix of the AU center incidence relation according to the query, the key and the value of the AU center incidence relation.

In a possible implementation manner, the action unit correlation learning apparatus further includes: and the classification unit is used for inputting the characteristic result between the central point of the AU and the AU into a classification layer so as to output a classification result for representing whether the AU occurs or not.

The action unit correlation learning apparatus provided in this embodiment may be configured to execute the action unit correlation learning method provided in any of the above method embodiments, and the implementation principle and the technical effect are similar, which are not described herein again.

In the technical scheme of the application, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good custom of the public order.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device comprises a receiver 50, a transmitter 51, at least one processor 52 and a memory 53, and the electronic device formed by the above components can be used to implement several specific embodiments of the present application, which are not described herein again.

The embodiments of the present application further provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when a processor executes the computer-executable instructions, the steps in the method in the foregoing embodiments are implemented.

Embodiments of the present application further provide a computer program product, which includes computer instructions, and when the computer instructions are executed by a processor, the computer instructions implement the steps of the method in the above embodiments.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An action unit correlation learning method is characterized by comprising the following steps:

acquiring a face image, and determining local features corresponding to the face points on the basis of the face image; the human face image comprises a plurality of Action Units (AUs), each AU comprises at least one face point, and the face point is an AU central point;

determining an attention score matrix of AU center association relation through an attention mechanism based on the local features; wherein the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points;

2. The action unit correlation learning method of claim 1, wherein the learning of the correlation between the AU center point and the AU through a potential relationship network based on the attention score matrix, the first initial adjacency matrix and the second initial adjacency matrix comprises:

when the potential relationship network comprises a scaling dot product attention module and a graph convolution neural network module which are connected in sequence, calculating a first dot product result of the attention score matrix and the first initial adjacency matrix and a second dot product result of the attention score matrix and the second initial adjacency matrix through the scaling dot product attention module;

determining a feature result between the AU center point and the AU through a graph convolution neural network module based on the first dot product result and the second dot product result; wherein the feature result is used to reflect a correlation between the AU center point and the AU.

3. The action unit dependency learning method of claim 2, wherein determining, by a convolutional neural network module, a feature result between the AU center point and the AU based on the first and second dot product results comprises:

when the graph convolution neural network module comprises a first graph convolution neural network and a second graph convolution neural network which are connected in parallel, inputting the first point product result into the first graph convolution neural network, learning the correlation between the AU central point with the spatial relationship type and the AU through the first graph convolution neural network, and outputting a spatial relationship characteristic;

inputting the second dot product result to the second graph convolution neural network, learning the correlation between the AU central point with the non-spatial relationship type and the AU through the second graph convolution neural network, and outputting a non-spatial relationship feature;

and splicing the spatial relation features and the non-spatial relation features to obtain a feature result between the central point of the AU and the AU.

4. The action unit correlation learning method according to claim 1, wherein the determining local features corresponding to facial points based on the face image includes:

extracting global features of the face image;

based on the global features, local features corresponding to the facial points are extracted.

5. The action unit correlation learning method according to claim 4, wherein the extracting local features corresponding to face points based on the global features comprises:

selecting a region in a preset range from a face image constructed by global features as an interested region by taking a face point as a center;

and cutting the region of interest, and performing maximum pool operation on the cut region of interest to extract local features corresponding to the face points.

6. The action unit correlation learning method according to claim 1, wherein the determining an attention score matrix of an AU center association relationship by an attention mechanism based on the local features comprises:

determining queries, keys and values of AU center association relations through an attention mechanism based on the local features;

and determining an attention score matrix of the AU center incidence relation according to the query, the key and the value of the AU center incidence relation.

7. The action unit correlation learning method of claim 1, wherein after the learning of the correlation between the AU center point and the AU through a potential relationship network based on the attention score matrix, the first initial adjacency matrix and the second initial adjacency matrix, further comprising:

and inputting the characteristic result between the central point of the AU and the AU into a classification layer so as to output a classification result for representing whether the AU occurs or not.

8. An action unit correlation learning device, comprising:

the acquisition determining module is used for acquiring a face image and determining local features corresponding to the face points on the basis of the face image; the human face image comprises a plurality of action units AU, each AU comprises at least one face point, and the face point is an AU central point;

the determining module is used for determining an attention score matrix of the AU center incidence relation through an attention mechanism based on the local features; the AU center association relationship is an association relationship between the AU center points, and elements in the attention score matrix are used for reflecting attention of the association relationship between the AU center points;

9. An electronic device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

execution of the computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform the action unit correlation learning method of any of claims 1-7.

10. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the action unit correlation learning method of any one of claims 1 to 7 when executed by a processor.