CN112800903A

CN112800903A - Dynamic expression recognition method and system based on space-time diagram convolutional neural network

Info

Publication number: CN112800903A
Application number: CN202110067161.2A
Authority: CN
Inventors: 卢官明; 缪远俊; 卢峻禾
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-05-14
Anticipated expiration: 2041-01-19
Also published as: CN112800903B

Abstract

The invention discloses a dynamic expression recognition method and system based on a space-time diagram convolutional neural network. Firstly, detecting key points of a face of each frame of image in a dynamic expression sequence to obtain normalized coordinates and serial numbers of the key points; extracting local texture feature vectors of the key points, splicing the local texture feature vectors with the normalized coordinates of the key points to form local fusion feature vectors of the key points; then connecting key points between the same frames to form a space domain edge, connecting key points with the same number of adjacent frames to form a time domain edge, and forming a space-time topological graph by using the edges and the key points; then, constructing a space-time graph convolutional neural network, and training the space-time graph convolutional neural network by using the generated space-time topological graph; and finally, taking a space-time topological graph generated based on the new expression sequence as input, and carrying out expression recognition by using the trained network model. The method utilizes the position information of the key points of the face, can overcome the influence of illumination, skin color and posture change, and improves the accuracy and robustness of expression recognition.

Description

Dynamic expression recognition method and system based on space-time diagram convolutional neural network

Technical Field

The invention relates to a dynamic expression recognition method and system based on a space-time diagram convolutional neural network, and belongs to the field of image processing and pattern recognition.

Background

As computers become more and more important in people's daily life, human-computer interaction will also become an inevitable trend in technological development. To enhance the human-computer interaction experience, computers need to have the ability to recognize human emotions. In 1986, the scientific research of the psychologist mehraban shows that in daily life, human face expression is an important carrier in emotion transmission, and can transmit the most abundant information, which exceeds the sum of the information amount transmitted by language and sound. Therefore, expression recognition is an essential link in human-computer interaction, and human emotional states are judged by extracting human expression information, so that human emotional requirements are met.

With the continuous enrichment of facial expression recognition technology, facial expression recognition has become a research hotspot in the field of computer vision and pattern recognition. Aiming at the problem of how to effectively extract the time and space information of the dynamic expression sequence, the current mainstream method mostly adopts a Convolutional Neural Network (CNN) to extract the space information of each frame of image expression, and then utilizes a long-short term memory network (LSTM) to extract the time information of the dynamic expression sequence; or directly utilizing a three-dimensional convolutional neural network (3D-CNN) to simultaneously convolve the input sequence in a space dimension and a time dimension, wherein the extracted features not only contain information in the images, but also contain information between the images. These methods typically use raw images as input to learn features related to the expression recognition task through supervised training. However, the original image is rich in too much interference information irrelevant to expression recognition, such as age, gender, illumination and other information, and the process from the original image to the low-dimensional feature vector finally used for expression classification is equivalent to a supervised dimension reduction process for mining useful information, and the process is often complex and needs to train a large number of parameters. The face contour formed by the face key points is a higher-level expression compared with the whole image, and the change of the face contour of different individuals in different expression states has the same characteristic mode, so that the model trained by the face key points has certain robustness on the change of skin color, illumination and posture, and in addition, the number of the key points is obviously less than that of the pixels of the whole image, and a simpler model can be obtained.

With the development in recent years, the convolutional neural network can well process data with a graph structure, such as social network relationships, communication networks, molecular structures and the like, and can map the data onto low-dimensional vectors, and the data cannot be processed by the traditional Convolutional Neural Network (CNN), so that a spatio-temporal topological graph generated based on human face key points can be processed by using the convolutional neural network, and higher-level features can be learned to realize classification of dynamic expressions.

The chinese patent application "a facial expression recognition method based on a graph convolution neural network" (patent application No. 201910091261.1, publication No. CN 110008819A) uses each pixel point in a facial expression gray scale image as a node of the image, constructs a topological graph according to a certain rule, and inputs the topological graph into the constructed graph convolution neural network model to obtain the classification result of the expression. The method is too complex for a topological graph constructed by each pixel point in an image, and is not beneficial to information fusion between nodes with long distances; in addition, the method is only suitable for images and cannot be applied to video sequences, and classification of dynamic expression sequences is achieved.

Chinese patent application "a dynamic expression recognition method based on facial feature point data enhancement" (patent application No. 202010776415.3, publication No. CN 111931630A) obtains the recognition result of the facial dynamic expression by inputting the initial frame and the peak frame of the dynamic expression sequence and the trajectory graph constructed according to the facial key points into the convolutional neural network, respectively. The method has the problems that the trajectory graph is manually designed according to the key points of the human face, the characteristic extraction process is complicated, the complexity is high, and the real-time performance of the model is influenced.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defect that the existing expression recognition method cannot effectively utilize the information of key points of the human face, the invention provides a dynamic expression recognition method and a dynamic expression recognition system based on a space-time graph convolutional neural network, which can fully utilize the position information of the key points of the human face and overcome the influence of illumination, skin color and posture change, thereby effectively improving the accuracy and robustness of dynamic expression recognition.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:

a dynamic expression recognition method based on a space-time diagram convolutional neural network comprises the following steps:

(1) preprocessing each expression sequence in the dynamic expression data set to obtain expression sequences with equal length;

(2) detecting key points of the face of each frame of image in the preprocessed expression sequence to obtain position coordinates and serial numbers of each key point, and normalizing the coordinates of the key points;

(3) extracting local texture feature vectors of each key point in the expression sequence, and splicing the local texture feature vectors with the normalized coordinates of the key points to obtain local fusion feature vectors of the key points;

(4) connecting key points of the expression sequences and the frames to form space domain edges, connecting key points of the adjacent frames with the same number to form time domain edges, forming edge sets of a space-time topological graph, and constructing the space-time topological graph by taking the key point sets of the expression sequences as node sets of the space-time topological graph;

(5) constructing a spatiotemporal graph convolutional neural network, wherein the network comprises a plurality of spatiotemporal graph convolutional blocks which are sequentially connected, a global average pooling layer, two full-connection layers and a classification layer; on the basis of realizing the feature fusion of adjacent nodes on a space domain, the space diagram convolution in the space-time diagram convolution block firstly calculates the similarity between the nodes to obtain a similarity matrix, and then multiplies the matrix by the input feature to realize the feature fusion of the similar nodes on the space domain;

(6) training a space-time graph convolutional neural network by utilizing the constructed space-time topological graph and the corresponding expression categories to obtain a trained space-time graph convolutional neural network model;

(7) and taking a space-time topological graph generated based on the new expression sequence as input, identifying by using the trained network model, and outputting a final classification result.

Further, the preprocessing in the step (1) comprises the following sub-steps:

(1.1) intercepting each expression sequence into a frame sequence with the length of S, intercepting the last S frame of the sequence for sequences more than S frames, and expanding the sequence to the S frames by using the last frame of the sequence for sequences less than S frames; wherein S is the frame length of a set expression sequence;

(1.2) normalizing the size of each frame of image in the sequence to ensure that the size of each frame of image is m multiplied by n pixels; wherein m and n are the set image width and height.

Further, the normalizing the coordinates of the key points in the step (2) comprises the following sub-steps:

(2.1) carrying out face key point detection on each frame of image in the preprocessed expression sequence to obtain a set of key points V ═ { V ═ V }_t，iL 1 is more than or equal to t and less than or equal to S, 1 is more than or equal to i and less than or equal to N, wherein v_t，i＝(x_t，i，y_t，i) The key point coordinates with the number i in the t frame image are represented, S represents the frame length of the expression sequence, N represents the number of key points in each frame image, and the key points are distributed at the mouth, eyes, eyebrows and nose parts;

(2.2) subtracting the coordinates of the nose tip key points in the first frame from the coordinates of all key points to obtain a key point set V ' ═ V ' after coordinate normalization '_t，i|1≤t≤S，1≤i≤N}。

Further, the local fusion feature vector in the step (3) is specifically implemented as follows:

noting the local texture feature vector extracted from the key point with the number of i in the t-th frame as l_t，iThe feature vector is compared with the normalized coordinates v 'of the key point'_t，iSplicing to obtain local fusion characteristic vector m of key points_t，i＝{v′_t，i，l_t，i) And performing the same operation on all key points in the dynamic expression sequence to obtain a key point set of M ═ M_t，i|1≤t≤S，1≤i≤N}。

Further, the method for connecting key points between the expression sequences and the frames in the step (4) to form space domain edges comprises the following steps: firstly, connecting key points distributed at the positions of the mouth, eyes, eyebrows and nose according to the geometric structure of the human face to form edges of sub-images of all the positions; then, in order to facilitate the circulation of information among subgraphs of each part, the subgraphs are connected with each other to form edges among the subgraphs.

Further, the spatio-temporal map volume block in the step (5) is specifically calculated as follows:

(5.1) carrying out dimension transformation on the input feature map:

f＝g(f_in)

f_infor inputting feature maps, dimension C_inX T x N, wherein C_inRepresenting the number of channels of the node characteristics, T representing the time dimension of the characteristic diagram, and N representing the number of nodes of an airspace; g (-) represents a dimension transformation function, and the input feature graph f is transformed by g (-)_inIs transformed into nxc_inT；

(5.2) calculating normalized similarity matrix B ═ B_i，j|1≤i，j≤N}，b_i，jRepresenting the similarity degree of the nodes i and j, and generating a new edge for the space-time topological graph by adopting a similarity measurement mode:

wherein f is_iThe ith row vector of the f matrix is represented, the internal | is absolute value operation, and the external | is modulus operation;

(5.3) constructing a spatial graph convolution, specifically expressed as:

wherein

A＝{a_i，jI is less than or equal to 1, j is less than or equal to N, and the dimensionality is NxN, whereina_i，j0 denotes that the key points i and j are not connected, a_i，j1 represents that two key points are connected, and a_i，i1 is ═ 1; Λ is diagonal matrix, diagonal element Λ_i＝∑_ja_i，j；f_inThe input feature map convolved for the spatial map is the same as the input of step (5.1), h (-) and u (-) both represent the dimension transformation function, h (-) transforms the dimensions of the input feature map to C_inT × N, and u (-) transforms the dimension of the computation result into C_inX T x N; a convolution kernel with W of 1 × 1 for transforming the number of channels of the node feature into C_out，f_outIs the output of the spatial graph convolution and has an output dimension of C_out×T×N；

(5.4) sequentially passing the output result of the step (5.3) through a normalization layer BN and a ReLu activation function layer;

(5.5) constructing residual connection, and connecting the output characteristic diagram of the step (5.4) and the input characteristic diagram f of the step (5.1)_inCarrying out residual error connection;

(5.6) constructing a time domain convolution layer, wherein the dimension of the output characteristic diagram of the step (5.5) is C_outX T x N, the size of the time domain convolution kernel is set to [ m x 1]Then, the convolution of 1 node and m key frames is completed every time, and m is selected from 2, 3 and 4 values; the step length is s, the s frames are moved every time, the convolution of the next key point is carried out after 1 key point is completed, and the dimensionality of an output characteristic graph of the time domain convolution is C through padding operation_out×(T/s)×N；

(5.7) sequentially passing the output result of the step (5.6) through the normalization layer BN and the ReLu activation function layer.

Based on the same inventive concept, the invention discloses a dynamic expression recognition system based on a space-time diagram convolutional neural network, which comprises:

the preprocessing module is used for preprocessing each expression sequence in the dynamic expression data set to obtain expression sequences with equal length;

the key point detection module is used for detecting the key points of the face of each frame of image in the preprocessed expression sequence to obtain the position coordinates and the serial numbers of each key point and normalizing the coordinates of the key points;

the key point feature fusion module is used for extracting a local texture feature vector of each key point in the expression sequence, and splicing the local texture feature vector with the normalized coordinates of the key points to obtain a local fusion feature vector of the key points;

the space-time topological graph building module is used for connecting key points of the expression sequences and the frames to form space domain edges, connecting key points of the adjacent frames with the same number to form time domain edges to form edge sets of the space-time topological graph, and building the space-time topological graph by taking the key point sets of the expression sequences as node sets of the space-time topological graph;

the space-time graph convolutional neural network module comprises a plurality of space-time graph convolutional blocks which are sequentially connected, a global average pooling layer, two full-connection layers and a classification layer; on the basis of realizing the feature fusion of adjacent nodes on a space domain, the space diagram convolution in the space-time diagram convolution block firstly calculates the similarity between the nodes to obtain a similarity matrix, and then multiplies the matrix by the input feature to realize the feature fusion of the similar nodes on the space domain;

the network training module is used for training the spatiotemporal graph convolutional neural network by utilizing the constructed spatiotemporal topological graph and the corresponding expression categories to obtain a trained spatiotemporal graph convolutional neural network model;

and the expression recognition module is used for taking a space-time topological graph generated based on the new expression sequence as input, recognizing by using a trained network model and outputting a final classification result.

Based on the same inventive concept, the dynamic expression recognition system based on the space-time graph convolutional neural network comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the dynamic expression recognition method based on the space-time graph convolutional neural network when being loaded to the processor.

Has the advantages that: compared with the prior art, the invention has the following technical effects:

(1) the time domain and space domain characteristics of the key points of the face are extracted by adopting a space-time graph convolutional neural network, the characteristic extraction is expanded from a static image to an image sequence, the parameters are adaptively adjusted through a training network, the dynamic characteristics capable of reflecting time information can be automatically extracted, and the extracted characteristics can better represent the dynamic change of facial expression; by the method, expressions such as happiness, heart injury, anger and the like can be effectively identified, and a new way is provided for developing systems such as intelligent human-computer interaction and the like.

(2) The invention can generate new edges for the space-time topological graph by adopting a similarity measurement mode, and can effectively make up the defect of manually designing the space-time topological graph; by the method, each node can fuse the information of the adjacent nodes and the information of the similar nodes, and the flexibility and the applicability of the model can be improved.

(3) Firstly, generating a space-time topological graph according to a dynamic expression sequence, and then identifying by using a trained space-time graph convolutional neural network model; compared with the whole dynamic expression sequence, the spatio-temporal topological graph formed by the face key points is a higher-level expression, and because each frame of image in the sequence is rich in too much interference information irrelevant to expression recognition, such as age, gender, illumination and the like, the network model has certain robustness to changes of skin color and illumination.

(4) According to the method, the local texture feature vector of each key point is fused with the coordinate information, so that the expression capability of the key point features is enhanced, and the accuracy of dynamic expression classification can be improved by the method because only the motion information of the key points is considered in the key point features completely based on the coordinates.

(5) The number of the face key points is far lower than that of the pixel points of the whole image, so that the input of the network model is converted into a space-time topological graph formed by the face key points from the whole dynamic expression sequence, the complexity of the model is greatly reduced, and the real-time performance of the model is improved.

(6) The invention can accurately mark the key points of the face deflected by a certain angle by using a key point detection algorithm, and construct a space-time topological graph, thereby realizing the identification of dynamic expressions, so the method has certain robustness to the change of the posture.

Drawings

FIG. 1 is a general flow diagram of a method of an embodiment of the invention.

Fig. 2 is a spatial topology diagram constructed by an embodiment of the present invention. Fig. 3 is a diagram of a network model structure constructed according to an embodiment of the present invention.

Fig. 4 is a screenshot of a partial sequence of images of a CK + expression data set used in an embodiment of the invention.

Detailed Description

The technical solution of the present invention is further explained with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the method for identifying a dynamic expression based on a space-time graph convolutional neural network disclosed in the embodiment of the present invention specifically includes the following steps:

and (1) preprocessing each expression sequence in the dynamic expression data set, so that each expression sequence sample can be represented by a sequence with the same length, and the preprocessed expression sequence is obtained. The method specifically comprises the following substeps:

In the embodiment, a CK + dynamic expression data set is used, and a part of samples in the data set are shown in fig. 4, and expression categories of the data set are anger, disgust, fear, happiness, sadness, surprise and neutrality; in practice, other video data sets may be used, or a camera may be used to capture facial expression videos to establish an expression video library including emotion category labels. Preprocessing expression sequences in the data set, intercepting each expression sequence into a frame sequence with the length of 16, intercepting the last 16 frames of the sequence for sequences with more than 16 frames, and expanding the last frame of the sequence to 16 frames by using the last frame of the sequence for sequences with less than 16 frames; meanwhile, the size of each frame of image in the sequence is normalized, so that the size of each frame of image is 64 x 64 pixels.

Step (2) face key point detection is carried out on each frame of image in the preprocessed expression sequence, the position coordinates and the serial numbers of each key point are returned, and the coordinates of the key point sequence are normalized, and the method comprises the following substeps:

In the embodiment, a Dlib open source toolkit is adopted to detect key points of each frame of image, and the coordinates and the numbers of 68 key points such as a nose tip, a mouth corner, an eye corner and the like are returned; to reduce complexity, the key points numbered 1-17 are deleted and the remaining 51 key points are renumbered in the original order.

Setting 51 key point sets obtained by detecting face key points of the t-th frame image as V_t＝{v_t，iI is more than or equal to 1 and less than or equal to 51, and the coordinates of the key points with the number of 14 in the first frame are subtracted from the coordinates of all the key points; the same operation is carried out on each frame of image, and a key point set V 'after coordinate normalization is obtained'_t，i|1≤t≤16，1≤i≤51＝{v_t，i-v_1，14|1≤t≤16，1≤i≤51}。

Extracting local texture feature vectors of each key point in the expression sequence, and splicing the local texture feature vectors with the normalized coordinates of the key points to obtain local fusion feature vectors of the key pointsAmount of the compound (A). Specifically, let l be the local texture feature vector extracted from the key point numbered i in the t-th frame_t，iThe feature vector is compared with the normalized coordinates v 'of the key point'_t，iSplicing to obtain local fusion characteristic vector m of key points_t，i＝{v′_t，i，l_t，i) And performing the same operation on all key points in the dynamic expression sequence to obtain a key point set of M ═ M_t，iT is more than or equal to 1 and less than or equal to S, and i is more than or equal to 1 and less than or equal to N }. In this embodiment, a rotation invariant LBP operator with a radius of 1 and 8 sampling points is used to calculate the minimum LBP value of each key point as the local texture feature vector of the key point.

Step (4) constructing a space-time topological graph according to key points of the expression sequence, and comprising the following substeps:

(4.1) forming an airspace edge by connecting key points between the same frames, and firstly connecting the key points distributed at the eyebrows, the eyes, the nose and the mouth according to the geometric structure of the human face to form the edge of each organ subgraph; then, in order to facilitate the circulation of information among the organ subgraphs, the subgraphs are connected with each other to form edges among the organ subgraphs, as shown in fig. 2; forming time domain edges by connecting key points with the same number of adjacent frames, wherein the edges form an edge set E of the space-time topological graph;

(4.2) dynamic expression sequence key point set M ═ { M ═ M_t，iL 1 is more than or equal to t is less than or equal to 16, and i is more than or equal to 1 and less than or equal to 51 are used as a node set of the space-time topological graph;

and (4.3) forming a space-time topological graph Q (M, E) by using the edge set E and the node set M.

Step (5) constructing a spatio-temporal graph convolutional neural network, wherein the network comprises k spatio-temporal graph convolutional blocks which are sequentially connected, k is selected from 6, 8 and 10 numerical values, and the network comprises a global average pooling layer, two full-connection layers and a classification layer; on the basis of realizing the feature fusion of adjacent nodes on the airspace, the spatial graph convolution in the spatial graph convolution block firstly calculates the similarity between the nodes to obtain a similarity matrix, and then multiplies the matrix by the input feature to realize the feature fusion of the similar nodes on the airspace.

The spatio-temporal graph convolutional neural network constructed in the embodiment comprises six spatio-temporal graph convolutional blocks which are sequentially connected, a global average pooling layer, two full connection layers and a softmax classification layer. The spatio-temporal map rolling block is specifically as follows:

the space domain edge of the space-time topological graph is artificially designed according to the natural structure of the face, and a new edge cannot be generated in the whole network; for example, in the smiling process, a left mouth corner key point and a right mouth corner key point of the face are deformed similarly, the characteristics of the two key points have higher similarity, and if an edge is constructed between the two key points, the fusion of key point information is facilitated; therefore, a new edge can be generated for the space-time topological graph by adopting a similarity measurement mode, and the flexibility of the model is improved; the calculation steps of the spatio-temporal map volume block are as follows:

(5.1) carrying out dimension transformation on the input feature map:

f＝g(f_in)

f_infor inputting feature maps, dimension C_inX T x N, wherein C_inRepresenting the number of channels of the node characteristics, T representing the time dimension of the characteristic diagram, and N representing the number of nodes of an airspace; g (-) represents a dimension transformation function, and the input feature graph f can be transformed through g (-)_inIs transformed into nxc_inT；

(5.2) calculating normalized similarity matrix B ═ B_i，j|1≤i，j≤51}，b_i，jRepresents the degree of similarity of nodes i and j:

wherein f is_iThe ith row vector of the f matrix is represented, the internal | is absolute value operation, the external | is modulo operation, all key points are normalized by using the key points of the nose tip in the first frame, so that the key points of the face vertically symmetrical about the nose tip have similar position coordinates through absolute value operation, and the two key points have higher similarity;

(5.3) constructing a spatial graph convolution, specifically expressed as:

wherein

A＝{a_i，jI is less than or equal to 1, j is less than or equal to 51, and the dimensionality is 51 multiplied by 51, wherein a_i，j0 denotes that the key points i and j are not connected, a_i，j1 represents two key points connected, a_i，i1 is ═ 1; Λ is diagonal matrix, diagonal element Λ_i＝∑_ja_i，j(ii) a Fin is the input feature graph of the spatial graph convolution is the same as the input of the step (5.1), h (-) and u (-) both represent dimension transformation functions, h (-) transforms the dimension of the input feature graph into C_inT × 51, and u (-) transforms the dimension of the computation result into C_inX T x 51; a convolution kernel with W of 1 × 1 for transforming the number of channels of the node feature into C_out，f_outIs the output of the spatial graph convolution and has an output dimension of C_out×T×51；

(5.4) sequentially passing the output result of the step (5.3) through a normalization layer (BN) and a ReLu activation function layer;

(5.5) constructing residual connection, wherein the dimension of the output characteristic diagram of the step (5.4) is C_outX T x 51, and step (5.1) input feature map f_inDimension of C_inX T X51 when C_in＝C_outThen, the two characteristic maps are directly added to f_in+f_outWhen C is present_in≠C_outWhen necessary, input feature map f_inIs converted into C_outThen, performing an addition operation;

(5.6) constructing a time domain convolution layer, wherein the convolution operation can only fuse the information of the adjacent and similar nodes of the node in the space domain and cannot fuse the information of the adjacent node in the time domain;

analogous to image convolution, in time-domain convolution, the size of the convolution kernel is [ m × 1 ]]Then, the convolution of m key frames is finished by 1 node every time, the step length is s, then the convolution moves by s frames every time, and 1 joint point is finishedPerforming convolution of the next joint point, wherein the dimension of an output characteristic graph of the time domain convolution is C through the padding operation_out×(T/s)×51；

(5.7) sequentially passing the output result of the step (5.6) through a normalization layer (BN) and a ReLu activation function layer.

The space-time pattern convolutional neural network in this embodiment is shown in fig. 3, and each layer of specific information is as follows:

the space-time map convolution block C1 has an input channel number of 3, an output channel number of 32, and a step size of 1;

the space-time map convolution block C2 has an input channel number of 32, an output channel number of 32, and a step size of 1;

the space-time map convolution block C3 has an input channel number of 32, an output channel number of 64, and a step size of 2;

the space-time map convolution block C4 has an input channel number of 64, an output channel number of 64, and a step size of 1;

the space-time map convolution block C5 has an input channel number of 64, an output channel number of 128, and a step size of 2;

the number of input channels of the spatio-temporal map convolution block C6 is 128, the number of output channels is 128, and the step size is 1;

global average pooling layer: the output characteristic dimensionality after 6 space-time volume blocks is 128 multiplied by 4 multiplied by 51, and all nodes are subjected to average operation to obtain a 128-dimensional vector;

full connection layer: the first fully connected layer has 128 dimensions for input and 64 dimensions for output, and the second fully connected layer has 64 dimensions for input and 7 dimensions for output.

And (6) training the spatiotemporal graph convolutional neural network by utilizing the constructed spatiotemporal topological graph and the corresponding expression type to obtain a trained spatiotemporal graph convolutional neural network model. During training, an Adam method is adopted as an optimization strategy, and cross entropy is selected as a loss function of gradient back propagation.

And (7) taking a space-time topological graph generated based on the new expression sequence as input, identifying by using the trained network model, and outputting a final classification result.

Based on the same inventive concept, the embodiment of the invention discloses a dynamic expression recognition system based on a space-time diagram convolutional neural network, which comprises:

Based on the same inventive concept, the dynamic expression recognition system based on the space-time graph convolutional neural network disclosed by the invention comprises at least one computing device, wherein the computing device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and the computer program realizes the dynamic expression recognition method based on the space-time graph convolutional neural network of the embodiment when being loaded into the processor.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions within the technical scope of the present invention are included in the scope of the present invention, and therefore, the scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A dynamic expression recognition method based on a space-time diagram convolutional neural network is characterized by comprising the following steps:

2. The method for recognizing the dynamic expression based on the spatio-temporal convolutional neural network as claimed in claim 1, wherein the preprocessing in step (1) comprises the following sub-steps:

3. The method for recognizing dynamic expressions based on a spatio-temporal convolutional neural network as claimed in claim 1, wherein the step (2) of normalizing the coordinates of the key points comprises the following sub-steps:

(2.2) subtracting the coordinates of the nose tip key points in the first frame from the coordinates of all the key points to obtain normalized coordinatesIs equal to { V'_t，i|1≤t≤S，1≤i≤N}。

4. The method for identifying dynamic expressions based on the spatio-temporal convolutional neural network as claimed in claim 1, wherein the local fusion feature vector in the step (3) is implemented by the following steps:

5. The method for identifying dynamic expressions based on the spatio-temporal convolutional neural network of claim 1, wherein the method for connecting key points between the expression sequence and the frame in step (4) to form a spatial boundary comprises the following steps: firstly, connecting key points distributed at the positions of the mouth, eyes, eyebrows and nose according to the geometric structure of the human face to form edges of sub-images of all the positions; then, in order to facilitate the circulation of information among subgraphs of each part, the subgraphs are connected with each other to form edges among the subgraphs.

6. The method for identifying dynamic expressions based on the spatiotemporal convolutional neural network as claimed in claim 1, wherein the spatiotemporal convolutional block in the step (5) is specifically calculated as follows:

(5.1) carrying out dimension transformation on the input feature map:

f＝g(f_in)

(5.3) constructing a spatial graph convolution, specifically expressed as:

wherein

A＝{a_i，jI is less than or equal to 1, j is less than or equal to N, and the dimensionality is NxN, wherein a_i，j0 denotes that the key points i and j are not connected, a_i，j1 represents that two key points are connected, and a_i，i1 is ═ 1; Λ is diagonal matrix, diagonal element Λ_i＝∑_ja_i，j；f_inThe input feature map convolved for the spatial map is the same as the input of step (5.1), h (-) and u (-) both represent the dimension transformation function, h (-) transforms the dimensions of the input feature map to C_inT × N, and u (-) transforms the dimension of the computation result into C_inX T x N; a convolution kernel with W of 1 × 1 for transforming the number of channels of the node feature into C_out，f_outIs the output of the spatial graph convolution and has an output dimension of C_out×T×N；

(5.5) constructing residual connection, and connecting the output characteristic diagram of the step (5.4) and the input characteristic diagram of the step (5.1)Sign f_inCarrying out residual error connection;

7. A dynamic expression recognition system based on a space-time graph convolutional neural network is characterized by comprising:

8. A dynamic expression recognition system based on a space-time graph convolutional neural network, comprising at least one computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program, when loaded into the processor, implements the method for dynamic expression recognition based on a space-time graph convolutional neural network according to any one of claims 1 to 6.