CN117238019A

CN117238019A - Video facial expression category identification method and system based on space-time relative transformation

Info

Publication number: CN117238019A
Application number: CN202311250870.XA
Authority: CN
Inventors: 文贵华; 陈栋梁
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-09-26
Filing date: 2023-09-26
Publication date: 2023-12-15

Abstract

The application discloses a method and a system for identifying facial expression categories of video based on space-time relative transformation, wherein the method comprises the steps of extracting local spatial characteristics of each frame of image in the video; carrying out space relative transformation on each local space feature based on Euclidean distance, and carrying out weighted fusion and aggregation learning on the obtained space relative transformation features to obtain the space features of each frame of image; based on Euclidean distance and hyperbolic distance, carrying out time sequence relative transformation on each frame of image space characteristics, and carrying out weighted fusion and aggregation learning on the obtained time sequence relative transformation characteristics to obtain space-time emotion characteristics of the video; and identifying facial expression categories according to the space-time emotion characteristics of the video. The application can better extract the space and time sequence characteristics of the facial expression video, thereby encoding the space and time sequence geometric information into the time and space characteristics, and promoting more effective space-time characteristic learning.

Description

Video facial expression category identification method and system based on space-time relative transformation

Technical Field

The application relates to the technical field of facial expression recognition, in particular to a video facial expression category recognition method and system based on space-time relative transformation.

Background

Facial expression recognition is a technique for understanding human emotion from a facial image or video sequence, by which the emotion of the other party can be understood, and the intrinsic psychological state thereof can be perceived. Therefore, the machine can recognize emotion of human like a human, and can better promote interaction between the robot and the human, so that the machine is truly intelligent.

In the real world, facial expressions are dynamic and video-based data can provide richer facial information. In recent years, with the proposal of many large-scale video facial expression data sets, a real dynamic facial expression recognition task has various challenges such as scene change, gesture change, illumination difference and the like.

Currently, advanced deep expression recognition models are proposed, such as Zhao et al propose a Former-DFER method based on the transducer technology and Wang et al propose a dual-path multi-stimulus collaborative network. However, the existing video expression recognition method is weak in feature extraction capability, and more accurate and more discriminant space-time features are difficult to extract, so that the accuracy of the existing video expression recognition method is low, and further improvement and improvement are still needed.

Disclosure of Invention

In view of the above, the application provides a method and a system for identifying facial expression categories of video based on space-time relative transformation, which aim to improve the feature extraction capability and further improve the accuracy of facial expression identification in video.

In order to achieve the above purpose, the present application adopts the following technical scheme:

in one aspect, the application provides a method for identifying video facial expression categories based on space-time relative transformation, which comprises the following steps:

extracting local spatial characteristics of each frame of image in the video;

carrying out space relative transformation on each local space feature based on Euclidean distance, and carrying out weighted fusion and aggregation learning on the obtained space relative transformation features to obtain the space features of each frame of image;

performing time sequence relative transformation on the spatial features of each frame of image based on Euclidean distance and hyperbolic distance, and performing weighted fusion and aggregation learning on the obtained time sequence relative transformation features to obtain space-time emotion features of the video;

and identifying facial expression categories according to the space-time emotion characteristics of the video.

Preferably, the time sequence relative transformation is performed on the image space features of each frame based on the Euclidean distance to obtain a first time sequence relative transformation feature in the Euclidean space, the time sequence relative transformation is performed on the image space features of each frame based on the hyperbolic distance to obtain a second time sequence relative transformation feature in the hyperbolic space, and the time-space emotion feature of the video is obtained after the weighted fusion and the aggregation learning of the first time sequence relative transformation feature and the second time sequence relative transformation feature.

Preferably, the spatial relative transformation feature is a one-dimensional vector formed by Euclidean distances between the corresponding local spatial feature and other local spatial features;

the first time sequence relative transformation characteristic is a one-dimensional vector formed by Euclidean distances of the corresponding frame image space characteristic and other frame image space characteristics;

the second time sequence relative transformation characteristic is a one-dimensional vector formed by hyperbolic distances of the corresponding frame image space characteristic and other frame image space characteristics.

Preferably, when calculating the hyperbolic distance, firstly, mapping the spatial characteristics of each frame of image into the hyperbolic geometric space through the Poincare disk model according to the following formula,

wherein, tan h (·) represents a hyperbolic tangent function, c represents a sphere negative curvature; s is(s) _fi Representing the spatial characteristics of the image of the i frame.

Preferably, in the hyperbolic space, the hyperbolic distance of the spatial features of any two frames of images is calculated according to the following formula;

wherein,representing features s _hyi Sum s _hyj Is(s) the hyperbolic distance of(s) _hyi Sum s _hyj Respectively representing corresponding time sequence characteristics of the spatial characteristics of the ith frame and the jth frame in hyperbolic space,/for the image space characteristics of the ith frame and the jth frame>Representing an addition under hyperbolic space, where x, y represent for any two points in hyperbolic space.

Preferably, the process of weighted fusion includes:

learning, by the full connection layer, the attention weights of the spatial relative transformation feature/the first temporal relative transformation feature/the second temporal relative transformation feature,

the normalization is carried out by utilizing the Sigmoid function to obtain the corresponding attention coefficient,

the spatial transformation features are weighted according to the corresponding attention coefficients and then spliced with the corresponding local spatial features, so that weighted spatial relative transformation features are obtained; or (b)

And the first time sequence relative transformation characteristic/the second time sequence relative transformation characteristic are weighted according to the corresponding attention coefficient and then are spliced with the image space characteristic of the corresponding frame, so that the weighted time sequence relative transformation characteristic is obtained.

Preferably, the weighted and fused transformation features are subjected to aggregation learning sequentially through a multi-head attention and a multi-layer perceptron; wherein,

before the aggregation learning of the spatial relative transformation characteristics, a spatial dynamic category token is firstly set and spliced into the weighted fusion spatial relative transformation characteristics,

before the relative transformation characteristics of the time sequence are subjected to aggregation learning, a dynamic time sequence category token is set, and is spliced into the weighted fused relative transformation characteristics of the time sequence, and the time sequence position codes are determined and embedded into the corresponding weighted fused relative transformation characteristics of the time sequence.

Preferably, after the spatial features of each frame of image are obtained, sorting is performed according to time, and the relative transformation features of the time sequence are obtained according to the sorted spatial features of the image.

On the other hand, the application discloses a video facial expression category recognition system based on space-time relative transformation, which comprises,

the space relative transformation module is used for carrying out space relative transformation on the local space characteristics of each frame of image according to the Euclidean distance to obtain space relative transformation characteristics;

the spatial feature interaction module is used for carrying out weighted fusion and aggregation learning on the spatial relative transformation features to obtain spatial features of each frame of image;

the time sequence relative transformation module is used for performing time sequence transformation on the spatial characteristics of each frame of image according to the Euclidean distance and the hyperbolic distance to obtain time sequence relative transformation characteristics;

and the time sequence feature interaction module is used for carrying out weighted fusion and aggregation learning on the time sequence relative transformation features to obtain the space-time emotion features of the video.

Preferably, the method further comprises the steps of,

the local spatial feature extraction module is used for extracting local spatial features of each frame of image in the video; and

and the facial expression type recognition module is used for recognizing facial expression types according to the space-time emotion characteristics of the video.

Compared with the prior art, the application discloses a video facial expression category recognition method and a system based on space-time relative transformation, which are characterized in that the space relative transformation characteristics are obtained through Euclidean space distance calculation, the space relative transformation characteristics are combined with original space characteristics in a weighting and splicing way by utilizing self-learned fusion attention coefficients, space geometric information is encoded into the space characteristics, double time sequence relative transformation characteristics are further calculated through Euclidean distance and hyperbolic distance, the double time sequence relative transformation characteristics are combined with original video time sequence characteristics in a weighting and splicing way by utilizing the respectively learned fusion attention coefficients, and the double time sequence dynamic geometric information is encoded into the time sequence characteristics to generate more accurate video expression characteristics, so that the emotion category of the facial expression in the video is finally obtained.

The technical scheme of the application can better extract the space and time sequence characteristics of the facial expression video; thereby better focusing on important space and timing characteristics and ignoring noise; encoding spatial and temporal geometric information into temporal and spatial features may facilitate more efficient spatio-temporal feature learning.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the application is further described in detail through the drawings and the embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a video facial expression class identification method based on space-time relative transformation;

FIG. 2 is a flow chart for weighted fusion and aggregate learning of spatial relative transformation features;

FIG. 3 is a schematic diagram of a timing video feature of the present application;

FIG. 4 is a flow chart for weighted fusion and aggregate learning of timing relative transformation features;

fig. 5 is a schematic diagram of a video facial expression category recognition process based on spatiotemporal relative transformation.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application discloses a video facial expression category recognition method and a system based on space-time relative transformation, which further encode space and time sequence geometric information into time and space characteristics by focusing on the space characteristics and time sequence characteristics in facial expression local characteristics so as to reduce noise interference, accurately extract emotion characteristics in facial expressions and promote effective learning of the space-time characteristics.

As shown in fig. 1, the method of the present application comprises the steps of:

extracting local spatial characteristics of each frame of image in the video;

The following is a description of specific examples.

Example 1

Firstly, extracting local spatial characteristics of each frame of image in a video;

in the embodiment, the local spatial features of each frame of image in the input facial expression video data are extracted by convolution; feature extraction is performed on video images of each frame using ResNet as a convolutional backbone network.

In one embodiment, the backbone convolves the godResNet-18 is adopted through the network, 5 stages (Stage) are provided, and all other stages except the first Stage are composed of residual blocks with different output characteristic scales. Given an input of a facial expression videoThe spatial characteristics { Z } of all frame images are obtained through a backbone convolution neural network ₁ ,Z ₂ ,...,Z _T }。

Secondly, carrying out space relative transformation on each local space feature based on Euclidean distance, and carrying out weighted fusion and aggregation learning on the obtained space relative transformation features to obtain the space features of each frame of image;

overall, the second step comprises the following steps:

s21, calculating Euclidean distances between every two different local spatial feature blocks according to the input local spatial features;

s22, the spatial relative transformation characteristic of each local spatial characteristic is represented by a one-dimensional vector formed by the distance between the local spatial characteristic and other local spatial characteristics, and all the one-dimensional vectors are connected in parallel to obtain a group of spatial relative transformation characteristics;

s23, calculating a fused attention coefficient of the space relative transformation characteristic based on the attention mechanics operation to obtain a weighted space relative transformation characteristic.

S24, splicing the weighted spatial relative transformation characteristics with the corresponding original local spatial characteristics to obtain weighted fusion spatial relative transformation characteristics;

s25: and performing aggregation learning on the weighted and fused spatial relative transformation characteristics to obtain spatial characteristics of the image.

The facial local area of the face usually transmits richer emotion information, and each frame of image space feature z obtained through a backbone convolution neural network _i The dimension is c×h×w. Then for each frame video image feature Z _i Dividing into L local spatial features along the spatial dimensions (h and w), i.e. { z ₁ ,...,z _L -wherein l=hw. The local spatial features are then input into a spatial relative transformation module to learn the spaceThe relative transformation enhances the features.

S21, calculating Euclidean distance between every two different local spatial feature blocks

The original features do not well consider utilizing and reserving manifold geometric structures among different local features, can well reflect structural information of facial images of people, and show high-level rich emotion features. Therefore, the space relative transformation module provided by the application constructs an intrinsic important space geometry structure by calculating Euclidean distance between every two different local space feature blocks, and the Euclidean distance of any two local space feature sums can be expressed as follows:

where m is the dimension of the feature,representing two local spatial features z _i And z _j I is the total amount of local spatial features. The application explores the important structural relation by calculating the distance between each space local feature and other space local features so as to discover the useful space structural information of the face of the person relevant to the expression,

s22, determining the spatial relative transformation characteristics of the local spatial characteristics according to the Euclidean distance;

for each local spatial feature a set of euclidean distances from other local features is obtained, so that each local spatial feature can be represented by a feature made up of these distances. The set of feature vectors covers the inherent similarity relation between the local feature and other features, and can effectively express geometric structure information between the local feature and the other features, and the spatial relative transformation vector of any local spatial feature i is expressed as:

wherein (1)>

Then, all the space relative transformation vectors are connected in parallel to obtain a group of corresponding space relative transformation characteristics, which are specifically expressed as follows:

s23, weighting the spatial relative transformation characteristics;

after learning to obtain a group of spatial relative transformation features, in order to better embed the learned spatial geometric knowledge into the original spatial features, the application designs an effective fusion attention mechanism, so as to adaptively learn a proper fusion attention coefficient, and find an optimal fusion mode with the original features for weighted splicing, wherein the fusion attention module is shown in figure 2;

specifically, for the learned spatial relative transformation feature matrix S, the learned spatial relative transformation feature matrix S is input into a full connection layer (FC) to adaptively learn the attention weight of any local spatial relative transformation feature vector si, and then the learned attention weight is normalized to be between 0 and 1 through a Sigmoid function to obtain the final fused attention coefficient alpha _i Finally, a group of attention weights alpha= { alpha is obtained ₁ ，α ₂ ，α ₃ ，...，α _L Multiplying it to the spatial relative transformation matrix S, the above procedure can be simply summarized as:

α＝σ(FC(S))

wherein the method comprises the steps ofRepresenting element-by-element multiplication operations, σ (·) is defined as a Sigmoid function. By adaptively adjusting and learning the weight coefficients of the respective local spatial relative transformation eigenvectors,further fusion results in a more comprehensive weighted spatial relative transformation feature.

S24, splicing the weighted spatial relative transformation features with the corresponding original local spatial features;

further, consider stitching the captured weighted spatial relative transformation features with the original features to effectively embed the learned geometry knowledge into the original spatial features, which is expressed as follows:

i.e. arbitrary eigenvectorsWherein zi represents the original local feature, +.>Representing the eigenvector tandem splice operation.

Through the operation, the weighted fusion space relative change characteristic is finally obtained.

S25: performing aggregation learning on the weighted and fused spatial relative transformation characteristics, and learning the spatial characteristics of each frame of expression image in the video;

and (3) obtaining weighted fusion relative transformation characteristics embedded with space geometric knowledge after the step S24. And then, effectively fusing the original features and geometric knowledge of the feature matrix through aggregation learning, and simultaneously realizing effective interaction between different local features so as to learn effective spatial feature representation of each frame of image.

Specifically, a dynamic fusion feature token s is introduced first _f It is defined as:

then splice it into weighted fusion spatial relative transformation features, so that S _R ＝{s _f， s _R1 ，s _R2 ，...，s _RL }. The different local space plus fusion features are then aggregated, and interaction of local space information is achieved through multi-head space attention (MHSA) to better integrate geometric knowledge into the features, and the process can be described by the formula:

S _RM ＝S _R +LN(MHSA(S _R ))

S _RS ＝S _RM +LN(MLP(S _RM ))

wherein ρ represents a Softmax function, W _Qj ，W _Kj And W is _Vj Representing the feature embedding matrix of the jth head,representing a multi-head spatial attention feature fusion transformation matrix, MLP representing a multi-layer perceptron, LN (·) representing a layer normalization operation (Layer Normalization). Through the operation, the important spatial features of each frame of expression image can be effectively explored and captured.

In one embodiment, before the third step, the spatial features of the multiple frames of images are ordered according to time sequence to form the video time sequence feature. As shown in fig. 3.

For each frame of image, after spatial relative transformation, weighted fusion and aggregation learning, the learned dynamic fusion feature token s is extracted _f ＝S _RS，0 As a final feature of the frame image. Then, the features of each frame of image are connected in parallel to form a group of time sequence features S of the facial expression video _F ＝{s _f1 ，s _f2 ，...，s _fT }。

Thirdly, carrying out time sequence relative transformation on the spatial features of each frame of image based on Euclidean distance and hyperbolic distance, and carrying out weighted fusion and aggregation learning on the obtained time sequence relative transformation features to obtain the space-time emotion features of the video;

video data typically contains dynamic time sequence information changes, and exploration and utilization of useful time sequence geometry information can help effectively capture facial expression dynamic information in video, which helps learn richer time sequence information to obtain more efficient video expression features.

In this regard, after the time sequence features are obtained, the application designs a dual time sequence relative transformation module to capture and construct useful time sequence geometric information from different spaces, and embed the useful time sequence geometric information into the original time sequence features for weighting and fusion, so that the time sequence features can contain richer multi-level information.

The method comprises the following specific steps:

s31: calculating Euclidean distance between every two different time sequence sub-features in the input video time sequence features; the time sequence sub-feature of the times refers to the spatial feature of each frame of image obtained in the second step;

s32: each time sequence sub-feature uses the distance between the feature and other time sequence sub-features to form a one-dimensional vector to represent the corresponding Euclidean space time sequence relative transformation feature, and a group of Euclidean distance time sequence relative transformation features are obtained in parallel.

S33: mapping the input video time sequence characteristics into a hyperbolic space to obtain hyperbolic time sequence characteristics, and calculating the hyperbolic distance between every two time sequence sub-characteristics for the hyperbolic time sequence characteristics;

s34: each hyperbolic time sequence sub-feature is used for forming a one-dimensional vector by the distance between the feature and other hyperbolic time sequence sub-features to represent the corresponding hyperbolic distance time sequence relative transformation feature, and a group of hyperbolic distance time sequence relative transformation features are obtained in parallel;

s35: applying attention learning operation to respectively calculate fusion attention coefficients of European and hyperbolic time sequence relative transformation characteristics to obtain weighted time sequence relative transformation characteristics;

s36: splicing the learned weighted time sequence relative transformation characteristics with the original time sequence characteristics;

s37: and performing aggregation learning on the time sequence characteristics after weighted fusion to finally obtain the space-time emotion characteristics of the video.

Specifically, the dual-time sequence relative transformation module mainly comprises three parts, namely Euclidean distance time sequence relative transformation, hyperbolic distance time sequence relative transformation and fusion attention learning.

In S31, the original timing characteristics S _F ＝{s _f1 ，s _f2 ，...，s _fT The dynamic time sequence manifold geometry structure among different time sequence sub-features is not well considered to be utilized and reserved, the fine action change of different facial expression frames can be well reflected, and high-level rich emotion features are shown.

Therefore, the Euclidean distance time sequence relative transformation module provided by the application constructs an important time sequence dynamic geometric structure by calculating Euclidean distances between two different time sequence sub-features, and aims at any two local time sequence sub-features s _fi Sum s _fj The euclidean distance can be expressed as follows:

where n is denoted as the dimension of the timing sub-feature,representing two timing sub-features s _fi Sum s _fj Is a euclidean distance of (c). The application searches the important structural relation by calculating the Euclidean distance between each time sequence sub-feature and other time sequence sub-features so as to discover the useful time sequence dynamic geometric structure information of the face video frame.

In S32, for each time sequence sub-feature, a set of corresponding euclidean distances obtained from each other with other time sequence sub-features are obtained, and a one-dimensional vector formed by these distances can be used to represent the dynamic geometric information structure of the time sequence sub-feature. The set of feature vectors covers the inherent similarity relation between the time sequence sub-feature and other time sequence features, and can effectively describe the dynamic time sequence geometric change between the time sequence sub-feature and other time sequence features, and the Euclidean distance time sequence relative transformation vector of any time sequence sub-feature i is expressed as:

wherein (1)>

The Euclidean distance time sequence relative transformation vectors of all time sequence sub-features are connected in parallel to obtain a group of corresponding Euclidean distance time sequence relative transformation feature matrixes, which are specifically expressed as follows:

s33, mapping the input video time sequence characteristics into a hyperbolic space to obtain hyperbolic time sequence characteristics;

for facial expression video data, since facial expression changes are mainly presented in some small facial action unit areas, there are small differences in facial expression images between different frames, the euclidean distance is linear and symmetrical, important features in the data may be ignored, and the important feature in the data may not capture the abundant and small action change differences of the video expression. The distance of the hyperbolic space is nonlinear and asymmetric, so that the slight difference and directivity in the data can be reflected, and meanwhile, the hyperbolic space has high deformability, can better capture the similarity and distance in the data, and can better adapt to the dynamic change of the data. Accordingly, the hyperbolic distance timing relative transformation considers that by converting the original timing characteristics into hyperbolic space, the difference between different frame images of the video can be enhanced to better capture the facial expression subtle action changes.

For the timing characteristics S of the input _F ＝{s _f1 ，s _f2 ，...，s _fT Firstly mapping the complex pattern into a hyperbolic geometric space through a Poincare disk model (Poincare ball model) to obtain a hyperbolic time sequence characteristic S _HY ＝{s _hy1 ，s _hy2 ，...，s _hyT }. Specifically, each ofTiming sub-feature s _fi The mapping from euclidean space to hyperbolic geometry space is shown in the following equation:

where tan h (·) represents the hyperbolic tangent function and c represents the sphere negative curvature. Through the calculation, a group of hyperbolic time sequence characteristics S are obtained _HY . Then, under the hyperbolic space, the hyperbolic distance between different hyperbolic time sequence subsequences is calculated. For any two hyperbolic timing sub-features s _hyi Sum s _hyj The hyperbolic distance can be expressed as follows:

wherein the method comprises the steps ofRepresenting two timing sub-features s _hyi Sum s _hyj Is a hyperbolic distance of (c). />Representing an addition under hyperbolic space, where x, y represent for any two points in hyperbolic space. The application searches the important structural relation by calculating the hyperbolic distance between each time sequence sub-feature and other time sequence sub-features so as to discover the useful time sequence dynamic geometric structure information of the face video frame.

S34, after the step S33, a group of corresponding hyperbolic distances which are mutually obtained with other time sequence sub-features are obtained for each time sequence sub-feature, and a one-dimensional vector formed by the hyperbolic distances can express the dynamic geometric information structure of the time sequence sub-feature in the hyperbolic space, so that the dynamic time sequence geometric change between the time sequence sub-features can be more clearly described. Thus, a hyperbolic distance timing relative transformation vector for any timing sub-feature i can be obtained, which is expressed as:

wherein (1)>

The hyperbolic distance time sequence relative transformation vectors of all time sequence sub-features are connected in parallel to obtain a group of corresponding hyperbolic distance time sequence relative transformation feature matrixes, which are specifically expressed as follows:

s35, weighting and fusing the relative transformation characteristics of the sequence;

in order to better embed the learned time sequence geometric knowledge into the original time sequence characteristics, the application deploys a double time sequence fusion attention mechanism to adaptively learn a proper fusion attention coefficient, and the double distance time sequence relative transformation characteristics and the original time sequence characteristics are subjected to weighted fusion and splicing, wherein the double time sequence fusion attention mechanism is shown in figure 3;

specifically, for the learned dual-timing relative transformation feature matrices V and U, the arbitrary timing relative transformation feature vector V is adaptively learned respectively _i And u _i Then the learned attention weights are integrated between 0 and 1 through Sigmoid function to obtain the final integrated attention coefficient beta _i And gamma _i Finally, two groups of attention weights beta= { beta are obtained respectively ₁ ，β ₂ ，β ₃ ，...，β _T Sum γ= { γ ₁ ，γ ₂ ，γ ₃ ，...，γ _T The above procedure can be simply summarized as:

β＝σ(FC(V))，γ＝σ(FC(U))

wherein,representing element-by-element multiplication operation, sigma (·) is defined as a Sigmoid function, and through adaptive adjustment and learning of each time sequence relative transformation feature vector weight coefficient, more comprehensive weighted time sequence relative transformation features are further obtained through fusion.

S36, splicing the learned weighted time sequence relative transformation characteristic with the original time sequence characteristic.

Further, consider stitching the captured dual weighted timing relative transformation features with the original timing features to effectively embed the learned knowledge of the timing dynamic geometry into the original timing features, which is expressed as follows:

i.e. arbitrary eigenvectors

Wherein the method comprises the steps ofRepresenting the eigenvector tandem splice operation. Through the operation, the weighted fusion time sequence pair transformation characteristic S is finally obtained _P 。

Performing aggregation learning on the weighted and fused time sequence features to efficiently fuse the original time sequence features and dynamic geometric knowledge, and simultaneously realizing effective interaction among different time sequence features so as to obtain final emotion feature representation of the facial expression video;

specifically, first one is introducedDynamic class token s _cls It is defined as:

then, the time sequence feature matrix { s } is obtained by splicing the time sequence feature matrix { s } into the weighted fusion space relative transformation features _cls ，S _P1 ，S _P2 ，...，S _PL }。

At the same time, a leachable time sequence position code embedded e is defined ^pos Where pos= {1,2,3,.. _P ＝{s _cls ，s _P1 ，s _P2 ，...，s _PL }+e ^pos . And then different weighted fusion time sequence relative transformation characteristics are aggregated, interaction of different time sequence information is realized through multi-head time sequence attention (MHTA), and time sequence dynamic geometric knowledge is better fused into the characteristics, and the process can be expressed as follows:

S _PM ＝S _P +LN(MHTA(S _P ))

S _PT ＝S _PM +LN(MLP(S _PM ))

wherein ρ represents a Softmax function, W _Qj ，W _Kj And W is _Vj Representing the feature embedding matrix of the jth head,representing a multi-head sequential attention feature fusion transformation matrix, MLP representing a multi-layer perceptron, LN (·) representing a layer normalization operation (Layer Normalization).

Not only can weight by aggregation learningInteractive learning is carried out between the fused time sequence sub-features, and dynamic classification tokens s can be enabled to be further enabled _cls＝ S _PT，0 The useful temporal features are sufficiently selected for fusion to obtain a useful classification feature representation.

And fourthly, identifying facial expression categories according to the space-time emotion characteristics of the video.

In this embodiment, the spatiotemporal emotional characteristics S of each video are obtained _PT，0 Then, the emotion is input into a linear classifier for emotion recognition, which is expressed as:

wherein the method comprises the steps ofRepresenting the emotion classification of the video prediction. In addition, video facial expression recognition belongs to an image video classification task, so that a back propagation optimization algorithm is required to optimize classification loss to minimize the classification loss as much as possible. The training penalty of this embodiment is mainly realized by a class cross entropy penalty, which is defined as:

wherein the method comprises the steps ofRepresenting the true emotion label of the input video data, Y _i The probability of the ith expression is expressed, so its value is not 0, i.e. 1.

The application discloses a video facial expression emotion type recognition method based on space-time relative transformation, which is realized by adopting an end-to-end neural network, and is characterized in that a facial expression video frame is input and a recognized emotion type is output.

The overall architecture is shown in fig. 4, the proposed method firstly uses ResNet as a convolution backbone network to perform feature extraction on video images of each frame, after performing space relative transformation, obtaining space relative transformation features based on Euclidean distance, weighting, splicing and fusing the features with original space features by utilizing fusion attention, and further performing transformation learning on the spliced and fused features to obtain the space features of each frame of image.

And then, parallelly connecting the obtained spatial features of each frame of image to form a time sequence feature of a video sequence, carrying out double time sequence relative transformation to obtain double time sequence relative transformation features based on Euclidean distance and hyperbolic distance, carrying out weighted splicing fusion on the double time sequence relative transformation features and the original time sequence features by utilizing fusion attention, and carrying out transformation learning on the spliced fusion features to obtain final video emotion features.

Finally, the emotion classification is input into a classifier of the last layer of the neural network to obtain the predicted emotion classification.

Example two

The embodiment further discloses a video facial expression category recognition system based on space-time relative transformation, wherein a frame diagram is shown in fig. 5, and the system comprises:

the spatial feature interaction module is used for carrying out weighted fusion and aggregation learning on the spatial relative transformation features to obtain spatial features of each frame of image; preferably, the module is constituted by a space Transformer;

the space Transformer sequentially comprises vector embedding, multi-head space attention, a first Add & Nom layer, a multi-layer perceptron and a second Add & Nom layer;

and the time sequence feature interaction module is used for carrying out weighted fusion and aggregation learning on the time sequence relative transformation features to obtain the space-time emotion features of the video. Preferably, the timing sequence feature interaction module mainly comprises a timing sequence transducer;

in one embodiment, the method further comprises:

The present embodiment is implemented using the deep learning framework PyTorch, and all experiments are run on a server equipped with 2 NVIDIA RTX 3090 GPUs. In addition, the CPU is Intel i9-10850K, the memory is 64G, and the operating system is Ubuntu 18.04. The implementation uses the same super-parameter settings of ResNet 18. During training, a random gradient descent algorithm with a weight decay of 1e-4, a momentum of 0.9 and a batch size of 32 was used for parameter learning. The total training round of the model is 60, the learning rate is started from 0.01, and the learning rate is attenuated to be original every 30 roundsThe model is trained using only the general video data enhancement method, i.e., the size of an input set of video frame images is scaled to 112 x 112, then copies of 112 x 112 are randomly cropped out, and finally the set of video images is flipped with a probability level of 0.5. During the test phase, the image is scaled to 112×112.

Experiments show that the method provided by the application surpasses the existing video facial expression recognition method, is small in parameter quantity and obtains better performance, and the method also shows the high efficiency of the method.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for identifying the facial expression category of video based on space-time relative transformation is characterized by comprising the following steps,

extracting local spatial characteristics of each frame of image in the video;

2. The method for recognizing the facial expression category of the video based on the space-time relative transformation according to claim 1, wherein the method is characterized in that the time sequence relative transformation is carried out on the spatial characteristics of each frame of image based on the Euclidean distance, the first time sequence relative transformation characteristics in the Euclidean space are obtained, the time sequence relative transformation is carried out on the spatial characteristics of each frame of image based on the hyperbolic distance, the second time sequence relative transformation characteristics in the hyperbolic space are obtained, and the time-time emotional characteristics of the video are obtained after the weighted fusion and the aggregation learning of the first time sequence relative transformation characteristics and the second time sequence relative transformation characteristics.

3. The method for identifying facial expression categories of video based on space-time relative transformation according to claim 2, wherein,

the space relative transformation feature is a one-dimensional vector formed by Euclidean distances between the corresponding local space feature and other local space features;

4. The method for recognizing facial expression category of video based on space-time relative transformation according to claim 3, wherein when calculating hyperbolic distance, mapping each frame of image space features into hyperbolic geometric space by poincare disk model according to the following formula,

5. The method for identifying video facial expression categories based on space-time relative transformation according to claim 4, wherein in the hyperbolic geometric space, the hyperbolic distance of the spatial features of any two frames of images is calculated according to the following formula;

6. The method for identifying video facial expression categories based on spatiotemporal relative transformation of claim 2, wherein the process of weighted fusion comprises:

the spatial relative transformation characteristics are weighted according to the corresponding attention coefficients and then spliced with the corresponding local spatial characteristics, so that weighted spatial relative transformation characteristics are obtained; or (b)

7. The method for identifying the facial expression categories of the video based on the space-time relative transformation according to claim 1, wherein the weighted fusion transformation characteristics are subjected to aggregation learning sequentially through a multi-head attention and a multi-layer perceptron; wherein,

8. The method for identifying video facial expression categories based on space-time relative transformation according to claim 1, wherein each frame of image space features are obtained and then are sorted according to time, and time sequence relative transformation features are obtained according to the sorted image space features.

9. A video facial expression category recognition system based on space-time relative transformation is characterized by comprising,

10. The video facial expression category recognition system based on spatiotemporal relative transformation of claim 9, further comprising,