CN116665308B

CN116665308B - Double interaction space-time feature extraction method

Info

Publication number: CN116665308B
Application number: CN202310741806.5A
Authority: CN
Inventors: 王正友; 张硕; 高新月; 韩学丛; 庄珊娜; 王辉; 白晶; 朱佩祥
Original assignee: Shijiazhuang Sanpang Technology Co ltd; Shijiazhuang Tiedao University
Current assignee: Shijiazhuang Sanpang Technology Co ltd; Shijiazhuang Tiedao University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2024-01-23
Anticipated expiration: 2043-06-21
Also published as: CN116665308A

Abstract

The invention discloses a double interaction space-time feature extraction method, and relates to the technical field of machine vision. The method comprises the following steps: preprocessing skeleton data of a dataset, and extracting double interaction action categories to obtain action tensors; extracting double interaction space-time characteristics through a space-time diagram convolution network, and capturing global and local information; performing feature fusion processing on the feature tensor through STCP based on three-branch pooling to obtain a fine-granularity double interaction space-time feature tensor; the final feature tensor is used for helping the network convergence through the full connection layer and the Softmax layer so as to output the double interaction category, and the method has the advantage of high recognition precision.

Description

Double interaction space-time feature extraction method

Technical Field

The invention relates to the technical field of machine vision, in particular to a double interaction space-time feature extraction method based on a transducer and multi-scale position sensing.

Background

In the field of Computer Vision (CV) research with human focus, human motion recognition (Human Action Recognition, HAR) task has become an important research topic in the field of Computer Vision due to its wide application in many fields such as man-machine interaction, smart home, autopilot, virtual reality, etc. At present, single person behavior recognition research based on videos is relatively more, and interaction behavior recognition research based on double persons is still in an exploration stage. Compared with the single-person action, the double-person interaction behavior recognition not only needs to deal with the problems of illumination change, scene switching, camera visual angle conversion and the like, but also considers the problems of relative relation change, limb shielding, space-time relation change and the like between two persons in the double-person interaction process. Therefore, double interactive behavior recognition is still a challenging problem in the field of computer vision, and how to effectively extract features and build a reasonable motion recognition model is always the focus of research of related researchers at home and abroad.

The traditional motion recognition mainly comprises a feature extraction part and a classifier, and people manually design features to pertinently extract the features of the pictures. However, with the development of motion recognition, more motion data expression forms are developed from two-dimensional plane diagrams to three-dimensional skeleton data, so that the classification of single motions is further increased, the interactive recognition of double motions and even group motions is further increased, and the scene of motion recognition is also more and more complex. With the development of deep learning, models of neural networks, particularly deep networks, have been widely successful in complex motion recognition. To create a skeleton diagram representing a two-person relationship, liu Xing et al propose to represent a single skeleton and an interactive relationship skeleton, respectively, in a coordinate system using a method of relative views. Pei Xiaomin et al propose to use cameras as the center of coordinates to find euclidean distances of the single and double skeletons themselves and the interaction joints, respectively, to represent double skeleton features. Li Jianan et al propose constructing knowledge-given graphs, knowledge-learning graphs, and natural connection graphs to learn interactions with minimal prior knowledge. Zhu L et al propose constructing a binary relationship interaction graph to generate a relationship adjacency matrix to model double interaction. Yoshiki Ito et al propose that intra-and inter-construct maps be input to a multi-flow network, respectively, to extract interactions. However, none of these methods take into account the influence of long-range joint characteristic information and long-range dependence on recognition accuracy, and the problem that fine local joint information is ignored.

Disclosure of Invention

The technical problem to be solved by the invention is how to provide a double interaction space-time feature extraction method with high recognition accuracy based on a transducer and multi-scale position sensing.

In order to solve the technical problems, the invention adopts the following technical scheme: a double interaction space-time feature extraction method comprises the following steps:

s1: preprocessing skeleton data of a dataset, and extracting double interaction action categories to obtain action tensors;

s2: extracting double interaction space-time characteristics through a space-time diagram convolution network, and capturing global and local information;

s3: performing feature fusion processing on the feature tensor through STCP based on three-branch pooling to obtain a fine-granularity double interaction space-time feature tensor;

s4: and the finally obtained characteristic tensor helps the network to converge through the full connection layer and the Softmax layer so as to output the double interaction category.

The further technical proposal is that: constructing a double interaction space feature extraction module combining a transducer and a light space diagram convolution, and extracting double interaction space features; and constructing a multi-scale position sensing time graph convolution module which has a larger time receptive field and focuses on important joint position information, and extracting the double interaction time characteristics.

The further technical proposal is that: performing feature fusion processing on the feature tensor by using the STCP module based on three-branch pooling; the module comprises three branches of space, time and channels for processing the feature tensor, and a more accurate feature map is obtained by a three-branch parallel method and a method for fusing features by adopting cascading.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: in the method, in the space-time feature extraction process, the combination of local information and global information is realized, fine important joint details are captured, the accuracy of identifying double interaction actions by a reference model is improved, and the proposed module has high embedding property and can be conveniently embedded into other network models.

Drawings

The invention will be described in further detail with reference to the drawings and the detailed description.

FIG. 1 is a flow chart of a method according to an embodiment of the invention

FIG. 2 is a schematic block diagram of a transducer-based spatial feature extraction module in a method according to an embodiment of the present invention;

FIG. 3 is a schematic block diagram of a multi-scale location-aware temporal feature extraction module according to an embodiment of the present invention;

fig. 4 is a schematic block diagram of an STCP attention module in the method according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

As shown in fig. 1, the embodiment of the invention discloses a double interaction space-time feature extraction method, which comprises the following steps:

s1: preprocessing skeleton data of a dataset, and extracting double interaction action categories so as to obtain action tensors;

the double interaction space-time feature extraction method comprises the steps of extracting double interaction space features based on a transform space diagram convolution and extracting double interaction time features based on a multi-scale position sensing time diagram convolution through a space-time diagram convolution network, so that deep extraction of the space-time features is realized, and local information and global information are captured.

As shown in fig. 2, the embodiment of the invention further discloses a transducer-based space map convolution module for extracting the spatial features of the bone joint points. Firstly, carrying out 1×1 convolution on an input skeleton diagram, introducing more nonlinear factors, carrying out preliminary double interaction space feature extraction on an input vector by utilizing light space diagram convolution, and then normalizing the feature vector by a batch processing layer Batch Normalization. The process is defined as:

(1)

(2)

in the middle ofFor giving up adjacency matrix>Normalizing (I/O)>And->Representing input-output characteristics,/->Representing a graph distance metric function up to 2,/for>Representing a batch normalization layer;

the encoder is a transducer encoder which enters a transducer immediately, the number of layers of the encoder is defined as 2, the module consists of a multi-head attention mechanism, a feedforward neural network, a normalization layer and residual error connection, the most core of the encoder part is the multi-head attention module, a single sub-attention mechanism can be split into a plurality of subspaces, the sub-attention mechanism is executed on each subspace, so that the characteristics of different layers and different angles can be captured better, global modeling can be carried out on the spatial characteristics, and different weights can be distributed to the characteristic diagram in a self-adaptive mode; the feed-forward neural network sublayer consists of two linear transforms and an activation function, wherein the first linear transform converts the input vector into an intermediate representation vector and the second linear transform converts the intermediate representation vector into a final representation vector; the residual connection and normalization layer is used for accelerating model convergence and improving model expression capacity, the residual connection can enable the model to be trained more easily, gradient disappearance and gradient explosion are avoided, the normalization layer can accelerate model convergence, and meanwhile robustness and generalization capacity of the model are improved. In addition, an additional residual error connection is adopted for the whole module, so that the over fitting of the model is prevented, the network parameter number is reduced, and the time complexity is reduced. The process is defined as:

(3)

(4)

(5)

(6)

wherein:for inputting information +.>For content information->For the information itself +.>The attention moment array is converted into standard state distribution, < >>Normalization is achieved. />The input to each layer of neurons is translated into a mean variance,representing the input vector +.>Representing input vector +.>Representing the final characteristics of the output through the encoder,

as shown in fig. 3, the embodiment of the invention also discloses a convolution module based on the multi-scale position sensing map for extracting the time characteristics of the bone joint points. Firstly, 4 parallel time convolution branches are carried out on the extracted space feature diagram, and each branch starts with convolution of 1 multiplied by 1; then pass through batch layer and ReLU activation function pairNormalizing the feature map; the first two branches are then convolved with 2 3 x 1 times and 2 different syndromes are applied to fuse features between different channels to obtain a multi-scale time-receptive field; while the third branch extracts the most significant feature information in successive frames through a 3 x 1 max pooling layer; the last branch contains a residual connection to maintain the gradient during back propagation; the four branches are subjected to multi-scale feature fusion through product operation; residual connection is added to the outer layer of the multi-scale convolution to help the network to converge rapidly, and the multi-scale convolution is combined with the multi-scale features through weighted summation operation; taking as input a multi-scale temporal feature, using a pooled convolution kernel of two spatial dimensionsOr->Each channel is encoded along a horizontal coordinate and a vertical coordinate, respectively, to generate a pair of feature maps with direction sensing capability. The process is defined as:

(7)

(8)

in the method, in the process of the invention,and->Represents->The height in the individual channels is +.>Width is +.>Output of->Indicate->Characteristic tensor of the channel.

Concatenating and transmitting the generated fusion profile to a shared 1 x 1 convolution transfer functionAmong them. The process is defined as:

(9)

in the middle ofRepresenting concatenation along the spatial dimension,/->Representing a nonlinear activation function +.>An intermediate feature map representing spatial information encoded in the horizontal and vertical directions.

Then along the spatial dimensionDivided into two independent tensors->And->Through two 1X 1 convolution transforms +.>And->Conversion to input tensor with the same number of channels +.>. The process is defined as:

(10)

(11)

in the method, in the process of the invention,representing a sigmoid activation function.

Will outputAnd->Used as attention weight, finally coordinate attention block +.>And outputting. The process is defined as:

(12)

s3: the feature fusion processing is carried out on the feature tensor by the STCP module based on three-branch pooling, so that the joint with the most abundant information in the specific frame is distinguished from the whole time frame sequence, and the fine-granularity double interaction space-time feature tensor is obtained. Firstly, carrying out average pooling operation on input features on a frame level and an articular level respectively, and carrying out local average pooling and local partial pooling on the feature vectors subjected to time dimension pooling to obtain the articulation point data with different importance corresponding to double interaction actions. The process is defined as:

(13)

(14)

(15)

in the middle ofIndicated are pooling operations of the corresponding dimensions.

The space-time dimension feature vectors are then pooled as input through the channel dimension, followed by combining and concatenating the three branch feature vectors together, and compressing the information through the fully concatenated layer. The process is defined as:

(16)

(17)

in the middle ofRepresenting dot product->Indicating the connection operation +_>Representing HardSwish activation function, +.>Representing a trainable parameter;

and then, three independent full-connection layers are utilized to obtain the attention scores of the time frame dimension, the joint dimension and the channel dimension, and finally, the attention scores are multiplied to obtain the local attention map of the space-time channel as the attention score of the whole action sequence.

(18)

In the middle ofRepresenting a sigmoid activation function,/->Representing the Swish activation function.

In order to extract the double interaction space features more effectively, the method adds a transducer into a backbone network, extracts the space feature vectors again through a transducer feature extractor after the primary feature extraction is carried out on the space map convolution, and captures the lost important joint information, so that the space map convolution part of the backbone network can fully retain the detail information, residual connection is added inside, and the model training time is greatly shortened. The model is therefore superior to other networks in terms of performance in the spatial feature extraction part.

In order to increase the receptive field of time dimension and solve the long-time dependence problem, the method introduces multi-scale convolution to obtain multi-scale information, and simultaneously, in order to enhance the sensitivity of a network model to an information channel and improve the position sensing capability, a position sensing attention module is added.

After the method is used for constructing a double interaction space-time feature extraction method based on a transducer and multi-scale position sensing, the importance of different body parts in the whole action sequence, time frames and importance of channels on weighted bone joints in different action stages are considered, an STCP module is designed, the module is divided into a three-branch structure of a time dimension, a space dimension and a channel dimension, pooling operation is respectively carried out in space and time, and in the time dimension, joint point data with different importance corresponding to double interaction actions is obtained through partial segmentation and partial pooling; then carrying out dimension pooling operation on the two; the obtained feature vectors are connected and the attention scores on the space local joints, time and channels are obtained through three full-connection layers; finally, multiplying the three to obtain the local attention map of the space-time channel.

In summary, the method realizes the combination of local information and global information and captures fine important joint details in the space-time feature extraction process, improves the accuracy of identifying double interaction actions by a reference model, has high embedding property, and can be conveniently embedded into other network models.

Claims

1. A double interaction space-time feature extraction method is characterized by comprising the following steps:

s4: the finally obtained characteristic tensor is used for helping the network convergence through the full connection layer and the Softmax layer so as to output the double interaction action category;

the extracting the double interaction space-time characteristics through the space-time diagram convolution network comprises the following steps:

double interaction space features are convolutionally extracted based on a transducer space diagram, double interaction time features are convolutionally extracted based on a multi-scale position sensing time diagram, deep extraction of space-time features is achieved, and local information and global information are captured;

the method for extracting the double interaction space features comprises the following steps:

firstly, carrying out 1×1 convolution on an input skeleton diagram, introducing more nonlinear factors, carrying out preliminary double interaction space feature extraction on an input vector by utilizing light space diagram convolution, and then normalizing the feature vector by a batch processing layer Batch Normalization, wherein the process is defined as follows:

F _out ＝BN(f _out ) (2)

middle lambda _d For adjacency matrix A _d Normalization, f _in And f _out Representing input and output characteristics, d represents a graph distance measurement function, and BN represents a batch normalization layer up to 2;

an encoder fransformaerencoder immediately after entering the fransformater, the encoder layer number defined as 2, wherein the encoder includes a multi-headed attention mechanism, a feed forward neural network, a normalization layer, and a residual;

an additional residual connection is adopted for the whole encoder to prevent the model from being overfitted, reduce the network parameter quantity and reduce the time complexity, and the process is defined as:

X _Add ＝LayerNorm(X+MultiHeadAttention(X)) (4)

FFN(Z)＝max(0,ZW ₁ +b ₁ )W ₂ +b ₂ (5)

Y＝add(f _in ,f _tran ) (6)

wherein: q is input information, K is content information, V is information itself,converting the attention moment array into standard state distribution, and realizing normalization by softmax; layerNorm converts the input of each layer of neurons into a mean variance, Z represents the input vector, f _in Representing an input vector f _tran Representing the double interaction space-time characteristics output by the encoder;

the extraction method of the double interaction time characteristics comprises the following steps:

firstly, 4 parallel time convolution branches are carried out on the extracted space feature diagram, and each branch starts with convolution of 1 multiplied by 1; then normalizing the feature map through a batch layer and a ReLU activation function; the first two branches are then convolved with 2 3 x 1 times and 2 different syndromes are applied to fuse features between different channels to obtain a multi-scale time-receptive field; while the third branch extracts the most significant feature information in successive frames through a 3 x 1 max pooling layer; the last branch contains a residual connection to maintain the gradient during back propagation; the four branches are subjected to multi-scale feature fusion through product operation; residual connection is added to the outer layer of the multi-scale convolution to help the network to converge rapidly, and the multi-scale convolution is combined with the multi-scale features through weighted summation operation; taking a multi-scale time feature as an input, encoding each channel along a horizontal coordinate and a vertical coordinate respectively by using a pooled convolution kernel (H, 1) or (1, W) with two spatial dimensions to generate a pair of feature maps with direction perception capability, wherein the process is defined as:

in the method, in the process of the invention,and->Represents the output of the c-th channel with height h and width w, x _c Representing a characteristic tensor of the c-th channel;

connecting and transmitting the generated fusion feature diagram to a shared 1×1 convolution transfer function F ₁ Among these, the process is defined as:

in the formula [ · ], ·]Representing concatenation along the spatial dimension, delta represents a nonlinear activation function,an intermediate feature map representing encoded spatial information in the horizontal direction and the vertical direction;

then divide f into two independent tensors along the spatial dimensionAnd->Through two 1X 1 convolution transforms F _h And F _w Transformed into an input tensor X with the same number of channels, the process is defined as:

g ^h ＝σ(F _h (f ^h )) (10)

g ^w ＝σ(F _w (f ^w )) (11)

wherein sigma represents a sigmoid activation function;

g to be output ^h And g ^w As the attention weight, the coordinate attention block Y is finally output, the process being defined as:

in the step S3:

firstly, carrying out average pooling operation on input features on a frame level and an articular level respectively, and carrying out local average pooling and local partial pooling on feature vectors subjected to time dimension pooling to obtain articular point data with different importance corresponding to double interaction actions, wherein the process is defined as follows:

f _t ＝pool _t (f _in ) (13)

f _v ＝pool _v (f _in ) (14)

f _p ＝pool _p (f _t ) (15)

wherein pool represents pooling operation of corresponding dimension;

then, the space-time dimension feature vectors are used as input and are subjected to channel dimension pooling, and then three branch feature vectors are combined and connected together, and information is compressed through a full connection layer; the process is defined as:

f _c ＝pool _p (f _t )⊙pool _v (f _in ) (16)

in the formula, +.,representing the operation of the connection, θ represents the HardSwish activation function, WRepresenting a trainable parameter;

the method comprises the steps of obtaining attention scores of a time frame dimension, a joint dimension and a channel dimension by utilizing three independent full-connection layers, and finally multiplying the three to obtain a space-time channel local attention map as the attention score of the whole action sequence;

where σ represents the sigmoid activation function and φ represents the Swish activation function.