CN116385945B

CN116385945B - Video interaction action detection method and system based on random frame complement and attention

Info

Publication number: CN116385945B
Application number: CN202310657865.4A
Authority: CN
Inventors: 高文杰; 高赞; 周冕; 赵一博; 卓涛; 李志慧; 程志勇; 李传森; 刘冬冬
Original assignee: Shandong Zhonglian Audio Visual Information Technology Co ltd; Tianjin University of Technology; Shandong Institute of Artificial Intelligence
Current assignee: Shandong Zhonglian Audio Visual Information Technology Co ltd
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2023-08-25
Anticipated expiration: 2043-06-06
Also published as: CN116385945A

Abstract

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a video interaction detection method and system based on random frame supplementing and attention, wherein the method comprises the following specific steps: (1) selection of a feature extraction network; (2) self-attention global information modeling; (3) random frame complement data enhancement; (4) generation of pyramid features; (5) boundary positioning and classification. The invention can simultaneously aggregate global time sequence and multi-scale local time sequence information, and perform efficient action positioning through the generated pyramid features. The method uses a random frame-based frame complement to enhance data, and solves the problem that a single model has different performances on data sets with different sizes through the combination of LSTM+transducer so as to obtain more accurate action positioning and classification results.

Description

Video interaction action detection method and system based on random frame complement and attention

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a video interaction detection method and system based on random frame supplementing and attention.

Background

In recent years, with the rapid development of deep learning technology, many scholars propose many time sequence motion positioning methods based on the deep learning technology. Identifying action instances and identifying their categories in time, i.e., instant action positioning, remains a challenging problem in video understanding. Significant progress has been made in the development of depth models for TAL. Most of the previous work considered the use of the actions Proposals [ BMN ] or Anchor window [ GTAN ], and developed convolutional neural networks [ CDC, SSN ], recurrent neural networks [ SS-TAD ] and graphic neural networks [ BC-GNN, G-TAD ] for TAL. Despite the steady progress made on the primary benchmark, the accuracy of existing methods is usually at the cost of modeling complexity, including more and more complex Proposal generation, anchor design and loss functions, network architecture and output decoding processes. Meanwhile, because the action boundary in the video is not clear, the existing method often has the problem of inaccurate boundary prediction.

How to solve the problem of timing action positioning, some solutions have been given in the previously proposed methods, but some problems still remain in these methods. The Anchor-based method requires a strong a priori knowledge, and the number of anchors defined for each dataset is also different, which can affect the final result. Although the action-Guided method can achieve good results, the action-Guided method is too computationally intensive. The Anchor-free approach may be a good solution.

Disclosure of Invention

The invention aims to solve the problem of time sequence action positioning, and the previous time sequence action positioning method either needs strong priori knowledge on a data set or has large calculation amount. The invention provides a video interaction motion detection method and a video interaction motion detection system based on random frame supplementing and attention, which are used for solving the problems that a time sequence motion positioning method needs strong priori knowledge or has large calculation amount.

The technical scheme for solving the technical problems is as follows:

a video interaction detection method based on random frame complement and attention comprises the following steps:

step 10, selection of a feature extraction network

Selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;

step 20, self-attention global information modeling

Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;

the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;

step 30 random frame complement data enhancement

At the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;

calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;

step 40, pyramid feature generation

On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;

step 50. Boundary locating and classifying

After obtaining the pyramid features of 6 scales; pyramid features of each scale are respectively input into different 1D convolutions to obtain positioning and classifying features, then classifying features are adopted to classify, boundary regression is carried out by adopting the positioning features, constraint is carried out by adopting focal loss in the training classification process, and constraint is adopted in the training regression processAnd (5) performing constraint.

Based on the above video interaction detection method based on the random frame complement and attention, the formula in step 30 is as follows:

original video feature vector:；

handleXDivided into t/k segments:each->Contains k frames;

a frame is randomly taken from each segment, and replicated k times,

，/>representing a random frame fetch,/->Representative copykPerforming secondary operation;

，/>representative vectors X and->New eigenvectors after passing back bone network,/->Mean square loss function.

Based on the video interaction detection method based on the random frame complement and Attention, the method operates on the extracted features through a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention, wherein the Channel-only branch is defined as follows:

wherein->Is a 1 x 1 convolutional layer, ">Is->That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>is->Operator, X is a matrix dot product operation,/>、/>And->The number of internal channels between is C/2, the output of the channel branch is +.>Wherein->Is a channel multiplication operator;

spatial-only branch is defined as follows:whereinIs standard 1->1 convolution->Is three +.>，/>Is->Operator (F)>Is a global pooling operator,>the output of the spatial branch is +.>Wherein->Is a spatial multiplication operator;

the outputs of the channel branches and the spatial branches are composed in parallel layout:

。

on the basis of the video interaction action detection method based on the random frame complement and the attention, each video loss is defined as follows:;

wherein the method comprises the steps ofIs the length of the input sequence. />Is an indication function indicating whether the time step t is within the range of motion, i.e. positiveSample (S)>Is the total number of positive samples, +.>Applied to all levels on the output pyramid and averaged over all video samples during training, +.>Is a coefficient of balance classification loss and regression loss, < ->One for distance regression>。

Based on the video interaction detection method based on the random frame complement and attention, pyramid features are obtained by adopting 6 layers of transformers, each layer consists of LSTM, local multi-head self-attention and MLP block alternating layers, layerNorm is applied before each MSA or MLP, residual connection is added after each block, a channel MLP is provided with two linear layers, GELU is used for activation in the middle, downsampling operation is realized by using a single-step depth separable 1D convolution, and the model is 2 times downsampling ratio, and the specific formula is as follows:

，，/>is a learnable per channel scaling factor initialized to 0,/o>Is the downsampling ratio.

The embodiment of the invention also provides a video interaction detection system based on random frame supplement and attention, which comprises a feature extraction module for extracting global time sequence information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.

In an embodiment of the present invention, there is also provided a computer-readable storage medium storing a computer program, where the video interaction detection method is implemented when the computer program is executed by a processor.

In an embodiment of the present invention, there is also provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.

The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and the above technical solution has the following advantages or beneficial effects:

1) Modeling of global information can be achieved by looking for more important frames and giving higher weight through a self-attention mechanism.

2) The original video features are subjected to random frame complement, so that the original video changes more greatly, and the data enhancement is achieved.

3) By combining LSTM and transducer, model capability is improved, and the problem that a single model performs differently on data sets of different sizes is solved.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

Fig. 1 is a structural diagram of the present invention.

Detailed Description

In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings.

Embodiment 1 as shown in fig. 1, an operation flow chart of a video interaction detection method based on random frame interpolation and attention according to the present invention includes the following steps:

step 10, selection of a feature extraction network

In the time sequence motion positioning task, an excellent feature extractor needs to be selected first to obtain robust features, and due to the characteristics of the time sequence motion positioning task, the feature extractor capable of extracting time sequence information needs to be selected. A dual stream I3D network is therefore employed herein for feature extraction. The input of RGB flows is continuous video frames, so that time and space characteristics can be extracted at the same time, and for Flow flows, the input is continuous optical Flow frames, so that time sequence information can be further extracted and modeled; selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;

step 20, self-attention global information modeling

step 30 random frame complement data enhancement

To include background of irrelevant activities in the video that is not clipped, the action boundaries are made unclear; in order to expand the variation of the video and make the boundary more obvious, a random frame complement is provided for data enhancement;

step 40, pyramid feature generation

step 50. Boundary locating and classifying

By obtaining pyramid features in step 40, the classification head examines all pyramids after obtaining pyramid featuresEach time of layertAnd predicts each timetAction of->This header is implemented using a lightweight 1D convolutional network connected to each pyramid layer, the parameters being shared at all levels; the classification network uses a 3-layer kernel size 3 1D convolution, layer normalization (frontLayer 2) and ReLU activation; adding a sigmoid function to each output dimension to predict the probability of C action categories; the regression head is similar to the classification head, and examines each moment of all L layers on the pyramidt；

The difference is that the regression head predicts the distance to start of motion and offsetOnly when the current time step istIn the action, each pyramid level is pre-assigned with an output regression range, and the regression head adopts the same design as the classification network by using a one-dimensional convolution network, but adds a ReLU at the tail end for distance estimation; model t output for each time>Including action category->Is +.>The method comprises the steps of carrying out a first treatment on the surface of the The loss function likewise follows a very simple design, with only two (1)/(2)>One focal loss is classified for class C; (2)/>One GIoU loss for distance regression;

step 60 time sequence action positioning effect

Extracting video features on the thumb 14 using dual stream I3D pre-trained on Kinetics on the thumb 14 dataset; taking 16 continuous frames as input of I3D, and extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4; the two flow features are further connected (2048-D) as inputs to the model; mAP@0.3:0.1:0.7 was used to evaluate the model of the invention. 50 epochs were trained in the present invention, with a linear heat body of 5 epochs; the initial learning rate is 1e-4, and cosine learning rate attenuation is used; the size of the small batch is 2, and the weight decay is 1e-4; on the basis of ablation, the window size of local self-attention is 19; external classification scores from UntrimdredNet are also incorporated; for the activitynet1.3 dataset, feature extraction was performed using dual stream I3D, but the step size of the sliding window was increased to 16; downsampling the extracted features to a fixed length 128 by linear interpolation; for evaluation, mAP@0.5:0.05:0.95 was used and the average mAP was reported; the model trained 15 cycles, with a linear warm-up of 5 cycles; the learning rate is 1e-3, the small batch size is 16, and the weight decay is 1e-4. Window size 25 for local self-attention; furthermore, in conjunction with external classification results, similarly, the present invention considers a pre-training approach from TSP and compares the model to the same set of baselines, including the closest competitor single-stage model.

In the test procedure, at the time of reasoning, the complete sequence is entered into the model, since no position embedding is used in the model. Our model takes input videoXAnd outputAt each time step of all pyramid layerst. Each time steptFurther decoding an action instance->。/>And->Is the start and offset of the action, +.>Is the action confidence score. The resulting action candidates are further processed using a Soft-NMS to delete highly overlapping instances, resulting in a final output of the action.

Comparison of experimental effects for the present invention with other methods in the thumb 14 dataset and the activitynet1.3 dataset are in the following table:

the invention achieves the best effect on the thumb 14 dataset, 68.3 when calculating the average mAP of tIoU from 0.3 to 0.7, and 36.18 when calculating the average mAP of tIoU from 0.5 to 0.95 is still a good effect on the ActivityNet1.3 dataset, although the invention does not achieve the best effect, but achieves more than the vast majority of the effects.

In this embodiment, the operation is performed by Channel-only branch and Spatial-only branch in Polarized Self-Attention for the extracted features. Channel-only branch is defined as follows:

wherein->Is a 1 x 1 convolutional layer, ">Is->That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>is->Operator, X is a matrix dot product operation,/>、/>And->The number of internal channels between is C/2, the output of the channel branch is +.>Wherein->Is a channel multiplication operator; the principle is as follows: the input features X are first transformed into Q and V with a one-dimensional convolution of convolution kernel 1, where the channel of Q is fully compressed, while the channel dimension of V remains at a relatively high level (i.e., C/2), because the channel dimension of Q is compressed, as described above, and thus requires enhancement of the information by HDR, the information of Q is enhanced with Softmax, then Q and K are matrix multiplied, and then one-dimensional convolution of convolution kernel 1 is followed by LN to increase the dimension of C/2 on the channel to C, finally all parameters are kept between 0-1 with the Sigmoid function.

Spatial-only branch is defined as follows:

wherein->Is standard 1->1 convolution->Is three +.>，/>Is thatOperator (F)>Is a global pooling operator,>the output of the spatial branch is +.>Wherein->Is a spatial multiplication operator; it can be seen that, similar to Channel-only branch, the input features are first converted to Q and V using one-dimensional convolution with a convolution kernel of 1, where for Q features, globalPooling is also used to compress the time dimension, converting to a size of 1; the time dimension of the V feature is maintained at a greater level; since the time dimension of Q is compressed, softmax is used to enhance Q information; then, Q and K are matrix multiplied, and then reshape and sigmoid are connected so that all parameters remain between 0 and 1.

passing the enhanced features containing global information through a shallow convolutional neural network is helpful for better merging local context information and training of stable visual transducers for time series data.

In this embodiment, the formula in step 30 is as follows:

original video feature vector:；

handleXDivided into t/k segments:each->Contains k frames;

a frame is randomly taken from each segment, and replicated k times,

,/>representing a random frame fetch,/->Representative copykPerforming secondary operation;

,/>representative vectors X and->New eigenvectors after passing back bone network,/->Mean square loss function.

In this embodiment, each video loss is defined as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofIs the length of the input sequence. />Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->Is the total number of positive samples, +.>Applied to all levels on the output pyramid and averaged over all video samples during training, +.>Is a coefficient of balance classification loss and regression loss, < ->One for distance regression>。

In this embodiment, the pyramid features are obtained using 6 Transformer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations using a single-step depth separable 1D convolution, model of 2 times downsampling ratio, specifically formulated as follows:

,,/>is a learnable per channel scaling factor initialized to 0,/o>Is the downsampling ratio.

Embodiment 2 of the present invention further provides a video interaction detection system based on random frame interpolation and attention, which includes a feature extraction module for extracting global timing information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.

Embodiment 3 in an embodiment of the present invention, there is further provided a computer-readable storage medium storing a computer program, wherein the video interaction detection method is implemented when the computer program is executed by a processor.

Embodiment 4 in an embodiment of the present invention, there is further provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.

While the foregoing description of the embodiments of the present invention has been presented with reference to the drawings, it is not intended to limit the scope of the invention, but rather, it is apparent that various modifications or variations can be made by those skilled in the art without the need for inventive work on the basis of the technical solutions of the present invention.

Claims

1. A video interaction detection method based on random frame complement and attention is characterized in that: the method comprises the following steps:

step 10, selection of a feature extraction network

Selecting an I3D network pre-trained based on a Kinetics data set to extract features;

step 20, self-attention global information modeling

Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; searching for the relation between frames by using Polarized Self-Attention polarization Attention and weighting;

adding 1D volumes before the Transformer network;

step 30 random frame complement data enhancement

On the output of the 10 th step feature network, a video is divided into a plurality of fragments, a frame is randomly taken from each fragment, and other frames are the same as the taken frames, so that a new feature vector with larger variation is formed;

calculating an mse loss by the new feature vector passing through the backup and the original video feature vector;

the formula in step 30 is as follows:

original video feature vector:t represents the length of the video feature sequence, and D represents the feature dimension;

handleXDivided into t/k segments:each->Containing k frames, i representing the ith video feature;

a frame is randomly taken from each segment, and replicated k times,

，/>representative vector->And->New characteristics after passing back bone networkSyndrome vector, ->Mean square loss function>Represents an adjustment factor, typically 1;

step 40, pyramid feature generation

On the basis of the network in the step 20, the features after passing through the multi-scale information aggregation module are encoded into a 6-layer feature pyramid through a multi-scale transducer, and LSTM is combined with the transducer;

step 50. Boundary locating and classifying

2. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: the operation is performed by a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention for the extracted features, the Channel-only branch being defined as follows:

wherein->Is a 1 x 1 convolutional layer, ">Is->That is, the feature dimension is changed from C/2 XHXW to C/2 XHW, C is expressed as the channel dimension, H is expressed as the height of the picture, W is expressed as the width of the picture,/>Is thatOperator (F)>Intermediate parameters representing the convolution of channels, X being the matrix dot product operation,/>、/>And->The number of internal channels between is C/2, the output of the channel branch is +.>Wherein->Is a channel multiplication operator;

spatial-only branch is defined as follows:

wherein F _SG Representing Sigmoid function->Is standard 1->1 convolution->Is three in number,/>Is->Operator (F)>Is a global pooling operator and,the output of the spatial branch isWherein->Is a spatial multiplication operator;

the outputs of the channel branches and the spatial branches are composed in parallel layout:。

3. the method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: each video loss is defined as follows:；

wherein the method comprises the steps ofIs the length of the input sequence, < > is->Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->Is the total number of positive samples, +.>Applied to all levels on the output pyramid and averaged over all video samples during training, +.>Is a coefficient of balance classification loss and regression loss, < ->One for distance regression>Lcls is denoted as classification loss.

4. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: pyramid features are obtained using 6 transfomer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations with one single step depth separable 1D convolution downsampling ratio for a model of 2 times, specific formulas are as follows:

；

，/>is a learnable per channel scaling factor initialized to 0,/o>Representation->The time series length of the layer; />Representation->The time series length of the layers.

5. A video interaction detection system based on random frame complement and attention is characterized in that:

the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting global time sequence information;

the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information;

the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear;

the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer;

the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features;

the formula used in the random frame complement data enhancement module is as follows:

original video feature vector: ，

t represents the length of the video feature sequence, and D represents the feature dimension;

handleXDivided into t/k segments:，

each of which isContaining k frames, i representing the ith video feature; a frame is randomly taken from each segment, and replicated k times,

，

representing a random frame fetch,/->Representative copykPerforming secondary operation;

，

representative vector->And->New eigenvectors after passing back bone network,/->Mean square loss function>Represents an adjustment factor, typically 1.

6. A computer readable storage medium storing a computer program, wherein the video interaction detection method of any of claims 1 to 4 is implemented when the computer program is executed by a processor.

7. A computing device, comprising: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method of any of claims 1 to 4.