Disclosure of Invention
The invention aims to solve the problem of time sequence action positioning, and the previous time sequence action positioning method either needs strong priori knowledge on a data set or has large calculation amount. The invention provides a video interaction motion detection method and a video interaction motion detection system based on random frame supplementing and attention, which are used for solving the problems that a time sequence motion positioning method needs strong priori knowledge or has large calculation amount.
The technical scheme for solving the technical problems is as follows:
a video interaction detection method based on random frame complement and attention comprises the following steps:
step 10, selection of a feature extraction network
Selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;
the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;
step 30 random frame complement data enhancement
At the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;
calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;
step 40, pyramid feature generation
On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;
step 50. Boundary locating and classifying
After obtaining the pyramid features of 6 scales; pyramid features of each scale are respectively input into different 1D convolutions to obtain positioning and classifying features, then classifying features are adopted to classify, boundary regression is carried out by adopting the positioning features, constraint is carried out by adopting focal loss in the training classification process, and constraint is adopted in the training regression process
And (5) performing constraint.
Based on the above video interaction detection method based on the random frame complement and attention, the formula in step 30 is as follows:
original video feature vector:
;
handle
XDivided into t/k segments:
each->
Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
,/>
representing a random frame fetch,/->
Representative copy
kPerforming secondary operation;
,/>
representative vectors X and->
New eigenvectors after passing back bone network,/->
Mean square loss function.
Based on the video interaction detection method based on the random frame complement and Attention, the method operates on the extracted features through a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention, wherein the Channel-only branch is defined as follows:
wherein
Is a 1 x 1 convolutional layer, ">
Is->
That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>
is->
Operator, X is a matrix dot product operation->
,/>
、/>
And
the number of internal channels between is C/2, the output of the channel branch is +.>
Wherein->
Is a channel multiplication operator;
spatial-only branch is defined as follows:
wherein->
Is standard 1->
1 convolution->
Is three +.>
,/>
Is->
Operator (F)>
Is a global pooling operator,>
the output of the spatial branch is +.>
Wherein->
Is a spatial multiplication operator;
the outputs of the channel branches and the spatial branches are composed in parallel layout:
each video interaction action detection method based on random frame supplement and attentionA video loss is defined as follows:
;
wherein the method comprises the steps of
Is the length of the input sequence. />
Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->
Is the total number of positive samples, +.>
Applied to all levels on the output pyramid and averaged over all video samples during training, +.>
Is a coefficient of balance classification loss and regression loss, < ->
One for distance regression>
。
Based on the video interaction detection method based on the random frame complement and attention, pyramid features are obtained by adopting 6 layers of transformers, each layer consists of LSTM, local multi-head self-attention and MLP block alternating layers, layerNorm is applied before each MSA or MLP, residual connection is added after each block, a channel MLP is provided with two linear layers, GELU is used for activation in the middle, downsampling operation is realized by using a single-step depth separable 1D convolution, and the model is 2 times downsampling ratio, and the specific formula is as follows:
,
,/>
is a learnable per channel scaling factor initialized to 0,/o>
Is the downsampling ratio.
The embodiment of the invention also provides a video interaction detection system based on random frame supplement and attention, which comprises a feature extraction module for extracting global time sequence information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
In an embodiment of the present invention, there is also provided a computer-readable storage medium storing a computer program, where the video interaction detection method is implemented when the computer program is executed by a processor.
In an embodiment of the present invention, there is also provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.
The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and the above technical solution has the following advantages or beneficial effects:
1) Modeling of global information can be achieved by looking for more important frames and giving higher weight through a self-attention mechanism.
2) The original video features are subjected to random frame complement, so that the original video changes more greatly, and the data enhancement is achieved.
3) By combining LSTM and transducer, model capability is improved, and the problem that a single model performs differently on data sets of different sizes is solved.
Detailed Description
In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings.
Embodiment 1 as shown in fig. 1, an operation flow chart of a video interaction detection method based on random frame interpolation and attention according to the present invention includes the following steps:
step 10, selection of a feature extraction network
In the time sequence motion positioning task, an excellent feature extractor needs to be selected first to obtain robust features, and due to the characteristics of the time sequence motion positioning task, the feature extractor capable of extracting time sequence information needs to be selected. A dual stream I3D network is therefore employed herein for feature extraction. The input of RGB flows is continuous video frames, so that time and space characteristics can be extracted at the same time, and for Flow flows, the input is continuous optical Flow frames, so that time sequence information can be further extracted and modeled; selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;
the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;
step 30 random frame complement data enhancement
To include background of irrelevant activities in the video that is not clipped, the action boundaries are made unclear; in order to expand the variation of the video and make the boundary more obvious, a random frame complement is provided for data enhancement;
at the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;
calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;
step 40, pyramid feature generation
On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;
step 50. Boundary locating and classifying
By obtaining pyramid features in step 40, the classification head examines all pyramids after obtaining pyramid features
Each time of layer
tAnd predicts each time
tAction of->
This header is implemented using a lightweight 1D convolutional network connected to each pyramid layer, the parameters being shared at all levels; the classification network is implemented using layer 3 kernel size 3 1D convolution, layer normalization (first 2 layers) and ReLU activation; adding a sigmoid function to each output dimension to predict the probability of C action categories; the regression head is similar to the classification head, and examines each moment of all L layers on the pyramid
t;
The difference is that the regression head predicts the distance to start of motion and offset
Only when the current time step is
tIn the action, each pyramid level is pre-assigned with an output regression range, and the regression head adopts the same design as the classification network by using a one-dimensional convolution network, but adds a ReLU at the tail end for distance estimation; model t output for each time>
Including action category->
Is +.>
The method comprises the steps of carrying out a first treatment on the surface of the The loss function likewise follows a very simple design, with only two (1)/(2)>
One focal loss is classified for class C; (2)/>
One GIoU loss for distance regression;
step 60 time sequence action positioning effect
Extracting video features on the thumb 14 using dual stream I3D pre-trained on Kinetics on the thumb 14 dataset; taking 16 continuous frames as input of I3D, and extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4; the two flow features are further connected (2048-D) as inputs to the model; mAP@0.3:0.1:0.7 was used to evaluate the model of the invention. 50 epochs were trained in the present invention, with a linear heat body of 5 epochs; the initial learning rate is 1e-4, and cosine learning rate attenuation is used; the size of the small batch is 2, and the weight decay is 1e-4; on the basis of ablation, the window size of local self-attention is 19; external classification scores from UntrimdredNet are also incorporated; for the activitynet1.3 dataset, feature extraction was performed using dual stream I3D, but the step size of the sliding window was increased to 16; downsampling the extracted features to a fixed length 128 by linear interpolation; for evaluation, mAP@0.5:0.05:0.95 was used and the average mAP was reported; the model trained 15 cycles, with a linear warm-up of 5 cycles; the learning rate is 1e-3, the small batch size is 16, and the weight decay is 1e-4. Window size 25 for local self-attention; furthermore, in conjunction with external classification results, similarly, the present invention considers a pre-training approach from TSP and compares the model to the same set of baselines, including the closest competitor single-stage model.
In the test procedure, at the time of reasoning, the complete sequence is entered into the model, since no position embedding is used in the model. Our model takes input video
XAnd output
At each time step of all pyramid layers
t. Each time step
tFurther decoding an action instance->
。/>
And->
Is the start and offset of the action, +.>
Is the action confidence score. The resulting action candidates are further processed using a Soft-NMS to delete highly overlapping instances, resulting in a final output of the action.
Comparison of experimental effects for the present invention with other methods in the thumb 14 dataset and the activitynet1.3 dataset are in the following table:
the invention achieves the best effect on the thumb 14 dataset, 68.3 when calculating the average mAP of tIoU from 0.3 to 0.7, and 36.18 when calculating the average mAP of tIoU from 0.5 to 0.95 is still a good effect on the ActivityNet1.3 dataset, although the invention does not achieve the best effect, but achieves more than the vast majority of the effects.
In this embodiment, the operation is performed by Channel-only branch and Spatial-only branch in Polarized Self-Attention for the extracted features. Channel-only branch is defined as follows:
wherein
Is a 1 x 1 convolutional layer, ">
Is->
That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>
is->
Operator, X is a matrix dot product operation->
,/>
、/>
And
the number of internal channels between is C/2, the output of the channel branch is +.>
Wherein->
Is a channel multiplication operator; the principle is as follows: the input features X are first transformed into Q and V with a one-dimensional convolution of convolution kernel 1, where the channel of Q is fully compressed, while the channel dimension of V remains at a relatively high level (i.e., C/2), because the channel dimension of Q is compressed, as described above, and thus requires enhancement of the information by HDR, the information of Q is enhanced with Softmax, then Q and K are matrix multiplied, and then one-dimensional convolution of convolution kernel 1 is followed by LN to increase the dimension of C/2 on the channel to C, finally all parameters are kept between 0-1 with the Sigmoid function.
Spatial-only branch is defined as follows:
wherein->
Is standard 1->
1 convolution->
Is three +.>
,/>
Is->
Operator (F)>
Is a global pooling operator,>
the output of the spatial branch is
Wherein->
Is a spatial multiplication operator; it can be seen that, similar to Channel-only branch, the input features are first converted to Q and V using one-dimensional convolution with a convolution kernel of 1, where for Q features, globalPooling is also used to compress the time dimension, converting to a size of 1; the time dimension of the V feature is maintained at a greater level; since the time dimension of Q is compressed, softmax is used to enhance Q information; then, Q and K are matrix multiplied, and then reshape and sigmoid are connected so that all parameters remain between 0 and 1.
The outputs of the channel branches and the spatial branches are composed in parallel layout:
passing the enhanced features containing global information through a shallow convolutional neural network is helpful for better merging local context information and training of stable visual transducers for time series data.
In this embodiment, the formula in step 30 is as follows:
original video feature vector:
;
handle
XDivided into t/k segments:
each->
Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
,/>
representing a random frame fetch,/->
Representative copy
kPerforming secondary operation;
,/>
representative vectors X and->
New eigenvectors after passing back bone network,/->
Mean square loss function.
In this embodiment, each video loss is defined as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Is the length of the input sequence. />
Is an indication function indicating whether the time step t is inWithin the action range, i.e. positive sample, +.>
Is the total number of positive samples, +.>
Applied to all levels on the output pyramid and averaged over all video samples during training, +.>
Is a coefficient of balance classification loss and regression loss, < ->
One for distance regression>
。
In this embodiment, the pyramid features are obtained using 6 Transformer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations using a single-step depth separable 1D convolution, model of 2 times downsampling ratio, specifically formulated as follows:
,
,/>
is a learnable per channel scaling factor initialized to 0,/o>
Is the downsampling ratio.
Embodiment 2 of the present invention further provides a video interaction detection system based on random frame interpolation and attention, which includes a feature extraction module for extracting global timing information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
Embodiment 3 in an embodiment of the present invention, there is further provided a computer-readable storage medium storing a computer program, wherein the video interaction detection method is implemented when the computer program is executed by a processor.
Embodiment 4 in an embodiment of the present invention, there is further provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.
While the foregoing description of the embodiments of the present invention has been presented with reference to the drawings, it is not intended to limit the scope of the invention, but rather, it is apparent that various modifications or variations can be made by those skilled in the art without the need for inventive work on the basis of the technical solutions of the present invention.