CN116385945A - Video interaction action detection method and system based on random frame complement and attention - Google Patents

Video interaction action detection method and system based on random frame complement and attention Download PDF

Info

Publication number
CN116385945A
CN116385945A CN202310657865.4A CN202310657865A CN116385945A CN 116385945 A CN116385945 A CN 116385945A CN 202310657865 A CN202310657865 A CN 202310657865A CN 116385945 A CN116385945 A CN 116385945A
Authority
CN
China
Prior art keywords
attention
features
pyramid
random frame
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310657865.4A
Other languages
Chinese (zh)
Other versions
CN116385945B (en
Inventor
高文杰
高赞
周冕
赵一博
卓涛
李志慧
程志勇
李传森
刘冬冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Zhonglian Audio Visual Information Technology Co ltd
Original Assignee
Shandong Zhonglian Audio Visual Information Technology Co ltd
Tianjin University of Technology
Shandong Institute of Artificial Intelligence
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Zhonglian Audio Visual Information Technology Co ltd, Tianjin University of Technology, Shandong Institute of Artificial Intelligence filed Critical Shandong Zhonglian Audio Visual Information Technology Co ltd
Priority to CN202310657865.4A priority Critical patent/CN116385945B/en
Publication of CN116385945A publication Critical patent/CN116385945A/en
Application granted granted Critical
Publication of CN116385945B publication Critical patent/CN116385945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a video interaction detection method and system based on random frame supplementing and attention, wherein the method comprises the following specific steps: (1) selection of a feature extraction network; (2) self-attention global information modeling; (3) random frame complement data enhancement; (4) generation of pyramid features; (5) boundary positioning and classification. The invention can simultaneously aggregate global time sequence and multi-scale local time sequence information, and perform efficient action positioning through the generated pyramid features. The method uses a random frame-based frame complement to enhance data, and solves the problem that a single model has different performances on data sets with different sizes through the combination of LSTM+transducer so as to obtain more accurate action positioning and classification results.

Description

Video interaction action detection method and system based on random frame complement and attention
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a video interaction detection method and system based on random frame supplementing and attention.
Background
In recent years, with the rapid development of deep learning technology, many scholars propose many time sequence motion positioning methods based on the deep learning technology. Identifying action instances and identifying their categories in time, i.e., instant action positioning, remains a challenging problem in video understanding. Significant progress has been made in the development of depth models for TAL. Most of the previous work considered the use of the actions Proposals [ BMN ] or Anchor window [ GTAN ], and developed convolutional neural networks [ CDC, SSN ], recurrent neural networks [ SS-TAD ] and graphic neural networks [ BC-GNN, G-TAD ] for TAL. Despite the steady progress made on the primary benchmark, the accuracy of existing methods is usually at the cost of modeling complexity, including more and more complex Proposal generation, anchor design and loss functions, network architecture and output decoding processes. Meanwhile, because the action boundary in the video is not clear, the existing method often has the problem of inaccurate boundary prediction.
How to solve the problem of timing action positioning, some solutions have been given in the previously proposed methods, but some problems still remain in these methods. The Anchor-based method requires a strong a priori knowledge, and the number of anchors defined for each dataset is also different, which can affect the final result. Although the action-Guided method can achieve good results, the action-Guided method is too computationally intensive. The Anchor-free approach may be a good solution.
Disclosure of Invention
The invention aims to solve the problem of time sequence action positioning, and the previous time sequence action positioning method either needs strong priori knowledge on a data set or has large calculation amount. The invention provides a video interaction motion detection method and a video interaction motion detection system based on random frame supplementing and attention, which are used for solving the problems that a time sequence motion positioning method needs strong priori knowledge or has large calculation amount.
The technical scheme for solving the technical problems is as follows:
a video interaction detection method based on random frame complement and attention comprises the following steps:
step 10, selection of a feature extraction network
Selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;
the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;
step 30 random frame complement data enhancement
At the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;
calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;
step 40, pyramid feature generation
On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;
step 50. Boundary locating and classifying
After obtaining the pyramid features of 6 scales; pyramid features of each scale are respectively input into different 1D convolutions to obtain positioning and classifying features, then classifying features are adopted to classify, boundary regression is carried out by adopting the positioning features, constraint is carried out by adopting focal loss in the training classification process, and constraint is adopted in the training regression process
Figure SMS_1
And (5) performing constraint.
Based on the above video interaction detection method based on the random frame complement and attention, the formula in step 30 is as follows:
original video feature vector:
Figure SMS_2
handleXDivided into t/k segments:
Figure SMS_3
each->
Figure SMS_4
Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
Figure SMS_5
,/>
Figure SMS_6
representing a random frame fetch,/->
Figure SMS_7
Representative copykPerforming secondary operation;
Figure SMS_8
,/>
Figure SMS_9
representative vectors X and->
Figure SMS_10
New eigenvectors after passing back bone network,/->
Figure SMS_11
Mean square loss function.
Based on the video interaction detection method based on the random frame complement and Attention, the method operates on the extracted features through a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention, wherein the Channel-only branch is defined as follows:
Figure SMS_13
wherein
Figure SMS_18
Is a 1 x 1 convolutional layer, ">
Figure SMS_21
Is->
Figure SMS_15
That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>
Figure SMS_17
is->
Figure SMS_20
Operator, X is a matrix dot product operation->
Figure SMS_23
,/>
Figure SMS_12
、/>
Figure SMS_16
And
Figure SMS_19
the number of internal channels between is C/2, the output of the channel branch is +.>
Figure SMS_22
Wherein->
Figure SMS_14
Is a channel multiplication operator;
spatial-only branch is defined as follows:
Figure SMS_25
wherein->
Figure SMS_29
Is standard 1->
Figure SMS_32
1 convolution->
Figure SMS_26
Is three +.>
Figure SMS_28
,/>
Figure SMS_31
Is->
Figure SMS_34
Operator (F)>
Figure SMS_24
Is a global pooling operator,>
Figure SMS_27
the output of the spatial branch is +.>
Figure SMS_30
Wherein->
Figure SMS_33
Is a spatial multiplication operator;
the outputs of the channel branches and the spatial branches are composed in parallel layout:
Figure SMS_35
each video interaction action detection method based on random frame supplement and attentionA video loss is defined as follows:
Figure SMS_36
;
wherein the method comprises the steps of
Figure SMS_37
Is the length of the input sequence. />
Figure SMS_38
Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->
Figure SMS_39
Is the total number of positive samples, +.>
Figure SMS_40
Applied to all levels on the output pyramid and averaged over all video samples during training, +.>
Figure SMS_41
Is a coefficient of balance classification loss and regression loss, < ->
Figure SMS_42
One for distance regression>
Figure SMS_43
Based on the video interaction detection method based on the random frame complement and attention, pyramid features are obtained by adopting 6 layers of transformers, each layer consists of LSTM, local multi-head self-attention and MLP block alternating layers, layerNorm is applied before each MSA or MLP, residual connection is added after each block, a channel MLP is provided with two linear layers, GELU is used for activation in the middle, downsampling operation is realized by using a single-step depth separable 1D convolution, and the model is 2 times downsampling ratio, and the specific formula is as follows:
Figure SMS_44
Figure SMS_45
,/>
Figure SMS_46
is a learnable per channel scaling factor initialized to 0,/o>
Figure SMS_47
Is the downsampling ratio.
The embodiment of the invention also provides a video interaction detection system based on random frame supplement and attention, which comprises a feature extraction module for extracting global time sequence information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
In an embodiment of the present invention, there is also provided a computer-readable storage medium storing a computer program, where the video interaction detection method is implemented when the computer program is executed by a processor.
In an embodiment of the present invention, there is also provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.
The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and the above technical solution has the following advantages or beneficial effects:
1) Modeling of global information can be achieved by looking for more important frames and giving higher weight through a self-attention mechanism.
2) The original video features are subjected to random frame complement, so that the original video changes more greatly, and the data enhancement is achieved.
3) By combining LSTM and transducer, model capability is improved, and the problem that a single model performs differently on data sets of different sizes is solved.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
Fig. 1 is a structural diagram of the present invention.
Detailed Description
In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings.
Embodiment 1 as shown in fig. 1, an operation flow chart of a video interaction detection method based on random frame interpolation and attention according to the present invention includes the following steps:
step 10, selection of a feature extraction network
In the time sequence motion positioning task, an excellent feature extractor needs to be selected first to obtain robust features, and due to the characteristics of the time sequence motion positioning task, the feature extractor capable of extracting time sequence information needs to be selected. A dual stream I3D network is therefore employed herein for feature extraction. The input of RGB flows is continuous video frames, so that time and space characteristics can be extracted at the same time, and for Flow flows, the input is continuous optical Flow frames, so that time sequence information can be further extracted and modeled; selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;
the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;
step 30 random frame complement data enhancement
To include background of irrelevant activities in the video that is not clipped, the action boundaries are made unclear; in order to expand the variation of the video and make the boundary more obvious, a random frame complement is provided for data enhancement;
at the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;
calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;
step 40, pyramid feature generation
On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;
step 50. Boundary locating and classifying
By obtaining pyramid features in step 40, the classification head examines all pyramids after obtaining pyramid features
Figure SMS_48
Each time of layertAnd predicts each timetAction of->
Figure SMS_49
This header is implemented using a lightweight 1D convolutional network connected to each pyramid layer, the parameters being shared at all levels; the classification network is implemented using layer 3 kernel size 3 1D convolution, layer normalization (first 2 layers) and ReLU activation; adding a sigmoid function to each output dimension to predict the probability of C action categories; the regression head is similar to the classification head, and examines each moment of all L layers on the pyramidt
The difference is that the regression head predicts the distance to start of motion and offset
Figure SMS_50
Only when the current time step istIn the action, each pyramid level is pre-assigned with an output regression range, and the regression head adopts the same design as the classification network by using a one-dimensional convolution network, but adds a ReLU at the tail end for distance estimation; model t output for each time>
Figure SMS_51
Including action category->
Figure SMS_52
Is +.>
Figure SMS_53
The method comprises the steps of carrying out a first treatment on the surface of the The loss function likewise follows a very simple design, with only two (1)/(2)>
Figure SMS_54
One focal loss is classified for class C; (2)/>
Figure SMS_55
One GIoU loss for distance regression;
step 60 time sequence action positioning effect
Extracting video features on the thumb 14 using dual stream I3D pre-trained on Kinetics on the thumb 14 dataset; taking 16 continuous frames as input of I3D, and extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4; the two flow features are further connected (2048-D) as inputs to the model; mAP@0.3:0.1:0.7 was used to evaluate the model of the invention. 50 epochs were trained in the present invention, with a linear heat body of 5 epochs; the initial learning rate is 1e-4, and cosine learning rate attenuation is used; the size of the small batch is 2, and the weight decay is 1e-4; on the basis of ablation, the window size of local self-attention is 19; external classification scores from UntrimdredNet are also incorporated; for the activitynet1.3 dataset, feature extraction was performed using dual stream I3D, but the step size of the sliding window was increased to 16; downsampling the extracted features to a fixed length 128 by linear interpolation; for evaluation, mAP@0.5:0.05:0.95 was used and the average mAP was reported; the model trained 15 cycles, with a linear warm-up of 5 cycles; the learning rate is 1e-3, the small batch size is 16, and the weight decay is 1e-4. Window size 25 for local self-attention; furthermore, in conjunction with external classification results, similarly, the present invention considers a pre-training approach from TSP and compares the model to the same set of baselines, including the closest competitor single-stage model.
In the test procedure, at the time of reasoning, the complete sequence is entered into the model, since no position embedding is used in the model. Our model takes input videoXAnd output
Figure SMS_56
At each time step of all pyramid layerst. Each time steptFurther decoding an action instance->
Figure SMS_57
。/>
Figure SMS_58
And->
Figure SMS_59
Is the start and offset of the action, +.>
Figure SMS_60
Is the action confidence score. The resulting action candidates are further processed using a Soft-NMS to delete highly overlapping instances, resulting in a final output of the action.
Comparison of experimental effects for the present invention with other methods in the thumb 14 dataset and the activitynet1.3 dataset are in the following table:
Figure SMS_61
the invention achieves the best effect on the thumb 14 dataset, 68.3 when calculating the average mAP of tIoU from 0.3 to 0.7, and 36.18 when calculating the average mAP of tIoU from 0.5 to 0.95 is still a good effect on the ActivityNet1.3 dataset, although the invention does not achieve the best effect, but achieves more than the vast majority of the effects.
In this embodiment, the operation is performed by Channel-only branch and Spatial-only branch in Polarized Self-Attention for the extracted features. Channel-only branch is defined as follows:
Figure SMS_64
wherein
Figure SMS_68
Is a 1 x 1 convolutional layer, ">
Figure SMS_71
Is->
Figure SMS_65
That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>
Figure SMS_66
is->
Figure SMS_69
Operator, X is a matrix dot product operation->
Figure SMS_72
,/>
Figure SMS_62
、/>
Figure SMS_67
And
Figure SMS_70
the number of internal channels between is C/2, the output of the channel branch is +.>
Figure SMS_73
Wherein->
Figure SMS_63
Is a channel multiplication operator; the principle is as follows: the input features X are first transformed into Q and V with a one-dimensional convolution of convolution kernel 1, where the channel of Q is fully compressed, while the channel dimension of V remains at a relatively high level (i.e., C/2), because the channel dimension of Q is compressed, as described above, and thus requires enhancement of the information by HDR, the information of Q is enhanced with Softmax, then Q and K are matrix multiplied, and then one-dimensional convolution of convolution kernel 1 is followed by LN to increase the dimension of C/2 on the channel to C, finally all parameters are kept between 0-1 with the Sigmoid function.
Spatial-only branch is defined as follows:
Figure SMS_76
wherein->
Figure SMS_79
Is standard 1->
Figure SMS_82
1 convolution->
Figure SMS_74
Is three +.>
Figure SMS_77
,/>
Figure SMS_80
Is->
Figure SMS_83
Operator (F)>
Figure SMS_75
Is a global pooling operator,>
Figure SMS_78
the output of the spatial branch is
Figure SMS_81
Wherein->
Figure SMS_84
Is a spatial multiplication operator; it can be seen that, similar to Channel-only branch, the input features are first converted to Q and V using one-dimensional convolution with a convolution kernel of 1, where for Q features, globalPooling is also used to compress the time dimension, converting to a size of 1; the time dimension of the V feature is maintained at a greater level; since the time dimension of Q is compressed, softmax is used to enhance Q information; then, Q and K are matrix multiplied, and then reshape and sigmoid are connected so that all parameters remain between 0 and 1.
The outputs of the channel branches and the spatial branches are composed in parallel layout:
Figure SMS_85
passing the enhanced features containing global information through a shallow convolutional neural network is helpful for better merging local context information and training of stable visual transducers for time series data.
In this embodiment, the formula in step 30 is as follows:
original video feature vector:
Figure SMS_86
handleXDivided into t/k segments:
Figure SMS_87
each->
Figure SMS_88
Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
Figure SMS_89
,/>
Figure SMS_90
representing a random frame fetch,/->
Figure SMS_91
Representative copykPerforming secondary operation;
Figure SMS_92
,/>
Figure SMS_93
representative vectors X and->
Figure SMS_94
New eigenvectors after passing back bone network,/->
Figure SMS_95
Mean square loss function.
In this embodiment, each video loss is defined as follows:
Figure SMS_98
the method comprises the steps of carrying out a first treatment on the surface of the Wherein->
Figure SMS_100
Is the length of the input sequence. />
Figure SMS_102
Is an indication function indicating whether the time step t is inWithin the action range, i.e. positive sample, +.>
Figure SMS_96
Is the total number of positive samples, +.>
Figure SMS_99
Applied to all levels on the output pyramid and averaged over all video samples during training, +.>
Figure SMS_101
Is a coefficient of balance classification loss and regression loss, < ->
Figure SMS_103
One for distance regression>
Figure SMS_97
In this embodiment, the pyramid features are obtained using 6 Transformer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations using a single-step depth separable 1D convolution, model of 2 times downsampling ratio, specifically formulated as follows:
Figure SMS_104
,
Figure SMS_105
,/>
Figure SMS_106
is a learnable per channel scaling factor initialized to 0,/o>
Figure SMS_107
Is the downsampling ratio.
Embodiment 2 of the present invention further provides a video interaction detection system based on random frame interpolation and attention, which includes a feature extraction module for extracting global timing information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
Embodiment 3 in an embodiment of the present invention, there is further provided a computer-readable storage medium storing a computer program, wherein the video interaction detection method is implemented when the computer program is executed by a processor.
Embodiment 4 in an embodiment of the present invention, there is further provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.
While the foregoing description of the embodiments of the present invention has been presented with reference to the drawings, it is not intended to limit the scope of the invention, but rather, it is apparent that various modifications or variations can be made by those skilled in the art without the need for inventive work on the basis of the technical solutions of the present invention.

Claims (8)

1. A video interaction detection method based on random frame complement and attention is characterized in that: the method comprises the following steps:
step 10, selection of a feature extraction network
Selecting an I3D network pre-trained based on a Kinetics data set to extract features;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; searching for the relation between frames by using Polarized Self-Attention polarization Attention and weighting;
adding 1D volumes before the Transformer network;
step 30 random frame complement data enhancement
On the output of the step 1 feature network, a video is divided into a plurality of fragments, a frame is randomly taken from each fragment, and other frames are the same as the taken frames, so that a new feature vector with larger variation is formed;
calculating an mse loss by the new feature vector passing through the backup and the original video feature vector;
step 40, pyramid feature generation
On the basis of the network in the step 20, the features after passing through the multi-scale information aggregation module are encoded into a 6-layer feature pyramid through a multi-scale transducer, and LSTM is combined with the transducer;
step 50. Boundary locating and classifying
After obtaining the pyramid features of 6 scales; pyramid features of each scale are respectively input into different 1D convolutions to obtain positioning and classifying features, then classifying features are adopted to classify, boundary regression is carried out by adopting the positioning features, constraint is carried out by adopting focal loss in the training classification process, and constraint is adopted in the training regression process
Figure QLYQS_1
And (5) performing constraint.
2. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: the formula in step 30 is as follows:
original video feature vector:
Figure QLYQS_2
handleXDivided into t/k segments:
Figure QLYQS_3
each->
Figure QLYQS_4
Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
Figure QLYQS_5
,/>
Figure QLYQS_6
representing a random frame fetch,/->
Figure QLYQS_7
Representative copykPerforming secondary operation;
Figure QLYQS_8
,/>
Figure QLYQS_9
representative vector->
Figure QLYQS_10
And->
Figure QLYQS_11
New eigenvectors after passing back bone network,/->
Figure QLYQS_12
Mean square loss function.
3. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: the operation is performed by a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention for the extracted features, the Channel-only branch being defined as follows:
Figure QLYQS_14
wherein->
Figure QLYQS_19
Is a 1 x 1 convolutional layer, ">
Figure QLYQS_22
Is->
Figure QLYQS_16
That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>
Figure QLYQS_18
is that
Figure QLYQS_21
Operator, X is a matrix dot product operation->
Figure QLYQS_24
,/>
Figure QLYQS_13
、/>
Figure QLYQS_17
And->
Figure QLYQS_20
The number of internal channels between is C/2, the output of the channel branch is +.>
Figure QLYQS_23
Wherein->
Figure QLYQS_15
Is a channel multiplication operator;
spatial-only branch is defined as follows:
Figure QLYQS_26
wherein->
Figure QLYQS_30
Is standard 1/>
Figure QLYQS_33
1 convolution->
Figure QLYQS_27
Is three +.>
Figure QLYQS_29
,/>
Figure QLYQS_32
Is->
Figure QLYQS_35
Operator (F)>
Figure QLYQS_25
Is a global pooling operator,>
Figure QLYQS_28
the output of the spatial branch is +.>
Figure QLYQS_31
Wherein
Figure QLYQS_34
Is a spatial multiplication operator;
the outputs of the channel branches and the spatial branches are composed in parallel layout:
Figure QLYQS_36
4. the method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: each video loss is defined as follows:
Figure QLYQS_37
wherein the method comprises the steps of
Figure QLYQS_38
Is the length of the input sequence, < > is->
Figure QLYQS_39
Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->
Figure QLYQS_40
Is the total number of positive samples, +.>
Figure QLYQS_41
Applied to all levels on the output pyramid and averaged over all video samples during training, +.>
Figure QLYQS_42
Is a coefficient of balance classification loss and regression loss, < ->
Figure QLYQS_43
One for distance regression>
Figure QLYQS_44
5. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: pyramid features are obtained using 6 transfomer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations with one single step depth separable 1D convolution downsampling ratio for a model of 2 times, specific formulas are as follows:
Figure QLYQS_45
Figure QLYQS_46
,/>
Figure QLYQS_47
is a learnable per channel scaling factor initialized to 0,/o>
Figure QLYQS_48
Is the downsampling ratio.
6. A video interaction detection system based on random frame complement and attention is characterized in that: the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting global time sequence information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
7. A computer readable storage medium storing a computer program, wherein the video interaction detection method of any of claims 1 to 5 is implemented when the computer program is executed by a processor.
8. A computing device, comprising: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method of any of claims 1 to 5.
CN202310657865.4A 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention Active CN116385945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310657865.4A CN116385945B (en) 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310657865.4A CN116385945B (en) 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention

Publications (2)

Publication Number Publication Date
CN116385945A true CN116385945A (en) 2023-07-04
CN116385945B CN116385945B (en) 2023-08-25

Family

ID=86981016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310657865.4A Active CN116385945B (en) 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention

Country Status (1)

Country Link
CN (1) CN116385945B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117138455A (en) * 2023-10-31 2023-12-01 克拉玛依曜诚石油科技有限公司 Automatic liquid filtering system and method
CN117354525A (en) * 2023-12-05 2024-01-05 深圳市旭景数字技术有限公司 Video coding method and system for realizing efficient storage and transmission of digital media

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN111259795A (en) * 2020-01-16 2020-06-09 河南职业技术学院 Human behavior recognition method based on multi-stream deep learning
WO2021051545A1 (en) * 2019-09-16 2021-03-25 平安科技(深圳)有限公司 Behavior identification model-based fall-down action determining method and apparatus, computer device, and storage medium
CN114973416A (en) * 2022-06-07 2022-08-30 哈尔滨理工大学 Sign language recognition algorithm based on three-dimensional convolution network
US20220296334A1 (en) * 2021-03-22 2022-09-22 Verb Surgical Inc. Deep-learning-based real-time remaining surgery duration (rsd) estimation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
WO2021051545A1 (en) * 2019-09-16 2021-03-25 平安科技(深圳)有限公司 Behavior identification model-based fall-down action determining method and apparatus, computer device, and storage medium
CN111259795A (en) * 2020-01-16 2020-06-09 河南职业技术学院 Human behavior recognition method based on multi-stream deep learning
US20220296334A1 (en) * 2021-03-22 2022-09-22 Verb Surgical Inc. Deep-learning-based real-time remaining surgery duration (rsd) estimation
CN114973416A (en) * 2022-06-07 2022-08-30 哈尔滨理工大学 Sign language recognition algorithm based on three-dimensional convolution network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MINGYANG QIAO等: ""Action Recognition based on Video Spatio-Temporal Transformer"", 《2022 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND COMPUTER APPLICATIONS (ICAICA)》, pages 1 - 4 *
刘文龙等: ""基于多特征融合及 Transformer 的人体跌倒动作检测算法"", 《应用科技》, pages 49 - 62 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117138455A (en) * 2023-10-31 2023-12-01 克拉玛依曜诚石油科技有限公司 Automatic liquid filtering system and method
CN117138455B (en) * 2023-10-31 2024-02-27 克拉玛依曜诚石油科技有限公司 Automatic liquid filtering system and method
CN117354525A (en) * 2023-12-05 2024-01-05 深圳市旭景数字技术有限公司 Video coding method and system for realizing efficient storage and transmission of digital media
CN117354525B (en) * 2023-12-05 2024-03-15 深圳市旭景数字技术有限公司 Video coding method and system for realizing efficient storage and transmission of digital media

Also Published As

Publication number Publication date
CN116385945B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN116385945B (en) Video interaction action detection method and system based on random frame complement and attention
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN109685819B (en) Three-dimensional medical image segmentation method based on feature enhancement
CN111950455B (en) Motion imagery electroencephalogram characteristic identification method based on LFFCNN-GRU algorithm model
CN115896817A (en) Production method and system of fluorine-nitrogen mixed gas
CN110175551B (en) Sign language recognition method
CN110276784B (en) Correlation filtering moving target tracking method based on memory mechanism and convolution characteristics
CN115222998B (en) Image classification method
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
CN117407772B (en) Method and system for classifying training multi-element time sequence data by supervising and comparing learning network model
CN113837229B (en) Knowledge-driven text-to-image generation method
CN112149645A (en) Human body posture key point identification method based on generation of confrontation learning and graph neural network
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
CN113850182A (en) Action identification method based on DAMR-3 DNet
Yanmin et al. Research on ear recognition based on SSD_MobileNet_v1 network
Dastbaravardeh et al. Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames
Ni et al. High-order generalized orderless pooling networks for synthetic-aperture radar scene classification
CN115858799A (en) Knowledge representation learning method integrating ordered relationship path and entity description information
Ma et al. Convolutional transformer network for fine-grained action recognition
Zeng et al. Expression Recognition Based on Multi-Scale Adaptive Parallel Integration Network
CN117688974B (en) Knowledge graph-based generation type large model modeling method, system and equipment
WO2024093466A1 (en) Person image re-identification method based on autonomous model structure evolution
CN116012388B (en) Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy
Wang et al. Sparsenet: Deep convolutional network with sparse connections between blocks
Chan et al. Human Action Recognition Based on Spatial-Temporal Attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231225

Address after: Building A6-211, Hanyu Jingu, No. 7000 Jingshi Road, Jinan Area, China (Shandong) Pilot Free Trade Zone, Jinan City, Shandong Province, 250000

Patentee after: Shandong Zhonglian Audio-Visual Information Technology Co.,Ltd.

Address before: No.19 Keyuan Road, Lixia District, Jinan City, Shandong Province

Patentee before: Shandong Institute of artificial intelligence

Patentee before: TIANJIN University OF TECHNOLOGY

Patentee before: Shandong Zhonglian Audio-Visual Information Technology Co.,Ltd.