CN116385945B - Video interaction action detection method and system based on random frame complement and attention - Google Patents

Video interaction action detection method and system based on random frame complement and attention Download PDF

Info

Publication number
CN116385945B
CN116385945B CN202310657865.4A CN202310657865A CN116385945B CN 116385945 B CN116385945 B CN 116385945B CN 202310657865 A CN202310657865 A CN 202310657865A CN 116385945 B CN116385945 B CN 116385945B
Authority
CN
China
Prior art keywords
attention
feature
video
features
pyramid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310657865.4A
Other languages
Chinese (zh)
Other versions
CN116385945A (en
Inventor
高文杰
高赞
周冕
赵一博
卓涛
李志慧
程志勇
李传森
刘冬冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Zhonglian Audio Visual Information Technology Co ltd
Original Assignee
Shandong Zhonglian Audio Visual Information Technology Co ltd
Tianjin University of Technology
Shandong Institute of Artificial Intelligence
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Zhonglian Audio Visual Information Technology Co ltd, Tianjin University of Technology, Shandong Institute of Artificial Intelligence filed Critical Shandong Zhonglian Audio Visual Information Technology Co ltd
Priority to CN202310657865.4A priority Critical patent/CN116385945B/en
Publication of CN116385945A publication Critical patent/CN116385945A/en
Application granted granted Critical
Publication of CN116385945B publication Critical patent/CN116385945B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a video interaction detection method and system based on random frame supplementing and attention, wherein the method comprises the following specific steps: (1) selection of a feature extraction network; (2) self-attention global information modeling; (3) random frame complement data enhancement; (4) generation of pyramid features; (5) boundary positioning and classification. The invention can simultaneously aggregate global time sequence and multi-scale local time sequence information, and perform efficient action positioning through the generated pyramid features. The method uses a random frame-based frame complement to enhance data, and solves the problem that a single model has different performances on data sets with different sizes through the combination of LSTM+transducer so as to obtain more accurate action positioning and classification results.

Description

Video interaction action detection method and system based on random frame complement and attention
Technical Field
The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a video interaction detection method and system based on random frame supplementing and attention.
Background
In recent years, with the rapid development of deep learning technology, many scholars propose many time sequence motion positioning methods based on the deep learning technology. Identifying action instances and identifying their categories in time, i.e., instant action positioning, remains a challenging problem in video understanding. Significant progress has been made in the development of depth models for TAL. Most of the previous work considered the use of the actions Proposals [ BMN ] or Anchor window [ GTAN ], and developed convolutional neural networks [ CDC, SSN ], recurrent neural networks [ SS-TAD ] and graphic neural networks [ BC-GNN, G-TAD ] for TAL. Despite the steady progress made on the primary benchmark, the accuracy of existing methods is usually at the cost of modeling complexity, including more and more complex Proposal generation, anchor design and loss functions, network architecture and output decoding processes. Meanwhile, because the action boundary in the video is not clear, the existing method often has the problem of inaccurate boundary prediction.
How to solve the problem of timing action positioning, some solutions have been given in the previously proposed methods, but some problems still remain in these methods. The Anchor-based method requires a strong a priori knowledge, and the number of anchors defined for each dataset is also different, which can affect the final result. Although the action-Guided method can achieve good results, the action-Guided method is too computationally intensive. The Anchor-free approach may be a good solution.
Disclosure of Invention
The invention aims to solve the problem of time sequence action positioning, and the previous time sequence action positioning method either needs strong priori knowledge on a data set or has large calculation amount. The invention provides a video interaction motion detection method and a video interaction motion detection system based on random frame supplementing and attention, which are used for solving the problems that a time sequence motion positioning method needs strong priori knowledge or has large calculation amount.
The technical scheme for solving the technical problems is as follows:
a video interaction detection method based on random frame complement and attention comprises the following steps:
step 10, selection of a feature extraction network
Selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;
the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;
step 30 random frame complement data enhancement
At the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;
calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;
step 40, pyramid feature generation
On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;
step 50. Boundary locating and classifying
After obtaining the pyramid features of 6 scales; pyramid features of each scale are respectively input into different 1D convolutions to obtain positioning and classifying features, then classifying features are adopted to classify, boundary regression is carried out by adopting the positioning features, constraint is carried out by adopting focal loss in the training classification process, and constraint is adopted in the training regression processAnd (5) performing constraint.
Based on the above video interaction detection method based on the random frame complement and attention, the formula in step 30 is as follows:
original video feature vector:
handleXDivided into t/k segments:each->Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
,/>representing a random frame fetch,/->Representative copykPerforming secondary operation;
,/>representative vectors X and->New eigenvectors after passing back bone network,/->Mean square loss function.
Based on the video interaction detection method based on the random frame complement and Attention, the method operates on the extracted features through a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention, wherein the Channel-only branch is defined as follows:
wherein->Is a 1 x 1 convolutional layer, ">Is->That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>is->Operator, X is a matrix dot product operation,/>、/>And->The number of internal channels between is C/2, the output of the channel branch is +.>Wherein->Is a channel multiplication operator;
spatial-only branch is defined as follows:whereinIs standard 1->1 convolution->Is three +.>,/>Is->Operator (F)>Is a global pooling operator,>the output of the spatial branch is +.>Wherein->Is a spatial multiplication operator;
the outputs of the channel branches and the spatial branches are composed in parallel layout:
on the basis of the video interaction action detection method based on the random frame complement and the attention, each video loss is defined as follows:;
wherein the method comprises the steps ofIs the length of the input sequence. />Is an indication function indicating whether the time step t is within the range of motion, i.e. positiveSample (S)>Is the total number of positive samples, +.>Applied to all levels on the output pyramid and averaged over all video samples during training, +.>Is a coefficient of balance classification loss and regression loss, < ->One for distance regression>
Based on the video interaction detection method based on the random frame complement and attention, pyramid features are obtained by adopting 6 layers of transformers, each layer consists of LSTM, local multi-head self-attention and MLP block alternating layers, layerNorm is applied before each MSA or MLP, residual connection is added after each block, a channel MLP is provided with two linear layers, GELU is used for activation in the middle, downsampling operation is realized by using a single-step depth separable 1D convolution, and the model is 2 times downsampling ratio, and the specific formula is as follows:
,/>is a learnable per channel scaling factor initialized to 0,/o>Is the downsampling ratio.
The embodiment of the invention also provides a video interaction detection system based on random frame supplement and attention, which comprises a feature extraction module for extracting global time sequence information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
In an embodiment of the present invention, there is also provided a computer-readable storage medium storing a computer program, where the video interaction detection method is implemented when the computer program is executed by a processor.
In an embodiment of the present invention, there is also provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.
The effects provided in the summary of the invention are merely effects of embodiments, not all effects of the invention, and the above technical solution has the following advantages or beneficial effects:
1) Modeling of global information can be achieved by looking for more important frames and giving higher weight through a self-attention mechanism.
2) The original video features are subjected to random frame complement, so that the original video changes more greatly, and the data enhancement is achieved.
3) By combining LSTM and transducer, model capability is improved, and the problem that a single model performs differently on data sets of different sizes is solved.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.
Fig. 1 is a structural diagram of the present invention.
Detailed Description
In order to clearly illustrate the technical features of the present solution, the present invention will be described in detail below with reference to the following detailed description and the accompanying drawings.
Embodiment 1 as shown in fig. 1, an operation flow chart of a video interaction detection method based on random frame interpolation and attention according to the present invention includes the following steps:
step 10, selection of a feature extraction network
In the time sequence motion positioning task, an excellent feature extractor needs to be selected first to obtain robust features, and due to the characteristics of the time sequence motion positioning task, the feature extractor capable of extracting time sequence information needs to be selected. A dual stream I3D network is therefore employed herein for feature extraction. The input of RGB flows is continuous video frames, so that time and space characteristics can be extracted at the same time, and for Flow flows, the input is continuous optical Flow frames, so that time sequence information can be further extracted and modeled; selecting an I3D network pre-trained based on a Kinetics data set to extract features, taking 16 continuous frames as I3D input, extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4, and further connecting (2048-D) double-flow features as model input;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; the weighted Self-Attention Polarized Attention is used for searching the relation between frames and weighting, and more important frames can be searched and given higher weight through the Self-Attention-based weighting strategy;
the 1D volume is added before the transducer network, so that the training of the local context information and the stable visual transducer can be better combined, and the modeling of the global information is realized;
step 30 random frame complement data enhancement
To include background of irrelevant activities in the video that is not clipped, the action boundaries are made unclear; in order to expand the variation of the video and make the boundary more obvious, a random frame complement is provided for data enhancement;
at the output of the step 1 feature network, a video is divided intoT/kEach segment, randomly taking one frame from each segment, and the restk-1 frame is identical to the taken frame to form a new feature vector with a large variation, equivalent to the acceleration of the video, but with unchanged actual position of the motion;
calculating an mse loss by using the new feature vector passing through the backup and the original video feature vector, restraining the new feature vector and the original video feature vector, and making the new feature vector and the original video feature vector be pulled close to learn some information mutually so as to achieve the aim of data enhancement;
step 40, pyramid feature generation
On the basis of the step 20 network, the characteristics after passing through the multi-scale information aggregation module are encoded into a 6-layer characteristic pyramid through a multi-scale transducer, and LSTM and the transducer are combined and fused to provide supplementary history information and attention-based information representation provided by the LSTM and the transducer module, so that the model capacity is improved, the problem that the performances of a single model on data sets with different sizes are different can be solved, and the LSTM performs better on a small data set than the transducer, but the transducer performs very prominently after pre-training;
step 50. Boundary locating and classifying
By obtaining pyramid features in step 40, the classification head examines all pyramids after obtaining pyramid featuresEach time of layertAnd predicts each timetAction of->This header is implemented using a lightweight 1D convolutional network connected to each pyramid layer, the parameters being shared at all levels; the classification network uses a 3-layer kernel size 3 1D convolution, layer normalization (frontLayer 2) and ReLU activation; adding a sigmoid function to each output dimension to predict the probability of C action categories; the regression head is similar to the classification head, and examines each moment of all L layers on the pyramidt
The difference is that the regression head predicts the distance to start of motion and offsetOnly when the current time step istIn the action, each pyramid level is pre-assigned with an output regression range, and the regression head adopts the same design as the classification network by using a one-dimensional convolution network, but adds a ReLU at the tail end for distance estimation; model t output for each time>Including action category->Is +.>The method comprises the steps of carrying out a first treatment on the surface of the The loss function likewise follows a very simple design, with only two (1)/(2)>One focal loss is classified for class C; (2)/>One GIoU loss for distance regression;
step 60 time sequence action positioning effect
Extracting video features on the thumb 14 using dual stream I3D pre-trained on Kinetics on the thumb 14 dataset; taking 16 continuous frames as input of I3D, and extracting 1024-D features before the last full connection layer by using a sliding window with a step length of 4; the two flow features are further connected (2048-D) as inputs to the model; mAP@0.3:0.1:0.7 was used to evaluate the model of the invention. 50 epochs were trained in the present invention, with a linear heat body of 5 epochs; the initial learning rate is 1e-4, and cosine learning rate attenuation is used; the size of the small batch is 2, and the weight decay is 1e-4; on the basis of ablation, the window size of local self-attention is 19; external classification scores from UntrimdredNet are also incorporated; for the activitynet1.3 dataset, feature extraction was performed using dual stream I3D, but the step size of the sliding window was increased to 16; downsampling the extracted features to a fixed length 128 by linear interpolation; for evaluation, mAP@0.5:0.05:0.95 was used and the average mAP was reported; the model trained 15 cycles, with a linear warm-up of 5 cycles; the learning rate is 1e-3, the small batch size is 16, and the weight decay is 1e-4. Window size 25 for local self-attention; furthermore, in conjunction with external classification results, similarly, the present invention considers a pre-training approach from TSP and compares the model to the same set of baselines, including the closest competitor single-stage model.
In the test procedure, at the time of reasoning, the complete sequence is entered into the model, since no position embedding is used in the model. Our model takes input videoXAnd outputAt each time step of all pyramid layerst. Each time steptFurther decoding an action instance->。/>And->Is the start and offset of the action, +.>Is the action confidence score. The resulting action candidates are further processed using a Soft-NMS to delete highly overlapping instances, resulting in a final output of the action.
Comparison of experimental effects for the present invention with other methods in the thumb 14 dataset and the activitynet1.3 dataset are in the following table:
the invention achieves the best effect on the thumb 14 dataset, 68.3 when calculating the average mAP of tIoU from 0.3 to 0.7, and 36.18 when calculating the average mAP of tIoU from 0.5 to 0.95 is still a good effect on the ActivityNet1.3 dataset, although the invention does not achieve the best effect, but achieves more than the vast majority of the effects.
In this embodiment, the operation is performed by Channel-only branch and Spatial-only branch in Polarized Self-Attention for the extracted features. Channel-only branch is defined as follows:
wherein->Is a 1 x 1 convolutional layer, ">Is->That is, the feature dimension is changed from C/2 XH XW to C/2 XHW,>is->Operator, X is a matrix dot product operation,/>、/>And->The number of internal channels between is C/2, the output of the channel branch is +.>Wherein->Is a channel multiplication operator; the principle is as follows: the input features X are first transformed into Q and V with a one-dimensional convolution of convolution kernel 1, where the channel of Q is fully compressed, while the channel dimension of V remains at a relatively high level (i.e., C/2), because the channel dimension of Q is compressed, as described above, and thus requires enhancement of the information by HDR, the information of Q is enhanced with Softmax, then Q and K are matrix multiplied, and then one-dimensional convolution of convolution kernel 1 is followed by LN to increase the dimension of C/2 on the channel to C, finally all parameters are kept between 0-1 with the Sigmoid function.
Spatial-only branch is defined as follows:
wherein->Is standard 1->1 convolution->Is three +.>,/>Is thatOperator (F)>Is a global pooling operator,>the output of the spatial branch is +.>Wherein->Is a spatial multiplication operator; it can be seen that, similar to Channel-only branch, the input features are first converted to Q and V using one-dimensional convolution with a convolution kernel of 1, where for Q features, globalPooling is also used to compress the time dimension, converting to a size of 1; the time dimension of the V feature is maintained at a greater level; since the time dimension of Q is compressed, softmax is used to enhance Q information; then, Q and K are matrix multiplied, and then reshape and sigmoid are connected so that all parameters remain between 0 and 1.
The outputs of the channel branches and the spatial branches are composed in parallel layout:
passing the enhanced features containing global information through a shallow convolutional neural network is helpful for better merging local context information and training of stable visual transducers for time series data.
In this embodiment, the formula in step 30 is as follows:
original video feature vector:
handleXDivided into t/k segments:each->Contains k frames;
a frame is randomly taken from each segment, and replicated k times,
,/>representing a random frame fetch,/->Representative copykPerforming secondary operation;
,/>representative vectors X and->New eigenvectors after passing back bone network,/->Mean square loss function.
In this embodiment, each video loss is defined as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofIs the length of the input sequence. />Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->Is the total number of positive samples, +.>Applied to all levels on the output pyramid and averaged over all video samples during training, +.>Is a coefficient of balance classification loss and regression loss, < ->One for distance regression>
In this embodiment, the pyramid features are obtained using 6 Transformer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations using a single-step depth separable 1D convolution, model of 2 times downsampling ratio, specifically formulated as follows:
,,/>is a learnable per channel scaling factor initialized to 0,/o>Is the downsampling ratio.
Embodiment 2 of the present invention further provides a video interaction detection system based on random frame interpolation and attention, which includes a feature extraction module for extracting global timing information; the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information; the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear; the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer; and the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features.
Embodiment 3 in an embodiment of the present invention, there is further provided a computer-readable storage medium storing a computer program, wherein the video interaction detection method is implemented when the computer program is executed by a processor.
Embodiment 4 in an embodiment of the present invention, there is further provided a computing device including: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method.
While the foregoing description of the embodiments of the present invention has been presented with reference to the drawings, it is not intended to limit the scope of the invention, but rather, it is apparent that various modifications or variations can be made by those skilled in the art without the need for inventive work on the basis of the technical solutions of the present invention.

Claims (7)

1. A video interaction detection method based on random frame complement and attention is characterized in that: the method comprises the following steps:
step 10, selection of a feature extraction network
Selecting an I3D network pre-trained based on a Kinetics data set to extract features;
step 20, self-attention global information modeling
Modeling global time sequence information on the basis of the selection of the basic network in the step 10, and outputting an I3D network; searching for the relation between frames by using Polarized Self-Attention polarization Attention and weighting;
adding 1D volumes before the Transformer network;
step 30 random frame complement data enhancement
On the output of the 10 th step feature network, a video is divided into a plurality of fragments, a frame is randomly taken from each fragment, and other frames are the same as the taken frames, so that a new feature vector with larger variation is formed;
calculating an mse loss by the new feature vector passing through the backup and the original video feature vector;
the formula in step 30 is as follows:
original video feature vector:t represents the length of the video feature sequence, and D represents the feature dimension;
handleXDivided into t/k segments:each->Containing k frames, i representing the ith video feature;
a frame is randomly taken from each segment, and replicated k times,
,/>representing a random frame fetch,/->Representative copykPerforming secondary operation;
,/>representative vector->And->New characteristics after passing back bone networkSyndrome vector, ->Mean square loss function>Represents an adjustment factor, typically 1;
step 40, pyramid feature generation
On the basis of the network in the step 20, the features after passing through the multi-scale information aggregation module are encoded into a 6-layer feature pyramid through a multi-scale transducer, and LSTM is combined with the transducer;
step 50. Boundary locating and classifying
After obtaining the pyramid features of 6 scales; pyramid features of each scale are respectively input into different 1D convolutions to obtain positioning and classifying features, then classifying features are adopted to classify, boundary regression is carried out by adopting the positioning features, constraint is carried out by adopting focal loss in the training classification process, and constraint is adopted in the training regression processAnd (5) performing constraint.
2. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: the operation is performed by a Channel-only branch and a Spatial-only branch in the Polarized Self-Attention for the extracted features, the Channel-only branch being defined as follows:
wherein->Is a 1 x 1 convolutional layer, ">Is->That is, the feature dimension is changed from C/2 XHXW to C/2 XHW, C is expressed as the channel dimension, H is expressed as the height of the picture, W is expressed as the width of the picture,/>Is thatOperator (F)>Intermediate parameters representing the convolution of channels, X being the matrix dot product operation,/>、/>And->The number of internal channels between is C/2, the output of the channel branch is +.>Wherein->Is a channel multiplication operator;
spatial-only branch is defined as follows:
wherein F SG Representing Sigmoid function->Is standard 1->1 convolution->Is three in number,/>Is->Operator (F)>Is a global pooling operator and,the output of the spatial branch isWherein->Is a spatial multiplication operator;
the outputs of the channel branches and the spatial branches are composed in parallel layout:
3. the method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: each video loss is defined as follows:
wherein the method comprises the steps ofIs the length of the input sequence, < > is->Is an indication function indicating whether the time step t is within the motion range, i.e. positive samples,/->Is the total number of positive samples, +.>Applied to all levels on the output pyramid and averaged over all video samples during training, +.>Is a coefficient of balance classification loss and regression loss, < ->One for distance regression>Lcls is denoted as classification loss.
4. The method for detecting video interaction based on random frame complement and attention according to claim 1, wherein the method comprises the following steps: pyramid features are obtained using 6 transfomer layers, each consisting of alternating layers of LSTM, local multi-headed self-attention, and MLP blocks, with LayerNorm applied before each MSA or MLP, with residual connections added after each block, channel MLP with two linear layers, middle activated with GELU, downsampling operations with one single step depth separable 1D convolution downsampling ratio for a model of 2 times, specific formulas are as follows:
,/>is a learnable per channel scaling factor initialized to 0,/o>Representation->The time series length of the layer; />Representation->The time series length of the layers.
5. A video interaction detection system based on random frame complement and attention is characterized in that:
the device comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for extracting global time sequence information;
the time sequence self-attention module is used for modeling global time sequence information to obtain characteristics containing multi-scale local information;
the random frame supplementing data enhancement module is used for enabling the actions and boundaries of the original video to be clear;
the pyramid feature generation module is used for encoding the features of the multi-scale local information into a 6-layer feature pyramid through a multi-scale transducer and combining LSTM with the transducer;
the classification module is used for inputting pyramid features of each scale into different 1D convolutions to obtain positioning and classifying features;
the formula used in the random frame complement data enhancement module is as follows:
original video feature vector:
t represents the length of the video feature sequence, and D represents the feature dimension;
handleXDivided into t/k segments:
each of which isContaining k frames, i representing the ith video feature; a frame is randomly taken from each segment, and replicated k times,
representing a random frame fetch,/->Representative copykPerforming secondary operation;
representative vector->And->New eigenvectors after passing back bone network,/->Mean square loss function>Represents an adjustment factor, typically 1.
6. A computer readable storage medium storing a computer program, wherein the video interaction detection method of any of claims 1 to 4 is implemented when the computer program is executed by a processor.
7. A computing device, comprising: at least one processor; at least one memory storing a computer program which, when executed by the at least one processor, implements the video interaction detection method of any of claims 1 to 4.
CN202310657865.4A 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention Active CN116385945B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310657865.4A CN116385945B (en) 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310657865.4A CN116385945B (en) 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention

Publications (2)

Publication Number Publication Date
CN116385945A CN116385945A (en) 2023-07-04
CN116385945B true CN116385945B (en) 2023-08-25

Family

ID=86981016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310657865.4A Active CN116385945B (en) 2023-06-06 2023-06-06 Video interaction action detection method and system based on random frame complement and attention

Country Status (1)

Country Link
CN (1) CN116385945B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117138455B (en) * 2023-10-31 2024-02-27 克拉玛依曜诚石油科技有限公司 Automatic liquid filtering system and method
CN117354525B (en) * 2023-12-05 2024-03-15 深圳市旭景数字技术有限公司 Video coding method and system for realizing efficient storage and transmission of digital media

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN111259795A (en) * 2020-01-16 2020-06-09 河南职业技术学院 Human behavior recognition method based on multi-stream deep learning
WO2021051545A1 (en) * 2019-09-16 2021-03-25 平安科技(深圳)有限公司 Behavior identification model-based fall-down action determining method and apparatus, computer device, and storage medium
CN114973416A (en) * 2022-06-07 2022-08-30 哈尔滨理工大学 Sign language recognition algorithm based on three-dimensional convolution network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11883245B2 (en) * 2021-03-22 2024-01-30 Verb Surgical Inc. Deep-learning-based real-time remaining surgery duration (RSD) estimation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
WO2021051545A1 (en) * 2019-09-16 2021-03-25 平安科技(深圳)有限公司 Behavior identification model-based fall-down action determining method and apparatus, computer device, and storage medium
CN111259795A (en) * 2020-01-16 2020-06-09 河南职业技术学院 Human behavior recognition method based on multi-stream deep learning
CN114973416A (en) * 2022-06-07 2022-08-30 哈尔滨理工大学 Sign language recognition algorithm based on three-dimensional convolution network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于多特征融合及 Transformer 的人体跌倒动作检测算法";刘文龙等;《应用科技》;第49-62页 *

Also Published As

Publication number Publication date
CN116385945A (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN116385945B (en) Video interaction action detection method and system based on random frame complement and attention
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN109685819B (en) Three-dimensional medical image segmentation method based on feature enhancement
CN111950455B (en) Motion imagery electroencephalogram characteristic identification method based on LFFCNN-GRU algorithm model
CN108133188A (en) A kind of Activity recognition method based on motion history image and convolutional neural networks
CN109443382A (en) Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network
CN110175551B (en) Sign language recognition method
CN115896817A (en) Production method and system of fluorine-nitrogen mixed gas
CN111462191B (en) Non-local filter unsupervised optical flow estimation method based on deep learning
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN110276784B (en) Correlation filtering moving target tracking method based on memory mechanism and convolution characteristics
CN115222998B (en) Image classification method
CN113569805A (en) Action recognition method and device, electronic equipment and storage medium
CN112149645A (en) Human body posture key point identification method based on generation of confrontation learning and graph neural network
CN113837229B (en) Knowledge-driven text-to-image generation method
CN112668543B (en) Isolated word sign language recognition method based on hand model perception
CN113850182A (en) Action identification method based on DAMR-3 DNet
Dastbaravardeh et al. Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames
CN113537240B (en) Deformation zone intelligent extraction method and system based on radar sequence image
Ma et al. Convolutional transformer network for fine-grained action recognition
CN114463614A (en) Significance target detection method using hierarchical significance modeling of generative parameters
Shi et al. Attention-YOLOX: Improvement in On-Road Object Detection by Introducing Attention Mechanisms to YOLOX
Zeng et al. Expression Recognition Based on Multi-Scale Adaptive Parallel Integration Network
WO2024093466A1 (en) Person image re-identification method based on autonomous model structure evolution
CN116012388B (en) Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231225

Address after: Building A6-211, Hanyu Jingu, No. 7000 Jingshi Road, Jinan Area, China (Shandong) Pilot Free Trade Zone, Jinan City, Shandong Province, 250000

Patentee after: Shandong Zhonglian Audio-Visual Information Technology Co.,Ltd.

Address before: No.19 Keyuan Road, Lixia District, Jinan City, Shandong Province

Patentee before: Shandong Institute of artificial intelligence

Patentee before: TIANJIN University OF TECHNOLOGY

Patentee before: Shandong Zhonglian Audio-Visual Information Technology Co.,Ltd.