CN116543339A - Short video event detection method and device based on multi-scale attention fusion - Google Patents

Short video event detection method and device based on multi-scale attention fusion Download PDF

Info

Publication number
CN116543339A
CN116543339A CN202310578964.3A CN202310578964A CN116543339A CN 116543339 A CN116543339 A CN 116543339A CN 202310578964 A CN202310578964 A CN 202310578964A CN 116543339 A CN116543339 A CN 116543339A
Authority
CN
China
Prior art keywords
patch
blocks
short video
attention
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310578964.3A
Other languages
Chinese (zh)
Inventor
苏育挺
马潇
井佩光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310578964.3A priority Critical patent/CN116543339A/en
Publication of CN116543339A publication Critical patent/CN116543339A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a short video event detection method and device based on multi-scale attention fusion, wherein the method comprises the following steps: generating three-scale short video subsequences by using sliding windows with different scales as input; dividing the input of each scale into local blocks uniformly, dividing the local blocks into small patch blocks by using inner transformer and outer transformer, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation; inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which is used in short video event detection. The device comprises: a processor and a memory. The invention can learn the relation and the relation between different visual elements in the image, thereby improving the accuracy of classification tasks.

Description

Short video event detection method and device based on multi-scale attention fusion
Technical Field
The invention relates to the field of short video event detection, in particular to a short video event detection method and device based on multi-scale attention fusion.
Background
With the rapid development of the Internet and a short video platform, the Internet and the short video platform are popular with young people due to the characteristics of long time, rich content, simple manufacture, strong transmissibility and the like. These short video data, in turn, contain rich information and value, such as: interest, social hotspots, cultural trends, etc. have become very important topics for research and analysis of these data. Short video, an emerging form of multimedia, has become an integral part of people's life. The research and analysis of the massive short video data not only can help people to better understand and grasp the current cultural and social trends, but also can provide beneficial references for short video content production, user portraits, personalized recommendations and the like.
Currently, short video content research is mainly focused on popularity prediction, scene classification, short video recommendation and other aspects. For example, a low rank multi-view learning framework may improve popularity prediction for short videos; the multi-modal collaborative learning neural network (Neural Multimodal Cooperative Learning, NMCL) is constructed to calculate attention scores through a perceptual attention mechanism to measure the correlation between different modalities, thereby achieving scene classification of short videos. In addition, in the background that the short video has a plurality of different types of audience groups, a user video common attention network is established, and recommendation tasks and the like of the short video can be realized.
Short video event detection is more challenging and practical than tasks such as popularity prediction, scene classification, and short video recommendation. Short video event detection aims at automatically discovering and identifying events from short videos, such as: sports games, concerts, exhibitions, celebrations, etc., thereby providing more accurate video content understanding and management. The short video event detection can provide more accurate, rich and timely information service for users and more efficient management and recommendation functions for short video platforms.
Short video event detection technology refers to a technology that detects the features of the subject and object involved in a particular scenario. In video, simple gesture actions such as: clapping, running, smiling are considered actions, while in special situations such as: the collection of actions that occur at birthdays, parades, weddings, etc. is referred to as an event. At present, event detection technologies are mainly classified into abnormal event detection of surveillance videos and sports event detection of traditional videos. The mainstream technology comprises the following steps: the hybrid self-coding model combining the long-term memory network and the convolutional neural network is used for detecting abnormal events, the abnormal event detection hybrid modulation method based on characteristic expected subgraph calibration classification is used for video monitoring scenes, and a semi-supervised learning scheme is adopted for the characteristic of different normal and abnormal event distribution in the monitoring video.
In order to better exploit multimodal fusion, it is necessary to build a reliable model that can extract visual information of different scales and fine granularity. In addition, the development of short video event detection techniques is also limited by the lack of mainstream data sets.
Disclosure of Invention
The invention provides a short video event detection method and device based on multi-scale attention fusion, and provides a novel video feature extraction mode, wherein the complexity and the richness of a short video can be better understood by generating three video subsequences with different scales and then carrying out feature coding on the video subsequences. By performing different precision convertors (known to those skilled in the art) on the local and patch blocks in each frame of the sub-sequence, relationships and associations between different visual elements in the image can be learned, thereby improving the accuracy of classification tasks, as described in detail below:
a short video event detection method based on multi-scale attention fusion, the method comprising:
generating three-scale short video subsequences by using sliding windows with different scales as input;
dividing the input of each scale into local blocks uniformly, dividing the local blocks into small patch blocks by using inner transformer and outer transformer, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation;
inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which is used in short video event detection.
Wherein the three scales of short video subsequences are represented as:
where o ε {1,2,3}, j ε {4,8, 16}.
Wherein, the uniformly dividing the input of each scale into local blocks is as follows:
X=[X 1 ,X 2 ,…,X n ]
each partial block is further divided into m patch blocks, i.e. the partial block consists of m patch blocks:
X i →[x i,1 ,x i,2 ,…,x i,m ]。
further, the attention of each patch is sent to outer transformer together with the overall attention of the local block for calculation as:
embedding the sequence conversion of patch blocks into local blocks by linear projection:
wherein the sequence of patch blocksAnd full connection layer FC to match both dimensions for addition;
embedding is achieved using outer transformer modules:
the relationship between partial block encodings is modeled with outer transformer.
A short video event detection device based on multi-scale attention fusion, the device comprising: a processor and a memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of the claims.
The technical scheme provided by the invention has the beneficial effects that:
1. because of the richness and complexity of video content, it is often not sufficient to extract information from only a single time scale, and to better process these short video data, the present invention uses three different scale sliding windows to generate three different scale video sub-sequences; the sliding windows with different scales can capture specific characteristics and information in different time periods in the video; for example, a smaller scale sliding window may better capture fast motion and detail information in a video, while a larger scale sliding window may better capture overall features and background information in a video; the complexity and richness of the short video can be better understood by feature coding each sub-sequence;
2. by combining the internal and external transformers, better visual information representation capabilities can be obtained and short video data can be better handled; the internal transducer is focused on processing visual information in the patch, so that local detail information can be better captured by the method, the precision and effect of a model are improved, the internal transducer is very effective for processing complex visual information, and the relation and the connection between different visual elements in an image can be learned; feature aggregation of parts and patches to enhance representation capabilities; the method can better capture global features and background information of the image, thereby improving the robustness and generalization capability of the model.
3. Short video event detection is a very challenging task, and has very high requirements on accuracy and robustness of the algorithm; however, since this direction lacks the mainstream data set, the present invention constructs a new short video event detection data set; the data set comprises short videos from different fields, different time scales and different angles, and covers rich and diverse scenes and events; a large amount of short video data is collected from multiple data sources and carefully labeled and categorized to ensure accuracy and validity of the data set.
The invention fully utilizes the multi-mode information of the short video to detect the event, and simultaneously provides a novel key frame extraction mode and an innovative network architecture, thereby being beneficial to improving the accuracy of the short video event detection task.
Drawings
FIG. 1 is a schematic diagram of key frame extraction;
fig. 2 is a flow chart of a network model framework.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
Example 1
The embodiment of the invention provides a short video event detection method based on multi-scale attention fusion, which fully utilizes visual information of multiple scales and fine granularity of short videos to solve the detection problem of short video events, and is shown in fig. 1 and 2, and the method comprises the following steps:
101: generating three-scale short video subsequences by using sliding windows with different scales as input;
in order to better process complex and rich short video data, visual information of the complex and rich short video data is fully utilized, sliding windows with different scales are used for generating three-scale short video subsequences as input, and therefore specific features and information in different time periods in the video can be better captured.
102: dividing each scale input into local blocks uniformly, exploring a new architecture by using inner transformer (patch block transformer) and outer transformer (patch block transformer), dividing the local blocks into smaller patch blocks, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation;
in practice, transformers help people to solve some complex tasks by encoding input data and extracting strong feature representations. In natural image processing, it becomes particularly important to extract object features of different scales and positions for complex details and color information. For processing short video data, the short video data is firstly converted into a frame sequence suitable for a transducer, and then a novel transducer network architecture is provided for processing visual information. The architecture utilizes three sliding windows (step sizes of 1,2 and 4) to generate three video subsequences with different scales, and calculates a correlation matrix under the three scales, so that three video subsequences with different scales from an original video are obtained as input of a network model. The method can well treat the multi-scale problem in the natural image and is helpful for mining object features with different scales and positions.
First, the embodiment of the present invention uniformly divides the input of each scale into partial blocks. For information within each local block, the attention representation is considered essential and has high performance, so inner transformer and outer transformer are used to explore a new architecture. The local block is further divided into smaller patch blocks, such as: 2 x 2 patch blocks divided into 4 or 9 blocks, and calculating the attention of each patch block is calculated together with other patches in a given local block, with negligible calculation costs. In specific implementation, the embodiment of the invention does not limit the number and the size of the blocks, and selects according to the needs in practical application.
103: inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which can be used for a range of downstream tasks.
For each scale of input, embodiments of the present invention divide it uniformly into a series of partial blocks, and further divide each partial block into smaller patch blocks. In order to better process this information, embodiments of the present invention propose a completely new architecture, including internal and external transducers. The inner transducer is used to process visual information within the patch block, while the outer transducer is used to process visual information of the local block and to add characteristic information of the patch block to the local block to enhance the representational capacity of the local block. In calculating the attention representation, the attention of each patch block will be calculated with other patches in the given local block, with negligible computational cost. The method can extract visual information on different granularities and simultaneously solve the problem of characteristic information loss caused by information loss between different granularities.
Wherein inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information of local blocks simultaneously and aggregate features of the local and patch to enhance representation capabilities. Finally, the model may be used for downstream tasks, such as: and (5) detecting an event. Through the model, the embodiment of the invention can extract visual information with different scales and fine granularity and provide more detail features.
Finally, the embodiment of the invention uses the obtained visual information with different scales and different fine granularity for the event detection task. Experimental results on the Flickr short video event detection data set show that the classification accuracy of the method is superior to that of the existing mainstream method. This approach may help to better understand the complexity of natural images and provide more detailed features.
The network is finally applied to the short video event detection task. This method has wide application in short video and other fields of image processing, and can be applied to more downstream tasks.
The performance of the scheme is evaluated by using 3 indexes such as Accuracy (ACC), average Accuracy (Average Precision, AP), average Recall (AR) and the like to evaluate the model, so that the objectivity and Accuracy of the experimental result are ensured.
In summary, the embodiment of the present invention fully utilizes the visual information of multiple scales and fine granularity of the short video through the steps 101 to 103, thereby solving the detection problem of the short video event.
Example 2
The scheme of example 1 is further described in conjunction with the calculation formulas and examples below, and is described in detail below:
201: uniform sampling is a commonly used short video feature extraction method, which captures key information of a video by selecting frames at equal intervals in the video;
the method comprises the following implementation steps: video decomposition, selection of sampling intervals, and frame decimation. During frame extraction, it is necessary to select the appropriate sampling interval to determine which frames can be used as input features based on the task requirements and the dataset characteristics. In addition, the method can improve the diversity and uniformity of the characteristics and reduce the redundancy degree, thereby improving the model performance.
The three-scale short video subsequences extracted by the embodiment of the invention are expressed as follows:
where f.epsilon. {1,2,3}, j.epsilon.P 4,8, 16}, where
202: first, the basic components in the transducer, including MSA (multi-head self-attention), MLP (multi-layer perceptron) and LN (normalization), are briefly described, and are described below:
(1) MSA: it is able to establish associations between different locations and aggregate information for each location by weighted calculation of attention weights. Multiple heads self-attention is introduced on the basis of self-attention mechanism, and each head can perform different attention calculation on input, so that more rich and various characteristic representations can be extracted.
In MSA, Q, K, V represents a query vector, a key vector, and a value vector, respectively, which are obtained from an input word vector by linear transformation. In calculating the attention weight, for a given query vector Q, the model calculates a similarity score for each input word vector for that word vector and the query vector, with a higher similarity score indicating that the word vector and the query vector are more relevant. In calculating the similarity score, the model needs to use the key vector K and the value vector V:
finally, the output is generated using a linear layer, and then the output values of each head are concatenated and linearly projected to form the final output.
(2) MLP: MLP applies to feature transforms and nonlinear transforms between self-care layers:
MLP(X)=FC(σ(FC(X)))
FC(X)=XW+b (3)
wherein W and b are the weight and bias terms, respectively, of the fully connected layer and σ (·) is the activation function.
(3) LN: layer normalization is a key component of the transducer that enables stable training and rapid convergence:
wherein, mu, delta respectively refer to the mean value and standard deviation of the characteristic, and gamma, beta refer to the convertible parameters.
203: for any two-dimensional image in a short video sub-sequence of different scales, it is uniformly divided into n partial blocks:
X=[X 1 ,X 2 ,…,X n ] (5)
global and local information in an image is learned according to a proposed network model architecture. Therefore, the embodiment of the present invention further divides each local block into m patch blocks, that is, the local block is composed of m patch blocks:
X i →[x i,1 ,x i,2 ,…,x i,m ] (6)
wherein x is i,j Refers to the j-th patch block of the i-th partial block.
By linear projection, embodiments of the present invention convert it into an embedded sequence:
Y i =[y i,1 ,y i,2 ,…,y i,m ]
y i,j =FC(Vec(x i,j )) (7)
wherein y is i,j Refers to the embedded sequence of the j-th patch, and Vec (·) refers to the quantization operation.
In the embodiment of the invention, two processing operations are mainly performed, namely processing local blocks, processing patch blocks, and exploring the relationship between the patch blocks by using a transducer, wherein the process can be expressed as follows:
Y l i =Y l i1 +MLP(LN(Y l i1 )) (8)
where i=1, 2,..l is the index of the first block, and L is the total number of stacked blocks. This section can be seen as inner transformer module, which establishes the relationship between patch blocks by computing interactions between any two patch blocks within a local block. For example, in a partial block of a face, words corresponding to eyes are more related to other words of the eyes, while interactions with the mouth portion are less.
For the processing of local blocks, first the sequence conversion of patch blocks is embedded into the local blocks by linear projection:
wherein the sequence of patch blocksAnd full connection layer FC match both dimensions for addition.
By the above operation, the embedding of the patch block represents the enhancement of the image feature, and the embedding is completed by using standard trasformer, namely outer transformer module:
the relationship between partial block encodings is modeled with outer transformer.
Thus, the inputs and outputs of embodiments of the present invention include the above-described partial block encoding and patch block encoding. And finally, the method is applied to short video event detection.
204: three public data sets were used for this study: UCF-101, HMDB51 and a newly established Flickr short video event detection dataset. The newly established data set is divided into a training set and a test set in a ratio of 10:2. UCF-101 and HMDB51 are two disclosed sets of motion recognition data, where UCF-101 contains 13320 short videos from 101 different motion categories and HMDB51 contains 6766 short videos covering 51 different categories. Throughout the training and testing process, the performance of the model was evaluated using 3 indices, such as Accuracy (ACC), average Accuracy (Average Precision, AP), average Recall (AR), etc. The specific meaning of the 3 indexes is as follows:
(1) Accuracy (ACC) is an index for evaluating the Accuracy of classification algorithms. It represents the proportion of the total number of samples that the classifier correctly classifies among all classifications. The higher the accuracy, the better the performance of the classifier. The calculation formula of the accuracy rate is as follows:
ACC=(TP+TN)/(TP+FP+FN+TN) (11)
where TP represents the true number of examples (number of samples correctly determined as positive examples), TN represents the true number of examples (number of samples correctly determined as negative examples), FP represents the false number of positive examples (number of samples incorrectly determined as positive examples), and FN represents the false number of negative examples (number of samples incorrectly determined as negative examples).
The calculation method of the accuracy is very simple, and is the ratio of the number of correctly classified samples to the total number of samples in all classifications. In the classification problem, the higher the accuracy, the better the performance of the classifier is represented, because the classifier can correctly judge the more samples of the positive example and the negative example. However, accuracy is not always able to fully reflect the performance of the classifier. Especially in case of sample imbalance, if the classifier always decides all samples as counterexamples, the accuracy may still be high, but the performance of the classifier is poor. Therefore, in evaluating model performance, other metrics, in addition to accuracy, need to be considered.
(2) The average accuracy (Average Precision, AP) is an index for evaluating information retrieval algorithms and may also be used to evaluate the performance of classification algorithms. It represents the average precision at all Recall (Recall) where Recall represents the proportion of the number of samples correctly classified as positive by the classifier to the number of samples in all positive, and precision represents the proportion of the number of samples correctly classified as positive by the classifier to the number of samples in all positive. The higher the average accuracy, the better the classifier performance. The calculation formula of the average accuracy rate is as follows:
AP=∑(P(i)×ΔR(i)) (12)
where P (i) represents the precision at the ith location, and ΔR (i) represents the difference between the recall at the ith location and the recall at the i-1 th location. The calculation method of the average accuracy rate is relatively complex, and it requires calculating the accuracy rate at all recall rates and calculating the product of the accuracy rate and the recall rate.
(3) Average Recall (AR) is an indicator used to evaluate the performance of classification algorithms. It represents the average recall at all precision rates, where precision rate represents the proportion of the number of samples correctly classified as positive by the classifier to all the number of samples classified as positive, and recall represents the proportion of the number of samples correctly classified as positive by the classifier to all the number of positive samples. The higher the average recall, the better the classifier performance. The calculation formula of the average recall rate is as follows:
AR=∑(R(i)×ΔP(i)) (13)
where R (i) represents the recall at the i-th location and ΔP (i) represents the difference between the precision at the i-th location and the precision at the i-1 th location. Similar to the average precision, it is necessary to calculate the corresponding recall or precision at different precision and calculate the product of the two. The average recall rate is mainly used for evaluating the performances of the classifier under different accuracy rates, and can reflect the performances of the classifier more comprehensively. In the application of the classifier, the embodiment of the invention generally uses a plurality of thresholds to generate different classification results, and the average recall can help to comprehensively evaluate the performance of the classifier under the different thresholds and select the optimal threshold.
In summary, in order to better process natural images with high complexity and abundant detail and color information, the embodiment of the present invention uses only the conventional transducer to extract visual information with respect to granularity of local block division, so that it is very important to mine the features of objects with different scales and positions. In order to better process short video, the embodiment of the invention firstly converts the short video into a frame sequence form suitable for the input of a transformer, and proposes a novel transformer network architecture for visual information processing. Three video sub-sequences of different scales of the original video are first obtained as input. In order to further extract features, the embodiment of the invention uniformly divides the input of each scale into local blocks, and then further divides the local blocks into smaller patch blocks. Embodiments of the present invention explore a new architecture, i.e., intra-former and outer transformer, for processing information within local blocks and feature aggregation between local blocks. Visual information within the local blocks and visual information within the patch blocks are aggregated to enhance representational capacity by calculating the attention of each patch block. Finally, the embodiment of the invention uses the model for short video event detection of downstream tasks.
Example 3
A short video event detection device based on multiscale attention fusion, the device comprising: a processor and a memory having stored therein program instructions that the processor invokes the stored program instructions in the memory to cause the apparatus to perform the method steps of any one of:
generating three-scale short video subsequences by using sliding windows with different scales as input;
dividing the input of each scale into local blocks uniformly, dividing the local blocks into small patch blocks by using inner transformer and outer transformer, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation;
inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which is used in short video event detection.
Wherein the three scales of short video subsequences are expressed as:
where i e {1,2,3}, j e {4,8, 16}.
The input of each scale is evenly divided into local blocks:
X=[X 1 ,X 2 ,…,X n ]
each partial block is further divided into m patch blocks, i.e. the partial block consists of m patch blocks:
X i →[x i,1 ,x i,2 ,…,x i,m ]。
the attention of each patch is fed outer transformer with the overall attention of the local block to be calculated as:
embedding the sequence conversion of patch blocks into local blocks by linear projection:
wherein the sequence of patch blocksAnd full connection layer FC to match both dimensions for addition;
embedding is achieved using outer transformer modules:
the relationship between partial block encodings is modeled with outer transformer.
It should be noted that, the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention are not described herein in detail.
The execution main body of the processor and the memory can be a device with a calculation function, such as a computer, a singlechip, a microcontroller, and the like, and the execution main body is not limited in the embodiment of the invention, and is selected according to the needs in practical application.
The data signals are transmitted between the memory and the processor through the bus, and the embodiments of the present invention will not be described in detail.
The embodiment of the invention does not limit the types of other devices except the types of the devices, so long as the devices can complete the functions.
Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (5)

1. A short video event detection method based on multi-scale attention fusion, the method comprising:
generating three-scale short video subsequences by using sliding windows with different scales as input;
dividing the input of each scale into local blocks uniformly, dividing the local blocks into small patch blocks by using inner transformer and outer transformer, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation;
inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which is used in short video event detection.
2. The short video event detection method based on multi-scale attention fusion according to claim 1, wherein the three-scale short video subsequences are expressed as:
where i e {1,2,3}, j e {4,8, 16}.
3. The short video event detection method based on multi-scale attention fusion according to claim 1, wherein the uniformly dividing the input of each scale into partial blocks is:
X=[X 1 ,X 2 ,…,X n ]
each partial block is further divided into m patch blocks, i.e. the partial block consists of m patch blocks:
X i →[x i,1 ,x i,2 ,…,x i,m ]。
4. a method for detecting short video events based on multi-scale attention fusion according to claim 1, wherein the feeding outer transformer the attention of each patch together with the whole attention of the local block is calculated as:
embedding the sequence conversion of patch blocks into local blocks by linear projection:
wherein the sequence of patch blocksAnd full connection layer FC to match both dimensions for addition;
embedding is achieved using outer transformer modules:
the relationship between partial block encodings is modeled with outer transformer.
5. A short video event detection device based on multi-scale attention fusion, the device comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-4.
CN202310578964.3A 2023-05-22 2023-05-22 Short video event detection method and device based on multi-scale attention fusion Pending CN116543339A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310578964.3A CN116543339A (en) 2023-05-22 2023-05-22 Short video event detection method and device based on multi-scale attention fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310578964.3A CN116543339A (en) 2023-05-22 2023-05-22 Short video event detection method and device based on multi-scale attention fusion

Publications (1)

Publication Number Publication Date
CN116543339A true CN116543339A (en) 2023-08-04

Family

ID=87454099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310578964.3A Pending CN116543339A (en) 2023-05-22 2023-05-22 Short video event detection method and device based on multi-scale attention fusion

Country Status (1)

Country Link
CN (1) CN116543339A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115443A (en) * 2023-08-18 2023-11-24 中南大学 Segmentation method for identifying infrared small targets

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115443A (en) * 2023-08-18 2023-11-24 中南大学 Segmentation method for identifying infrared small targets

Similar Documents

Publication Publication Date Title
Cheng et al. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
Pan et al. Deepfake detection through deep learning
CN109063565B (en) Low-resolution face recognition method and device
WO2020177673A1 (en) Video sequence selection method, computer device and storage medium
CN111428073B (en) Image retrieval method of depth supervision quantization hash
CN111464881B (en) Full-convolution video description generation method based on self-optimization mechanism
CN113642604B (en) Audio-video auxiliary touch signal reconstruction method based on cloud edge cooperation
CN116543339A (en) Short video event detection method and device based on multi-scale attention fusion
CN114048351A (en) Cross-modal text-video retrieval method based on space-time relationship enhancement
Baddar et al. On-the-fly facial expression prediction using lstm encoded appearance-suppressed dynamics
CN115203409A (en) Video emotion classification method based on gating fusion and multitask learning
CN114926742A (en) Loop detection and optimization method based on second-order attention mechanism
Wu et al. Interactive two-stream network across modalities for deepfake detection
CN113689527B (en) Training method of face conversion model and face image conversion method
CN113420179A (en) Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
Liu et al. A multimodal approach for multiple-relation extraction in videos
Zou et al. 360$^{\circ} $ Image Saliency Prediction by Embedding Self-Supervised Proxy Task
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN113792167B (en) Cross-media cross-retrieval method based on attention mechanism and modal dependence
CN115346132A (en) Method and device for detecting abnormal events of remote sensing images by multi-modal representation learning
Sun et al. Multimodal micro-video classification based on 3D convolutional neural network
Hammad et al. Characterizing the impact of using features extracted from pre-trained models on the quality of video captioning sequence-to-sequence models
CN117540007B (en) Multi-mode emotion analysis method, system and equipment based on similar mode completion
Yin et al. GAMC: An Unsupervised Method for Fake News Detection using Graph Autoencoder with Masking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination