CN116543339A

CN116543339A - Short video event detection method and device based on multi-scale attention fusion

Info

Publication number: CN116543339A
Application number: CN202310578964.3A
Authority: CN
Inventors: 苏育挺; 马潇; 井佩光
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-04

Abstract

The invention discloses a short video event detection method and device based on multi-scale attention fusion, wherein the method comprises the following steps: generating three-scale short video subsequences by using sliding windows with different scales as input; dividing the input of each scale into local blocks uniformly, dividing the local blocks into small patch blocks by using inner transformer and outer transformer, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation; inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which is used in short video event detection. The device comprises: a processor and a memory. The invention can learn the relation and the relation between different visual elements in the image, thereby improving the accuracy of classification tasks.

Description

Short video event detection method and device based on multi-scale attention fusion

Technical Field

The invention relates to the field of short video event detection, in particular to a short video event detection method and device based on multi-scale attention fusion.

Background

With the rapid development of the Internet and a short video platform, the Internet and the short video platform are popular with young people due to the characteristics of long time, rich content, simple manufacture, strong transmissibility and the like. These short video data, in turn, contain rich information and value, such as: interest, social hotspots, cultural trends, etc. have become very important topics for research and analysis of these data. Short video, an emerging form of multimedia, has become an integral part of people's life. The research and analysis of the massive short video data not only can help people to better understand and grasp the current cultural and social trends, but also can provide beneficial references for short video content production, user portraits, personalized recommendations and the like.

Currently, short video content research is mainly focused on popularity prediction, scene classification, short video recommendation and other aspects. For example, a low rank multi-view learning framework may improve popularity prediction for short videos; the multi-modal collaborative learning neural network (Neural Multimodal Cooperative Learning, NMCL) is constructed to calculate attention scores through a perceptual attention mechanism to measure the correlation between different modalities, thereby achieving scene classification of short videos. In addition, in the background that the short video has a plurality of different types of audience groups, a user video common attention network is established, and recommendation tasks and the like of the short video can be realized.

Short video event detection is more challenging and practical than tasks such as popularity prediction, scene classification, and short video recommendation. Short video event detection aims at automatically discovering and identifying events from short videos, such as: sports games, concerts, exhibitions, celebrations, etc., thereby providing more accurate video content understanding and management. The short video event detection can provide more accurate, rich and timely information service for users and more efficient management and recommendation functions for short video platforms.

Short video event detection technology refers to a technology that detects the features of the subject and object involved in a particular scenario. In video, simple gesture actions such as: clapping, running, smiling are considered actions, while in special situations such as: the collection of actions that occur at birthdays, parades, weddings, etc. is referred to as an event. At present, event detection technologies are mainly classified into abnormal event detection of surveillance videos and sports event detection of traditional videos. The mainstream technology comprises the following steps: the hybrid self-coding model combining the long-term memory network and the convolutional neural network is used for detecting abnormal events, the abnormal event detection hybrid modulation method based on characteristic expected subgraph calibration classification is used for video monitoring scenes, and a semi-supervised learning scheme is adopted for the characteristic of different normal and abnormal event distribution in the monitoring video.

In order to better exploit multimodal fusion, it is necessary to build a reliable model that can extract visual information of different scales and fine granularity. In addition, the development of short video event detection techniques is also limited by the lack of mainstream data sets.

Disclosure of Invention

The invention provides a short video event detection method and device based on multi-scale attention fusion, and provides a novel video feature extraction mode, wherein the complexity and the richness of a short video can be better understood by generating three video subsequences with different scales and then carrying out feature coding on the video subsequences. By performing different precision convertors (known to those skilled in the art) on the local and patch blocks in each frame of the sub-sequence, relationships and associations between different visual elements in the image can be learned, thereby improving the accuracy of classification tasks, as described in detail below:

a short video event detection method based on multi-scale attention fusion, the method comprising:

generating three-scale short video subsequences by using sliding windows with different scales as input;

dividing the input of each scale into local blocks uniformly, dividing the local blocks into small patch blocks by using inner transformer and outer transformer, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation;

inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which is used in short video event detection.

Wherein the three scales of short video subsequences are represented as:

where o ε {1,2,3}, j ε {4,8, 16}.

Wherein, the uniformly dividing the input of each scale into local blocks is as follows:

X＝[X ¹ ，X ² ，…，X ⁿ ]

each partial block is further divided into m patch blocks, i.e. the partial block consists of m patch blocks:

X ⁱ →[x ^i，1 ，x ^i，2 ，…，x ^i，m ]。

further, the attention of each patch is sent to outer transformer together with the overall attention of the local block for calculation as:

embedding the sequence conversion of patch blocks into local blocks by linear projection:

wherein the sequence of patch blocksAnd full connection layer FC to match both dimensions for addition;

embedding is achieved using outer transformer modules:

the relationship between partial block encodings is modeled with outer transformer.

A short video event detection device based on multi-scale attention fusion, the device comprising: a processor and a memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of the claims.

The technical scheme provided by the invention has the beneficial effects that:

1. because of the richness and complexity of video content, it is often not sufficient to extract information from only a single time scale, and to better process these short video data, the present invention uses three different scale sliding windows to generate three different scale video sub-sequences; the sliding windows with different scales can capture specific characteristics and information in different time periods in the video; for example, a smaller scale sliding window may better capture fast motion and detail information in a video, while a larger scale sliding window may better capture overall features and background information in a video; the complexity and richness of the short video can be better understood by feature coding each sub-sequence;

2. by combining the internal and external transformers, better visual information representation capabilities can be obtained and short video data can be better handled; the internal transducer is focused on processing visual information in the patch, so that local detail information can be better captured by the method, the precision and effect of a model are improved, the internal transducer is very effective for processing complex visual information, and the relation and the connection between different visual elements in an image can be learned; feature aggregation of parts and patches to enhance representation capabilities; the method can better capture global features and background information of the image, thereby improving the robustness and generalization capability of the model.

3. Short video event detection is a very challenging task, and has very high requirements on accuracy and robustness of the algorithm; however, since this direction lacks the mainstream data set, the present invention constructs a new short video event detection data set; the data set comprises short videos from different fields, different time scales and different angles, and covers rich and diverse scenes and events; a large amount of short video data is collected from multiple data sources and carefully labeled and categorized to ensure accuracy and validity of the data set.

The invention fully utilizes the multi-mode information of the short video to detect the event, and simultaneously provides a novel key frame extraction mode and an innovative network architecture, thereby being beneficial to improving the accuracy of the short video event detection task.

Drawings

FIG. 1 is a schematic diagram of key frame extraction;

fig. 2 is a flow chart of a network model framework.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

Example 1

The embodiment of the invention provides a short video event detection method based on multi-scale attention fusion, which fully utilizes visual information of multiple scales and fine granularity of short videos to solve the detection problem of short video events, and is shown in fig. 1 and 2, and the method comprises the following steps:

101: generating three-scale short video subsequences by using sliding windows with different scales as input;

in order to better process complex and rich short video data, visual information of the complex and rich short video data is fully utilized, sliding windows with different scales are used for generating three-scale short video subsequences as input, and therefore specific features and information in different time periods in the video can be better captured.

102: dividing each scale input into local blocks uniformly, exploring a new architecture by using inner transformer (patch block transformer) and outer transformer (patch block transformer), dividing the local blocks into smaller patch blocks, calculating the attention of each patch block, and sending outer transformer the attention of each patch block together with the whole attention of the local block for calculation;

in practice, transformers help people to solve some complex tasks by encoding input data and extracting strong feature representations. In natural image processing, it becomes particularly important to extract object features of different scales and positions for complex details and color information. For processing short video data, the short video data is firstly converted into a frame sequence suitable for a transducer, and then a novel transducer network architecture is provided for processing visual information. The architecture utilizes three sliding windows (step sizes of 1,2 and 4) to generate three video subsequences with different scales, and calculates a correlation matrix under the three scales, so that three video subsequences with different scales from an original video are obtained as input of a network model. The method can well treat the multi-scale problem in the natural image and is helpful for mining object features with different scales and positions.

First, the embodiment of the present invention uniformly divides the input of each scale into partial blocks. For information within each local block, the attention representation is considered essential and has high performance, so inner transformer and outer transformer are used to explore a new architecture. The local block is further divided into smaller patch blocks, such as: 2 x 2 patch blocks divided into 4 or 9 blocks, and calculating the attention of each patch block is calculated together with other patches in a given local block, with negligible calculation costs. In specific implementation, the embodiment of the invention does not limit the number and the size of the blocks, and selects according to the needs in practical application.

103: inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information that aggregates local and patch features, which can be used for a range of downstream tasks.

For each scale of input, embodiments of the present invention divide it uniformly into a series of partial blocks, and further divide each partial block into smaller patch blocks. In order to better process this information, embodiments of the present invention propose a completely new architecture, including internal and external transducers. The inner transducer is used to process visual information within the patch block, while the outer transducer is used to process visual information of the local block and to add characteristic information of the patch block to the local block to enhance the representational capacity of the local block. In calculating the attention representation, the attention of each patch block will be calculated with other patches in the given local block, with negligible computational cost. The method can extract visual information on different granularities and simultaneously solve the problem of characteristic information loss caused by information loss between different granularities.

Wherein inner transformer is used to process visual information within patch blocks, outer transformer is used to process visual information of local blocks simultaneously and aggregate features of the local and patch to enhance representation capabilities. Finally, the model may be used for downstream tasks, such as: and (5) detecting an event. Through the model, the embodiment of the invention can extract visual information with different scales and fine granularity and provide more detail features.

Finally, the embodiment of the invention uses the obtained visual information with different scales and different fine granularity for the event detection task. Experimental results on the Flickr short video event detection data set show that the classification accuracy of the method is superior to that of the existing mainstream method. This approach may help to better understand the complexity of natural images and provide more detailed features.

The network is finally applied to the short video event detection task. This method has wide application in short video and other fields of image processing, and can be applied to more downstream tasks.

The performance of the scheme is evaluated by using 3 indexes such as Accuracy (ACC), average Accuracy (Average Precision, AP), average Recall (AR) and the like to evaluate the model, so that the objectivity and Accuracy of the experimental result are ensured.

In summary, the embodiment of the present invention fully utilizes the visual information of multiple scales and fine granularity of the short video through the steps 101 to 103, thereby solving the detection problem of the short video event.

Example 2

The scheme of example 1 is further described in conjunction with the calculation formulas and examples below, and is described in detail below:

201: uniform sampling is a commonly used short video feature extraction method, which captures key information of a video by selecting frames at equal intervals in the video;

the method comprises the following implementation steps: video decomposition, selection of sampling intervals, and frame decimation. During frame extraction, it is necessary to select the appropriate sampling interval to determine which frames can be used as input features based on the task requirements and the dataset characteristics. In addition, the method can improve the diversity and uniformity of the characteristics and reduce the redundancy degree, thereby improving the model performance.

The three-scale short video subsequences extracted by the embodiment of the invention are expressed as follows:

where f.epsilon. {1,2,3}, j.epsilon.P 4,8, 16}, where

202: first, the basic components in the transducer, including MSA (multi-head self-attention), MLP (multi-layer perceptron) and LN (normalization), are briefly described, and are described below:

(1) MSA: it is able to establish associations between different locations and aggregate information for each location by weighted calculation of attention weights. Multiple heads self-attention is introduced on the basis of self-attention mechanism, and each head can perform different attention calculation on input, so that more rich and various characteristic representations can be extracted.

In MSA, Q, K, V represents a query vector, a key vector, and a value vector, respectively, which are obtained from an input word vector by linear transformation. In calculating the attention weight, for a given query vector Q, the model calculates a similarity score for each input word vector for that word vector and the query vector, with a higher similarity score indicating that the word vector and the query vector are more relevant. In calculating the similarity score, the model needs to use the key vector K and the value vector V:

finally, the output is generated using a linear layer, and then the output values of each head are concatenated and linearly projected to form the final output.

(2) MLP: MLP applies to feature transforms and nonlinear transforms between self-care layers:

MLP(X)＝FC(σ(FC(X)))

FC(X)＝XW+b (3)

wherein W and b are the weight and bias terms, respectively, of the fully connected layer and σ (·) is the activation function.

(3) LN: layer normalization is a key component of the transducer that enables stable training and rapid convergence:

wherein, mu, delta respectively refer to the mean value and standard deviation of the characteristic, and gamma, beta refer to the convertible parameters.

203: for any two-dimensional image in a short video sub-sequence of different scales, it is uniformly divided into n partial blocks:

X＝[X ¹ ，X ² ，…，X ⁿ ] (5)

global and local information in an image is learned according to a proposed network model architecture. Therefore, the embodiment of the present invention further divides each local block into m patch blocks, that is, the local block is composed of m patch blocks:

X ⁱ →[x ^i，1 ，x ^i，2 ，…，x ^i，m ] (6)

wherein x is ^i，j Refers to the j-th patch block of the i-th partial block.

By linear projection, embodiments of the present invention convert it into an embedded sequence:

Y ⁱ ＝[y ^i，1 ，y ^i，2 ，…，y ^i，m ]

y ^i，j ＝FC(Vec(x ^i，j )) (7)

wherein y is ^i，j Refers to the embedded sequence of the j-th patch, and Vec (·) refers to the quantization operation.

In the embodiment of the invention, two processing operations are mainly performed, namely processing local blocks, processing patch blocks, and exploring the relationship between the patch blocks by using a transducer, wherein the process can be expressed as follows:

Y _l ⁱ ＝Y _l ⁱ¹ +MLP(LN(Y _l ⁱ¹ )) (8)

where i=1, 2,..l is the index of the first block, and L is the total number of stacked blocks. This section can be seen as inner transformer module, which establishes the relationship between patch blocks by computing interactions between any two patch blocks within a local block. For example, in a partial block of a face, words corresponding to eyes are more related to other words of the eyes, while interactions with the mouth portion are less.

For the processing of local blocks, first the sequence conversion of patch blocks is embedded into the local blocks by linear projection:

wherein the sequence of patch blocksAnd full connection layer FC match both dimensions for addition.

By the above operation, the embedding of the patch block represents the enhancement of the image feature, and the embedding is completed by using standard trasformer, namely outer transformer module:

Thus, the inputs and outputs of embodiments of the present invention include the above-described partial block encoding and patch block encoding. And finally, the method is applied to short video event detection.

204: three public data sets were used for this study: UCF-101, HMDB51 and a newly established Flickr short video event detection dataset. The newly established data set is divided into a training set and a test set in a ratio of 10:2. UCF-101 and HMDB51 are two disclosed sets of motion recognition data, where UCF-101 contains 13320 short videos from 101 different motion categories and HMDB51 contains 6766 short videos covering 51 different categories. Throughout the training and testing process, the performance of the model was evaluated using 3 indices, such as Accuracy (ACC), average Accuracy (Average Precision, AP), average Recall (AR), etc. The specific meaning of the 3 indexes is as follows:

(1) Accuracy (ACC) is an index for evaluating the Accuracy of classification algorithms. It represents the proportion of the total number of samples that the classifier correctly classifies among all classifications. The higher the accuracy, the better the performance of the classifier. The calculation formula of the accuracy rate is as follows:

ACC＝(TP+TN)/(TP+FP+FN+TN) (11)

where TP represents the true number of examples (number of samples correctly determined as positive examples), TN represents the true number of examples (number of samples correctly determined as negative examples), FP represents the false number of positive examples (number of samples incorrectly determined as positive examples), and FN represents the false number of negative examples (number of samples incorrectly determined as negative examples).

The calculation method of the accuracy is very simple, and is the ratio of the number of correctly classified samples to the total number of samples in all classifications. In the classification problem, the higher the accuracy, the better the performance of the classifier is represented, because the classifier can correctly judge the more samples of the positive example and the negative example. However, accuracy is not always able to fully reflect the performance of the classifier. Especially in case of sample imbalance, if the classifier always decides all samples as counterexamples, the accuracy may still be high, but the performance of the classifier is poor. Therefore, in evaluating model performance, other metrics, in addition to accuracy, need to be considered.

(2) The average accuracy (Average Precision, AP) is an index for evaluating information retrieval algorithms and may also be used to evaluate the performance of classification algorithms. It represents the average precision at all Recall (Recall) where Recall represents the proportion of the number of samples correctly classified as positive by the classifier to the number of samples in all positive, and precision represents the proportion of the number of samples correctly classified as positive by the classifier to the number of samples in all positive. The higher the average accuracy, the better the classifier performance. The calculation formula of the average accuracy rate is as follows:

AP＝∑(P(i)×ΔR(i)) (12)

where P (i) represents the precision at the ith location, and ΔR (i) represents the difference between the recall at the ith location and the recall at the i-1 th location. The calculation method of the average accuracy rate is relatively complex, and it requires calculating the accuracy rate at all recall rates and calculating the product of the accuracy rate and the recall rate.

(3) Average Recall (AR) is an indicator used to evaluate the performance of classification algorithms. It represents the average recall at all precision rates, where precision rate represents the proportion of the number of samples correctly classified as positive by the classifier to all the number of samples classified as positive, and recall represents the proportion of the number of samples correctly classified as positive by the classifier to all the number of positive samples. The higher the average recall, the better the classifier performance. The calculation formula of the average recall rate is as follows:

AR＝∑(R(i)×ΔP(i)) (13)

where R (i) represents the recall at the i-th location and ΔP (i) represents the difference between the precision at the i-th location and the precision at the i-1 th location. Similar to the average precision, it is necessary to calculate the corresponding recall or precision at different precision and calculate the product of the two. The average recall rate is mainly used for evaluating the performances of the classifier under different accuracy rates, and can reflect the performances of the classifier more comprehensively. In the application of the classifier, the embodiment of the invention generally uses a plurality of thresholds to generate different classification results, and the average recall can help to comprehensively evaluate the performance of the classifier under the different thresholds and select the optimal threshold.

In summary, in order to better process natural images with high complexity and abundant detail and color information, the embodiment of the present invention uses only the conventional transducer to extract visual information with respect to granularity of local block division, so that it is very important to mine the features of objects with different scales and positions. In order to better process short video, the embodiment of the invention firstly converts the short video into a frame sequence form suitable for the input of a transformer, and proposes a novel transformer network architecture for visual information processing. Three video sub-sequences of different scales of the original video are first obtained as input. In order to further extract features, the embodiment of the invention uniformly divides the input of each scale into local blocks, and then further divides the local blocks into smaller patch blocks. Embodiments of the present invention explore a new architecture, i.e., intra-former and outer transformer, for processing information within local blocks and feature aggregation between local blocks. Visual information within the local blocks and visual information within the patch blocks are aggregated to enhance representational capacity by calculating the attention of each patch block. Finally, the embodiment of the invention uses the model for short video event detection of downstream tasks.

Example 3

A short video event detection device based on multiscale attention fusion, the device comprising: a processor and a memory having stored therein program instructions that the processor invokes the stored program instructions in the memory to cause the apparatus to perform the method steps of any one of:

Wherein the three scales of short video subsequences are expressed as:

where i e {1,2,3}, j e {4,8, 16}.

The input of each scale is evenly divided into local blocks:

X＝[X ¹ ，X ² ，…，X ⁿ ]

X ⁱ →[x ^i，1 ，x ^i，2 ，…，x ^i，m ]。

the attention of each patch is fed outer transformer with the overall attention of the local block to be calculated as:

embedding is achieved using outer transformer modules:

It should be noted that, the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention are not described herein in detail.

The execution main body of the processor and the memory can be a device with a calculation function, such as a computer, a singlechip, a microcontroller, and the like, and the execution main body is not limited in the embodiment of the invention, and is selected according to the needs in practical application.

The data signals are transmitted between the memory and the processor through the bus, and the embodiments of the present invention will not be described in detail.

The embodiment of the invention does not limit the types of other devices except the types of the devices, so long as the devices can complete the functions.

Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A short video event detection method based on multi-scale attention fusion, the method comprising:

2. The short video event detection method based on multi-scale attention fusion according to claim 1, wherein the three-scale short video subsequences are expressed as:

where i e {1,2,3}, j e {4,8, 16}.

3. The short video event detection method based on multi-scale attention fusion according to claim 1, wherein the uniformly dividing the input of each scale into partial blocks is:

X＝[X ¹ ,X ² ,…,X ⁿ ]

X ⁱ →[x ^i,1 ,x ^i,2 ,…,x ^i,m ]。

4. a method for detecting short video events based on multi-scale attention fusion according to claim 1, wherein the feeding outer transformer the attention of each patch together with the whole attention of the local block is calculated as:

embedding is achieved using outer transformer modules:

5. A short video event detection device based on multi-scale attention fusion, the device comprising: a processor and a memory, the memory having stored therein program instructions that invoke the program instructions stored in the memory to cause an apparatus to perform the method steps of any of claims 1-4.