CN110807369A

CN110807369A - Efficient short video content intelligent classification method based on deep learning and attention mechanism

Info

Publication number: CN110807369A
Application number: CN201910952622.7A
Authority: CN
Inventors: 包秀平; 袁家斌; 陈蓓
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-02-18
Anticipated expiration: 2039-10-09
Also published as: CN110807369B

Abstract

The invention discloses an efficient short video content intelligent classification method based on a deep learning and attention mechanism, and relates to an efficient method for intelligently classifying short videos according to contents. A core algorithm model of the method is composed of a two-dimensional convolutional neural network and a pseudo three-dimensional convolutional neural network which are connected in series and used for extracting shallow spatial information and high-dimensional spatial and temporal information respectively, the probability that a video belongs to each category is obtained through a normalized exponential function, and final prediction classification is obtained according to the probability. The method gives consideration to time performance and prediction accuracy, can be used for real-time content supervision and classification of the short video, and the obtained result can be used for reference of short video recommendation.

Description

Efficient short video content intelligent classification method based on deep learning and attention mechanism

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an intelligent short video content classification method based on deep learning and attention mechanism.

Background

The short video is the most rapid internet content spreading mode in recent years, the content is various, the making and releasing thresholds are low, the monitoring is difficult due to the mass short video, and illegal videos such as violence, yellow and the like are easily mixed. According to the short video recommendation method and device, the short video contents are automatically classified in a deep learning mode, the short video platform can be assisted to review and supervise the short videos uploaded by the user, the classified short videos can also be used as reference factors for short video recommendation, the associated short videos are recommended according to the watching history of the user, and the competitiveness of the short video platform is improved.

Deep learning has become one of the main ways of automatically classifying video contents, but the training is difficult, the parameter amount is large, the time cost is large, and the classification efficiency is low, so that the method is difficult to be applied to practical engineering application. Especially, the three-dimensional convolutional neural network can integrate the time information of the video compared with the two-dimensional convolutional neural network, and is difficult to train and expand due to the difficulty in training and the requirement of higher hardware resource conditions. In the prior art, "Zolfaghari M, Singh K, Brox T.ECO: effective connected Network for Online Video interpretation [ J ]. 2018". series connection of two-dimensional convolutional neural Network and three-dimensional convolutional neural Network is used for real-time classification of videos.

Disclosure of Invention

In order to solve the problem of an automatic video content classification method in the prior art, the invention provides an efficient short video content intelligent classification method based on a deep learning and attention mechanism.

In order to achieve the purpose, the invention adopts the following technical scheme:

an efficient short video content intelligent classification method based on deep learning and attention mechanism comprises the following steps:

step 1, simply preprocessing an original short video, and quickly converting the video into a picture by using an FFmpeg tool;

step 2, determining the number N of pictures of the input model according to the requirements of accuracy and time, uniformly extracting input images from all the images at equal intervals to form a section of ordered input frames, and cutting the input frames into the size of 224 pixels multiplied by 224 pixels;

step 3, inputting the pictures processed in the step 2 into a two-dimensional convolutional neural network with an attention mechanism, outputting shallow feature representation diagrams, and stacking the shallow feature representation diagrams according to the time and channel sequence to form a feature diagram sequence X with time and space information;

step 4, inputting the output result of the step 3 into a pseudo three-dimensional convolution neural network with an attention mechanism, and learning time information and high-dimensional space information; the pseudo three-dimensional convolutional neural network comprises a plurality of unit modules which are sequentially arranged, each unit module comprises a plurality of convolutional layers which are sequentially arranged and an attention mechanism module which is positioned behind the last convolutional layer, the attention mechanism module is used for recalibrating time and space information to obtain the weights of all channels in the unit module, the weight of each channel is multiplied by the output of the last convolutional layer to obtain the output result of the attention mechanism module in the unit module, the output result is output to the next unit module, and the attention mechanism module of the last unit module outputs high-dimensional characteristics;

and 5, inputting the high-dimensional features obtained in the step 4 into a full-connection layer to obtain the probability that the video belongs to each category, and obtaining the final prediction classification according to the probability.

The characteristic diagram sequence with time and space information output by the step 3

Wherein: x represents the sequence of the stacked characteristic graphs output by the step 3, X represents a unit characteristic graph in X, subscript c represents the number of channels, superscript d represents the time dimension,a unit profile representing channel c, with time dimension d.

Further, in step 4, the convolutional layer structure is obtained by using a classical network and modifying a convolutional kernel to a pseudo three-dimensional convolutional kernel. The available classical networks are an inclusion Network, a Residual Network (Residual Network) and the like.

Further, the output U of each attention mechanism module in the step 4 is represented asWherein: u represents a unit characteristic diagram in U, subscript c represents the number of channels, superscript d represents a time dimension, and the calculation process of each attention mechanism module is described as follows:

Z＝[z₁,z₂,...，z_c]

s＝σ(δ(Z，W₁)，W₂)

s＝[s₁，s₂，...，s_c]

where D is the time dimension of the feature sequence, W and H refer to the width and height of each feature map, i, j, k refer to the spatial horizontal direction (width), spatial vertical direction (height) and time dimension indices of the image, x_c(i, j, k) refers to pixel points with index i, j, k in the unit feature map with channel c, z_cUnit profile x for channel c_cZ is the global mean of all channels, i.e. Z ═ Z₁,z₂,…，z_c]S weight of all channels in a unit block, s_cIs the weight of channel c in a unit module, W₁And W₂Parameters of two fully connected layers are respectively, delta represents a ReLU activation function, and sigma represents a Sigmoid activation function.

Further, in the step 5, the category of the maximum prediction probability value is selected as a video classification label, or all prediction categories larger than a threshold value are selected as video labels.

Compared with the prior art, the invention has the following beneficial effects:

the method combines the characteristics of the two-dimensional convolutional neural network and the three-dimensional convolutional neural network, adopts a mode of connecting the two-dimensional convolutional neural network and the pseudo three-dimensional convolutional network in series, and can improve the prediction accuracy and the model robustness, increase the attention module, improve the model performance and give consideration to the prediction accuracy.

Drawings

FIG. 1 is a data transmission flow diagram of the present invention;

FIG. 2 is a diagram of the overall framework of the model of the present invention;

FIG. 3 is a detailed diagram of a two-dimensional convolutional neural network layer incorporating an attention mechanism in the present invention;

FIG. 4 is a detailed view of a pseudo three-dimensional convolutional neural network layer incorporating an attention mechanism in the present invention;

fig. 5 is a flow chart of the present invention.

Detailed Description

The invention will be further elucidated with reference to specific embodiments.

As shown in fig. 1, the overall network framework is a series connection of two convolutional neural networks, and finally the prediction probability of each category is output.

As shown in fig. 2, the specific process is as follows: firstly, uniformly extracting frames of a short video, wherein the number of the extracted frame images is set as N: dividing all video frames into N parts, randomly extracting one frame from each part, arranging the frames according to time sequence, and sending the frames into a two-dimensional convolution neural network. Note that the mechanism adopts the Squeze-and-excitation module proposed by J.Hu, L.Shen, and G.Sun, "Squeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition,2018, pp.7132-7141.

As shown in fig. 3, details of the two-dimensional convolutional neural network layer are described, including a network structure detail diagram and convolutional kernel parameters, where the two-dimensional convolutional neural network adopts an inclusion network as an example. And finally outputting shallow features by the two-dimensional convolutional neural network layer. As shown in fig. 2, the shallow feature map sequence ordered according to the channels and time sequence is input to the pseudo three-dimensional convolutional neural network layer.

As shown in fig. 4, in the present embodiment, a residual error network is used as an example for the pseudo three-dimensional convolution network, and a convolution kernel is changed into pseudo three-dimensional. The attention mechanism adopts an optimized Squeeze-and-excitation attention module suitable for the three-dimensional network. And inputting the obtained high-dimensional features into two continuous full-connection layers, finally outputting the prediction probability of each category by using a Sigmoid function, and ending prediction.

As shown in fig. 5, in particular, a method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism includes the following steps:

specifically, the characteristic diagram sequence with time and space information output by the step 3Wherein: x represents the sequence of the stacked characteristic graphs output by the step 3, X represents a unit characteristic graph in X, subscript c represents the number of channels, superscript d represents the time dimension,

a unit profile representing channel c, with time dimension d.

specifically, the output U of each attention mechanism module in the step 4 is represented as

Wherein: u represents a unit characteristic diagram in U, subscript c represents the number of channels, superscript d represents a time dimension, and the calculation process of each attention mechanism module is described as follows:

Z＝[z₁,z₂,...，z_c]

s＝δ(δ(Z，W₁)，W₂)

s＝[s₁，s₂，...，s_c]

where D is the time dimension of the feature sequence, W and H refer to the width and height of each feature map, i, j, k refer to the spatial horizontal direction (width), spatial vertical direction (height) and time dimension indices of the image, x_c(i, j, k) refers to pixel points with index i, j, k in the unit feature map with channel c, z_cUnit profile x for channel c_cZ is the global mean of all channels, i.e. Z ═ Z₁,z₂,…,z_c]S the weight of all channels in a unit block,s_cis the weight of channel c in a unit module, W₁And W₂Parameters of two fully connected layers are respectively, delta represents a ReLU activation function, and sigma represents a Sigmoid activation function.

And 5, inputting the high-dimensional features obtained in the step 4 into the full-link layer to obtain the probability that the video belongs to each category, and obtaining the final prediction classification according to the probability, specifically, selecting the category with the maximum prediction probability value as a video classification label, or selecting all prediction categories larger than a threshold value as video labels.

The invention discloses a high-efficiency method for intelligently classifying short videos according to contents. A core algorithm model of the method is composed of a two-dimensional convolutional neural network and a pseudo three-dimensional convolutional neural network which are connected in series and used for extracting shallow spatial information and high-dimensional spatial and temporal information respectively, the probability that a video belongs to each category is obtained through a normalized exponential function, and final prediction classification is obtained according to the probability. The method gives consideration to time performance and prediction accuracy, can be used for real-time content supervision and classification of the short video, and the obtained result can be used for reference of short video recommendation.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An efficient short video content intelligent classification method based on deep learning and attention mechanism is characterized by comprising the following steps:

and 5, inputting the high-dimensional features obtained in the step 4 into two continuous full-connection layers to obtain the probability that the video belongs to each category, and obtaining the final prediction classification according to the probability.

2. The method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism as claimed in claim 1, wherein the feature map sequence with temporal and spatial information output in step 3

Wherein: x represents the sequence of the stacked characteristic graphs output by the step 3, X represents a unit characteristic graph in X, subscript c represents the number of channels, superscript d represents the time dimension,

a unit profile representing channel c, with time dimension d.

3. The method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism as claimed in claim 1, wherein the output U of each attention mechanism module in the step 4 is represented as

Z＝[z₁,z₂,…,z_c]

s＝σ(δ(Z,W₁)，W₂)

s＝[s₁，s₂，…，s_c]

wherein D is the time dimension of the feature sequence, W and H refer to the width and height of each feature map respectively, i, j, k refer to the spatial horizontal direction, spatial vertical direction and time dimension index of each feature map respectively, and x_c(i, j, k) refers to pixel points with index i, j, k in the unit feature map with channel c, z_cUnit profile x for channel c_cZ is the global mean of all channels, i.e. Z ═ Z₁,z₂,…,z_c]S weight of all channels in a unit block, s_cIs the weight of channel c in a unit module, W₁And W₂Parameters of two continuous full-connection layers are respectively, delta represents a ReLU activation function, and sigma represents a Sigmoid activation function.

4. The method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism as claimed in claim 1, wherein in said step 5, the category with the highest prediction probability value is selected as the video classification label, or all the prediction categories larger than the threshold are selected as the video labels.