CN110807369A - Efficient short video content intelligent classification method based on deep learning and attention mechanism - Google Patents

Efficient short video content intelligent classification method based on deep learning and attention mechanism Download PDF

Info

Publication number
CN110807369A
CN110807369A CN201910952622.7A CN201910952622A CN110807369A CN 110807369 A CN110807369 A CN 110807369A CN 201910952622 A CN201910952622 A CN 201910952622A CN 110807369 A CN110807369 A CN 110807369A
Authority
CN
China
Prior art keywords
attention mechanism
unit
dimensional
module
short video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910952622.7A
Other languages
Chinese (zh)
Other versions
CN110807369B (en
Inventor
包秀平
袁家斌
陈蓓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201910952622.7A priority Critical patent/CN110807369B/en
Publication of CN110807369A publication Critical patent/CN110807369A/en
Application granted granted Critical
Publication of CN110807369B publication Critical patent/CN110807369B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an efficient short video content intelligent classification method based on a deep learning and attention mechanism, and relates to an efficient method for intelligently classifying short videos according to contents. A core algorithm model of the method is composed of a two-dimensional convolutional neural network and a pseudo three-dimensional convolutional neural network which are connected in series and used for extracting shallow spatial information and high-dimensional spatial and temporal information respectively, the probability that a video belongs to each category is obtained through a normalized exponential function, and final prediction classification is obtained according to the probability. The method gives consideration to time performance and prediction accuracy, can be used for real-time content supervision and classification of the short video, and the obtained result can be used for reference of short video recommendation.

Description

Efficient short video content intelligent classification method based on deep learning and attention mechanism
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to an intelligent short video content classification method based on deep learning and attention mechanism.
Background
The short video is the most rapid internet content spreading mode in recent years, the content is various, the making and releasing thresholds are low, the monitoring is difficult due to the mass short video, and illegal videos such as violence, yellow and the like are easily mixed. According to the short video recommendation method and device, the short video contents are automatically classified in a deep learning mode, the short video platform can be assisted to review and supervise the short videos uploaded by the user, the classified short videos can also be used as reference factors for short video recommendation, the associated short videos are recommended according to the watching history of the user, and the competitiveness of the short video platform is improved.
Deep learning has become one of the main ways of automatically classifying video contents, but the training is difficult, the parameter amount is large, the time cost is large, and the classification efficiency is low, so that the method is difficult to be applied to practical engineering application. Especially, the three-dimensional convolutional neural network can integrate the time information of the video compared with the two-dimensional convolutional neural network, and is difficult to train and expand due to the difficulty in training and the requirement of higher hardware resource conditions. In the prior art, "Zolfaghari M, Singh K, Brox T.ECO: effective connected Network for Online Video interpretation [ J ]. 2018". series connection of two-dimensional convolutional neural Network and three-dimensional convolutional neural Network is used for real-time classification of videos.
Disclosure of Invention
In order to solve the problem of an automatic video content classification method in the prior art, the invention provides an efficient short video content intelligent classification method based on a deep learning and attention mechanism.
In order to achieve the purpose, the invention adopts the following technical scheme:
an efficient short video content intelligent classification method based on deep learning and attention mechanism comprises the following steps:
step 1, simply preprocessing an original short video, and quickly converting the video into a picture by using an FFmpeg tool;
step 2, determining the number N of pictures of the input model according to the requirements of accuracy and time, uniformly extracting input images from all the images at equal intervals to form a section of ordered input frames, and cutting the input frames into the size of 224 pixels multiplied by 224 pixels;
step 3, inputting the pictures processed in the step 2 into a two-dimensional convolutional neural network with an attention mechanism, outputting shallow feature representation diagrams, and stacking the shallow feature representation diagrams according to the time and channel sequence to form a feature diagram sequence X with time and space information;
step 4, inputting the output result of the step 3 into a pseudo three-dimensional convolution neural network with an attention mechanism, and learning time information and high-dimensional space information; the pseudo three-dimensional convolutional neural network comprises a plurality of unit modules which are sequentially arranged, each unit module comprises a plurality of convolutional layers which are sequentially arranged and an attention mechanism module which is positioned behind the last convolutional layer, the attention mechanism module is used for recalibrating time and space information to obtain the weights of all channels in the unit module, the weight of each channel is multiplied by the output of the last convolutional layer to obtain the output result of the attention mechanism module in the unit module, the output result is output to the next unit module, and the attention mechanism module of the last unit module outputs high-dimensional characteristics;
and 5, inputting the high-dimensional features obtained in the step 4 into a full-connection layer to obtain the probability that the video belongs to each category, and obtaining the final prediction classification according to the probability.
The characteristic diagram sequence with time and space information output by the step 3
Figure BDA0002226250720000021
Wherein: x represents the sequence of the stacked characteristic graphs output by the step 3, X represents a unit characteristic graph in X, subscript c represents the number of channels, superscript d represents the time dimension,a unit profile representing channel c, with time dimension d.
Further, in step 4, the convolutional layer structure is obtained by using a classical network and modifying a convolutional kernel to a pseudo three-dimensional convolutional kernel. The available classical networks are an inclusion Network, a Residual Network (Residual Network) and the like.
Further, the output U of each attention mechanism module in the step 4 is represented asWherein: u represents a unit characteristic diagram in U, subscript c represents the number of channels, superscript d represents a time dimension, and the calculation process of each attention mechanism module is described as follows:
Z=[z1,z2,...,zc]
s=σ(δ(Z,W1),W2)
s=[s1,s2,...,sc]
Figure BDA0002226250720000031
where D is the time dimension of the feature sequence, W and H refer to the width and height of each feature map, i, j, k refer to the spatial horizontal direction (width), spatial vertical direction (height) and time dimension indices of the image, xc(i, j, k) refers to pixel points with index i, j, k in the unit feature map with channel c, zcUnit profile x for channel ccZ is the global mean of all channels, i.e. Z ═ Z1,z2,…,zc]S weight of all channels in a unit block, scIs the weight of channel c in a unit module, W1And W2Parameters of two fully connected layers are respectively, delta represents a ReLU activation function, and sigma represents a Sigmoid activation function.
Further, in the step 5, the category of the maximum prediction probability value is selected as a video classification label, or all prediction categories larger than a threshold value are selected as video labels.
Compared with the prior art, the invention has the following beneficial effects:
the method combines the characteristics of the two-dimensional convolutional neural network and the three-dimensional convolutional neural network, adopts a mode of connecting the two-dimensional convolutional neural network and the pseudo three-dimensional convolutional network in series, and can improve the prediction accuracy and the model robustness, increase the attention module, improve the model performance and give consideration to the prediction accuracy.
Drawings
FIG. 1 is a data transmission flow diagram of the present invention;
FIG. 2 is a diagram of the overall framework of the model of the present invention;
FIG. 3 is a detailed diagram of a two-dimensional convolutional neural network layer incorporating an attention mechanism in the present invention;
FIG. 4 is a detailed view of a pseudo three-dimensional convolutional neural network layer incorporating an attention mechanism in the present invention;
fig. 5 is a flow chart of the present invention.
Detailed Description
The invention will be further elucidated with reference to specific embodiments.
As shown in fig. 1, the overall network framework is a series connection of two convolutional neural networks, and finally the prediction probability of each category is output.
As shown in fig. 2, the specific process is as follows: firstly, uniformly extracting frames of a short video, wherein the number of the extracted frame images is set as N: dividing all video frames into N parts, randomly extracting one frame from each part, arranging the frames according to time sequence, and sending the frames into a two-dimensional convolution neural network. Note that the mechanism adopts the Squeze-and-excitation module proposed by J.Hu, L.Shen, and G.Sun, "Squeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition,2018, pp.7132-7141.
As shown in fig. 3, details of the two-dimensional convolutional neural network layer are described, including a network structure detail diagram and convolutional kernel parameters, where the two-dimensional convolutional neural network adopts an inclusion network as an example. And finally outputting shallow features by the two-dimensional convolutional neural network layer. As shown in fig. 2, the shallow feature map sequence ordered according to the channels and time sequence is input to the pseudo three-dimensional convolutional neural network layer.
As shown in fig. 4, in the present embodiment, a residual error network is used as an example for the pseudo three-dimensional convolution network, and a convolution kernel is changed into pseudo three-dimensional. The attention mechanism adopts an optimized Squeeze-and-excitation attention module suitable for the three-dimensional network. And inputting the obtained high-dimensional features into two continuous full-connection layers, finally outputting the prediction probability of each category by using a Sigmoid function, and ending prediction.
As shown in fig. 5, in particular, a method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism includes the following steps:
step 1, simply preprocessing an original short video, and quickly converting the video into a picture by using an FFmpeg tool;
step 2, determining the number N of pictures of the input model according to the requirements of accuracy and time, uniformly extracting input images from all the images at equal intervals to form a section of ordered input frames, and cutting the input frames into the size of 224 pixels multiplied by 224 pixels;
step 3, inputting the pictures processed in the step 2 into a two-dimensional convolutional neural network with an attention mechanism, outputting shallow feature representation diagrams, and stacking the shallow feature representation diagrams according to the time and channel sequence to form a feature diagram sequence X with time and space information;
specifically, the characteristic diagram sequence with time and space information output by the step 3Wherein: x represents the sequence of the stacked characteristic graphs output by the step 3, X represents a unit characteristic graph in X, subscript c represents the number of channels, superscript d represents the time dimension,
Figure BDA0002226250720000042
a unit profile representing channel c, with time dimension d.
Step 4, inputting the output result of the step 3 into a pseudo three-dimensional convolution neural network with an attention mechanism, and learning time information and high-dimensional space information; the pseudo three-dimensional convolutional neural network comprises a plurality of unit modules which are sequentially arranged, each unit module comprises a plurality of convolutional layers which are sequentially arranged and an attention mechanism module which is positioned behind the last convolutional layer, the attention mechanism module is used for recalibrating time and space information to obtain the weights of all channels in the unit module, the weight of each channel is multiplied by the output of the last convolutional layer to obtain the output result of the attention mechanism module in the unit module, the output result is output to the next unit module, and the attention mechanism module of the last unit module outputs high-dimensional characteristics;
specifically, the output U of each attention mechanism module in the step 4 is represented as
Figure BDA0002226250720000051
Wherein: u represents a unit characteristic diagram in U, subscript c represents the number of channels, superscript d represents a time dimension, and the calculation process of each attention mechanism module is described as follows:
Figure BDA0002226250720000052
Z=[z1,z2,...,zc]
s=δ(δ(Z,W1),W2)
s=[s1,s2,...,sc]
Figure BDA0002226250720000053
where D is the time dimension of the feature sequence, W and H refer to the width and height of each feature map, i, j, k refer to the spatial horizontal direction (width), spatial vertical direction (height) and time dimension indices of the image, xc(i, j, k) refers to pixel points with index i, j, k in the unit feature map with channel c, zcUnit profile x for channel ccZ is the global mean of all channels, i.e. Z ═ Z1,z2,…,zc]S the weight of all channels in a unit block,scis the weight of channel c in a unit module, W1And W2Parameters of two fully connected layers are respectively, delta represents a ReLU activation function, and sigma represents a Sigmoid activation function.
And 5, inputting the high-dimensional features obtained in the step 4 into the full-link layer to obtain the probability that the video belongs to each category, and obtaining the final prediction classification according to the probability, specifically, selecting the category with the maximum prediction probability value as a video classification label, or selecting all prediction categories larger than a threshold value as video labels.
The invention discloses a high-efficiency method for intelligently classifying short videos according to contents. A core algorithm model of the method is composed of a two-dimensional convolutional neural network and a pseudo three-dimensional convolutional neural network which are connected in series and used for extracting shallow spatial information and high-dimensional spatial and temporal information respectively, the probability that a video belongs to each category is obtained through a normalized exponential function, and final prediction classification is obtained according to the probability. The method gives consideration to time performance and prediction accuracy, can be used for real-time content supervision and classification of the short video, and the obtained result can be used for reference of short video recommendation.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (4)

1. An efficient short video content intelligent classification method based on deep learning and attention mechanism is characterized by comprising the following steps:
step 1, simply preprocessing an original short video, and quickly converting the video into a picture by using an FFmpeg tool;
step 2, determining the number N of pictures of the input model according to the requirements of accuracy and time, uniformly extracting input images from all the images at equal intervals to form a section of ordered input frames, and cutting the input frames into the size of 224 pixels multiplied by 224 pixels;
step 3, inputting the pictures processed in the step 2 into a two-dimensional convolutional neural network with an attention mechanism, outputting shallow feature representation diagrams, and stacking the shallow feature representation diagrams according to the time and channel sequence to form a feature diagram sequence X with time and space information;
step 4, inputting the output result of the step 3 into a pseudo three-dimensional convolution neural network with an attention mechanism, and learning time information and high-dimensional space information; the pseudo three-dimensional convolutional neural network comprises a plurality of unit modules which are sequentially arranged, each unit module comprises a plurality of convolutional layers which are sequentially arranged and an attention mechanism module which is positioned behind the last convolutional layer, the attention mechanism module is used for recalibrating time and space information to obtain the weights of all channels in the unit module, the weight of each channel is multiplied by the output of the last convolutional layer to obtain the output result of the attention mechanism module in the unit module, the output result is output to the next unit module, and the attention mechanism module of the last unit module outputs high-dimensional characteristics;
and 5, inputting the high-dimensional features obtained in the step 4 into two continuous full-connection layers to obtain the probability that the video belongs to each category, and obtaining the final prediction classification according to the probability.
2. The method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism as claimed in claim 1, wherein the feature map sequence with temporal and spatial information output in step 3
Figure FDA0002226250710000011
Figure FDA0002226250710000012
Wherein: x represents the sequence of the stacked characteristic graphs output by the step 3, X represents a unit characteristic graph in X, subscript c represents the number of channels, superscript d represents the time dimension,
Figure FDA0002226250710000013
a unit profile representing channel c, with time dimension d.
3. The method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism as claimed in claim 1, wherein the output U of each attention mechanism module in the step 4 is represented as
Figure FDA0002226250710000014
Wherein: u represents a unit characteristic diagram in U, subscript c represents the number of channels, superscript d represents a time dimension, and the calculation process of each attention mechanism module is described as follows:
Z=[z1,z2,…,zc]
s=σ(δ(Z,W1),W2)
s=[s1,s2,…,sc]
Figure FDA0002226250710000022
wherein D is the time dimension of the feature sequence, W and H refer to the width and height of each feature map respectively, i, j, k refer to the spatial horizontal direction, spatial vertical direction and time dimension index of each feature map respectively, and xc(i, j, k) refers to pixel points with index i, j, k in the unit feature map with channel c, zcUnit profile x for channel ccZ is the global mean of all channels, i.e. Z ═ Z1,z2,…,zc]S weight of all channels in a unit block, scIs the weight of channel c in a unit module, W1And W2Parameters of two continuous full-connection layers are respectively, delta represents a ReLU activation function, and sigma represents a Sigmoid activation function.
4. The method for intelligently classifying high-efficiency short video contents based on deep learning and attention mechanism as claimed in claim 1, wherein in said step 5, the category with the highest prediction probability value is selected as the video classification label, or all the prediction categories larger than the threshold are selected as the video labels.
CN201910952622.7A 2019-10-09 2019-10-09 Short video content intelligent classification method based on deep learning and attention mechanism Active CN110807369B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910952622.7A CN110807369B (en) 2019-10-09 2019-10-09 Short video content intelligent classification method based on deep learning and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910952622.7A CN110807369B (en) 2019-10-09 2019-10-09 Short video content intelligent classification method based on deep learning and attention mechanism

Publications (2)

Publication Number Publication Date
CN110807369A true CN110807369A (en) 2020-02-18
CN110807369B CN110807369B (en) 2024-02-20

Family

ID=69487993

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910952622.7A Active CN110807369B (en) 2019-10-09 2019-10-09 Short video content intelligent classification method based on deep learning and attention mechanism

Country Status (1)

Country Link
CN (1) CN110807369B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259874A (en) * 2020-05-06 2020-06-09 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN112948708A (en) * 2021-03-05 2021-06-11 清华大学深圳国际研究生院 Short video recommendation method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670453A (en) * 2018-12-20 2019-04-23 杭州东信北邮信息技术有限公司 A method of extracting short video subject
CN110020682A (en) * 2019-03-29 2019-07-16 北京工商大学 A kind of attention mechanism relationship comparison net model methodology based on small-sample learning
CN110032926A (en) * 2019-02-22 2019-07-19 哈尔滨工业大学(深圳) A kind of video classification methods and equipment based on deep learning
CN110175580A (en) * 2019-05-29 2019-08-27 复旦大学 A kind of video behavior recognition methods based on timing cause and effect convolutional network
US10402978B1 (en) * 2019-01-25 2019-09-03 StradVision, Inc. Method for detecting pseudo-3D bounding box based on CNN capable of converting modes according to poses of objects using instance segmentation and device using the same

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670453A (en) * 2018-12-20 2019-04-23 杭州东信北邮信息技术有限公司 A method of extracting short video subject
US10402978B1 (en) * 2019-01-25 2019-09-03 StradVision, Inc. Method for detecting pseudo-3D bounding box based on CNN capable of converting modes according to poses of objects using instance segmentation and device using the same
CN110032926A (en) * 2019-02-22 2019-07-19 哈尔滨工业大学(深圳) A kind of video classification methods and equipment based on deep learning
CN110020682A (en) * 2019-03-29 2019-07-16 北京工商大学 A kind of attention mechanism relationship comparison net model methodology based on small-sample learning
CN110175580A (en) * 2019-05-29 2019-08-27 复旦大学 A kind of video behavior recognition methods based on timing cause and effect convolutional network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MOHAMMADREZA ZOLFAGHARI等: "Efficient Convolutional Network for Online Video Understanding", pages 1 - 24 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259874A (en) * 2020-05-06 2020-06-09 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN112948708A (en) * 2021-03-05 2021-06-11 清华大学深圳国际研究生院 Short video recommendation method

Also Published As

Publication number Publication date
CN110807369B (en) 2024-02-20

Similar Documents

Publication Publication Date Title
CN112001339B (en) Pedestrian social distance real-time monitoring method based on YOLO v4
Remez et al. Class-aware fully convolutional Gaussian and Poisson denoising
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
WO2021042828A1 (en) Neural network model compression method and apparatus, and storage medium and chip
CN104113789B (en) On-line video abstraction generation method based on depth learning
CN111325165B (en) Urban remote sensing image scene classification method considering spatial relationship information
CN112653899B (en) Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
CN111163338B (en) Video definition evaluation model training method, video recommendation method and related device
CN112836646B (en) Video pedestrian re-identification method based on channel attention mechanism and application
CN109218134B (en) Test case generation system based on neural style migration
CN105718932A (en) Colorful image classification method based on fruit fly optimization algorithm and smooth twinborn support vector machine and system thereof
CN110807369B (en) Short video content intelligent classification method based on deep learning and attention mechanism
CN111160356A (en) Image segmentation and classification method and device
CN109062811B (en) Test case generation method based on neural style migration
CN112183240A (en) Double-current convolution behavior identification method based on 3D time stream and parallel space stream
CN111476133A (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN116580184A (en) YOLOv 7-based lightweight model
CN113420179B (en) Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
CN107729821B (en) Video summarization method based on one-dimensional sequence learning
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN110490053B (en) Human face attribute identification method based on trinocular camera depth estimation
CN112132207A (en) Target detection neural network construction method based on multi-branch feature mapping
CN116229323A (en) Human body behavior recognition method based on improved depth residual error network
CN111652238A (en) Multi-model integration method and system
EP4164221A1 (en) Processing image data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant