CN109670453B

CN109670453B - Method for extracting short video theme

Info

Publication number: CN109670453B
Application number: CN201811567121.9A
Authority: CN
Inventors: 赵海秀; 刘同存; 张少杰; 王彦青; 刘昊鑫
Original assignee: EB INFORMATION TECHNOLOGY Ltd
Current assignee: Xinxun Digital Technology Hangzhou Co ltd
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2023-04-07
Anticipated expiration: 2038-12-20
Also published as: CN109670453A

Abstract

A method of extracting short video topics, comprising: cutting the short video into M video frame-cut pictures; acquiring a video spatial feature vector set of a video frame-cutting picture by using a convolutional neural network in a transfer learning mode; forming a feature vector time sequence by the video space feature vectors according to the playing time sequence, and inputting the feature vector time sequence into a bidirectional cyclic neural network so as to output a video space-time feature sequence set H; adjusting each video space-time characteristic sequence in the H by adopting an attention mechanism so as to obtain a new video space-time characteristic sequence set Q; and expanding the Q into a video space-time characteristic vector Z, carrying out linear transformation on the Z, and then respectively calculating the probability of the short video belonging to each topic by adopting a normalized exponential function so as to extract the topic of the short video. The invention belongs to the technical field of information, can automatically extract subject information from short videos, and effectively reduces the calculation amount.

Description

Method for extracting short video theme

Technical Field

The invention relates to a method for extracting a short video theme, belonging to the technical field of information.

Background

Short videos are increasingly becoming a way for people to know the world, the complex process of manually labeling the short videos can be greatly reduced by labeling a large number of short videos, and preparation is also made for subsequent short video classification and pushing favorite short videos for users.

Patent application CN 201810496579.3 (application name: a new unsupervised video semantic extraction method, application date: 2018-05-22, applicant: electronic technology university) discloses a new unsupervised video semantic extraction method, which comprises constructing a three-dimensional convolutional neural network model, and training the three-dimensional convolutional neural network model by using a video data set with tags in a video database; processing video data without labels in a video database into data which is in accordance with the input of a three-dimensional convolution neural network by using a sliding window; the generated data is used as input data of a three-dimensional convolution neural network model, and output data of a full connection layer of the three-dimensional convolution neural network model is taken as semantic features of a video segment; and using the generated video segment semantic feature sequence as the input of a video semantic self-encoder, and integrating by a self-encoder to obtain the overall semantic features of the video. According to the technical scheme, the semantic features of the video segments are directly extracted through the three-dimensional convolutional neural network, so that the system efficiency is not high due to the fact that the extremely large calculation amount is caused.

Therefore, how to automatically extract the subject information from the short video and effectively reduce the amount of calculation becomes a technical problem which needs to be solved urgently by technicians.

Disclosure of Invention

In view of the above, the present invention provides a method for extracting short video topics, which can automatically extract topic information from short videos and effectively reduce the amount of computation.

In order to achieve the above object, the present invention provides a method for extracting short video topics, comprising:

step one, cutting short video into M video frame-cut pictures according to a frame length at a certain interval;

step two, adopting a transfer learning mode, and obtaining a video space characteristic vector set Y = [ Y ] of M video frame-cut pictures by using a convolutional neural network ₁ ,y ₂ ,...,y _M ]Wherein, y ₁ 、y ₂ 、…、y _M Video space feature vectors are obtained by each video frame-cutting picture through a convolutional neural network;

step three, according to the playing time sequence of the short video, forming a feature vector time sequence by the video space feature vectors of M video frame-cut pictures, inputting the feature vector time sequence into a bidirectional cyclic neural network, and outputting a video space-time feature sequence set H = [ H ] ₁ ,h ₂ ,...,h _M ]Wherein h is ₁ 、h ₂ 、…、h _M Respectively outputting each video space-time characteristic sequence in the video space-time characteristic sequence set H;

step four, adopting an attention mechanism to calculate each video space-time characteristic sequence in the video space-time characteristic sequence set HAttention to other video space-time feature sequences, and adjusting each video space-time feature sequence in the video space-time feature sequence set H according to the attention, thereby obtaining a new video space-time feature sequence set Q = [ Q ] ₁ ,q ₂ ,...,q _M ]Wherein q is ₁ 、q ₂ 、…、q _M Respectively, video spatial-temporal feature sequences adjusted according to attention;

and fifthly, expanding the new video space-time characteristic sequence set Q into a video space-time characteristic vector Z, carrying out linear transformation on the video space-time characteristic vector Z, and then respectively calculating the probability of the short video belonging to each topic by adopting a normalized exponential function so as to extract the topic of the short video.

Compared with the prior art, the invention has the beneficial effects that: the method comprises the steps of intercepting a certain number of pictures from a short video according to a certain frame length, extracting spatial features of the short video from each picture, then transmitting the features into a Bidirectional LSTM network according to a time sequence for combination, thereby extracting spatial and temporal feature information of the short video, and introducing an attention mechanism to mainly mine a key frame which is most related to the video category.

Drawings

Fig. 1 is a flow chart of a method for extracting short video topics according to the present invention.

Fig. 2 is a flowchart illustrating the detailed steps of step two in fig. 1.

Fig. 3 is a flowchart showing the detailed steps of step four in fig. 1.

Fig. 4 is a flowchart showing the detailed steps of step five in fig. 1.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the method for extracting short video topics of the present invention includes:

step one, cutting short video into M video frame-cut pictures according to a frame length at a certain interval, wherein the value of M can be set according to the actual service requirement;

step two, adopting a transfer learning mode, and obtaining a video space characteristic vector set Y = [ Y ] of M video frame-cut pictures by using a convolutional neural network ₁ ,y ₂ ,...,y _M ]Wherein, y ₁ 、y ₂ 、…、y _M Respectively obtaining video space characteristic vectors of each video frame-cutting picture through a convolutional neural network; through a convolutional neural network, content characteristic information can be extracted from each video frame-cut picture;

step three, according to the playing time sequence of the short video, forming a feature vector time sequence by the video space feature vectors of M video frame-cut pictures, inputting the feature vector time sequence into a bidirectional cyclic neural network, and outputting a video space-time feature sequence set H = [ H ] ₁ ,h ₂ ,...,h _M ]Wherein h is ₁ 、h ₂ 、…、h _M Respectively outputting each video space-time characteristic sequence in the video space-time characteristic sequence set H; through a bidirectional cyclic neural network, content and time characteristic information can be extracted from all video frame-cut pictures;

step four, calculating the attention of each video space-time characteristic sequence in the video space-time characteristic sequence set H to other video space-time characteristic sequences by adopting an attention mechanism, and adjusting each video space-time characteristic sequence in the video space-time characteristic sequence set H according to the attention so as to obtain a new video space-time characteristic sequence set Q = [ Q ] ₁ ,q ₂ ,...,q _M ]Wherein q is ₁ 、q ₂ 、…、q _M Respectively video space-time characteristic sequences adjusted according to attention;

In the second step, considering that the labeled sample pictures in a specific field are fewer and cannot be sufficiently trained, a migration learning method may be adopted, and a convolutional neural network is used to extract content feature information of each video frame-cut picture, as shown in fig. 2, the second step may further include:

step 21, constructing and training an inclusion-v 3 pre-training convolutional neural network model based on an ImageNet public data set by adopting a transfer learning mode, wherein the input of the model is a video frame-cutting picture, and the output is the probability that the video frame-cutting picture belongs to different subjects;

step 22, inputting the M video frame-cut pictures into the convolutional neural network model trained in step 21, extracting the output of the second last layer from the convolutional neural network model as the video space feature vector of each video frame-cut picture, and forming a video space feature vector set Y = [ Y ] by the video space feature vectors of the M video frame-cut pictures ₁ ,y ₂ ,...,y _M ]And the M video frames are taken as the content characteristics of the M video frame-cut pictures.

Step three, a Bidirectional LSTM algorithm can be adopted, video space feature vectors of all the video frame-cut pictures extracted in step two are input into the model according to time sequence, and a video space-time feature sequence set H = [ H ] is output ₁ ,h ₂ ,...,h _M ]Through the time sequence layer in the third step, the invention can combine the characteristics of the frame-cut pictures of different videos, thereby obtaining the video space-time characteristic sequence.

According to the method, an attention mechanism is added behind the time sequence layer corresponding to the step three, so that the relation between different video space-time characteristic sequences in the video space-time characteristic sequence set H can be captured, and different attention degrees among the video space-time characteristic sequences are fully mined. As shown in fig. 3, the fourth step may further include:

step 41, calculating the relation between every two video space-time characteristic sequences in the video space-time characteristic sequence set HThe series value is:

wherein, f (h) _i ,h _j ) Is the relation value between the ith video spatial-temporal feature sequence and the jth video spatial-temporal feature sequence in H, and->

Are respectively paired with h _i 、h _j Value after nonlinear transformation, W _θ (h _i ) ^T Is to W _θ (h _i ) Performing transposition;

step 42, respectively calculating the attention of each video space-time characteristic sequence in the video space-time characteristic sequence set H to other video space-time characteristic sequences:

wherein, a _i ^j Is the attention of the ith video spatio-temporal feature sequence to the jth video spatio-temporal feature sequence, s _i ^j ＝f(h _i ,h _j )；f(h _i ,h _j ) Is a scalar quantity, representing the relationship between two video space-time characteristic sequences, the relationship value between the ith video space-time characteristic sequence and other video space-time characteristic sequences has M, which can be marked as [ s ] _i ¹ ,s _i ² ,...,s _i ^M ]；

Step 43, adjusting each video space-time characteristic sequence in the video space-time characteristic sequence set H according to the attention, wherein the calculation formula is as follows:

wherein q is _i Is the ith video spatio-temporal feature sequence, h, adjusted according to attention _j Is the jth video space-time characteristic sequence of the video space-time characteristic sequence set H, thereby forming a new video space-time characteristic sequence set Q = [ Q ] ₁ ,q ₂ ,...,q _M ](ii) a In this way it is possible to obtain,each video spatial-temporal feature sequence adds feature information of other video spatial-temporal feature sequences through different weights, so that attention can be focused on the key video spatial-temporal feature sequences.

As shown in fig. 4, step five may further include:

step 51, fully expanding the new video space-time feature sequence set Q to obtain a video space-time feature vector Z = [ Z ] ₁ ,z ₂ ,z ₃ ,...,z _M×N ]The dimension of each video space-time characteristic sequence in Q is N, and the dimension of a video space-time characteristic vector Z is M multiplied by N;

step 52, performing linear transformation on the video space-time feature vector Z through the full connection layer, and then respectively calculating the probability of the short video belonging to each topic by adopting a normalized exponential function:

wherein p is _k Is the probability that the short video belongs to the kth topic, f _w (Z) _k 、f _w (Z) _t Respectively carrying out linear transformation on Z to obtain the values of the kth class and the t class, wherein w is a linear function parameter;

step 53, counting the number L of subjects to which the short video belongs, wherein the probability is greater than the probability threshold, and determining whether L is 0? If yes, the probability that the short videos belong to each theme is ranked in a descending order, then P themes with the maximum probability values are output as the themes of the short videos, and the process is ended; if not, continuing the next step; wherein, P is a threshold value of the number of short video themes, which can be set according to actual business needs;

step 54, determine if L is less than or equal to P? If yes, outputting the theme with the probability larger than the probability threshold value as the theme of the short video, and ending the flow; if not, continuing the next step;

and step 55, arranging the L probabilities larger than the probability threshold value according to the descending order of the probabilities, and then outputting the P subjects with the maximum probability as the subjects of the short video.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for extracting short video topics is characterized by comprising the following steps:

step four, calculating the attention of each video space-time characteristic sequence in the video space-time characteristic sequence set H to other video space-time characteristic sequences by adopting an attention mechanism, and adjusting each video space-time characteristic sequence in the video space-time characteristic sequence set H according to the attention so as to obtain a new video space-time characteristic sequence set Q = [ Q ] ₁ ,q ₂ ,...,q _M ]Wherein q is ₁ 、q ₂ 、…、q _M Respectively, video spatial-temporal feature sequences adjusted according to attention;

2. The method of claim 1, wherein step two further comprises:

step 21, constructing and training an inclusion-v 3 pre-training convolutional neural network model based on an ImageNet public data set by adopting a transfer learning mode, wherein the input of the model is a video frame-cutting picture, and the output of the model is the probability that the video frame-cutting picture belongs to different subjects;

3. The method of claim 1, wherein step three employs a Bidirectional LSTM algorithm.

4. The method of claim 1, wherein step four further comprises:

step 41, calculating a relation value between every two video space-time characteristic sequences in the video space-time characteristic sequence set H:

wherein, f (h) _i ,h _j ) Is the relation value between the ith video space-time feature sequence and the jth video space-time feature sequence in H, W _θ (h _i )、/>

wherein, a _i ^j Is the attention of the ith video spatio-temporal feature sequence to the jth video spatio-temporal feature sequence, s _i ^j ＝f(h _i ,h _j )；

wherein q is _i Is the ith video spatio-temporal feature sequence, h, adjusted according to attention _j Is the jth video space-time characteristic sequence of the video space-time characteristic sequence set H, thereby forming a new video space-time characteristic sequence set.

5. The method of claim 1, wherein step five further comprises:

step 51, fully expanding the new video space-time feature sequence set Q to obtain a video space-time feature vector Z = [ Z ] ₁ ,z ₂ ,z ₃ ,…,z _M ]The dimension of each video space-time characteristic sequence in Q is N, and the dimension of a video space-time characteristic vector Z is M multiplied by N;

and 52, performing linear transformation on the video space-time characteristic vector Z through the full connection layer, and then respectively calculating the probability of the short video belonging to each theme by adopting a normalized exponential function:

wherein p is _k Probability of short video belonging to the kth topic, f _w (Z) _k 、f _w (Z) _t The values of the kth class and the t class are obtained after the Z is subjected to linear transformation, and w is a linear function parameter.

6. The method of claim 1, wherein step five further comprises:

step A1, counting the number L of the short videos belonging to each theme, wherein the probability is greater than a probability threshold value, judging whether L is 0, if so, sorting the short videos belonging to each theme in a descending order, outputting P themes with the maximum probability value as the themes of the short videos, and ending the process; if not, continuing the next step;

step A2, judging whether L is smaller than or equal to P, if so, outputting the theme with the probability larger than the probability threshold value as the theme of the short video, and ending the process; if not, continuing the next step;

and A3, arranging the L probabilities larger than the probability threshold value according to a probability descending order, and then outputting the P subjects with the maximum probability as the subjects of the short video.