CN113609918A

CN113609918A - Short video classification method based on zero-time learning

Info

Publication number: CN113609918A
Application number: CN202110785398.4A
Authority: CN
Inventors: 陶珺; 韩立新
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-11-05
Anticipated expiration: 2041-07-12
Also published as: CN113609918B

Abstract

The invention provides a short video classification method based on zero-time learning, which comprises the following steps of: a) constructing a training data set, and extracting clip segments from the original short video; b) introducing a classic Ken Burns effect to perform data enhancement; c) extracting visual features by adopting a deep neural network; d) constructing a semantic space, and constructing a class description A of the label class Y; representing each category in the form of semantic vectors, wherein any dimension of each semantic vector represents a high-level attribute; e) calculating the category similarity of the target video, and eliminating the target category with too small distance from the video training set; f) performing target video classification decision, and performing intra-class aggregation and inter-class separation on a classification model by adopting a triple Ranking Loss function; the method and the device make full use of the video characteristics and the label characteristics of the short video, effectively solve the problem of label classification of the short video, and improve the accuracy of classification of invisible short video.

Description

Short video classification method based on zero-time learning

Technical Field

The invention relates to the technical field of computer vision and transfer learning, in particular to a short video classification method based on zero-time learning.

Background

At present, two main video classification methods are available, one is a double-current network, and static features and motion features, namely spatial features and temporal features, of objects in a video are respectively extracted through two 2D convolutional neural networks (an upper layer of spatial stream constant and a lower layer of temporal stream constant); the other is a 3D convolutional neural network, which can capture temporal and spatial feature information in video simultaneously by performing a convolution operation using a 3D convolution kernel. These two methods can accurately classify videos into hundreds of different categories through training on large datasets.

However, both of the above methods often require enough samples to train a good enough model, and the cost of annotating video data is very expensive. Today, videos on the internet are increasingly large in scale, for example, if you tube web sites generate hundreds of hours of videos per minute, a model trained on an annotated video data set is difficult to achieve well.

Zero-shot learning (ZSL) is a good solution to the above-mentioned problems. The ZSL can be popularized to a new task without classes in a training data set only by training the model once. In the ZSL, a model is trained by using training set data, so that the model can classify objects in a test set, but there is no intersection between a training set class and a test set class, and during training, a connection between the training set and the test set needs to be established by means of description of the class, so that the model is effective.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a video classification method based on zero-time learning, aiming at the problems that the current video classification technology has low utilization degree of extracted video features and is difficult to complete classification tasks of unknown videos.

The technical scheme is as follows: a video classification method based on zero learning comprises the following steps:

1. a training data set of zero learning is constructed, sparse sampling is performed on the training set video data, and the shortest edge of each frame is reshaped to 128 pixels. Randomly prune a 112x112 patch on the training data set, and recursively prune a center patch. Data enhancement is performed on the training set, and images are combined into a video by using a Ken Burns effect: a series of objects move around the image, simulating video-like motion.

2. And potential visual feature extraction is carried out, and from the content of the video, deep features, namely spatial and temporal features, closely related to the video are mined so as to ensure the effectiveness of the features. The method mainly extracts the deepest expression vector from the video as the feature vector of the video through a 3D convolutional neural network, and the ratio of the number of video frames in the network is 1: and 4, sufficiently extracting the spatial features and the temporal features of the video.

3. Constructing category semantic space of video, and classifying Y of training set and test set_tr、Y_teConstruct its class description A_tr、A_te. Each class y_iE.g. Y, are all expressed as a semantic vector a_iE.g., a, and each dimension of the semantic vector represents a high-level attribute, such as "black and white", "tailed", "feathered", etc., that is set to a non-zero value in its dimension when the category contains such an attribute.

4. And calculating the category similarity of the target video, calculating the similarity distance between the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating the test categories of which the overlapping and distance are less than the similarity threshold.

5. And (4) video classification decision, namely reducing the dimensionality of the video feature vector through a full connection layer to enable the dimensionality to be consistent with the dimensionality of a semantic space. And then, a triple tracing Loss function is used to shorten the distance between the positive sample belonging to the same class with the training data sample and expand the distance between the positive sample and the negative sample not belonging to the same class.

The beneficial effects of the invention are specifically expressed as follows:

1) the method is different from the prior method which uses an interframe difference method to extract the video data, and the invention uses sparse sampling to extract the video frame. The extracted video frames have the characteristics of sparsity and globality, so that the time dependency relationship among frames with longer intervals can be modeled, and the video-level information can be obtained.

2) Scene understanding images are combined into a video through a Ken Burns effect, more scene information is provided for a classification model, and the types of training data sets are enriched.

3) In extracting potential visual features, the ratio of the number of retinal ganglion cells mimicking the fine spatial and color information and motion information perceived in the biological primate visual system is about 4: 1, spatial and temporal features of a video are extracted simultaneously through two paths in a 3D convolutional neural network, so that the video features are extracted more comprehensively.

4) And processing the test data set, calculating the class similarity distance between the test set and the training set through the cosine similarity embedded in the label, and only keeping the test set class of which the class distance is greater than the set similarity threshold t, so that the model test result is real and effective.

5) The distance d (r) between the positive sample and the anchor sample is passed using the triple tracing Loss as a Loss function_a,r_p) Distance d (r) between negative sample and anchor point_a,r_n) And the relation among the edge distance m obtains the size of the loss value, so that the parameters of the network are adjusted.

Drawings

FIG. 1 is a flowchart of an algorithm according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.

The invention relates to a video classification method based on zero learning, which comprises the following steps as shown in figure 1:

101. training set data preprocessing, extracting video clip segments through sparse sampling: the video is first divided into a fixed number of segments, then one segment is randomly drawn from each segment, and the shortest edge of each frame is reshaped to 128 pixels. Randomly cropping a 112x112 on the training data setpatch, which cuts a central patch inferentially to obtain a training data set D_s， D_s＝{(x₁,c₁),(x₂,c₂),...,(x_N,c_N) And the video x and the class labels c thereof form pairs.

102. Training set data is enhanced by converting images to video using the Ken Burns effect. For example, to create a 16 frame video from an image, we randomly select "start" and "end" object positions (and object sizes) in the image and linearly interpolate to obtain 16 objects. Then they are resized to 112x112, the training data set is labeled D_p；

103. Visual feature extraction, namely extracting the deepest expression vector from the training set preprocessed in the steps 101 and 102 through a 3D convolutional neural network as a feature vector of a video, and respectively extracting the expression vector as the feature vector of the video through two paths according to the ratio of the number of video clip segments as 1: 4, fully extracting spatial features and temporal features of the video, and connecting the spatial features and the temporal features by a transverse connection method to serve as complete visual features of the video;

104. constructing a category word vector, and using a word2vec tool to collect the categories Y of a training set and a test set_tr、Y_teThe mapping is a word vector, and the similarity between the words of the category is measured according to the word vector. Each dimension of the word vector represents a high-level attribute that is set to a non-zero value in its dimension when the class word contains such an attribute, and the word vector dimension for class label mapping is 300 in this document, i.e., c → R³⁰⁰；

105. Processing the data of the test set, calculating the class distance according to the cosine similarity embedded in the class label, and removing all classes, the distance between which and any training set class preprocessed in the step 101 is less than a similarity threshold t, from the test set to obtain a test set subset with fewer class classes;

106. and (4) video classification decision, namely reducing the dimensionality of the video feature vector through a full connection layer to enable the dimensionality to be consistent with the dimensionality of the word vector. And then, a triple tracing Loss function is used to shorten the distance between the positive sample belonging to the same class with the training data sample and expand the distance between the positive sample and the negative sample not belonging to the same class. And then, training by using a training data set to finally generate a video classifier, so that videos of unknown classes can be classified.

Step 105, calculating the category distance according to the cosine similarity embedded in the tag, including the following steps:

(1) let D C → C denote the distance measure on all possible class names C in space, let t ∈ R similarity threshold, D_s、D_pFor training the data set, c_sFor a class in the training dataset, c_tIs a class in the test data set. If it is not

The video classification task fully honors the zero shot constraint.

(2) The approach defined herein for d is semantic embedding using class names. The distance between the two classes is calculated by cosine distance cos, and the specific calculation formula is as follows:

d(c₁,c₂)＝cos(W2V(c1),W2V(c2))

the triple tracing Loss function of step 106 is implemented as follows:

(1) the loss function uses triplets of training data samples to achieve better experimental results, the triplets consisting of an anchor sample x_aOne positive sample x_pAnd a negative sample x_nAnd (4) forming. The purpose of which is to make the anchor sample x_aAnd negative sample x_nThe distance therebetween represents d (r)_a,r_n) (and greater than the edge distance m) than the anchor sample x_aAnd positive sample x_pThe distance therebetween represents d (r)_a,r_p) Is large.

(2) The triple tracing Loss formula is as follows:

L(r_a,r_p,r_n)＝max(0,m+d(r_a,r_p)-d(r_a,r_n))

the triple tracing Loss has the following 3 use cases:

①Easy Triplets：d(r_a,r_n)＞d(r_a,r_p) + m. In the embedding space, the negative samples are already far enough away from the anchor sample relative to the positive samples. The loss value at this time is 0, and the parameters of the network are not updated.

②Hard Triplets：d(r_a,r_n)＜d(r_a,r_p). The negative samples are closer to the anchor sample than the positive samples, where the penalty value is a positive number and is greater than m.

③Semi-Hard Triplets：d(r_a,r_p)＜d(r_a,r_n)＜d(r_a,r_p) + m. The negative samples are further from the anchor sample than the positive samples, but not more than the edge distance m, and the loss value is still positive and less than m.

Claims

1. A short video classification method based on zero learning comprises the following steps:

a. constructing training data of zero learning, extracting video clip segments, and performing remodeling and random cutting on each frame to obtain a training data set; enhancing data of the training set, namely editing images into videos by using a Ken Burns effect, and then adjusting the space size of the videos to obtain an enhanced training data set;

b. extracting potential visual features, namely obtaining a deep level feature vector of the video by using a deep neural network, wherein the deep level feature vector comprises spatial features and temporal features of the video;

c. constructing a category semantic space of the video, and constructing a category description A of a label category Y; representing each class as a semantic vector, each dimension of the semantic vector representing a high-level attribute, the class being set to a non-zero value in its dimension when it contains such an attribute;

d. calculating the category similarity of the target video, calculating the similarity distance between the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating the test categories of which the overlapping and distance are less than the similarity threshold;

e. performing video classification decision, namely performing dimensionality reduction on the dimensionality of the video feature vector through a full connection layer to enable the dimensionality of the video feature vector to be consistent with the dimensionality of a semantic space; and then, a triple tracing Loss function is used to shorten the distance between the positive sample belonging to the same class with the training data sample and expand the distance between the positive sample and the negative sample not belonging to the same class.

2. The zero-learning-based short video classification method according to claim 1, wherein in step b, the training set is modeled by using a deep neural network to extract important spatial features and temporal features in the video.

3. The zero-learning-based short video classification method according to claim 1, wherein in step c, the label category Y is constructed with its category description A, and each category Y is classified_iE.g. Y, are all expressed as a semantic vector a_iBelongs to the form of A, wherein the attribute selection of the semantic vector has great influence on the effect of the video classification algorithm based on zero learning finally.

4. The zero-learning-based short video classification method according to claim 1, wherein the elimination of the test classes in step d, in which the overlap and distance are less than the similarity threshold, achieves the goal of better generalization of the model.

5. The zero-learning-based short video classification method according to claim 1, wherein in step f, the video feature vector is subjected to dimensionality reduction, and then the triple Ranking Loss function is used to update the network parameters; in the Triplet Ranking Loss, the magnitude of the Loss value is determined by the relationship between the distance between the anchor sample and the positive sample, the distance between the anchor sample and the negative sample, and the edge distance, thereby updating the network parameters.