CN113609918B

CN113609918B - Short video classification method based on zero-order learning

Info

Publication number: CN113609918B
Application number: CN202110785398.4A
Authority: CN
Inventors: 陶珺; 韩立新
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2023-10-13
Anticipated expiration: 2041-07-12
Also published as: CN113609918A

Abstract

The application provides a short video classification method based on zero-order learning, which comprises the following steps: a) Constructing a training data set, and extracting clip fragments from the original short video; b) The classical Ken Burns effect is introduced for data enhancement; c) Extracting visual characteristics by adopting a deep neural network; d) Constructing a semantic space, and constructing a category description A of the tag category Y; each category is expressed in the form of semantic vectors, and any dimension of each semantic vector represents a high-level attribute; e) Calculating the category similarity of the target video, and eliminating target categories with too small distances from the video training set; f) Classifying and deciding the target video, and separating the classification model by adopting Triplet Ranking Loss loss function; the method and the device fully utilize the video characteristics and the label characteristics of the short video, effectively solve the label classification problem of the short video, and improve the classification accuracy of the invisible short video.

Description

Short video classification method based on zero-order learning

Technical Field

The application relates to the technical field of computer vision and transfer learning, in particular to a short video classification method based on zero-order learning.

Background

Currently, two main video classification methods are mainly adopted, one is a double-flow network, and the static characteristics and the motion characteristics of objects in the video, namely the spatial characteristics and the temporal characteristics, are respectively extracted through two 2D convolutional neural networks (an upper layer spatio stream convet and a lower layer temporal stream convet); another is a 3D convolutional neural network that is capable of capturing both temporal and spatial feature information in video simultaneously by performing a convolution operation using a 3D convolutional kernel. The two methods can accurately classify videos into hundreds of different categories through training on a large data set.

However, both of the above methods often require enough samples to train a good enough model and the cost of annotating video data is very expensive. Today video on the internet is increasingly scaled, e.g. the Youtube website has hundreds of hours of video production per minute, and models trained on annotated video datasets are very difficult to work with.

Zero-shot learning (ZSL) is a method that can better solve the above-mentioned problems. The ZSL can be popularized to a new task without class in the training data set only by training the model once. In ZSL, the model is trained using training set data so that the model can classify the objects of the test set, but there is no intersection between the training set class and the test set class, and during training, the association between the training set and the test set needs to be established by means of description of the classes, so that the model is effective.

Disclosure of Invention

The application aims to: aiming at the problem that the utilization degree of the extracted video features is low in the current video classification technology, the classification task of the unknown video is difficult to finish, and the video classification method based on zero-order learning is provided.

The technical scheme is as follows: a video classification method based on zero-order learning comprises the following steps:

1. a zero-order learned training dataset was constructed, sparse sampling was performed on the training set video data, and the shortest side of each frame was reshaped to 128pixels. A112 x112 patch is randomly cropped on the training dataset and a central patch is inferentially cropped. Data enhancement is performed on the training set, and images are combined into video by using Ken Burns effect: a series of objects move around the image simulating video-like movements.

2. Potential visual feature extraction, from the content of the video itself, mines deep features, namely spatial and temporal features, closely related to the video to ensure the effectiveness of the features. The most deep expression vector is mainly extracted from the video through a 3D convolutional neural network to serve as a characteristic vector of the video, and the ratio of the number of frames of the video is 1: and 4, fully extracting the spatial characteristics and the temporal characteristics of the video.

3. Constructing category semantic space of video, and aiming at category Y of training set and test set _tr 、Y _te Constructing class description A thereof _tr 、A _te . Each category y _i E Y, all expressed as a semantic vector a _i E a, and each dimension of this semantic vector represents a high-level attribute, such as "black-white", "tailed", "feathered", etc., which is set to a non-zero value in its dimension when this class contains such an attribute.

4. And (3) calculating the similarity of the category of the target video, calculating the similarity distance between the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating the test category with overlapping and distance smaller than the similarity threshold.

5. And carrying out video classification decision, namely reducing the dimension of the video feature vector through the full connection layer to ensure that the dimension is consistent with the dimension of the semantic space. The Triplet Ranking Loss loss function is then used to shorten the positive sample distance that belongs to the same class as the training data samples and to expand the negative sample distance that does not belong to the same class.

The beneficial effects of the application are specifically expressed as follows:

1) Unlike the previous application which uses the inter-frame difference method to extract the video data, the application uses sparse sampling to extract the video frame. The extracted video frames have sparsity and global characteristics, so that the time dependency relationship between frames with longer intervals can be modeled, and video level information can be ensured to be acquired.

2) The scene understanding images are combined into the video through the Ken Burns effect, more scene information is provided for the classification model, and the variety of the training data set is enriched.

3) In extracting the latent visual features, the ratio of retinal ganglion cell numbers mimicking the perception of fine spatial and color information and movement information in the biological primate vision system is about 4:1, simultaneously extracting the spatial and temporal characteristics of the video through two paths in the 3D convolutional neural network, thereby more comprehensively extracting the video characteristics.

4) And processing the test data set, calculating the class similarity distance between the test set and the training set through the cosine similarity embedded by the label, and only reserving the test set class with the class distance larger than the set similarity threshold t, so that the model test result is real and effective.

5) Using Triplet Ranking Loss as a loss function, the distance d (r) between the positive and anchor samples _a ,r _p ) Distance d (r) between negative sample and anchor point _a ,r _n ) And the relation between the edge distance m and the edge distance m obtains the magnitude of the loss value, thereby adjusting the parameters of the network.

Drawings

FIG. 1 is a flowchart of an algorithm according to an embodiment of the present application.

Detailed Description

The present application is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the application and not limiting of its scope, and various modifications of the application, which are equivalent to those skilled in the art upon reading the application, will fall within the scope of the application as defined in the appended claims.

The application discloses a video classification method based on zero-order learning, which is shown in fig. 1 and comprises the following steps:

101. preprocessing training set data, and extracting video clip fragments through sparse sampling: the video is first split into a fixed number of segments, then one segment is randomly extracted from each segment, and the shortest side of each frame is reshaped to 128pixels. Randomly clipping a 112x112 patch on the training data set, inferentially clipping a central patch, thereby obtaining a training data set D _s ， D _s ＝{(x ₁ ,c ₁ ),(x ₂ ,c ₂ ),...,(x _N ,c _N ) -consisting of pairs of video x and its class labels c.

102. The training set data is enhanced by converting the image to video using the Ken Burns effect. For example, to create a 16 frame video from an image, we randomly select the "start" and "end" object positions (and object sizes) in the image,and linear interpolation to obtain 16 objects. They are then resized to 112x112, the training dataset being denoted as D _p ；

103. Visual feature extraction, namely extracting the highest-level representation vector from the training set preprocessed in steps 101 and 102 through a 3D convolutional neural network to serve as a feature vector of a video, wherein the number proportion of video clips in the network is 1 through two paths: 4, fully extracting the spatial features and the temporal features of the video, and connecting the spatial features and the temporal features by using a transverse connection method to serve as complete visual features of the video;

104. category word vector construction using word2vec tools to construct categories Y of training and test sets _tr 、Y _te Mapping into word vectors, thereby measuring similarity between category words and words. Each dimension of the word vector represents a high-level attribute that is set to a non-zero value in its dimension when the category word contains such an attribute, where the category label maps to a word vector dimension of 300, i.e., c→r ³⁰⁰ ；

105. Processing test set data, performing class distance calculation through cosine similarity embedded by class labels, and removing all classes of which the distances between the test set and any training set class preprocessed in step 101 are smaller than a similarity threshold t from the test set to obtain a test set subset with fewer class classes;

106. and carrying out video classification decision, namely carrying out dimension reduction on the dimension of the video feature vector through the full connection layer, so that the dimension of the video feature vector is consistent with the dimension of the word vector. The Triplet Ranking Loss loss function is then used to shorten the positive sample distance that belongs to the same class as the training data samples and to expand the negative sample distance that does not belong to the same class. And then training by using the training data set, finally generating a video classifier, and carrying out video classification on the video of the unknown class.

Step 105 of calculating the category distance through the cosine similarity embedded by the tag comprises the following steps:

(1) Let d.fwdarw.C denote distance measures over all possible class names C in space, let t.epsilon.R similarity threshold,D _s 、D _p to train the data set, c _s For a certain class in the training dataset c _t Is a class in the test dataset. If it isThe video classification task fully honors the zero shot constraint.

(2) The method of definition d herein is semantic embedding using class names. The distance between two classes is calculated through cosine distance cos, and a specific calculation formula is as follows:

d(c ₁ ,c ₂ )＝cos(W2V(c1),W2V(c2))

the Triplet Ranking Loss loss function described in step 106 is used as follows:

(1) The loss function uses triplets of training data samples to achieve better experimental results, the triplets being derived from an anchor sample x _a A positive sample x _p And a negative sample x _n And (5) forming. The purpose of this is to enable the anchor sample x _a And negative sample x _n The distance between them represents d (r _a ,r _n ) And greater than the edge distance m) than the anchor sample x _a And positive sample x _p The distance between them represents d (r _a ,r _p ) Large.

(2) The Triplet Ranking Loss formula is specifically as follows:

L(r _a ,r _p ,r _n )＝max(0,m+d(r _a ,r _p )-d(r _a ,r _n ))

triplet Ranking Loss there are 3 use cases as follows:

①Easy Triplets：d(r _a ,r _n )＞d(r _a ,r _p ) +m. In the embedding space, the negative samples are already far enough away from the anchor samples relative to the positive samples. The loss value at this time is 0 and the parameters of the network are not updated.

②Hard Triplets：d(r _a ,r _n )＜d(r _a ,r _p ). The negative samples are closer to the anchor samples than the positive samples, where the loss value is positive and greater than m.

③Semi-Hard Triplets：d(r _a ,r _p )＜d(r _a ,r _n )＜d(r _a ,r _p ) +m. The negative samples are farther from the anchor samples than the positive samples, but are no more than the edge distance m, where the loss value is still positive and less than m.

Claims

1. A short video classification method based on zero-order learning comprises the following steps:

a. constructing training data for zero-time learning, extracting video clip fragments, and carrying out reshaping and random cutting on each frame so as to obtain a training data set; the training set data is enhanced, images are clipped into videos by using Ken Burns effect, and then the size of a video space is adjusted to obtain an enhanced training data set;

b. extracting potential visual features, namely obtaining deep feature vectors of the video by using a deep neural network, wherein the deep feature vectors comprise spatial features and temporal features of the video;

c. constructing a category semantic space of the video, and constructing a category description A of the tag category Y; each category is represented in the form of a semantic vector, and each dimension of the semantic vector represents a high-level attribute, which is set to a non-zero value in its dimension when the category contains such an attribute;

d. performing category similarity calculation on the target video, performing similarity distance calculation on the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating test categories with overlapping and distances smaller than the similarity threshold;

e. video classification decision, namely dimension reduction is carried out on the dimension of the video feature vector through a full connection layer, so that the dimension is consistent with the dimension of the semantic space; the Triplet Ranking Loss loss function is then used to shorten the positive sample distance that belongs to the same class as the training data samples and to expand the negative sample distance that does not belong to the same class.

2. The short video classification method based on zero-order learning of claim 1, wherein in step b, the training set is modeled by using a deep neural network, and important spatial features and temporal features in the video are extracted.

3. The zero-order learning-based short video classification method according to claim 1, wherein said pair of tag categories Y in step c constructs a category description a thereof, each category Y _i E Y, all expressed as a semantic vector a _i E a form, where the attribute selection of the semantic vector has a great impact on the final zero-order learning based video classification algorithm.

4. The short video classification method based on zero-order learning according to claim 1, wherein the step d is performed by eliminating test classes with overlapping and distances smaller than a similarity threshold, so as to achieve the purpose of better generalization effect of the model.

5. The short video classification method based on zero-order learning of claim 1, wherein in step e, the video feature vector is first reduced in dimension, and then network parameters are updated using Triplet Ranking Loss loss function; in Triplet Ranking Loss, the magnitude of the loss value is determined by the relationship between the distance between the anchor sample and the positive sample, the distance between the anchor sample and the negative sample, and the edge distance, thereby updating the network parameters.