CN113609918B - Short video classification method based on zero-order learning - Google Patents

Short video classification method based on zero-order learning Download PDF

Info

Publication number
CN113609918B
CN113609918B CN202110785398.4A CN202110785398A CN113609918B CN 113609918 B CN113609918 B CN 113609918B CN 202110785398 A CN202110785398 A CN 202110785398A CN 113609918 B CN113609918 B CN 113609918B
Authority
CN
China
Prior art keywords
video
category
zero
dimension
short video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110785398.4A
Other languages
Chinese (zh)
Other versions
CN113609918A (en
Inventor
陶珺
韩立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202110785398.4A priority Critical patent/CN113609918B/en
Publication of CN113609918A publication Critical patent/CN113609918A/en
Application granted granted Critical
Publication of CN113609918B publication Critical patent/CN113609918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application provides a short video classification method based on zero-order learning, which comprises the following steps: a) Constructing a training data set, and extracting clip fragments from the original short video; b) The classical Ken Burns effect is introduced for data enhancement; c) Extracting visual characteristics by adopting a deep neural network; d) Constructing a semantic space, and constructing a category description A of the tag category Y; each category is expressed in the form of semantic vectors, and any dimension of each semantic vector represents a high-level attribute; e) Calculating the category similarity of the target video, and eliminating target categories with too small distances from the video training set; f) Classifying and deciding the target video, and separating the classification model by adopting Triplet Ranking Loss loss function; the method and the device fully utilize the video characteristics and the label characteristics of the short video, effectively solve the label classification problem of the short video, and improve the classification accuracy of the invisible short video.

Description

Short video classification method based on zero-order learning
Technical Field
The application relates to the technical field of computer vision and transfer learning, in particular to a short video classification method based on zero-order learning.
Background
Currently, two main video classification methods are mainly adopted, one is a double-flow network, and the static characteristics and the motion characteristics of objects in the video, namely the spatial characteristics and the temporal characteristics, are respectively extracted through two 2D convolutional neural networks (an upper layer spatio stream convet and a lower layer temporal stream convet); another is a 3D convolutional neural network that is capable of capturing both temporal and spatial feature information in video simultaneously by performing a convolution operation using a 3D convolutional kernel. The two methods can accurately classify videos into hundreds of different categories through training on a large data set.
However, both of the above methods often require enough samples to train a good enough model and the cost of annotating video data is very expensive. Today video on the internet is increasingly scaled, e.g. the Youtube website has hundreds of hours of video production per minute, and models trained on annotated video datasets are very difficult to work with.
Zero-shot learning (ZSL) is a method that can better solve the above-mentioned problems. The ZSL can be popularized to a new task without class in the training data set only by training the model once. In ZSL, the model is trained using training set data so that the model can classify the objects of the test set, but there is no intersection between the training set class and the test set class, and during training, the association between the training set and the test set needs to be established by means of description of the classes, so that the model is effective.
Disclosure of Invention
The application aims to: aiming at the problem that the utilization degree of the extracted video features is low in the current video classification technology, the classification task of the unknown video is difficult to finish, and the video classification method based on zero-order learning is provided.
The technical scheme is as follows: a video classification method based on zero-order learning comprises the following steps:
1. a zero-order learned training dataset was constructed, sparse sampling was performed on the training set video data, and the shortest side of each frame was reshaped to 128pixels. A112 x112 patch is randomly cropped on the training dataset and a central patch is inferentially cropped. Data enhancement is performed on the training set, and images are combined into video by using Ken Burns effect: a series of objects move around the image simulating video-like movements.
2. Potential visual feature extraction, from the content of the video itself, mines deep features, namely spatial and temporal features, closely related to the video to ensure the effectiveness of the features. The most deep expression vector is mainly extracted from the video through a 3D convolutional neural network to serve as a characteristic vector of the video, and the ratio of the number of frames of the video is 1: and 4, fully extracting the spatial characteristics and the temporal characteristics of the video.
3. Constructing category semantic space of video, and aiming at category Y of training set and test set tr 、Y te Constructing class description A thereof tr 、A te . Each category y i E Y, all expressed as a semantic vector a i E a, and each dimension of this semantic vector represents a high-level attribute, such as "black-white", "tailed", "feathered", etc., which is set to a non-zero value in its dimension when this class contains such an attribute.
4. And (3) calculating the similarity of the category of the target video, calculating the similarity distance between the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating the test category with overlapping and distance smaller than the similarity threshold.
5. And carrying out video classification decision, namely reducing the dimension of the video feature vector through the full connection layer to ensure that the dimension is consistent with the dimension of the semantic space. The Triplet Ranking Loss loss function is then used to shorten the positive sample distance that belongs to the same class as the training data samples and to expand the negative sample distance that does not belong to the same class.
The beneficial effects of the application are specifically expressed as follows:
1) Unlike the previous application which uses the inter-frame difference method to extract the video data, the application uses sparse sampling to extract the video frame. The extracted video frames have sparsity and global characteristics, so that the time dependency relationship between frames with longer intervals can be modeled, and video level information can be ensured to be acquired.
2) The scene understanding images are combined into the video through the Ken Burns effect, more scene information is provided for the classification model, and the variety of the training data set is enriched.
3) In extracting the latent visual features, the ratio of retinal ganglion cell numbers mimicking the perception of fine spatial and color information and movement information in the biological primate vision system is about 4:1, simultaneously extracting the spatial and temporal characteristics of the video through two paths in the 3D convolutional neural network, thereby more comprehensively extracting the video characteristics.
4) And processing the test data set, calculating the class similarity distance between the test set and the training set through the cosine similarity embedded by the label, and only reserving the test set class with the class distance larger than the set similarity threshold t, so that the model test result is real and effective.
5) Using Triplet Ranking Loss as a loss function, the distance d (r) between the positive and anchor samples a ,r p ) Distance d (r) between negative sample and anchor point a ,r n ) And the relation between the edge distance m and the edge distance m obtains the magnitude of the loss value, thereby adjusting the parameters of the network.
Drawings
FIG. 1 is a flowchart of an algorithm according to an embodiment of the present application.
Detailed Description
The present application is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the application and not limiting of its scope, and various modifications of the application, which are equivalent to those skilled in the art upon reading the application, will fall within the scope of the application as defined in the appended claims.
The application discloses a video classification method based on zero-order learning, which is shown in fig. 1 and comprises the following steps:
101. preprocessing training set data, and extracting video clip fragments through sparse sampling: the video is first split into a fixed number of segments, then one segment is randomly extracted from each segment, and the shortest side of each frame is reshaped to 128pixels. Randomly clipping a 112x112 patch on the training data set, inferentially clipping a central patch, thereby obtaining a training data set D s , D s ={(x 1 ,c 1 ),(x 2 ,c 2 ),...,(x N ,c N ) -consisting of pairs of video x and its class labels c.
102. The training set data is enhanced by converting the image to video using the Ken Burns effect. For example, to create a 16 frame video from an image, we randomly select the "start" and "end" object positions (and object sizes) in the image,and linear interpolation to obtain 16 objects. They are then resized to 112x112, the training dataset being denoted as D p
103. Visual feature extraction, namely extracting the highest-level representation vector from the training set preprocessed in steps 101 and 102 through a 3D convolutional neural network to serve as a feature vector of a video, wherein the number proportion of video clips in the network is 1 through two paths: 4, fully extracting the spatial features and the temporal features of the video, and connecting the spatial features and the temporal features by using a transverse connection method to serve as complete visual features of the video;
104. category word vector construction using word2vec tools to construct categories Y of training and test sets tr 、Y te Mapping into word vectors, thereby measuring similarity between category words and words. Each dimension of the word vector represents a high-level attribute that is set to a non-zero value in its dimension when the category word contains such an attribute, where the category label maps to a word vector dimension of 300, i.e., c→r 300
105. Processing test set data, performing class distance calculation through cosine similarity embedded by class labels, and removing all classes of which the distances between the test set and any training set class preprocessed in step 101 are smaller than a similarity threshold t from the test set to obtain a test set subset with fewer class classes;
106. and carrying out video classification decision, namely carrying out dimension reduction on the dimension of the video feature vector through the full connection layer, so that the dimension of the video feature vector is consistent with the dimension of the word vector. The Triplet Ranking Loss loss function is then used to shorten the positive sample distance that belongs to the same class as the training data samples and to expand the negative sample distance that does not belong to the same class. And then training by using the training data set, finally generating a video classifier, and carrying out video classification on the video of the unknown class.
Step 105 of calculating the category distance through the cosine similarity embedded by the tag comprises the following steps:
(1) Let d.fwdarw.C denote distance measures over all possible class names C in space, let t.epsilon.R similarity threshold,D s 、D p to train the data set, c s For a certain class in the training dataset c t Is a class in the test dataset. If it isThe video classification task fully honors the zero shot constraint.
(2) The method of definition d herein is semantic embedding using class names. The distance between two classes is calculated through cosine distance cos, and a specific calculation formula is as follows:
d(c 1 ,c 2 )=cos(W2V(c1),W2V(c2))
the Triplet Ranking Loss loss function described in step 106 is used as follows:
(1) The loss function uses triplets of training data samples to achieve better experimental results, the triplets being derived from an anchor sample x a A positive sample x p And a negative sample x n And (5) forming. The purpose of this is to enable the anchor sample x a And negative sample x n The distance between them represents d (r a ,r n ) And greater than the edge distance m) than the anchor sample x a And positive sample x p The distance between them represents d (r a ,r p ) Large.
(2) The Triplet Ranking Loss formula is specifically as follows:
L(r a ,r p ,r n )=max(0,m+d(r a ,r p )-d(r a ,r n ))
triplet Ranking Loss there are 3 use cases as follows:
①Easy Triplets:d(r a ,r n )>d(r a ,r p ) +m. In the embedding space, the negative samples are already far enough away from the anchor samples relative to the positive samples. The loss value at this time is 0 and the parameters of the network are not updated.
②Hard Triplets:d(r a ,r n )<d(r a ,r p ). The negative samples are closer to the anchor samples than the positive samples, where the loss value is positive and greater than m.
③Semi-Hard Triplets:d(r a ,r p )<d(r a ,r n )<d(r a ,r p ) +m. The negative samples are farther from the anchor samples than the positive samples, but are no more than the edge distance m, where the loss value is still positive and less than m.

Claims (5)

1. A short video classification method based on zero-order learning comprises the following steps:
a. constructing training data for zero-time learning, extracting video clip fragments, and carrying out reshaping and random cutting on each frame so as to obtain a training data set; the training set data is enhanced, images are clipped into videos by using Ken Burns effect, and then the size of a video space is adjusted to obtain an enhanced training data set;
b. extracting potential visual features, namely obtaining deep feature vectors of the video by using a deep neural network, wherein the deep feature vectors comprise spatial features and temporal features of the video;
c. constructing a category semantic space of the video, and constructing a category description A of the tag category Y; each category is represented in the form of a semantic vector, and each dimension of the semantic vector represents a high-level attribute, which is set to a non-zero value in its dimension when the category contains such an attribute;
d. performing category similarity calculation on the target video, performing similarity distance calculation on the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating test categories with overlapping and distances smaller than the similarity threshold;
e. video classification decision, namely dimension reduction is carried out on the dimension of the video feature vector through a full connection layer, so that the dimension is consistent with the dimension of the semantic space; the Triplet Ranking Loss loss function is then used to shorten the positive sample distance that belongs to the same class as the training data samples and to expand the negative sample distance that does not belong to the same class.
2. The short video classification method based on zero-order learning of claim 1, wherein in step b, the training set is modeled by using a deep neural network, and important spatial features and temporal features in the video are extracted.
3. The zero-order learning-based short video classification method according to claim 1, wherein said pair of tag categories Y in step c constructs a category description a thereof, each category Y i E Y, all expressed as a semantic vector a i E a form, where the attribute selection of the semantic vector has a great impact on the final zero-order learning based video classification algorithm.
4. The short video classification method based on zero-order learning according to claim 1, wherein the step d is performed by eliminating test classes with overlapping and distances smaller than a similarity threshold, so as to achieve the purpose of better generalization effect of the model.
5. The short video classification method based on zero-order learning of claim 1, wherein in step e, the video feature vector is first reduced in dimension, and then network parameters are updated using Triplet Ranking Loss loss function; in Triplet Ranking Loss, the magnitude of the loss value is determined by the relationship between the distance between the anchor sample and the positive sample, the distance between the anchor sample and the negative sample, and the edge distance, thereby updating the network parameters.
CN202110785398.4A 2021-07-12 2021-07-12 Short video classification method based on zero-order learning Active CN113609918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110785398.4A CN113609918B (en) 2021-07-12 2021-07-12 Short video classification method based on zero-order learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110785398.4A CN113609918B (en) 2021-07-12 2021-07-12 Short video classification method based on zero-order learning

Publications (2)

Publication Number Publication Date
CN113609918A CN113609918A (en) 2021-11-05
CN113609918B true CN113609918B (en) 2023-10-13

Family

ID=78304418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110785398.4A Active CN113609918B (en) 2021-07-12 2021-07-12 Short video classification method based on zero-order learning

Country Status (1)

Country Link
CN (1) CN113609918B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3171297A1 (en) * 2015-11-18 2017-05-24 CentraleSupélec Joint boundary detection image segmentation and object recognition using deep learning
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN109961089A (en) * 2019-02-26 2019-07-02 中山大学 Small sample and zero sample image classification method based on metric learning and meta learning
CN111510233A (en) * 2013-12-03 2020-08-07 Lg 电子株式会社 Method of synchronizing supplemental content with uncompressed audio or video and apparatus therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510233A (en) * 2013-12-03 2020-08-07 Lg 电子株式会社 Method of synchronizing supplemental content with uncompressed audio or video and apparatus therefor
EP3171297A1 (en) * 2015-11-18 2017-05-24 CentraleSupélec Joint boundary detection image segmentation and object recognition using deep learning
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN109961089A (en) * 2019-02-26 2019-07-02 中山大学 Small sample and zero sample image classification method based on metric learning and meta learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朝乐门 ; 邢春晓 ; 张勇 ; .数据科学研究的现状与趋势.计算机科学.2018,(第01期),全文. *
陈畅怀 ; 韩立新 ; 曾晓勤 ; 王敏 ; .基于视觉特征的图像检索重排序.信息技术.2012,(第12期),全文. *

Also Published As

Publication number Publication date
CN113609918A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN109614979B (en) Data augmentation method and image classification method based on selection and generation
CN109726657B (en) Deep learning scene text sequence recognition method
Gosselin et al. Revisiting the fisher vector for fine-grained classification
AU2014368997B2 (en) System and method for identifying faces in unconstrained media
Le et al. Openforensics: Large-scale challenging dataset for multi-face forgery detection and segmentation in-the-wild
WO2018023734A1 (en) Significance testing method for 3d image
Ju et al. Fusing global and local features for generalized ai-synthesized image detection
CN111738054A (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
Wan et al. A new technique for summarizing video sequences through histogram evolution
WO2022199225A1 (en) Decoding method and apparatus, and computer-readable storage medium
CN110851627B (en) Method for describing sun black subgroup in full-sun image
Salem et al. Semantic image inpainting using self-learning encoder-decoder and adversarial loss
Liu et al. Asflow: Unsupervised optical flow learning with adaptive pyramid sampling
US20170200258A1 (en) Super-resolution image reconstruction method and apparatus based on classified dictionary database
Liu et al. Component semantic prior guided generative adversarial network for face super-resolution
Guo et al. Attribute-controlled face photo synthesis from simple line drawing
Liu et al. Exploring simple and transferable recognition-aware image processing
Guo et al. Scale region recognition network for object counting in intelligent transportation system
CN112132145B (en) Image classification method and system based on model extended convolutional neural network
CN110555406B (en) Video moving target identification method based on Haar-like characteristics and CNN matching
CN113609918B (en) Short video classification method based on zero-order learning
CN109165551B (en) Expression recognition method for adaptively weighting and fusing significance structure tensor and LBP characteristics
Ilin et al. Creating training datasets for ocr in mobile device video stream.
CN111209886A (en) Rapid pedestrian re-identification method based on deep neural network
Özyurt et al. A new method for classification of images using convolutional neural network based on Dwt-Svd perceptual hash function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant