CN113609918A - Short video classification method based on zero-time learning - Google Patents

Short video classification method based on zero-time learning Download PDF

Info

Publication number
CN113609918A
CN113609918A CN202110785398.4A CN202110785398A CN113609918A CN 113609918 A CN113609918 A CN 113609918A CN 202110785398 A CN202110785398 A CN 202110785398A CN 113609918 A CN113609918 A CN 113609918A
Authority
CN
China
Prior art keywords
video
category
zero
distance
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110785398.4A
Other languages
Chinese (zh)
Other versions
CN113609918B (en
Inventor
陶珺
韩立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202110785398.4A priority Critical patent/CN113609918B/en
Publication of CN113609918A publication Critical patent/CN113609918A/en
Application granted granted Critical
Publication of CN113609918B publication Critical patent/CN113609918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4007Interpolation-based scaling, e.g. bilinear interpolation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a short video classification method based on zero-time learning, which comprises the following steps of: a) constructing a training data set, and extracting clip segments from the original short video; b) introducing a classic Ken Burns effect to perform data enhancement; c) extracting visual features by adopting a deep neural network; d) constructing a semantic space, and constructing a class description A of the label class Y; representing each category in the form of semantic vectors, wherein any dimension of each semantic vector represents a high-level attribute; e) calculating the category similarity of the target video, and eliminating the target category with too small distance from the video training set; f) performing target video classification decision, and performing intra-class aggregation and inter-class separation on a classification model by adopting a triple Ranking Loss function; the method and the device make full use of the video characteristics and the label characteristics of the short video, effectively solve the problem of label classification of the short video, and improve the accuracy of classification of invisible short video.

Description

Short video classification method based on zero-time learning
Technical Field
The invention relates to the technical field of computer vision and transfer learning, in particular to a short video classification method based on zero-time learning.
Background
At present, two main video classification methods are available, one is a double-current network, and static features and motion features, namely spatial features and temporal features, of objects in a video are respectively extracted through two 2D convolutional neural networks (an upper layer of spatial stream constant and a lower layer of temporal stream constant); the other is a 3D convolutional neural network, which can capture temporal and spatial feature information in video simultaneously by performing a convolution operation using a 3D convolution kernel. These two methods can accurately classify videos into hundreds of different categories through training on large datasets.
However, both of the above methods often require enough samples to train a good enough model, and the cost of annotating video data is very expensive. Today, videos on the internet are increasingly large in scale, for example, if you tube web sites generate hundreds of hours of videos per minute, a model trained on an annotated video data set is difficult to achieve well.
Zero-shot learning (ZSL) is a good solution to the above-mentioned problems. The ZSL can be popularized to a new task without classes in a training data set only by training the model once. In the ZSL, a model is trained by using training set data, so that the model can classify objects in a test set, but there is no intersection between a training set class and a test set class, and during training, a connection between the training set and the test set needs to be established by means of description of the class, so that the model is effective.
Disclosure of Invention
The purpose of the invention is as follows: the invention provides a video classification method based on zero-time learning, aiming at the problems that the current video classification technology has low utilization degree of extracted video features and is difficult to complete classification tasks of unknown videos.
The technical scheme is as follows: a video classification method based on zero learning comprises the following steps:
1. a training data set of zero learning is constructed, sparse sampling is performed on the training set video data, and the shortest edge of each frame is reshaped to 128 pixels. Randomly prune a 112x112 patch on the training data set, and recursively prune a center patch. Data enhancement is performed on the training set, and images are combined into a video by using a Ken Burns effect: a series of objects move around the image, simulating video-like motion.
2. And potential visual feature extraction is carried out, and from the content of the video, deep features, namely spatial and temporal features, closely related to the video are mined so as to ensure the effectiveness of the features. The method mainly extracts the deepest expression vector from the video as the feature vector of the video through a 3D convolutional neural network, and the ratio of the number of video frames in the network is 1: and 4, sufficiently extracting the spatial features and the temporal features of the video.
3. Constructing category semantic space of video, and classifying Y of training set and test settr、YteConstruct its class description Atr、Ate. Each class yiE.g. Y, are all expressed as a semantic vector aiE.g., a, and each dimension of the semantic vector represents a high-level attribute, such as "black and white", "tailed", "feathered", etc., that is set to a non-zero value in its dimension when the category contains such an attribute.
4. And calculating the category similarity of the target video, calculating the similarity distance between the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating the test categories of which the overlapping and distance are less than the similarity threshold.
5. And (4) video classification decision, namely reducing the dimensionality of the video feature vector through a full connection layer to enable the dimensionality to be consistent with the dimensionality of a semantic space. And then, a triple tracing Loss function is used to shorten the distance between the positive sample belonging to the same class with the training data sample and expand the distance between the positive sample and the negative sample not belonging to the same class.
The beneficial effects of the invention are specifically expressed as follows:
1) the method is different from the prior method which uses an interframe difference method to extract the video data, and the invention uses sparse sampling to extract the video frame. The extracted video frames have the characteristics of sparsity and globality, so that the time dependency relationship among frames with longer intervals can be modeled, and the video-level information can be obtained.
2) Scene understanding images are combined into a video through a Ken Burns effect, more scene information is provided for a classification model, and the types of training data sets are enriched.
3) In extracting potential visual features, the ratio of the number of retinal ganglion cells mimicking the fine spatial and color information and motion information perceived in the biological primate visual system is about 4: 1, spatial and temporal features of a video are extracted simultaneously through two paths in a 3D convolutional neural network, so that the video features are extracted more comprehensively.
4) And processing the test data set, calculating the class similarity distance between the test set and the training set through the cosine similarity embedded in the label, and only keeping the test set class of which the class distance is greater than the set similarity threshold t, so that the model test result is real and effective.
5) The distance d (r) between the positive sample and the anchor sample is passed using the triple tracing Loss as a Loss functiona,rp) Distance d (r) between negative sample and anchor pointa,rn) And the relation among the edge distance m obtains the size of the loss value, so that the parameters of the network are adjusted.
Drawings
FIG. 1 is a flowchart of an algorithm according to an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following figures and specific examples, which are to be understood as illustrative only and not as limiting the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalent modifications thereof which may occur to those skilled in the art upon reading the present specification.
The invention relates to a video classification method based on zero learning, which comprises the following steps as shown in figure 1:
101. training set data preprocessing, extracting video clip segments through sparse sampling: the video is first divided into a fixed number of segments, then one segment is randomly drawn from each segment, and the shortest edge of each frame is reshaped to 128 pixels. Randomly cropping a 112x112 on the training data setpatch, which cuts a central patch inferentially to obtain a training data set Ds, Ds={(x1,c1),(x2,c2),...,(xN,cN) And the video x and the class labels c thereof form pairs.
102. Training set data is enhanced by converting images to video using the Ken Burns effect. For example, to create a 16 frame video from an image, we randomly select "start" and "end" object positions (and object sizes) in the image and linearly interpolate to obtain 16 objects. Then they are resized to 112x112, the training data set is labeled Dp
103. Visual feature extraction, namely extracting the deepest expression vector from the training set preprocessed in the steps 101 and 102 through a 3D convolutional neural network as a feature vector of a video, and respectively extracting the expression vector as the feature vector of the video through two paths according to the ratio of the number of video clip segments as 1: 4, fully extracting spatial features and temporal features of the video, and connecting the spatial features and the temporal features by a transverse connection method to serve as complete visual features of the video;
104. constructing a category word vector, and using a word2vec tool to collect the categories Y of a training set and a test settr、YteThe mapping is a word vector, and the similarity between the words of the category is measured according to the word vector. Each dimension of the word vector represents a high-level attribute that is set to a non-zero value in its dimension when the class word contains such an attribute, and the word vector dimension for class label mapping is 300 in this document, i.e., c → R300
105. Processing the data of the test set, calculating the class distance according to the cosine similarity embedded in the class label, and removing all classes, the distance between which and any training set class preprocessed in the step 101 is less than a similarity threshold t, from the test set to obtain a test set subset with fewer class classes;
106. and (4) video classification decision, namely reducing the dimensionality of the video feature vector through a full connection layer to enable the dimensionality to be consistent with the dimensionality of the word vector. And then, a triple tracing Loss function is used to shorten the distance between the positive sample belonging to the same class with the training data sample and expand the distance between the positive sample and the negative sample not belonging to the same class. And then, training by using a training data set to finally generate a video classifier, so that videos of unknown classes can be classified.
Step 105, calculating the category distance according to the cosine similarity embedded in the tag, including the following steps:
(1) let D C → C denote the distance measure on all possible class names C in space, let t ∈ R similarity threshold, Ds、DpFor training the data set, csFor a class in the training dataset, ctIs a class in the test data set. If it is not
Figure BDA0003159062350000041
The video classification task fully honors the zero shot constraint.
(2) The approach defined herein for d is semantic embedding using class names. The distance between the two classes is calculated by cosine distance cos, and the specific calculation formula is as follows:
d(c1,c2)=cos(W2V(c1),W2V(c2))
the triple tracing Loss function of step 106 is implemented as follows:
(1) the loss function uses triplets of training data samples to achieve better experimental results, the triplets consisting of an anchor sample xaOne positive sample xpAnd a negative sample xnAnd (4) forming. The purpose of which is to make the anchor sample xaAnd negative sample xnThe distance therebetween represents d (r)a,rn) (and greater than the edge distance m) than the anchor sample xaAnd positive sample xpThe distance therebetween represents d (r)a,rp) Is large.
(2) The triple tracing Loss formula is as follows:
L(ra,rp,rn)=max(0,m+d(ra,rp)-d(ra,rn))
the triple tracing Loss has the following 3 use cases:
①Easy Triplets:d(ra,rn)>d(ra,rp) + m. In the embedding space, the negative samples are already far enough away from the anchor sample relative to the positive samples. The loss value at this time is 0, and the parameters of the network are not updated.
②Hard Triplets:d(ra,rn)<d(ra,rp). The negative samples are closer to the anchor sample than the positive samples, where the penalty value is a positive number and is greater than m.
③Semi-Hard Triplets:d(ra,rp)<d(ra,rn)<d(ra,rp) + m. The negative samples are further from the anchor sample than the positive samples, but not more than the edge distance m, and the loss value is still positive and less than m.

Claims (5)

1. A short video classification method based on zero learning comprises the following steps:
a. constructing training data of zero learning, extracting video clip segments, and performing remodeling and random cutting on each frame to obtain a training data set; enhancing data of the training set, namely editing images into videos by using a Ken Burns effect, and then adjusting the space size of the videos to obtain an enhanced training data set;
b. extracting potential visual features, namely obtaining a deep level feature vector of the video by using a deep neural network, wherein the deep level feature vector comprises spatial features and temporal features of the video;
c. constructing a category semantic space of the video, and constructing a category description A of a label category Y; representing each class as a semantic vector, each dimension of the semantic vector representing a high-level attribute, the class being set to a non-zero value in its dimension when it contains such an attribute;
d. calculating the category similarity of the target video, calculating the similarity distance between the category of the test data and the category of the training set, setting a similarity threshold t, and eliminating the test categories of which the overlapping and distance are less than the similarity threshold;
e. performing video classification decision, namely performing dimensionality reduction on the dimensionality of the video feature vector through a full connection layer to enable the dimensionality of the video feature vector to be consistent with the dimensionality of a semantic space; and then, a triple tracing Loss function is used to shorten the distance between the positive sample belonging to the same class with the training data sample and expand the distance between the positive sample and the negative sample not belonging to the same class.
2. The zero-learning-based short video classification method according to claim 1, wherein in step b, the training set is modeled by using a deep neural network to extract important spatial features and temporal features in the video.
3. The zero-learning-based short video classification method according to claim 1, wherein in step c, the label category Y is constructed with its category description A, and each category Y is classifiediE.g. Y, are all expressed as a semantic vector aiBelongs to the form of A, wherein the attribute selection of the semantic vector has great influence on the effect of the video classification algorithm based on zero learning finally.
4. The zero-learning-based short video classification method according to claim 1, wherein the elimination of the test classes in step d, in which the overlap and distance are less than the similarity threshold, achieves the goal of better generalization of the model.
5. The zero-learning-based short video classification method according to claim 1, wherein in step f, the video feature vector is subjected to dimensionality reduction, and then the triple Ranking Loss function is used to update the network parameters; in the Triplet Ranking Loss, the magnitude of the Loss value is determined by the relationship between the distance between the anchor sample and the positive sample, the distance between the anchor sample and the negative sample, and the edge distance, thereby updating the network parameters.
CN202110785398.4A 2021-07-12 2021-07-12 Short video classification method based on zero-order learning Active CN113609918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110785398.4A CN113609918B (en) 2021-07-12 2021-07-12 Short video classification method based on zero-order learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110785398.4A CN113609918B (en) 2021-07-12 2021-07-12 Short video classification method based on zero-order learning

Publications (2)

Publication Number Publication Date
CN113609918A true CN113609918A (en) 2021-11-05
CN113609918B CN113609918B (en) 2023-10-13

Family

ID=78304418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110785398.4A Active CN113609918B (en) 2021-07-12 2021-07-12 Short video classification method based on zero-order learning

Country Status (1)

Country Link
CN (1) CN113609918B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3171297A1 (en) * 2015-11-18 2017-05-24 CentraleSupélec Joint boundary detection image segmentation and object recognition using deep learning
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN109961089A (en) * 2019-02-26 2019-07-02 中山大学 Small sample and zero sample image classification method based on metric learning and meta learning
CN111510233A (en) * 2013-12-03 2020-08-07 Lg 电子株式会社 Method of synchronizing supplemental content with uncompressed audio or video and apparatus therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111510233A (en) * 2013-12-03 2020-08-07 Lg 电子株式会社 Method of synchronizing supplemental content with uncompressed audio or video and apparatus therefor
EP3171297A1 (en) * 2015-11-18 2017-05-24 CentraleSupélec Joint boundary detection image segmentation and object recognition using deep learning
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN109961089A (en) * 2019-02-26 2019-07-02 中山大学 Small sample and zero sample image classification method based on metric learning and meta learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朝乐门;邢春晓;张勇;: "数据科学研究的现状与趋势", 计算机科学, no. 01 *
陈畅怀;韩立新;曾晓勤;王敏;: "基于视觉特征的图像检索重排序", 信息技术, no. 12 *

Also Published As

Publication number Publication date
CN113609918B (en) 2023-10-13

Similar Documents

Publication Publication Date Title
WO2018023734A1 (en) Significance testing method for 3d image
CN109614979B (en) Data augmentation method and image classification method based on selection and generation
AU2014368997B2 (en) System and method for identifying faces in unconstrained media
CN103593464B (en) Video fingerprint detecting and video sequence matching method and system based on visual features
US11854244B2 (en) Labeling techniques for a modified panoptic labeling neural network
CN102750385B (en) Correlation-quality sequencing image retrieval method based on tag retrieval
CN106610969A (en) Multimodal information-based video content auditing system and method
CN103020647A (en) Image classification method based on hierarchical SIFT (scale-invariant feature transform) features and sparse coding
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN108021869A (en) A kind of convolutional neural networks tracking of combination gaussian kernel function
CN102034267A (en) Three-dimensional reconstruction method of target based on attention
CN103853724A (en) Multimedia data sorting method and device
Lodh et al. Flower recognition system based on color and GIST features
CN111783521A (en) Pedestrian re-identification method based on low-rank prior guidance and based on domain invariant information separation
Wan et al. A new technique for summarizing video sequences through histogram evolution
CN106203448A (en) A kind of scene classification method based on Nonlinear Scale Space Theory
CN105718934A (en) Method for pest image feature learning and identification based on low-rank sparse coding technology
CN104680189A (en) Pornographic image detection method based on improved bag-of-words model
Guo et al. Scale region recognition network for object counting in intelligent transportation system
CN103336974B (en) A kind of flowers classification discrimination method based on local restriction sparse representation
CN109165551B (en) Expression recognition method for adaptively weighting and fusing significance structure tensor and LBP characteristics
Özyurt et al. A new method for classification of images using convolutional neural network based on Dwt-Svd perceptual hash function
CN105205161A (en) Simultaneous target searching and dividing method based on Internet images
CN113609918A (en) Short video classification method based on zero-time learning
CN114329050A (en) Visual media data deduplication processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant