CN116597348A - Training method and device for video classification model - Google Patents
Training method and device for video classification model Download PDFInfo
- Publication number
- CN116597348A CN116597348A CN202310507774.2A CN202310507774A CN116597348A CN 116597348 A CN116597348 A CN 116597348A CN 202310507774 A CN202310507774 A CN 202310507774A CN 116597348 A CN116597348 A CN 116597348A
- Authority
- CN
- China
- Prior art keywords
- video
- sample
- training
- classification
- test
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 384
- 238000013145 classification model Methods 0.000 title claims abstract description 204
- 238000000034 method Methods 0.000 title claims abstract description 133
- 238000012360 testing method Methods 0.000 claims abstract description 319
- 238000007499 fusion processing Methods 0.000 claims abstract description 85
- 238000006243 chemical reaction Methods 0.000 claims abstract description 40
- 230000004927 fusion Effects 0.000 claims description 70
- 238000005070 sampling Methods 0.000 claims description 55
- 238000012545 processing Methods 0.000 claims description 32
- 238000012937 correction Methods 0.000 claims description 31
- 238000004422 calculation algorithm Methods 0.000 claims description 26
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000000605 extraction Methods 0.000 claims description 18
- 238000003672 processing method Methods 0.000 claims description 18
- 238000013215 result calculation Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 56
- 238000010586 diagram Methods 0.000 description 17
- 230000006872 improvement Effects 0.000 description 11
- 230000006870 function Effects 0.000 description 9
- 238000004590 computer program Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 229920001296 polysiloxane Polymers 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000010979 ruby Substances 0.000 description 1
- 229910001750 ruby Inorganic materials 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/803—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the specification provides a training method and device for a video classification model, wherein the training method for the video classification model comprises the following steps: and performing label conversion on the matched labels of the test video sample and each training video sample, performing fusion processing on the test video sample and the training video sample according to the test video sample obtained by label conversion and the category label of each training video sample to obtain a virtual video sample, and performing model training on a video classification model constructed by model parameters obtained in the label conversion process based on the virtual video sample to obtain a video classification model.
Description
Technical Field
The present document relates to the field of data processing technologies, and in particular, to a training method and apparatus for a video classification model.
Background
With the development of network technology, an information network becomes an important component of life, and users perform service requests and service interactions in an online manner, so that the information network also becomes a mainstream trend; more and more services or capabilities are provided to users in an online manner; the method also provides higher requirements for the service capability of the service party, and how the service party effectively processes the data in the process of providing the service, so that more reliable and accurate service results are obtained, which is an important point of increasing attention of the service party and the user.
Disclosure of Invention
One or more embodiments of the present specification provide a method of training a video classification model. The training method of the video classification model comprises the following steps: and sampling the video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample. And performing label conversion based on the matched labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample. And carrying out fusion processing on the test video sample and each training video sample according to the fusion parameters and the category labels to obtain at least one virtual video sample. Inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
One or more embodiments of the present disclosure provide a video classification processing method, including: sampling the target video set to obtain a test video and at least one training video. And carrying out fusion processing on the test video and each training video according to the fusion parameters to obtain at least one virtual video corresponding to the test video. And carrying out video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video. And calculating the video classification result of the test video based on the classification result of each virtual video. The video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
One or more embodiments of the present specification provide a training apparatus for a video classification model, including: and the pairing label generation module is configured to sample the video set to obtain a test video sample and at least one training video sample, and generate pairing labels of the test video sample and each training video sample. And the label conversion module is configured to perform label conversion based on the matched labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample. And the fusion processing module is configured to fuse the test video sample with each training video sample according to the fusion parameters and the category labels to obtain at least one virtual video sample. And the parameter adjustment module is configured to input the at least one virtual video sample into a video classification model constructed based on the model parameters to perform video classification, and perform parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
One or more embodiments of the present specification provide a video classification processing apparatus, including: and the sampling module is configured to sample the target video set to obtain a test video and at least one training video. The fusion processing module is configured to fuse the test video with each training video according to the fusion parameters to obtain at least one virtual video corresponding to the test video. And the video classification module is configured to perform video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video. And the result calculation module is configured to calculate a video classification result of the test video based on the classification result of each virtual video. The video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
One or more embodiments of the present specification provide a training apparatus for a video classification model, including: a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to: and sampling the video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample. And performing label conversion based on the matched labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample. And carrying out fusion processing on the test video sample and each training video sample according to the fusion parameters and the category labels to obtain at least one virtual video sample. Inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
One or more embodiments of the present specification provide a video classification processing apparatus including: a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to: sampling the target video set to obtain a test video and at least one training video. And carrying out fusion processing on the test video and each training video according to the fusion parameters to obtain at least one virtual video corresponding to the test video. And carrying out video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video. And calculating the video classification result of the test video based on the classification result of each virtual video. The video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
One or more embodiments of the present specification provide a storage medium storing computer-executable instructions that, when executed by a processor, implement the following: and sampling the video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample. And performing label conversion based on the matched labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample. And carrying out fusion processing on the test video sample and each training video sample according to the fusion parameters and the category labels to obtain at least one virtual video sample. Inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
One or more embodiments of the present specification provide another storage medium storing computer-executable instructions that, when executed by a processor, implement the following: sampling the target video set to obtain a test video and at least one training video. And carrying out fusion processing on the test video and each training video according to the fusion parameters to obtain at least one virtual video corresponding to the test video. And carrying out video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video. And calculating the video classification result of the test video based on the classification result of each virtual video. The video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
Drawings
For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are needed in the description of the embodiments or of the prior art will be briefly described below, it being obvious that the drawings in the description that follow are only some of the embodiments described in the present description, from which other drawings can be obtained, without inventive faculty, for a person skilled in the art;
FIG. 1 is a schematic illustration of one implementation environment provided by one or more embodiments of the present disclosure;
FIG. 2 is a process flow diagram of a training method for a video classification model according to one or more embodiments of the present disclosure;
FIG. 3 is a process flow diagram of a training method for a video classification model applied to training and testing a scene of the video classification model according to one or more embodiments of the present disclosure;
FIG. 4 is a process flow diagram of a video classification processing method according to one or more embodiments of the present disclosure;
FIG. 5 is a schematic diagram of an embodiment of a training apparatus for a video classification model according to one or more embodiments of the present disclosure;
FIG. 6 is a schematic diagram of an embodiment of a video classification processing device according to one or more embodiments of the present disclosure;
FIG. 7 is a schematic diagram of a training apparatus for video classification models according to one or more embodiments of the present disclosure;
fig. 8 is a schematic structural diagram of a video classification processing device according to one or more embodiments of the present disclosure.
Detailed Description
In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive effort, are intended to be within the scope of the present disclosure.
As shown in fig. 1, in an aspect of one or more embodiments of the present disclosure, an implementation environment includes: pytorch deep learning framework.
In the implementation environment, two stages of meta training and meta testing are executed, in the meta training stage, a query video sample set and a support video sample set are constructed based on a base class video sample set, the support video samples are used for carrying out fusion processing on the query video samples, and the number of the support video samples corresponds to the number of the support video samples, so that data enhancement on the query video samples is realized, and more virtual video samples are obtained for training a video classification model to be trained.
In the meta-test stage, a query video and a support video set are constructed based on the new video sample set, data enhancement is carried out on the query video based on the support video set, a plurality of virtual videos are obtained, the classification result of the query video is calculated through the classification result of each virtual video of the video classification model, and therefore accuracy of the obtained classification result of the query video is improved.
As shown in FIG. 1, in the meta-training stage, a query video sample set Q is obtained from a base class video sample set of a large number of annotation data n And support a set of video samples S n Taking one query video sample and two support video samples as examples, performing fusion processing on the query video sample and the support video sample 1 to obtain a virtual video sample 1, and performing fusion processing on the query video sample and the support video sample 2 to obtain a virtual video sample 2; model training is carried out on the video classification model to be trained based on the virtual video sample 1 and the virtual video sample 2, and a video classification model after training is completed is obtained;
in the meta-test stage, a query video set Q is obtained from new video trigger of a small amount of marked data n And support video set S n Taking one query video and two support videos as examples, carrying out fusion processing on the query video and the support videos 1 to obtain a virtual video 1, and carrying out fusion processing on the query video and the support videos 2 to obtain a virtual video 2; calculating the similarity between the virtual video 1 and the support video 1 and the similarity between the virtual video 2 and the support video 2, and calculating the similarity between the virtual video 2 and the support video 1 and the similarity between the virtual video 2 and the support video 2, wherein the numerical values on the top of the arrow represent the similarity as shown in fig. 1; similarity of the query video is calculated based on the respective similarities.
One or more embodiments of a training method for a video classification model provided in the present specification are as follows:
according to the training method for the video classification model, label conversion is carried out from paired labels of a test video sample and a training video sample, model parameters are obtained in the label conversion process, the test video sample and the class labels of all training video samples are obtained through fusion parameters and label conversion, fusion processing is carried out on the test video sample and all training video samples, a virtual video sample is obtained, the video classification model is subjected to parameter adjustment according to a classification result of the video classification model constructed based on the model parameters, so that a trained video classification model is obtained, data enhancement is carried out on the test video sample in the small sample video classification process, a richer sample, namely a virtual video sample is obtained, and the generalization capability and robustness of the video classification model obtained through training are improved on the basis of limited samples.
Referring to fig. 2, the training method of the video classification model provided in the present embodiment specifically includes steps S202 to S208.
Step S202, sampling a video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample.
In practical application, in the process of classifying small sample videos, a large number of base class labeled samples and a small number of new class labeled samples can be obtained, and the training aim is to enable a model trained on the base class samples to obtain more accurate classification results in the process of testing on the basis of the new class samples; for the task of k support samples of a classification number, constructing a support video sample set, a query video sample set and pairing labels of the query video samples in the query video sample set and each support video sample in the support video sample set; therefore, the query video samples are subjected to data enhancement from the paired tags, and more samples for model training are obtained.
In this embodiment, the video set includes a base class video sample set; the base class video sample set is a sample which can be accessed during training; the test video samples include query video samples; the training video samples include support video samples; the pairing tag is determined based on whether the paired query video sample and training video sample belong to the same category. Specifically, in the process of model training, sampling is carried out from a base class video sample set to obtain a test video sample set and a training video sample set;
It should be noted that, in this embodiment, the base class video sample set includes a set obtained from a third party channel and including a large number of labeled samples; for example, a kinetic dataset, a Someting V2 dataset; the new video sample set in the embodiment comprises a small number of sets with marked video samples; for example, in a user self-certification scenario, a video of a planting farmland submitted by a user; or in a resource lending scene, the user shoots videos of owned farmlands or houses.
Optionally, the test video sample set includes a classified number of test video samples of video categories, and the number of the test video samples under each video category is a preset threshold; the training video sample set comprises training video samples of the preset categories of the classified number, and the number of the training video samples in each video category is larger than the preset threshold. Optionally, performing several classifications, wherein the number of classifications is several; for example, two classifications are performed, and the number of classifications is 2; 5 classifications were made, the number of classifications being 5. In this embodiment, the video category is a category of each sample in the small sample learning dimension.
In the process of constructing the test video sample set and the training video sample set, constructing the test video sample set and the training video sample set based on the classification quantity; for example, for a task of n classification, k support samples, the constructed test video sample set (query video sample set) contains n classes, each class has one video sample, and the constructed training video sample set (support video sample set) contains n classes, each class has k video samples.
In a specific implementation, in order to improve the effectiveness of the test video sample and the training video sample obtained by sampling, in an alternative implementation provided in this embodiment, in a process of sampling a video set to obtain the test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample, the following operations are performed:
sampling the video set according to the classification number to obtain a test video sample set consisting of a plurality of test video samples and a training video sample set consisting of at least one training video sample;
and generating pairing labels of each test video sample and each training video sample in the test video sample set.
Specifically, a test video sample set consisting of a first number of test video samples is obtained by sampling in the video set, and a training video sample set consisting of a second number of training video samples is obtained by sampling in the video set; optionally, the first number is equal to the number of classifications, and the video classifications of the test video samples in the test video sample set are different; the second number is equal to the product of the number of classifications and the number of samples; the video categories of the training video samples contained in the training video sample set are the same as the video categories of the test video samples contained in the test video sample set, and the number of training videos in each video category in the training video sample set is equal and equal to the number of samples.
After a test video sample set and a training video sample set are obtained, pairing the test video samples in the test video sample set and the test video samples in the training video sample set in pairs, and generating pairing labels; it should be noted that, after the test video sample set and the training video sample set are obtained, the processing procedure for each test video sample in the test video sample set is similar, so in this embodiment, a process of data enhancement and model training is specifically described by taking one test video sample as an example.
In a specific implementation process, in order to improve accuracy of paired labels, in an optional implementation manner provided in this embodiment, paired labels of a test video sample and each training video sample are generated in the following manner:
determining a pairing tag of the test video sample and a training video sample belonging to the same video category as the test video sample as a first tag;
and determining the paired label of the test video sample and the training video sample belonging to different video categories from the test video sample as a second label.
Specifically, according to whether the test video sample and the training video sample belong to the same video category, the pairing label of the test video sample and the training video sample is determined.
For example, the constructed query video sample set contains 5 query video samples, the video categories of the 5 query video samples are category 1, category 2 to category 5 respectively, the constructed support video sample set contains 25 support video samples, and 5 support video samples are respectively contained under the category 1 to category 5 of the 25 support video samples; that is, in the support video samples, the video category of 5 support video samples is category 1, the video category of 5 support video samples is category 2, …, and the video category of 5 support video samples is category 5, in this case, in the process of generating the pairing tag, each query video sample in the query video sample set, the pairing tag of the support video sample under the same video category in the support video sample set is 1, and the pairing tag of the support video sample under the same video category in the support video sample set is 0; in other words, each query video sample has a pairing tag of 1 with 5 support video samples and a pairing tag of 0 with 20 support video samples.
In addition, step S202 may be replaced by sampling in the video set according to the number of classifications, to obtain a test video sample set and a training video sample set, generating pairing labels of each test video sample in the test video sample set and each training video sample in the training video sample set, and forming a new implementation manner with other processing steps provided in the present embodiment.
Step S204, performing label conversion based on the paired labels of the test video sample and the training video samples, to obtain model parameters and class labels of the test video sample and the at least one training video sample.
In order to realize data enhancement of a test video sample based on a training video sample, more virtual video samples are obtained by fusing the test video sample and the training video sample, model training is performed based on the virtual video sample, and generalization capability and robustness of a video classification model obtained by training are improved.
In the process of carrying out fusion processing on the test video sample and the training video sample, as the small sample study is a meta-learning paradigm, but the fusion is carried out aiming at the traditional classification mode, the traditional classification paradigm is used, that is, the fusion cannot be directly carried out by utilizing the pairing tag in the meta-learning scene of the small sample, therefore, the tag conversion is carried out, the pairing tag is converted into the category tag, and then the fusion processing is carried out on the test video sample and the training video sample based on the category tag. In other words, in this embodiment, in order to generate more virtual video samples in the meta-training stage, the fusion of the conventional classification is introduced into the meta-training, however, the fusion is implemented on the premise that there is a structure of a classifier in the network, and since the meta-learning training is performed by maximizing the similarity between the same-class query video sample and the support video sample, there is often no structure of a classifier in the model, so that before the fusion is performed, label conversion is required.
In this embodiment, the category labels include labels of categories of the test video samples and the training video samples in the conventional classification dimension.
In this embodiment, to train a video classification model, a classifier needs to be built; the model parameters include classification weights of the classifier generated during the label conversion process. And after the classification weight is obtained, constructing a classifier.
In a specific label conversion process, that is, in a process of converting a pairing label into a category label, in order to improve accuracy and effectiveness of the converted category label, in an optional implementation manner provided in this embodiment, a label conversion process based on the pairing labels of the test video sample and each training video sample is implemented in the following manner, so as to obtain a model parameter and a category label of the test video sample and the at least one training video sample:
reading a first similarity algorithm for performing positive sample similarity calculation under the dimension of the paired labels and a second similarity algorithm for performing negative sample similarity calculation;
reading a third similarity algorithm for performing similarity calculation of video samples of the same category under the dimension of the category label and a fourth similarity algorithm for performing similarity calculation of video samples of different categories;
Based on the test video sample and the training video sample set, calculating a first classification weight according to the first similarity algorithm and the third similarity algorithm, calculating a second classification weight according to the second similarity algorithm and the fourth similarity algorithm, and determining the first classification weight and the second classification weight as the model parameters.
Optionally, the first classification weight includes a classification weight corresponding to a category to which the test video sample belongs;
the second classification weight comprises classification weights corresponding to the categories except the category to which the target number belongs; the target number includes the number of classifications of the classifier minus 1.
Specifically, for a test video sample, positive and negative samples are obtained; the positive samples of the test video samples are training video samples belonging to the same video category as the test video samples in the training video sample set; namely training video samples with the pairing label of 1 with the test video samples; the negative samples of the test video samples are training video samples in the training video sample set, and the training video samples belong to different video categories from the test video samples; i.e. training video samples with a pairing tag of 0 with the test video samples.
For the test video sample, the similarity of the test video sample to the positive sample and the negative sample can be calculated;
for example, the similarity of a test video sample to a positive sample may be calculated as follows:
the similarity of the test video sample to the negative sample can be calculated as follows:
where x represents the test video sample, x i Positive samples, x representing test video samples j Representing a negative of the test video sample,representing the similarity of the test video sample to the positive sample, < >>Representing the similarity of the test video to the negative samples.
The above is a process of performing computation of similarity between the test video sample and the positive and negative samples on the basis of the paired tags, and the following pair performs computation of similarity between classes and computation of similarity between classes on the basis of the class tags; the similarity between classes refers to the similarity between test video samples and training video samples under different class labels; the similarity refers to the similarity between the test video sample and the training video sample under the same category label;
for example, the similarity of test video samples may be calculated as follows:
the similarity of the test video samples can be calculated as follows:
Wherein w is y Representing the classification weight corresponding to the category to which the test video sample belongs; w (w) j And representing the classification weight corresponding to the category outside the category to which the test video sample belongs.
That is, the pairing tag is converted into a category tag by constructing a classifier; specifically, under the condition that the constraint is met, the label conversion can be completed by keeping the classification probability of the positive and negative samples obtained under the constructed classifier unchanged, wherein each weight of the classifier can be calculated one by one, so that the inner product of the query video sample corresponding to each weight is met and the same as that calculated by using the pairing label.
In the specific execution process, training of the classifier is carried out by taking similarity between the similarity and the similarity between the test video sample and the positive sample as constraints, and the similarity between the classes and the similarity between the test video sample and the negative sample are equal, so as to obtain the classification weight of the classifier. That is, a classifier is constructed, and the similarity of the query video sample after passing through the classifier is consistent with the similarity of the query video sample and the positive sample and the similarity of the query video sample and the negative sample;
in addition, in this embodiment, the model parameters may also be generated as follows:
Determining a first training video sample which is a positive sample with the test video sample and a second training video sample which is a negative sample with the test video sample based on the pairing labels of the test video sample and the training videos;
calculating positive sample similarity based on the test video sample and the first training video sample, and calculating negative sample similarity based on the test video sample and the second training video sample;
and constructing the classifier by taking the positive sample similarity and the negative sample similarity as constraints, and obtaining the classification weight of the classifier.
On the basis of obtaining the classification weight of the classifier, each training video sample and each test video sample are respectively input into the classifier to carry out classification, so that a class label is obtained.
And S206, carrying out fusion processing on the test video sample and each training video sample according to the fusion parameters and the category labels to obtain at least one virtual video sample.
The fusion parameters comprise fusion proportion in the process of carrying out fusion processing on the test video sample and the training video sample; the virtual video samples comprise video samples obtained after fusion processing of the test video samples and the training video samples. Optionally, the fusion process includes: two images of the same size (test video samples and training video samples) are interpolated pixel by pixel.
In the above step, after label conversion is performed on the paired labels based on the test video sample and each training video sample, category labels of the test video sample and the training video sample are obtained. Optionally, the category labels include labels corresponding to sample categories of the test video samples and the training video samples. Note that, the class labels in the present embodiment include labels of linear classes of samples of the conventional learning dimension.
In this step, in order to obtain more samples, after obtaining the category labels of the test video sample and each training video sample, the fusion processing is performed on the test video sample and each training video sample from the category labels, so as to obtain at least one virtual video sample.
In order to improve the effectiveness of the virtual video samples obtained after the fusion processing, the fusion parameters are changed along the time sequence, and meanwhile, the changing speed is not constrained, so that samples with more diversity are generated, and in an optional implementation manner provided in this embodiment, according to the fusion parameters and the category labels, the fusion processing is performed on the test video samples and each training video sample, and in the process of obtaining at least one virtual video sample, the following operations are performed:
According to the fusion proportion, carrying out fusion processing on each image frame of the test video sample and each training video sample to obtain at least one virtual video sample;
and according to the fusion proportion, carrying out fusion processing on the category labels of each image frame of the test video sample and the training video sample subjected to fusion processing, and obtaining the virtual category labels of each virtual video sample.
Specifically, according to the fusion proportion corresponding to each image frame, carrying out fusion processing on each image frame of the test video sample and each training video sample, and carrying out fusion processing on the category labels of the test video sample and each training video sample, so as to obtain at least one virtual video sample and the virtual category label of each virtual video sample. Optionally, the virtual category label refers to a category label of the virtual video sample in a linear dimension.
In the process of fusing the test video sample and the training video sample, the fusion process can be performed as follows:
in the process of performing fusion processing on category labels, the fusion processing may be performed according to the following formula:
wherein,,representing virtual video samples obtained by fusion, k representing a kth image frame; x is x i Representing test video samples, x j Representing training video samples, y i Class label, y, representing test video sample j Class label, lambda, representing training video samples k Representing the blend ratio of the kth image frame, T represents the total number of image frames of the test video sample.
In which lambda is k Can be obtained by calculation from a predetermined constant, time domain parameters of the kth frame of the test video sample and the training video sample, etc., for example, calculating a ratio of the time domain parameters of the kth frame of the test video sample to the time domain parameters of the kth frame of the training video sample, and then calculating a product of the predetermined constant and the ratio as lambda k . The above description of the calculation of the mixing ratio is merely exemplary, furthermore, lambda k But may also be calculated according to other relevant parameters, and the embodiment is not limited herein.
In addition, step S206 may be replaced by performing fusion processing on the test video sample and each training video sample according to the fusion parameters to obtain at least one virtual video sample, performing fusion processing on the test video sample and the class label of each training video sample to obtain the class label of each virtual video sample, and forming a new implementation manner with other processing steps provided in this embodiment.
It should be noted that, the above-mentioned fusion processing procedure is an explanation of the fusion processing procedure of one test video sample and one training video sample; in a specific execution process, the method is adopted to carry out fusion processing on each test video and each training video; for example, after 1 test video sample is fused with 5 training video samples, 5 virtual video samples are obtained.
It should be noted that, in this embodiment, only a manner of obtaining a virtual video sample by fusing a test video sample with each training video sample is provided, and in addition, the test video sample and the training video sample may be partially replaced to obtain the virtual video sample; for example, the 6 th to 10 th frames in the test video samples containing 10 frames are replaced by the 6 th to 10 th frames in the training video samples, so as to obtain virtual video samples; in addition, discontinuous image frames can be replaced; specifically, the virtual video sample can be obtained by performing data enhancement on the test video sample and the training video sample in other modes.
In addition, step S206 may be replaced by performing data enhancement on the test video samples based on the training video samples and the class labels to obtain at least one virtual video sample, and forming a new implementation manner with the other processing steps provided in the present embodiment.
Step S208, inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to perform video classification, and performing parameter adjustment on the video classification model based on the classification result of each virtual video sample to obtain a video classification model.
In the above steps, at least one virtual video sample is obtained, after at least one virtual video sample is obtained, model training is performed on a video classification model to be trained based on the virtual video sample, that is, the at least one virtual video sample is input into the video classification model constructed based on the model parameters to perform video classification, and parameter adjustment is performed on the video classification model based on the classification result of each virtual video sample, so as to obtain the video classification model after training is completed.
Optionally, the video classification model comprises a sampling layer, a feature extraction layer and a classification sub-model constructed based on the model parameters. In this embodiment, in order to implement model training of a video classification model to be trained, at least one virtual video sample is used as a sample for performing model training, specifically, each virtual video sample is input into the video classification model to be trained, that is, the video classification model constructed by the model parameters performs video classification, and in an optional implementation manner provided in this embodiment, video classification of any virtual video sample is implemented in the following manner:
Image sampling is carried out on any virtual video sample, image frames with the sampling number are obtained, and the image frames are input into a feature extraction layer for feature extraction, so that image frame features are obtained;
and inputting the image frame characteristics into a classification sub-model constructed by the model parameters to carry out video classification on any virtual video sample, so as to obtain a classification result of any virtual video sample.
Optionally, the classifying sub-model performs video classification on the arbitrary virtual video sample, including:
performing similarity calculation on the image frame characteristics and at least one training video characteristic under each class to obtain similarity of the image frame characteristics and each training video characteristic under each class;
and calculating the category matching probability of any virtual video sample and each category based on the similarity, and taking the calculated category matching probability as the classification result.
Specifically, for any virtual video sample, inputting the virtual video sample into a video classification model to be trained, firstly, performing image sampling on the virtual video sample by the video classification model based on a sampling layer to obtain image frames with the sampling number, then inputting the image frames obtained by sampling into a feature extraction layer to perform feature extraction to obtain image frame features, and finally inputting the image frame features into a classification sub-model constructed based on model parameters to perform video classification to obtain a classification result.
For example, the video classification model includes three parts, namely a sampling layer of TSN (temporal segment network), a feature extraction layer of ResNet50 (residual neural network), and a classifier constructed by the classification weights.
The classifying sub-model is used for calculating the similarity between the image frame characteristics of the virtual video sample and each training video characteristic in the video classifying process of the classifier, so as to obtain the similarity between the image frame characteristics and each training video characteristic; and calculating the category matching probability of the virtual video sample and each category based on the similarity.
Under the condition that only one training video feature exists under each category, calculating the similarity between the image frame features and the training video features, and calculating the category matching probability between the virtual video sample and each category based on the similarity; if the video image comprises a plurality of training video features under each category, calculating the average value of the similarity between the image frame and the plurality of training video features, and calculating the category matching probability of the virtual video sample and each category based on the average value of the similarity. In the process of calculating the category matching probability, the similarity between the virtual video sample and each category can be proportioned, so that the category matching probability of each category is obtained; other processing methods may be used to calculate the class matching probability, and the embodiment is not limited herein.
In the implementation, in order to realize the training of the video classification model, after the classification result of each virtual video sample is obtained, parameter adjustment is carried out on the video classification model based on the classification result of each virtual video sample; in order to improve the robustness and generalization capability of the video classification model obtained by training, in an optional implementation manner provided in this embodiment, in a process of performing parameter adjustment on the video classification model based on the classification result of each virtual video sample, the following operations are performed:
and calculating training loss based on the virtual video samples and the classification results of the virtual video samples, and carrying out parameter adjustment on the video classification model constructed by the model parameters based on the training loss.
In an optional implementation manner provided in this embodiment, in calculating the training loss based on the classification results of the virtual video samples and the virtual video samples, calculating the training loss based on the classification results of the virtual video samples and the virtual video samples includes:
carrying out parameter correction on classification result parameters in the classification result of the first virtual video sample; the first virtual video sample is obtained by fusing test video samples and training video samples of different video categories;
Calculating training loss based on the classification result after parameter correction, the classification result of the second virtual sample and the class labels of the virtual video samples; the second virtual video sample is obtained by fusing a test video sample and a support video sample of the same video category.
Optionally, performing parameter correction on a classification result parameter in the classification result of the first virtual video sample includes:
calculating a correction coefficient for parameter correction according to the fusion parameters;
and carrying out parameter correction on the classification prediction result based on the class label, the classification number and the correction coefficient corresponding to the first virtual video sample.
In a specific implementation process, the virtual video sample is introduced, which is essentially expected to reduce the phenomenon that the video classification model is excessively credible, and uncertainty is introduced to the training process through the class label, so that the video classification model is promoted to have better generalization performance. It is desirable to further promote this uncertainty after a mix of test and training video samples of different categories, for example, when the query video sample belongs to the same category, if the support video sample also belongs to the same category, because the two samples have more similar characteristics, the video classification model should make a more confident determination of the virtual video sample obtained by fusion processing the query video sample and the support video sample, but if the support video sample belongs to another category of sample, where the similar characteristics are less, the video classification model should make the category of the virtual video sample obtained by fusion processing the query video sample and the support video sample of another category more difficult to distinguish, and the video classification model should give a more confident determination at this time. Therefore, when the test video sample and the training video sample subjected to the fusion processing belong to different categories, the correction coefficient is calculated first:
Where ε belongs to the correction factor, τ is a constant, and λ is the mixing ratio.
After calculating the correction coefficient, the predicted classification result is corrected in the following manner:
where n is the number of classifications.
After the corrected classification result of the first virtual video sample is obtained through calculation, training loss is calculated based on the classification result of the first virtual video sample after parameter correction, the classification result of the second virtual video sample and the class labels of the first virtual video sample and the second virtual video sample, and parameter correction is carried out on the video classification model to be trained based on the training loss, so that the video classification model after training is completed is obtained.
The process describes the data enhancement of the query video sample in the meta-training process, and the meta-training process further comprises a meta-test stage; the following describes the metadata enhancement process of the query video in detail in the meta-test stage.
In the meta-test stage, the backbone network which is the same as that in the meta-training stage is reserved, so that the backbone network (video classification model) can correctly predict the virtual videos after fusion processing, and given one test video, L virtual videos can be generated by using the training video, wherein L is a super-parameter in meta-learning; by analyzing the classification results based on all the virtual videos, better classification predictions are made for the test videos.
In order to promote accurate classification of the test video, in an optional implementation manner provided in this embodiment, a meta-test process is implemented in the following manner:
sampling the target video set to obtain a test video and at least one training video;
according to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the video classification model obtained after the at least one virtual video is input and trained, and obtaining classification results of the virtual videos;
and calculating the video classification result of the test video based on the classification result of each virtual video.
Optionally, the target video set includes a new video set.
In practical application, the single classification result of the test sample is calculated in the following manner:
wherein y is pred Representing a video classification result of the test video; the portion of the corresponding category confidence rise due to the introduction of the support video sample is culled.
In practical application, if pixel training is same and random fusion is carried out without constraint, when the query video sample is relatively low, classification prediction is more sensitive to noise, so that a fixed fusion proportion larger than 0.5 is used, and each type selects the same number of support videos for fusion, and finally, the above formula can be simplified into the following form:
Where m represents the number of samples selected for fusion for each class.
In addition, in this embodiment, fusion processing with multiple dimensions or multiple modes may be performed, where the fusion processing may not be limited to a data layer, but may also be applied to a feature map, and for small sample video classification, as different frames of different videos all undergo feature extraction through the same backbone network, the fusion processing may also be implemented by adding the fusion processing to the backbone network;
alternatively, in addition to fusing samples, a further variety of data enhancement methods may be introduced, such as fusing two samples pixel by pixel to generate a virtual video sample, without involving more complex operations such as cropping, or considering that portions of one sample are swapped with another or more complex operations.
In this embodiment, from a sample perspective, a method for enhancing data of meta-learning is provided, the performance of a video classification model is improved by expanding samples, the method is conveniently applied to any method for classifying small sample videos by meta-learning, the performance of the video classification model obtained by training is stably improved by introducing small calculation overhead, and the method for classifying two different small sample videos and three different small sample videos obtains stable improvement under each evaluation index, as shown in the following table:
Wherein the values in brackets are the amount of improvement in the performance of the video classification model trained based on the above manner relative to the performance of the video classification model trained in the conventional manner.
In summary, in the training method of the video classification model provided in this embodiment, by converting the pairing learning into the category learning, the fusion commonly used in the traditional classification is introduced into the meta learning, for specific fusion, time domain enhancement fusion and asymmetric fusion are provided for the characteristics of the video sample and the characteristics of the small sample learning, the diversity of the time sequence and asymmetric fusion processing are reserved for the query video sample and the support video sample, and further a data enhancement method for the meta testing stage is provided, so that the remarkable performance improvement is finally obtained. It should be noted that this embodiment may be implemented using a Pytorch deep learning framework.
The following describes the training method of the video classification model provided in this embodiment by taking the application of the training method of the video classification model and the test scene as an example, and referring to fig. 3, the training method of the video classification model applied to the training and the test scene of the video classification model specifically includes the following steps.
Step S302, sampling a Kinetics data set to obtain a query video sample and at least one support video sample.
Step S304, generating a query video sample and a pairing label of each support video sample.
Step S306, calculating classification parameters of the classifier according to the similarity algorithm of the query video sample and each support video sample and the similarity algorithm of the query video sample and each support video sample under the linear category labels, and obtaining the category labels of the query video sample and each support video sample.
Optionally, in the process of calculating the classification parameters of the classifier, the classification weights of the classifier are calculated with the constraint that the similarity of the query video sample and the positive sample in the support video sample is equal to the similarity of the classification weights corresponding to the categories to which the query video sample belongs in the classifier, and/or with the constraint that the similarity of the query video sample and the negative sample in the support video sample is equal to the similarity of the classification weights corresponding to the categories other than the categories to which the query video sample belongs in the classifier. The classification parameters of the classifier comprise classification weights corresponding to the various categories. The category labels are labels of categories corresponding to the video samples under the classifier.
Step S308, carrying out fusion processing on the query video sample and each support video sample according to the fusion proportion to obtain at least one virtual video sample, and carrying out fusion processing on the query video sample and the category labels of each support video sample to obtain the category labels of each virtual video sample.
Step S310, inputting each virtual video sample into a video classification model to be trained comprising a classifier to perform video classification, and performing parameter adjustment on the video classification model to be trained based on the classification result and the classification label of each virtual video sample to obtain a trained video classification model.
Step S312, sampling the user self-certification video set of the user self-certification service to obtain a query video and at least one support video.
And step S314, carrying out fusion processing on the query video and each support video according to the fusion proportion to obtain at least one virtual video corresponding to the query video.
Step S316, inputting the virtual videos into the video classification model to perform video classification, and obtaining classification results of the virtual videos.
Step S318, calculating the video classification result of the query video based on the classification result of each virtual video.
One or more embodiments of a video classification processing method provided in the present specification are as follows:
Similar to the related content of the training method of the video classification model provided in the above embodiment, the related content of the video classification processing method provided in the present embodiment may be read, or the related content of the above embodiment may be adaptively modified, and accordingly, the related content of the above embodiment may be read, which is not described herein.
Referring to fig. 4, the video classification processing method provided in the present embodiment specifically includes steps S402 to S408.
Step S402, sampling the target video set to obtain a test video and at least one training video.
In this embodiment, the target video set includes a small number of sets with labeled videos; for example, in a user self-certification scenario, a video of a planting farmland submitted by a user; or in a resource lending scene, the user shoots videos of owned farmlands or houses. In other words, the target video set in this embodiment includes video data submitted by the user for participating in the service, that is, step S402 may be replaced by sampling the service video set of the target service to obtain the test video and at least one training video.
The test video comprises video for video classification. The at least one training video includes a video that assists in video classification of the test video; optionally, the embodiment may also be applied to a process of testing the video classification model obtained by training; in the process of testing the video classification model obtained through training, the test video refers to a query video or a query video sample, and the training video refers to a support video or a support video sample.
And step S404, carrying out fusion processing on the test video and each training video according to fusion parameters to obtain at least one virtual video corresponding to the test video.
The fusion parameters refer to fusion proportion in the process of carrying out fusion processing on the test video and each training video.
In the step, fusion processing is carried out on the test video and each training video according to the fusion proportion, at least one virtual video corresponding to the test video is obtained, and the number of the obtained virtual videos is equal to that of the training videos.
Specifically, according to the corresponding fusion proportion of each image frame, carrying out fusion processing on each image frame of the test video and each training video; in the process of fusing the test video and the training video, the fusion process can be performed as follows:
Wherein,,representing virtual video samples obtained by fusion, k representing a kth image frame; x is x i Representing test video samples, x j Representing training video samples.
In which lambda is k Can be obtained by calculation from a predetermined constant, time domain parameters of the kth frame of the test video sample and the training video sample, etc., for example, calculating a ratio of the time domain parameters of the kth frame of the test video sample to the time domain parameters of the kth frame of the training video sample, and then calculating a product of the predetermined constant and the ratio as lambda k . The above description of the calculation of the mixing ratio is onlyIs only exemplary, furthermore lambda k But may also be calculated according to other relevant parameters, and the embodiment is not limited herein.
In addition to performing fusion processing on the test video and each training video according to the fusion parameters to obtain at least one virtual video corresponding to the test video, other modes may be adopted to obtain the virtual video, for example, replacing a part of image frames in the test video and any training video to obtain the virtual video; for example, the 6 th to 10 th frames in the test video containing 10 frames are replaced by the 6 th to 10 th frames in the training video, so that a virtual video is obtained; in addition, discontinuous image frames can be replaced; specifically, the virtual video can be obtained by performing data enhancement on the test video and the training video in other modes. That is, step S404 may also be replaced by performing data enhancement on the test video based on each training video, to obtain at least one virtual video, and to form a new implementation with the other processing steps provided in the present embodiment. It should be noted that, in addition to the above method, other methods may be used for performing data enhancement, for example, splicing, replacement, etc., which is not limited herein.
Step S406, performing video classification on the at least one virtual video input video classification model, to obtain classification results of each virtual video.
Optionally, the video classification model is obtained by performing model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples. Optionally, the video classification model comprises a sampling layer, a feature extraction layer and a classification sub-model constructed based on the model parameters. Wherein the classification sub-model may be a classifier; correspondingly, the classification weight of the classifier is the model parameter.
The following describes the process of model training for a video classification model in detail.
In this embodiment, in order to implement model training of a video classification model to be trained, at least one virtual video sample is used as a sample for performing model training, specifically, each virtual video sample is input into the video classification model to be trained, that is, the video classification model constructed by the model parameters performs video classification, and in an optional implementation manner provided in this embodiment, video classification of any virtual video sample is implemented in the following manner:
Image sampling is carried out on any virtual video sample, image frames with the sampling number are obtained, and the image frames are input into a feature extraction layer for feature extraction, so that image frame features are obtained;
and inputting the image frame characteristics into a classification sub-model constructed by the model parameters to carry out video classification on any virtual video sample, so as to obtain a classification result of any virtual video sample.
Optionally, the classifying sub-model performs video classification on the arbitrary virtual video sample, including:
performing similarity calculation on the image frame characteristics and at least one training video characteristic under each class to obtain similarity of the image frame characteristics and each training video characteristic under each class;
and calculating the category matching probability of any virtual video sample and each category based on the similarity, and taking the calculated category matching probability as the classification result.
Specifically, for any virtual video sample, inputting the virtual video sample into a video classification model to be trained, firstly, performing image sampling on the virtual video sample by the video classification model based on a sampling layer to obtain image frames with the sampling number, then inputting the image frames obtained by sampling into a feature extraction layer to perform feature extraction to obtain image frame features, and finally inputting the image frame features into a classification sub-model constructed based on model parameters to perform video classification to obtain a classification result.
For example, the video classification model includes three parts, namely a sampling layer of TSN (temporal segment network), a feature extraction layer of ResNet50 (residual neural network), and a classifier constructed by the classification weights.
The classifying sub-model is used for calculating the similarity between the image frame characteristics of the virtual video sample and each training video characteristic in the video classifying process of the classifier, so as to obtain the similarity between the image frame characteristics and each training video characteristic; and calculating the category matching probability of the virtual video sample and each category based on the similarity.
Under the condition that only one training video feature exists under each category, calculating the similarity between the image frame features and the training video features, and calculating the category matching probability between the virtual video sample and each category based on the similarity; if the video image comprises a plurality of training video features under each category, calculating the average value of the similarity between the image frame and the plurality of training video features, and calculating the category matching probability of the virtual video sample and each category based on the average value of the similarity. In the process of calculating the category matching probability, the similarity between the virtual video sample and each category can be proportioned, so that the category matching probability of each category is obtained; other processing methods may be used to calculate the class matching probability, and the embodiment is not limited herein.
In the implementation, in order to realize the training of the video classification model, after the classification result of each virtual video sample is obtained, parameter adjustment is carried out on the video classification model based on the classification result of each virtual video sample; in order to improve the robustness and generalization capability of the video classification model obtained by training, in an optional implementation manner provided in this embodiment, in a process of performing parameter adjustment on the video classification model based on the classification result of each virtual video sample, the following operations are performed:
and calculating training loss based on the virtual video samples and the classification results of the virtual video samples, and carrying out parameter adjustment on the video classification model constructed by the model parameters based on the training loss.
In an optional implementation manner provided in this embodiment, in calculating the training loss based on the classification results of the virtual video samples and the virtual video samples, calculating the training loss based on the classification results of the virtual video samples and the virtual video samples includes:
carrying out parameter correction on classification result parameters in the classification result of the first virtual video sample; the first virtual video sample is obtained by fusing test video samples and training video samples of different video categories;
Calculating training loss based on the classification result after parameter correction, the classification result of the second virtual sample and the class labels of the virtual video samples; the second virtual video sample is obtained by fusing a test video sample and a support video sample of the same video category.
Optionally, performing parameter correction on a classification result parameter in the classification result of the first virtual video sample includes:
calculating a correction coefficient for parameter correction according to the fusion parameters;
and carrying out parameter correction on the classification prediction result based on the class label, the classification number and the correction coefficient corresponding to the first virtual video sample.
In a specific implementation process, the virtual video sample is introduced, which is essentially expected to reduce the phenomenon that the video classification model is excessively credible, and uncertainty is introduced to the training process through the class label, so that the video classification model is promoted to have better generalization performance. It is desirable to further promote this uncertainty after a mix of test and training video samples of different categories, for example, when the query video sample belongs to the same category, if the support video sample also belongs to the same category, because the two samples have more similar characteristics, the video classification model should make a more confident determination of the virtual video sample obtained by fusion processing the query video sample and the support video sample, but if the support video sample belongs to another category of sample, where the similar characteristics are less, the video classification model should make the category of the virtual video sample obtained by fusion processing the query video sample and the support video sample of another category more difficult to distinguish, and the video classification model should give a more confident determination at this time. Therefore, when the test video sample and the training video sample subjected to the fusion processing belong to different categories, the correction coefficient is calculated first:
Where ε belongs to the correction factor, τ is a constant, and λ is the mixing ratio.
After calculating the correction coefficient, the predicted classification result is corrected in the following manner:
where n is the number of classifications.
After the corrected classification result of the first virtual video sample is obtained through calculation, training loss is calculated based on the classification result of the first virtual video sample after parameter correction, the classification result of the second virtual video sample and the class labels of the first virtual video sample and the second virtual video sample, and parameter correction is carried out on the video classification model to be trained based on the training loss, so that the video classification model after training is completed is obtained.
In a specific execution process, the video classification model keeps the backbone network which is the same as that of the meta-training stage, so that the backbone network (video classification model) can correctly predict the virtual videos after fusion processing, and given a test video, L virtual videos can be generated by using the training video, wherein L is a super-parameter in meta-learning; by analyzing the classification results based on all the virtual videos, better classification predictions are made for the test videos.
Step S408, calculating a video classification result of the test video based on the classification result of each virtual video.
In practical application, the single classification result of the test sample is calculated in the following manner:
wherein y is pred Representing a video classification result of the test video; the portion of the corresponding category confidence rise due to the introduction of the support video sample is culled.
In practical application, if pixel training is same and random fusion is carried out without constraint, when the query video is low in occupation, classification prediction is more sensitive to noise, so that a fixed fusion proportion larger than 0.5 is used, and each type of support video with the same number is selected for fusion, and finally, the above formula can be simplified into the following form:
where m represents the number of samples selected for fusion for each class. f () represents the feature of the corresponding video.
Specifically, a video classification result of the query video is calculated based on the formula.
In addition, in this embodiment, fusion processing with multiple dimensions or multiple modes may be performed, where the fusion processing may not be limited to a data layer, but may also be applied to a feature map, and for small sample video classification, as different frames of different videos all undergo feature extraction through the same backbone network, the fusion processing may also be implemented by adding the fusion processing to the backbone network;
Alternatively, in addition to fusing samples, a further variety of data enhancement methods may be introduced, such as fusing two samples pixel by pixel to generate a virtual video sample, without involving more complex operations such as cropping, or considering that portions of one sample are swapped with another or more complex operations.
In this embodiment, from a sample perspective, a method for enhancing data of meta-learning is provided, the performance of a video classification model is improved by expanding samples, the method is conveniently applied to any method for classifying small sample videos by meta-learning, the performance of the video classification model obtained by training is stably improved by introducing small calculation overhead, and the method for classifying two different small sample videos and three different small sample videos is stably improved under each evaluation index.
One or more embodiments of a training apparatus for a video classification model provided in the present specification are as follows:
in the foregoing embodiments, a training method for a video classification model is provided, and a training device for a video classification model is provided correspondingly, which is described below with reference to the accompanying drawings.
Referring to fig. 5, a schematic diagram of an embodiment of a training apparatus for a video classification model according to the present embodiment is shown.
Since the apparatus embodiments correspond to the method embodiments, the description is relatively simple, and the relevant portions should be referred to the corresponding descriptions of the method embodiments provided above. The device embodiments described below are merely illustrative.
The embodiment provides a training device for a video classification model, which comprises:
the pairing tag generation module 502 is configured to sample the video set to obtain a test video sample and at least one training video sample, and generate pairing tags of the test video sample and each training video sample;
a tag conversion module 504 configured to obtain model parameters and category tags of the test video sample and the at least one training video sample based on tag conversion of the test video sample and a pairing tag of each training video sample;
the fusion processing module 506 is configured to perform fusion processing on the test video sample and each training video sample according to the fusion parameters and the category labels, so as to obtain at least one virtual video sample;
A parameter adjustment module 508, configured to input the at least one virtual video sample into a video classification model constructed based on the model parameters for video classification, and perform parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
One or more embodiments of a video classification processing apparatus provided in the present specification are as follows:
in the above-described embodiments, a video classification processing method is provided, and a video classification processing apparatus is provided corresponding to the video classification processing method, and is described below with reference to the accompanying drawings.
Referring to fig. 6, a schematic diagram of an embodiment of a video classification processing apparatus according to the present embodiment is shown.
Since the apparatus embodiments correspond to the method embodiments, the description is relatively simple, and the relevant portions should be referred to the corresponding descriptions of the method embodiments provided above. The device embodiments described below are merely illustrative.
The embodiment provides a video classification processing device, including:
a sampling module 602 configured to sample a set of target videos to obtain a test video and at least one training video;
the fusion processing module 604 is configured to perform fusion processing on the test video and each training video according to fusion parameters, so as to obtain at least one virtual video corresponding to the test video;
The video classification module 606 is configured to perform video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
a result calculation module 608 configured to calculate a video classification result of the test video based on the classification result of the respective virtual videos;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
One or more embodiments of a training apparatus for a video classification model provided herein are as follows:
in response to the above-described training method for a video classification model, one or more embodiments of the present disclosure further provide a training device for a video classification model, where the training device for a video classification model is used to perform the above-described provided training method for a video classification model, and fig. 7 is a schematic structural diagram of the training device for a video classification model provided by one or more embodiments of the present disclosure.
The training device for a video classification model provided in this embodiment includes:
as shown in fig. 7, the training device of the video classification model may have a relatively large difference due to different configurations or performances, and may include one or more processors 701 and a memory 702, where the memory 702 may store one or more storage applications or data. Wherein the memory 702 may be transient storage or persistent storage. The application program stored in memory 702 may include one or more modules (not shown in the figures), each of which may include a series of computer-executable instructions in a training device of the video classification model. Still further, the processor 701 may be configured to communicate with the memory 702 and execute a series of computer executable instructions in the memory 702 on a training device of the video classification model. The training device of the video classification model may also include one or more power sources 703, one or more wired or wireless network interfaces 704, one or more input/output interfaces 705, one or more keyboards 706, and the like.
In a particular embodiment, a training device for a video classification model includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions in the training device for a video classification model, and configured to be executed by the one or more processors, the one or more programs including computer-executable instructions for:
Sampling a video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample;
performing label conversion based on the pairing labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample;
according to the fusion parameters and the category labels, carrying out fusion processing on the test video sample and each training video sample to obtain at least one virtual video sample;
inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
One or more embodiments of a video classification processing device provided in the present specification are as follows:
in correspondence to the above-described video classification processing method, one or more embodiments of the present disclosure further provide a video classification processing apparatus, based on the same technical concept, for performing the above-provided video classification processing method, and fig. 8 is a schematic structural diagram of a video classification processing apparatus provided by one or more embodiments of the present disclosure.
The video classification processing device provided in this embodiment includes:
as shown in fig. 8, the video classification processing device may have a relatively large difference due to different configurations or performances, and may include one or more processors 801 and a memory 802, where the memory 802 may store one or more storage applications or data. Wherein the memory 802 may be transient storage or persistent storage. The application programs stored in the memory 802 may include one or more modules (not shown in the figures), each of which may include a series of computer executable instructions in the video classification processing device. Still further, the processor 801 may be configured to communicate with a memory 802 to execute a series of computer executable instructions in the memory 802 on a video classification processing device. The video classification processing device may also include one or more power sources 803, one or more wired or wireless network interfaces 804, one or more input/output interfaces 805, one or more keyboards 806, and the like.
In one particular embodiment, a video classification processing device includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the video classification processing device, and configured to be executed by one or more processors, the one or more programs comprising computer-executable instructions for:
Sampling the target video set to obtain a test video and at least one training video;
according to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
calculating a video classification result of the test video based on the classification result of each virtual video;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
One or more embodiments of a storage medium provided in the present specification are as follows:
one or more embodiments of the present disclosure further provide a storage medium, based on the same technical concept, corresponding to the training method of a video classification model described above.
The storage medium provided in this embodiment is configured to store computer executable instructions that, when executed by a processor, implement the following flow:
Sampling a video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample;
performing label conversion based on the pairing labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample;
according to the fusion parameters and the category labels, carrying out fusion processing on the test video sample and each training video sample to obtain at least one virtual video sample;
inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
It should be noted that, in the present specification, the embodiment about the storage medium and the embodiment about the training method of the video classification model in the present specification are based on the same inventive concept, so that the specific implementation of this embodiment may refer to the implementation of the foregoing corresponding method, and the repetition is omitted.
One or more embodiments of another storage medium provided in the present specification are as follows:
in correspondence to the video classification processing method described above, one or more embodiments of the present disclosure further provide a storage medium based on the same technical concept.
The storage medium provided in this embodiment is configured to store computer executable instructions that, when executed by a processor, implement the following flow:
sampling the target video set to obtain a test video and at least one training video;
according to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
calculating a video classification result of the test video based on the classification result of each virtual video;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
It should be noted that, the embodiments related to the storage medium in the present specification and the embodiments related to the video classification processing method in the present specification are based on the same inventive concept, so that the specific implementation of the embodiments may refer to the implementation of the corresponding method, and the repetition is omitted.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment focuses on the differences from other embodiments, for example, an apparatus embodiment, and a storage medium embodiment, which are all similar to a method embodiment, so that description is relatively simple, and relevant content in reading apparatus embodiments, and storage medium embodiments is referred to the part description of the method embodiment.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In the 30 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present specification.
One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.
Claims (25)
1. A method of training a video classification model, comprising:
sampling a video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample;
performing label conversion based on the pairing labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample;
According to the fusion parameters and the category labels, carrying out fusion processing on the test video sample and each training video sample to obtain at least one virtual video sample;
inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
2. The training method of a video classification model according to claim 1, the model parameters being obtained by:
reading a first similarity algorithm for performing positive sample similarity calculation under the dimension of the paired labels and a second similarity algorithm for performing negative sample similarity calculation;
reading a third similarity algorithm for performing similarity calculation of video samples of the same category under the dimension of the category label and a fourth similarity algorithm for performing similarity calculation of video samples of different categories;
based on the test video sample and the training video sample set, calculating a first classification weight according to the first similarity algorithm and the third similarity algorithm, calculating a second classification weight according to the second similarity algorithm and the fourth similarity algorithm, and determining the first classification weight and the second classification weight as the model parameters.
3. The method for training a video classification model according to claim 2, wherein the first classification weight includes a classification weight corresponding to a category to which the test video sample belongs;
the second classification weight comprises classification weights corresponding to the categories except the category to which the target number belongs; the target number includes the number of the classification number of the classifier minus a constant.
4. The method for training a video classification model according to claim 1, wherein the fusing the test video sample and each training video sample according to the fusion parameters and the class labels to obtain at least one virtual video sample comprises:
according to the fusion proportion, carrying out fusion processing on each image frame of the test video sample and each training video sample to obtain at least one virtual video sample;
and according to the fusion proportion, carrying out fusion processing on the test video sample subjected to fusion processing and the category labels of the image frames of the training video samples to obtain virtual category labels of the virtual video samples.
5. The method for training a video classification model according to claim 1, wherein the sampling the video set to obtain a test video sample and at least one training video sample, and generating a pairing tag of the test video sample and each training video sample, includes:
Sampling the video set according to the classification number to obtain a test video sample set consisting of a plurality of test video samples and a training video sample set consisting of at least one training video sample;
and generating pairing labels of each test video sample and each training video sample in the test video sample set.
6. The training method of a video classification model according to claim 5, wherein the test video sample set includes a classified number of test video samples of video categories, and the number of the test video samples under each video category is a preset threshold;
the training video sample set comprises training video samples of the preset categories of the classified number, and the number of the training video samples in each video category is larger than the preset threshold.
7. The method for training a video classification model according to claim 1, the generating a pairing tag of the test video sample and each training video sample, comprising:
determining a pairing tag of the test video sample and a training video sample belonging to the same video category as the test video sample as a first tag;
and determining the paired label of the test video sample and the training video sample belonging to different video categories from the test video sample as a second label.
8. The method of training a video classification model according to claim 1, the video classification of any one of the at least one virtual video sample comprising:
image sampling is carried out on any virtual video sample, image frames with the sampling number are obtained, and the image frames are input into a feature extraction layer for feature extraction, so that image frame features are obtained;
and inputting the image frame characteristics into a classification sub-model constructed by the model parameters to carry out video classification on any virtual video sample, so as to obtain a classification result of any virtual video sample.
9. The method for training a video classification model according to claim 8, the classification sub-model performing video classification on the arbitrary virtual video sample, comprising:
performing similarity calculation on the image frame characteristics and at least one training video characteristic under each class to obtain similarity of the image frame characteristics and each training video characteristic under each class;
and calculating the category matching probability of any virtual video sample and each category based on the similarity, and taking the calculated category matching probability as the classification result.
10. The method for training a video classification model according to claim 1, wherein the parameter adjustment of the video classification model based on the classification result of each virtual video sample comprises:
And calculating training loss based on the virtual video samples and the classification results of the virtual video samples, and carrying out parameter adjustment on the video classification model constructed by the model parameters based on the training loss.
11. The method for training a video classification model according to claim 10, the calculating a training loss based on the classification results of the virtual video samples and the virtual video samples, comprising:
carrying out parameter correction on classification result parameters in the classification result of the first virtual video sample; the first virtual video sample is obtained by fusing test video samples and training video samples of different video categories;
calculating training loss based on the classification result after parameter correction, the classification result of the second virtual sample and the class labels of the virtual video samples; the second virtual video sample is obtained by fusing a test video sample and a support video sample of the same video category.
12. The method for training a video classification model according to claim 11, wherein the performing parameter correction on the classification result parameter in the classification result of the first virtual video sample includes:
calculating a correction coefficient for parameter correction according to the fusion parameters;
And carrying out parameter correction on the classification result based on the class label, the classification number and the correction coefficient corresponding to the first virtual video sample.
13. The method of training a video classification model of claim 1, further comprising:
sampling the target video set to obtain a test video and at least one training video;
according to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the video classification model after the at least one virtual video is input and trained, and obtaining classification results of the virtual videos;
and calculating the video classification result of the test video based on the classification result of each virtual video.
14. A video classification processing method, comprising:
sampling the target video set to obtain a test video and at least one training video;
according to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
Calculating a video classification result of the test video based on the classification result of each virtual video;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
15. The video classification processing method according to claim 14, wherein the at least one virtual video sample is obtained after the test video sample and the training video samples are fused according to the fusion parameters and the class labels of the training video samples.
16. The video classification processing method according to claim 14, wherein the video classification of any one of the at least one virtual video comprises:
image sampling is carried out on any virtual video to obtain image frames with the sampling number, and the image frames are input into a feature extraction layer to carry out feature extraction to obtain image frame features;
and inputting the image frame characteristics into a classification sub-model to carry out video classification on any virtual video, and obtaining a classification result of any virtual video.
17. The method for classifying video according to claim 14, wherein the fusing the test video and each training video according to the fusion parameters to obtain at least one virtual video corresponding to the test video comprises:
and according to the fusion proportion, carrying out fusion processing on each image frame of the test video sample and each training video sample to obtain the at least one virtual video sample.
18. The video classification processing method according to claim 13, the model parameters being obtained by:
reading a first similarity algorithm for performing positive sample similarity calculation under the dimension of the paired labels and a second similarity algorithm for performing negative sample similarity calculation;
reading a third similarity algorithm for performing similarity calculation of video samples of the same category under the dimension of the category label and a fourth similarity algorithm for performing similarity calculation of video samples of different categories;
based on the test video sample and the training video sample set, calculating a first classification weight according to the first similarity algorithm and the third similarity algorithm, calculating a second classification weight according to the second similarity algorithm and the fourth similarity algorithm, and determining the first classification weight and the second classification weight as the model parameters.
19. The video classification processing method according to claim 18, wherein the first classification weight includes a classification weight corresponding to a category to which the test video sample belongs;
the second classification weight comprises classification weights corresponding to the categories except the category to which the target number belongs; the target number includes the number of the classification number of the classifier minus a constant.
20. A training apparatus for a video classification model, comprising:
the pairing tag generation module is configured to sample the video set to obtain a test video sample and at least one training video sample, and generate pairing tags of the test video sample and each training video sample;
the label conversion module is configured to perform label conversion based on the matched labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample;
the fusion processing module is configured to fuse the test video sample and each training video sample according to the fusion parameters and the category labels to obtain at least one virtual video sample;
and the parameter adjustment module is configured to input the at least one virtual video sample into a video classification model constructed based on the model parameters to perform video classification, and perform parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
21. A video classification processing apparatus comprising:
the sampling module is configured to sample the target video set to obtain a test video and at least one training video;
the fusion processing module is configured to fuse the test video and each training video according to fusion parameters to obtain at least one virtual video corresponding to the test video;
the video classification module is configured to perform video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
a result calculation module configured to calculate a video classification result of the test video based on the classification result of each virtual video;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
22. A training apparatus for a video classification model, comprising:
a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to:
Sampling a video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample;
performing label conversion based on the pairing labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample;
according to the fusion parameters and the category labels, carrying out fusion processing on the test video sample and each training video sample to obtain at least one virtual video sample;
inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
23. A video classification processing device, comprising:
a processor; and a memory configured to store computer-executable instructions that, when executed, cause the processor to:
sampling the target video set to obtain a test video and at least one training video;
According to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
calculating a video classification result of the test video based on the classification result of each virtual video;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
24. A storage medium storing computer-executable instructions that when executed by a processor implement the following:
sampling a video set to obtain a test video sample and at least one training video sample, and generating a pairing label of the test video sample and each training video sample;
performing label conversion based on the pairing labels of the test video sample and each training video sample to obtain model parameters and class labels of the test video sample and the at least one training video sample;
According to the fusion parameters and the category labels, carrying out fusion processing on the test video sample and each training video sample to obtain at least one virtual video sample;
inputting the at least one virtual video sample into a video classification model constructed based on the model parameters to classify the video, and carrying out parameter adjustment on the video classification model based on classification results of each virtual video sample to obtain a video classification model.
25. A storage medium storing computer-executable instructions that when executed by a processor implement the following:
sampling the target video set to obtain a test video and at least one training video;
according to the fusion parameters, carrying out fusion processing on the test video and each training video to obtain at least one virtual video corresponding to the test video;
performing video classification on the at least one virtual video input video classification model to obtain classification results of each virtual video;
calculating a video classification result of the test video based on the classification result of each virtual video;
the video classification model is obtained by carrying out model training on a video classification model constructed based on model parameters based on at least one virtual video sample; the model parameters are obtained after label conversion based on pairing labels of the test video samples and the training video samples.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310507774.2A CN116597348A (en) | 2023-05-04 | 2023-05-04 | Training method and device for video classification model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310507774.2A CN116597348A (en) | 2023-05-04 | 2023-05-04 | Training method and device for video classification model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116597348A true CN116597348A (en) | 2023-08-15 |
Family
ID=87603774
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310507774.2A Pending CN116597348A (en) | 2023-05-04 | 2023-05-04 | Training method and device for video classification model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116597348A (en) |
-
2023
- 2023-05-04 CN CN202310507774.2A patent/CN116597348A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109934253B (en) | Method and device for generating countermeasure sample | |
CN115712866B (en) | Data processing method, device and equipment | |
CN115828162B (en) | Classification model training method and device, storage medium and electronic equipment | |
CN117456028A (en) | Method and device for generating image based on text | |
CN116306868A (en) | Model processing method, device and equipment | |
CN111507726B (en) | Message generation method, device and equipment | |
CN117910542A (en) | User conversion prediction model training method and device | |
CN117409466B (en) | Three-dimensional dynamic expression generation method and device based on multi-label control | |
CN116186330B (en) | Video deduplication method and device based on multi-mode learning | |
CN115952859B (en) | Data processing method, device and equipment | |
CN117953258A (en) | Training method of object classification model, object classification method and device | |
CN117113174A (en) | Model training method and device, storage medium and electronic equipment | |
CN115017915B (en) | Model training and task execution method and device | |
CN116363418A (en) | Method and device for training classification model, storage medium and electronic equipment | |
CN115994252A (en) | Data processing method, device and equipment | |
CN116597348A (en) | Training method and device for video classification model | |
CN111242195B (en) | Model, insurance wind control model training method and device and electronic equipment | |
CN113344197A (en) | Training method of recognition model, service execution method and device | |
CN113344590A (en) | Method and device for model training and complaint rate estimation | |
CN118193797B (en) | Method and device for executing service, storage medium and electronic equipment | |
CN117992600B (en) | Service execution method and device, storage medium and electronic equipment | |
CN116501852B (en) | Controllable dialogue model training method and device, storage medium and electronic equipment | |
CN116109008B (en) | Method and device for executing service, storage medium and electronic equipment | |
CN118521343A (en) | Conversion prediction model training method and device | |
CN116543759A (en) | Speech recognition processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |