CN116416480A - Visual classification method and device based on multi-template prompt learning - Google Patents

Visual classification method and device based on multi-template prompt learning Download PDF

Info

Publication number
CN116416480A
CN116416480A CN202310680502.2A CN202310680502A CN116416480A CN 116416480 A CN116416480 A CN 116416480A CN 202310680502 A CN202310680502 A CN 202310680502A CN 116416480 A CN116416480 A CN 116416480A
Authority
CN
China
Prior art keywords
prompt
template
sample
visual
templates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310680502.2A
Other languages
Chinese (zh)
Other versions
CN116416480B (en
Inventor
杨舒
王生进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202310680502.2A priority Critical patent/CN116416480B/en
Publication of CN116416480A publication Critical patent/CN116416480A/en
Application granted granted Critical
Publication of CN116416480B publication Critical patent/CN116416480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a visual classification method and a visual classification device based on multi-template prompt learning, which relate to the technical field of machine learning and comprise the following steps: generating candidate text sets under a plurality of prompt templates by using the candidate text sets; inputting continuous video frames of the video to be classified and a candidate text set under each prompting template into a visual language coding model to obtain class probability distribution of the video under each prompting template; and determining a visual classification result of the video by using the category probability distribution. According to the invention, the full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment are performed on the visual language pre-training models which are integrated into the frame fusion module to obtain the plurality of prompt templates and the visual language coding models, so that the training sample utilization efficiency when the visual language pre-training models are generalized to the downstream visual understanding task is improved, and the understanding accuracy can be improved when the plurality of prompt templates and the visual language coding models are applied to the downstream visual understanding task.

Description

Visual classification method and device based on multi-template prompt learning
Technical Field
The invention relates to the technical field of machine learning, in particular to a visual classification method and device based on multi-template prompt learning.
Background
Visual-language pre-training (VLP) adopts a multi-modal self-supervised learning approach, and utilizes large-scale "image/video-text pair" data to learn cross-modal semantic associations between vision and language. However, existing visual language pre-training models, when applied to downstream image/video understanding tasks, typically concatenate a new classifier/regressor behind the encoded features, and then perform end-to-end parameter fine-tuning. In this way, there are two problems, in the first aspect, since the downstream task is inconsistent with the pre-training task, the end-to-end learning can cause the loss of knowledge learned in the pre-training stage; in a second aspect, too many parameters of the fine tuning may cause overfitting when there are fewer training samples for the downstream task.
Different from the parameter fine tuning method, the prompt learning method converts the downstream task by means of the prompt template, and enables the downstream task to adapt to the pre-training model, so that the objective function of the downstream task is consistent with that of the pre-training task. For example, in the image classification task, a predefined prompting template "a photo of [ CLASS ]", the candidate CLASS name is used to replace [ CLASS ] during testing, and the obtained text is sent to an image-text pre-training model together with the tested image for encoding and matching, so that the image classification is completed. Or training template parameters on the classified label samples by using a learnable prompt template of 'X X X X X [ CLASS ]'. However, existing prompt learning methods lack or have few or no leachable template parameters and do not effectively utilize downstream task samples, so that the generalization performance of the pre-trained model to downstream tasks is low.
Disclosure of Invention
Aiming at the problem that the utilization efficiency of training samples is low when a visual language pre-training model is generalized to a downstream task in the existing prompt learning method, the invention provides a visual classification method and device based on multi-template prompt learning.
In a first aspect, the present invention provides a visual classification method based on multi-template prompt learning, the method comprising:
acquiring videos to be classified;
for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;
Inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template;
determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates;
the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;
the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
According to the visual classification method based on multi-template prompt learning provided by the invention, the generation process of the plurality of preset prompt templates comprises the following steps:
generating a plurality of initial hint templates based on the given hint template format; the given prompt template format is that the prompt template consists of a plurality of prompt character bits and a category flag bit; the plurality of initial prompt templates have differences in the number of prompt character bits and/or the positions of category marker bits;
Embedding a word into each prompt character position in each initial prompt template to obtain a plurality of preset prompt templates;
wherein, embed a word for any hint character position in any initial hint template, comprising:
initializing a word to be embedded;
determining the coding serial number of the word to be embedded by using a word list;
substituting the coding serial number into a language embedding model to obtain coding characteristics of the word to be embedded;
embedding the coding feature into the arbitrary hint character bits in the arbitrary initial hint template.
According to the visual classification method based on multi-template prompt learning, embedding a class name into a prompt template is equivalent to embedding a class name into a class zone bit of the prompt template;
inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain the category probability distribution of the video under the prompt template, wherein the method comprises the following steps:
determining, with an image encoder of the visual language coding model, a fused visual feature of successive video frames of the video;
determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;
Respectively comparing the feature similarity between the fusion visual features and the text features of each candidate text in the candidate text set to obtain a comparison result;
the comparison result is marked as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;
and obtaining the category probability distribution of the video under the prompt template based on the probability.
According to the visual classification method based on multi-template prompt learning provided by the invention, the multiple prompt templates and the visual language coding model are obtained by utilizing a semi-labeled visual classification sample set to perform three-stage training of full-supervision template parameter optimization-semi-supervision model optimization and full-supervision template parameter fine adjustment on multiple preset prompt templates and improved visual language pre-training models, and the visual classification method comprises the following steps:
based on the visual language pre-training model, performing full-supervision learning on a plurality of preset prompt templates by using a first sample set contained in the visual classification sample set so as to optimize the preset prompt templates to obtain a plurality of first prompt templates;
based on a plurality of the first prompt templates, performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set so as to optimize the frame fusion module to obtain the visual language coding model;
Based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using a third sample set contained in the visual classification sample set so as to finely tune the plurality of first prompt templates to obtain a plurality of prompt templates;
the first sample set, the second sample set and the third sample set are all obtained by processing a pre-stored full-class annotation video set;
the samples in the first sample set are video intermediate frames carrying class labels;
the second sample set is characterized in that part of samples are continuous video frames carrying category labels, and part of samples are continuous video frames not carrying category labels;
the samples in the third sample set are continuous video frames carrying category labels.
According to the visual classification method based on multi-template prompt learning provided by the invention, based on the visual language pre-training model, the first sample set contained in the visual classification sample set is utilized to perform full-supervision learning on a plurality of preset prompt templates so as to optimize the preset prompt templates to obtain a plurality of first prompt templates, and the method comprises the following steps:
generating a candidate text set under the preset prompt template based on the category name set for each preset prompt template and each sample in the first sample set;
Inputting the sample and the candidate text set under the preset prompting template into the visual language pre-training model to obtain the class probability distribution of the sample under the preset prompting template;
determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of preset prompt templates;
optimizing a plurality of preset prompt templates to obtain a plurality of first prompt templates by utilizing the total supervision loss of the samples in the first sample set;
based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using the third sample set to fine tune the plurality of first prompt templates to obtain a plurality of prompt templates, including:
generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the third sample set;
inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
determining the total supervision loss of the sample according to the category labels of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
And optimizing a plurality of first prompt templates to obtain a plurality of prompt templates by using the total supervision loss of the samples in the third sample set.
According to the visual classification method based on multi-template prompt learning provided by the invention, the method for performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set based on a plurality of first prompt templates so as to optimize the frame fusion module to obtain the visual language coding model comprises the following steps:
generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the second sample set;
inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
when the sample is provided with a category label, performing uncertainty estimation on the category probability distribution of the sample under the first prompt template to obtain a pseudo-classification label and a weight of the sample under the first prompt template;
determining the unsupervised loss of the sample according to the pseudo classification labels and weights of the sample under a plurality of first prompt templates;
When the sample is not provided with the category label, determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
and optimizing the frame fusion module to obtain the visual language coding model according to the unsupervised loss of the sample without the category label in the second sample set and the full supervised loss of the sample with the category label.
According to the visual classification method based on multi-template prompt learning provided by the invention, uncertainty estimation is carried out on the class probability distribution of the sample under the first prompt template to obtain the pseudo classification label of the sample under the first prompt template, and the visual classification method comprises the following steps:
searching a probability value meeting a first condition from the class probability distribution of the sample under the first prompt template;
if the probability value exists, using the category name pointed by the probability value as a pseudo classification label of the sample under the first prompt template; otherwise, the sample is determined to be absent in the pseudo classification label under the first prompt template;
wherein the first condition is: the probability value is the maximum value in the probability distribution of the category and the probability value is greater than or equal to the confidence threshold.
According to the visual classification method based on multi-template prompt learning provided by the invention, uncertainty estimation is carried out on the class probability distribution of the sample under the first prompt template to obtain the weight of the sample under the first prompt template, and the method comprises the following steps:
traversing the class probability distribution of the sample under other first prompt templates except the first prompt template to find a first probability value meeting a first condition;
if the first probability value exists and the class name indicated by the first probability value is inconsistent with the pseudo classification label, the uncertainty measurement of the sample under the first prompt template is positive infinity;
if the first probability value exists and the class name indicated by the first probability value is consistent with the pseudo classification label, or the first probability value does not exist, the uncertainty measure of the sample under the first prompt template is the standard deviation of class probability distribution of the sample under a plurality of first prompt templates;
calculating the weight of the sample under the first prompt template by using the uncertainty measure of the sample under the first prompt template;
wherein the weight of the sample under the first hint template is inversely related to the uncertainty measure of the sample under the first hint template.
According to the visual classification method based on multi-template prompt learning, the expression of the unsupervised loss is as follows:
Figure SMS_1
in the above formula, M is the number of prompt templates, C is the number of category names in the category name set, and K is the number of prompt templatesThe number of pseudo class labels of the ith sample of the class label does not exist in the second sample set,
Figure SMS_2
for the uncertainty measure corresponding to the kth pseudo-classification tag of the ith sample without class labels in the second sample set,/o->
Figure SMS_3
For the weight corresponding to the kth pseudo-classification label of the ith sample without category label in the second sample set, ++>
Figure SMS_4
Class probability distribution converted for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set,/for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set>
Figure SMS_5
The class probability distribution of the ith sample without class labeling in the second sample set under the mth prompting template is obtained;
wherein the said
Figure SMS_6
Determined by the following formula:
Figure SMS_7
in the above-mentioned method, the step of,
Figure SMS_8
and representing the position of the probability value corresponding to the kth pseudo-classification label of the ith sample without the category label in the second sample set in the category probability distribution to which the probability value belongs.
In a second aspect, the present invention provides a visual classification device based on multi-template prompt learning, the device comprising:
The visual input module is used for acquiring videos to be classified; the candidate text generation module is used for generating a candidate text set under each prompt template based on the category name set of the visual classification task for each prompt template in the multiple prompt templates; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;
the visual language coding module is used for inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain the category probability distribution of the video under the prompt template; the visual classification result output module is used for determining a visual classification result of the video by using the category probability distribution of the video under the plurality of prompt templates;
the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;
the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
The invention provides a visual classification method and a visual classification device based on multi-template prompt learning, comprising the following steps: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; and determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates. According to the invention, the full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment are performed on the visual language pre-training models which are integrated into the frame fusion module to obtain the plurality of prompt templates and the visual language coding models, so that the training sample utilization efficiency when the visual language pre-training models are generalized to the downstream visual understanding task is improved, and the understanding accuracy can be improved when the plurality of prompt templates and the visual language coding models are applied to the downstream visual understanding task.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a visual classification method based on multi-template prompt learning;
FIG. 2 is a schematic diagram of a fully supervised template parameter optimization framework provided by the present invention;
FIG. 3 is a schematic diagram of an application of the frame fusion module in a visual language coding model;
FIG. 4 is a schematic diagram of a semi-supervised model optimization framework provided by the present invention;
FIG. 5 is a schematic diagram of a visual classification device based on multi-template prompt learning according to the present invention;
fig. 6 is a schematic structural diagram of an electronic device provided by the present invention;
reference numerals:
610: a processor; 620: a communication interface; 630: a memory; 640: a communication bus.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The visual classification method and apparatus based on multi-template prompt learning of the present invention is described below in conjunction with fig. 1-6.
In a first aspect, the present invention provides a visual classification method based on multi-template prompt learning, as shown in fig. 1, the method includes:
s11: acquiring videos to be classified;
s12: for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;
s13: inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; s14: determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;
The improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
The visual language pre-training model is a model which performs multi-mode self-supervision learning on a large-scale image/video-text pair data set in advance, and can perform feature extraction on an image/continuous video frame and a text respectively and map features of the two modes to the same semantic space.
The improved visual language pre-training model is characterized in that a frame fusion module is added behind an image encoder of the visual language pre-training model, and the visual features of continuous video frames can be subjected to feature fusion to obtain a visual feature, so that a foundation is laid for feature comparison.
The invention provides a visual classification method based on multi-template prompt learning, which is characterized in that full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment are carried out on a plurality of preset prompt templates and visual language pre-training models integrated into a frame fusion module to obtain a plurality of prompt templates and visual language coding models, so that the utilization efficiency of training samples when the visual language pre-training models are generalized to a downstream visual understanding task is improved, and further, the accuracy of understanding can be improved when the plurality of prompt templates and visual language coding models are applied to the downstream visual understanding task.
Specifically, the continuous video frames of the video to be classified in S11 are obtained by:
s11.1: acquiring compressed videos to be classified;
s11.2: decoding the compressed video to obtain decoded continuous video frames;
s11.3: preprocessing the decoded continuous video frames to obtain continuous video frames of the video to be classified;
wherein the preprocessing operation includes, but is not limited to, normalization.
Specifically, in the step S12, the plurality of alert templates are obtained by optimizing two-stage template parameters on the basis of a plurality of preset alert templates;
the generating process of the plurality of preset prompt templates comprises the following steps:
SA: generating a plurality of initial hint templates based on the given hint template format; the given prompt template format is that the prompt template consists of a plurality of prompt character bits and a category flag bit; the plurality of initial prompt templates have differences in the number of prompt character bits and/or the positions of category marker bits;
assuming that the hint character bits are represented by X and the CLASS flag bits are represented by [ CLASS ], the initial hint template that includes 7 hint character bits and that the CLASS flag bits are located at the extreme end may be represented as "X X X X X X [ CLASS ]".
SB: embedding a word into each prompt character position in each initial prompt template to obtain a plurality of preset prompt templates;
in order to adapt to machine learning, the content formats of the prompt character bit and the category zone bit in the initial prompt template are word coding features; the generation of the preset prompting template is to initialize the content of the prompting character position of the corresponding initial prompting template (the content is the template parameter to be learned), namely to embed words into the prompting character position of the corresponding initial prompting template.
Wherein, for a word to be embedded in a hint character position in any initial hint template, comprising:
initializing a word to be embedded;
determining the coding serial number of the word to be embedded by using a word list;
substituting the coding serial number into a language embedding model to obtain coding characteristics of the word to be embedded;
embedding the coding feature into the arbitrary hint character bits in the arbitrary initial hint template.
The language embedding model here takes the existing form, the choice of which will determine the dimensions of the coding features. In the invention, the length of a plurality of preset prompt templates, the initial content of prompt character bits, the dimension of coding features and the position distribution of category marker bits all affect the performance of a plurality of final prompt templates.
Further, in the S12, assuming that the total number of class names in the class name set is C and the total number of alert templates is M, embedding the C class names into the mth alert template to generate C texts containing the class names, where the candidate text set formed by the C candidate texts containing the class names is referred to as the candidate text set under the mth alert template; and traversing all the prompt templates to obtain candidate text sets under M prompt templates.
It should be noted that, embedding a category name into a prompt template is equivalent to embedding a category name into a category flag bit of a prompt template, and the embedding method is the same as the word embedding process of the prompt character bit in the generation process of the preset prompt template.
Specifically, the step S13 includes:
s13.1: determining, with an image encoder of the visual language coding model, a fused visual feature of successive video frames of the video;
s13.2: determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;
s13.3: respectively comparing the feature similarity between the fusion visual features and the text features of each candidate text in the candidate text set to obtain a comparison result;
S13.4: the comparison result is marked as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;
it can be seen that the closer the comparison results are, the higher the probability (score), whereby the class probability distribution is actually a kind of score distribution.
S13.5: and obtaining the category probability distribution of the video under the prompt template based on the probability.
It will be appreciated that the image encoder, text encoder and cross-modal feature similarity contrast structures are all structures in the visual language pre-training model, and thus corresponding structures also exist in the visual language encoding model.
The invention compares the feature extraction and similarity of candidate text sets under M prompt templates and the video to obtain M category probability distributions, wherein each category probability distribution consists of probabilities corresponding to C category names.
Specifically, the step S14 includes:
and determining a category probability distribution from the category probability distribution of the video under a plurality of prompt templates by taking an average or voting mode, and taking the category name corresponding to the maximum probability value of the category probability distribution as a visual classification result of the video.
From the above steps, it can be seen that the accuracy of the visual classification method of the present invention depends on a plurality of alert templates and visual language coding models, where the alert templates and the visual language coding models are obtained by performing three-stage training of full-supervision template parameter optimization-half-supervision model optimization and full-supervision template parameter fine-tuning on a plurality of preset alert templates and improved visual language pre-training models by using a half-labeled visual classification sample set, and specifically:
SI: based on the visual language pre-training model, performing full-supervision learning on a plurality of preset prompt templates by using a first sample set contained in the visual classification sample set so as to optimize the preset prompt templates to obtain a plurality of first prompt templates;
SII: based on a plurality of the first prompt templates, performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set so as to optimize the frame fusion module to obtain the visual language coding model;
SIII: based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using a third sample set contained in the visual classification sample set so as to finely tune the plurality of first prompt templates to obtain a plurality of prompt templates;
The first sample set, the second sample set and the third sample set are all obtained by processing a pre-stored full-class annotation video set;
the samples in the first sample set are video intermediate frames carrying class labels;
the second sample set is characterized in that part of samples are continuous video frames carrying category labels, and part of samples are continuous video frames not carrying category labels;
the samples in the third sample set are continuous video frames carrying category labels.
Here, fig. 2 is a schematic diagram of a full-supervision template parameter optimization framework, where the learnable alert templates are a plurality of preset alert templates, as shown in fig. 2, and SI specifically is:
SI-1: generating a candidate text set under the preset prompt template based on the category name set for each preset prompt template and each sample in the first sample set;
SI-2: inputting the sample and the candidate text set under the preset prompting template into the visual language pre-training model to obtain the class probability distribution of the sample under the preset prompting template;
SI-3: determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of preset prompt templates;
SI-4: optimizing a plurality of preset prompt templates to obtain a plurality of first prompt templates by utilizing the total supervision loss of the samples in the first sample set;
wherein the expression of the total supervision loss of the sample is:
Figure SMS_9
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure SMS_10
loss of total supervision for the jth sample in the first training set,>
Figure SMS_11
true tag for the j-th sample in the first training set +.>
Figure SMS_12
Distribution of 0-1 transformed, ->
Figure SMS_13
And (3) the probability distribution of the category of the jth sample in the first training set under the mth preset prompting template.
FIG. 3 is a schematic diagram of the application of the frame fusion module in a visual language coding model, wherein the frame fusion module is composed of a frame recoding module and a self-attention pooling module. The frame recoding module inputs the image characteristics of N video frames, embeds and adds the image characteristics with a N-bit leachable position, and sends the image characteristics into a transformer network to obtain N video frame characteristics. The transformer network comprises a multi-headed self-care layer and a fully-connected feed-forward network, wherein a residual connection (residual) is added outside the two network layers and normalized. The self-attention pooling module carries out global average pooling on N video frame features obtained by the frame recoding module to obtain an average feature, then connects the average feature with N video frame features in series to obtain N+1 input features, sends the N+1 input features into a multi-head self-attention layer and a fully connected feedforward network to obtain N+1 features, and outputs the first feature as a final video coding feature. FIG. 4 is a schematic diagram of a semi-supervised model optimization framework, the hint templates referred to in FIGS. 3 and 4 refer specifically to a first hint template, and the text set refers specifically to a class name set in FIG. 4. As shown in fig. 3 and 4, SII is specifically:
SII-1: generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the second sample set;
SII-2: inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
SII-3: when the sample is provided with a category label, performing uncertainty estimation on the category probability distribution of the sample under the first prompt template to obtain a pseudo-classification label and a weight of the sample under the first prompt template;
SII-4: determining the unsupervised loss of the sample according to the pseudo classification labels and weights of the sample under a plurality of first prompt templates;
SII-5: when the sample is not provided with the category label, determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
SII-6: and optimizing the frame fusion module to obtain the visual language coding model according to the unsupervised loss of the sample without the category label in the second sample set and the full supervised loss of the sample with the category label.
Further, the SII-3 is divided into two stages of determining a pseudo classification label of the sample under the first prompt template and determining a weight of the sample under the first prompt template;
wherein determining that the sample is under the first hint template comprises:
SII-3-a: searching a probability value meeting a first condition from the class probability distribution of the sample under the first prompt template;
SII-3-B: if the probability value exists, using the category name pointed by the probability value as a pseudo classification label of the sample under the first prompt template; otherwise, the sample is determined to be absent in the pseudo classification label under the first prompt template;
wherein the first condition is: the probability value is the maximum value in the probability distribution of the category and the probability value is greater than or equal to the confidence threshold.
If the probability value satisfying the first condition is not found, the sample is nonsensical and does not participate in the calculation of the unsupervised loss.
Determining the weight of the sample under the first prompt template comprises:
SII-3-I: traversing the class probability distribution of the sample under other first prompt templates except the first prompt template to find a first probability value meeting a first condition
SII-3-II: if the first probability value exists and the class name indicated by the first probability value is inconsistent with the pseudo classification label, the uncertainty measurement of the sample under the first prompt template is positive infinity;
SII-3-III: if the first probability value exists and the class name indicated by the first probability value is consistent with the pseudo classification label, or the first probability value does not exist, the uncertainty measure of the sample under the first prompt template is the standard deviation of class probability distribution of the sample under a plurality of first prompt templates;
SII-3-IV: calculating the weight of the sample under the first prompt template by using the uncertainty measure of the sample under the first prompt template;
wherein the weight of the sample under the first hint template is inversely related to the uncertainty measure of the sample under the first hint template.
In SII-4, the expression for unsupervised loss is as follows:
Figure SMS_14
in the above formula, M is the number of prompt templates, C is the number of category names in the category name set, K is the number of pseudo classification labels of the ith sample without category labels in the second sample set,
Figure SMS_15
For the uncertainty measure corresponding to the kth pseudo-classification tag of the ith sample without class labels in the second sample set,/o->
Figure SMS_16
For the weight corresponding to the kth pseudo-classification label of the ith sample without category label in the second sample set, ++>
Figure SMS_17
Class probability distribution converted for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set,/for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set>
Figure SMS_18
The class probability distribution of the ith sample without class labeling in the second sample set under the mth prompting template is obtained;
wherein the said
Figure SMS_19
Determined by the following formula: />
Figure SMS_20
In the above-mentioned method, the step of,
Figure SMS_21
a kth pseudo-classification tag characterizing an ith sample in the second sample set for which no class labels existThe position of the corresponding probability value in the probability distribution of the category to which it belongs.
In SII-5, the total supervision loss of the model optimization stage is consistent with the total supervision loss calculation method of the template parameter optimization stage, and the description is omitted.
According to the SII, a plurality of first prompt templates are utilized to automatically generate pseudo labels without labeling samples, so that the problem that label samples are limited in downstream tasks is solved; measuring sample uncertainty by utilizing differences among a plurality of first hint template codes so as to solve the problem of pseudo tag noise; and performing semi-supervised model optimization on the improved visual language pre-training model on the basis of the plurality of first prompt templates so as to improve the efficiency of generalizing the improved visual language pre-training model to downstream tasks.
SIII, specifically:
SIII-1, for each of the first hint template and each of the third sample set, generating a candidate set of text under the first hint template based on the set of category names;
SIII-2, inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
SIII-3, determining the total supervision loss of the sample according to the class label of the sample and the class probability distribution of the sample under a plurality of first prompt templates;
and SIII-4, optimizing a plurality of the first prompt templates to obtain a plurality of prompt templates by using the total supervision loss of the samples in the third sample set.
Likewise, the total supervision loss in the template parameter fine adjustment stage is consistent with the total supervision loss calculation method in the template parameter optimization stage, and will not be described again.
The SIII is obtained by fine tuning a plurality of prompt templates obtained by SI on the basis of a visual language coding model obtained by SII, the fine tuning process is similar to SI, and the learning rate is reduced.
The method is suitable for video understanding tasks including behavior recognition, and can obtain higher recognition accuracy under the condition of using the same pre-training visual language coding model and the same training sample.
For example, the invention can be applied to video retrieval tasks, and compared with video classification, the type name set is replaced by the retrieval word set.
In addition, the invention can also utilize the visual language pre-training model and the prompt template which only carries out the parameter optimization of the full supervision template to carry out the image understanding task, including the image classification and the image retrieval.
In order to verify the effectiveness of the invention, a CLIP pre-training model is adopted to generalize to behavior recognition tasks for illustration.
The CLIP model refers to a contrast learning-based language-Image Pre-Training model (Contrastive Language-Image Pre-Training, CLIP) proposed by Alec Radford et al in paper Learning Transferable Visual Models From Natural Language Supervision of 2021, which model contains one Image encoder and one text encoder, the data set being Pre-trained using 400M "Image-text descriptions", and the resulting two encoders being applicable to a variety of Image understanding tasks.
The data set HMDB51 adopted by the behavior recognition task comprises 5100 video clips, 3570 training samples and 1530 test samples. These video clips are divided into 51 behavior categories. The present embodiment randomly extracts 60% (i.e., 2142) "video-category pairs" from the training samples as a labeled set, and the remaining 40% (i.e., 1428) training samples use only video, constituting a label-free set.
1. And (5) optimizing the parameters of the full-supervision template.
The parameters of m=3 hint templates were optimized based on using 2142 tagged samples, where the visual input took an intermediate frame image of the video.
2. Semi-supervised model optimization.
And fixing parameters of a prompt template, and adjusting parameters of a frame fusion module in the visual language coding model by adopting a semi-supervised learning mode by using 2142 labeled samples and 1428 unlabeled samples.
3. And (5) fine adjustment of the parameters of the full-supervision template.
The visual language coding model parameters were fixed, the learning rate was reduced to 1/5 of the previous step, and 2142 labeled samples were used to fine tune the parameters of the hint template.
4. And (5) testing.
And testing by using 1530 test samples, summing and normalizing the prediction scores to obtain predicted categories, comparing the predicted categories with category names provided by a data set, and calculating the testing accuracy. Table 1 is a comparative table of the accuracy of the prior art method and the method of the present invention. As shown in Table 1, the invention realizes generalization of the visual language pre-training model to the downstream task, and the generalization capability is improved by 8.4% compared with the accuracy of the existing method on the test set.
TABLE 1
Figure SMS_22
In a second aspect, the visual classification device based on multi-template prompt learning provided by the invention is described, and the visual classification device based on multi-template prompt learning described below and the visual classification method based on multi-template prompt learning described above can be referred to correspondingly. Fig. 5 illustrates a schematic structural diagram of a visual classification device based on multi-template prompt learning, as shown in fig. 5, the device includes:
A visual input module 21 for acquiring video to be classified;
the candidate text generation module 22 generates, for each of a plurality of alert templates, a candidate text set under the alert template based on a category name set of the visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;
the visual language coding module 23 is configured to input the continuous video frames of the video and the candidate text set into a visual language coding model, so as to obtain a category probability distribution of the video under the prompt template;
a visual classification result output module 24, configured to determine a visual classification result of the video by using a category probability distribution of the video under a plurality of prompt templates;
the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;
the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
The invention provides a visual classifying device based on multi-template prompt learning, which is used for carrying out full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and visual language pre-training models integrated into a frame fusion module to obtain a plurality of prompt templates and visual language coding models, so that the utilization efficiency of training samples when the visual language pre-training models are generalized to downstream visual understanding tasks is improved, and further, the accuracy of understanding can be improved when the plurality of prompt templates and visual language coding models are applied to the downstream visual understanding tasks.
On the basis of the foregoing embodiments, as an optional embodiment, the generating process of the plurality of preset alert templates includes:
generating a plurality of initial hint templates based on the given hint template format; the given prompt template format is that the prompt template consists of a plurality of prompt character bits and a category flag bit; the plurality of initial prompt templates have differences in the number of prompt character bits and/or the positions of category marker bits;
embedding a word into each prompt character position in each initial prompt template to obtain a plurality of preset prompt templates;
Wherein, embed a word for any hint character position in any initial hint template, comprising:
initializing a word to be embedded;
determining the coding serial number of the word to be embedded by using a word list;
substituting the coding serial number into a language embedding model to obtain coding characteristics of the word to be embedded;
embedding the coding feature into the arbitrary hint character bits in the arbitrary initial hint template.
On the basis of the above embodiments, as an alternative embodiment, embedding a category name into a prompt template is equivalent to embedding a category name into a category flag of the prompt template;
the visual language encoding module comprises:
a visual feature extraction unit for determining visual features of the video using an image encoder of the visual language coding model;
a text feature extraction unit for determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;
the cross-modal feature comparison unit is used for respectively comparing the feature similarity between the fusion visual feature and the text feature of each candidate text in the candidate text set to obtain a comparison result;
The determining unit is used for marking the comparison result as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;
and the category probability distribution determining unit is used for obtaining the category probability distribution of the video under the prompt template based on the probability.
Based on the foregoing embodiments, as an optional embodiment, the plurality of alert templates and the visual language coding model are obtained by performing three-stage training of full-supervision template parameter optimization-half-supervision model optimization and full-supervision template parameter fine adjustment on a plurality of preset alert templates and an improved visual language pre-training model by using a semi-labeled visual classification sample set, and the method includes:
based on the visual language pre-training model, performing full-supervision learning on a plurality of preset prompt templates by using a first sample set contained in the visual classification sample set so as to optimize the preset prompt templates to obtain a plurality of first prompt templates;
based on a plurality of the first prompt templates, performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set so as to optimize the frame fusion module to obtain the visual language coding model;
Based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using a third sample set contained in the visual classification sample set so as to finely tune the plurality of first prompt templates to obtain a plurality of prompt templates;
the first sample set, the second sample set and the third sample set are all obtained by processing a pre-stored full-class annotation video set;
the samples in the first sample set are video intermediate frames carrying class labels;
the second sample set is characterized in that part of samples are continuous video frames carrying category labels, and part of samples are continuous video frames not carrying category labels;
the samples in the third sample set are continuous video frames carrying category labels.
On the basis of the foregoing embodiments, as an optional embodiment, the performing, based on the visual language pre-training model, full-supervised learning on a plurality of preset alert templates by using a first sample set included in the visual classification sample set to optimize the plurality of preset alert templates to obtain a plurality of first alert templates includes:
generating a candidate text set under the preset prompt template based on the category name set for each preset prompt template and each sample in the first sample set;
Inputting the sample and the candidate text set under the preset prompting template into the visual language pre-training model to obtain the class probability distribution of the sample under the preset prompting template;
determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of preset prompt templates;
optimizing a plurality of preset prompt templates to obtain a plurality of first prompt templates by utilizing the total supervision loss of the samples in the first sample set;
based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using the third sample set to fine tune the plurality of first prompt templates to obtain a plurality of prompt templates, including:
generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the third sample set;
inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
determining the total supervision loss of the sample according to the category labels of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
And optimizing a plurality of first prompt templates to obtain a plurality of prompt templates by using the total supervision loss of the samples in the third sample set.
Based on the foregoing embodiments, as an optional embodiment, the performing semi-supervised learning on the modified visual language pre-training model by using a second sample set included in the visual classification sample set based on the plurality of first prompt templates to optimize the frame fusion module to obtain the visual language coding model includes:
generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the second sample set;
inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
when the sample is provided with a category label, performing uncertainty estimation on the category probability distribution of the sample under the first prompt template to obtain a pseudo-classification label and a weight of the sample under the first prompt template;
determining the unsupervised loss of the sample according to the pseudo classification labels and weights of the sample under a plurality of first prompt templates;
When the sample is not provided with the category label, determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
and optimizing the frame fusion module to obtain the visual language coding model according to the unsupervised loss of the sample without the category label in the second sample set and the full supervised loss of the sample with the category label.
Based on the foregoing embodiments, as an optional embodiment, performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a pseudo classification label of the sample under the first prompt template, where the method includes:
searching a probability value meeting a first condition from the class probability distribution of the sample under the first prompt template;
if the probability value exists, using the category name pointed by the probability value as a pseudo classification label of the sample under the first prompt template; otherwise, the sample is determined to be absent in the pseudo classification label under the first prompt template;
wherein the first condition is: the probability value is the maximum value in the probability distribution of the category and the probability value is greater than or equal to the confidence threshold.
On the basis of the foregoing embodiments, as an optional embodiment, performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a weight of the sample under the first prompt template, where the method includes:
traversing the class probability distribution of the sample under other first prompt templates except the first prompt template to find a first probability value meeting a first condition;
if the first probability value exists and the class name indicated by the first probability value is inconsistent with the pseudo classification label, the uncertainty measurement of the sample under the first prompt template is positive infinity;
if the first probability value exists and the class name indicated by the first probability value is consistent with the pseudo classification label, or the first probability value does not exist, the uncertainty measure of the sample under the first prompt template is the standard deviation of class probability distribution of the sample under a plurality of first prompt templates;
calculating the weight of the sample under the first prompt template by using the uncertainty measure of the sample under the first prompt template;
wherein the weight of the sample under the first hint template is inversely related to the uncertainty measure of the sample under the first hint template.
On the basis of the above embodiments, as an alternative embodiment, the expression of the unsupervised loss is as follows:
Figure SMS_23
in the above formula, M is the number of prompt templates, C is the number of category names in the category name set, K is the number of pseudo classification labels of the ith sample without category labels in the second sample set,
Figure SMS_24
for the uncertainty measure corresponding to the kth pseudo-classification tag of the ith sample without class labels in the second sample set,/o->
Figure SMS_25
For the weight corresponding to the kth pseudo-classification label of the ith sample without category label in the second sample set, ++>
Figure SMS_26
Class probability distribution converted for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set,/for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set>
Figure SMS_27
The class probability distribution of the ith sample without class labeling in the second sample set under the mth prompting template is obtained;
wherein the said
Figure SMS_28
Determined by the following formula:
Figure SMS_29
in the above-mentioned method, the step of,
Figure SMS_30
and representing the position of the probability value corresponding to the kth pseudo-classification label of the ith sample without the category label in the second sample set in the category probability distribution to which the probability value belongs.
In a third aspect, fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, where the electronic device may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a visual classification method based on multi-template hint learning, the method comprising: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set; the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In a fourth aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program when executed by a processor being capable of performing the methods provided by the above methods to perform a multi-template hint learning-based visual classification method, the method comprising: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set; the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of input continuous video frames
In a fifth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided above to perform a multi-template hint learning-based visual classification method, the method comprising: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set; the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of input continuous video frames
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A visual classification method based on multi-template prompt learning, the method comprising:
acquiring videos to be classified;
for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;
inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template;
Determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates;
the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;
the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
2. The visual classification method based on multi-template prompt learning according to claim 1, wherein the generating process of the plurality of preset prompt templates comprises:
generating a plurality of initial hint templates based on the given hint template format; the given prompt template format is that the prompt template consists of a plurality of prompt character bits and a category flag bit; the plurality of initial prompt templates have differences in the number of prompt character bits and/or the positions of category marker bits;
Embedding a word into each prompt character position in each initial prompt template to obtain a plurality of preset prompt templates;
wherein, embed a word for any hint character position in any initial hint template, comprising:
initializing a word to be embedded;
determining the coding serial number of the word to be embedded by using a word list;
substituting the coding serial number into a language embedding model to obtain coding characteristics of the word to be embedded;
embedding the coding feature into the arbitrary hint character bits in the arbitrary initial hint template.
3. The visual classification method based on multi-template prompt learning according to claim 2, wherein embedding a class name into a prompt template is equivalent to embedding a class name into a class flag of a prompt template;
inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain the category probability distribution of the video under the prompt template, wherein the method comprises the following steps:
determining, with an image encoder of the visual language coding model, a fused visual feature of successive video frames of the video;
determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;
Respectively comparing the feature similarity between the fusion visual features and the text features of each candidate text in the candidate text set to obtain a comparison result;
the comparison result is marked as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;
and obtaining the category probability distribution of the video under the prompt template based on the probability.
4. A visual classification method based on multi-template prompt learning according to any one of claims 1-3, wherein the multiple prompt templates and the visual language coding model are obtained by performing three-stage training of full-supervision template parameter optimization-half-supervision model optimization and full-supervision template parameter fine adjustment on multiple preset prompt templates and improved visual language pre-training models by using a half-labeled visual classification sample set, and the method comprises the following steps:
based on the visual language pre-training model, performing full-supervision learning on a plurality of preset prompt templates by using a first sample set contained in the visual classification sample set so as to optimize the preset prompt templates to obtain a plurality of first prompt templates;
based on a plurality of the first prompt templates, performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set so as to optimize the frame fusion module to obtain the visual language coding model;
Based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using a third sample set contained in the visual classification sample set so as to finely tune the plurality of first prompt templates to obtain a plurality of prompt templates;
the first sample set, the second sample set and the third sample set are all obtained by processing a pre-stored full-class annotation video set;
the samples in the first sample set are video intermediate frames carrying class labels;
the second sample set is characterized in that part of samples are continuous video frames carrying category labels, and part of samples are continuous video frames not carrying category labels;
the samples in the third sample set are continuous video frames carrying category labels.
5. The multi-template prompt learning-based visual classification method according to claim 4, wherein the performing full-supervised learning on a plurality of preset prompt templates by using a first sample set included in the visual classification sample set based on the visual language pre-training model to optimize the plurality of preset prompt templates to obtain a plurality of first prompt templates includes:
generating a candidate text set under the preset prompt template based on the category name set for each preset prompt template and each sample in the first sample set;
Inputting the sample and the candidate text set under the preset prompting template into the visual language pre-training model to obtain the class probability distribution of the sample under the preset prompting template;
determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of preset prompt templates;
optimizing a plurality of preset prompt templates to obtain a plurality of first prompt templates by utilizing the total supervision loss of the samples in the first sample set;
based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using the third sample set to fine tune the plurality of first prompt templates to obtain a plurality of prompt templates, including:
generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the third sample set;
inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
determining the total supervision loss of the sample according to the category labels of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
And optimizing a plurality of first prompt templates to obtain a plurality of prompt templates by using the total supervision loss of the samples in the third sample set.
6. The multi-template prompt learning-based visual classification method as claimed in claim 4, wherein the semi-supervised learning of the improved visual language pre-training model based on the plurality of first prompt templates using a second sample set included in the visual classification sample set to optimize the frame fusion module to obtain the visual language coding model comprises:
generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the second sample set;
inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;
when the sample is provided with a category label, performing uncertainty estimation on the category probability distribution of the sample under the first prompt template to obtain a pseudo-classification label and a weight of the sample under the first prompt template;
Determining the unsupervised loss of the sample according to the pseudo classification labels and weights of the sample under a plurality of first prompt templates;
when the sample is not provided with the category label, determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of first prompt templates;
and optimizing the frame fusion module to obtain the visual language coding model according to the unsupervised loss of the sample without the category label in the second sample set and the full supervised loss of the sample with the category label.
7. The visual classification method based on multi-template prompt learning according to claim 6, wherein performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a pseudo classification label of the sample under the first prompt template comprises:
searching a probability value meeting a first condition from the class probability distribution of the sample under the first prompt template;
if the probability value exists, using the category name pointed by the probability value as a pseudo classification label of the sample under the first prompt template; otherwise, the sample is determined to be absent in the pseudo classification label under the first prompt template;
Wherein the first condition is: the probability value is the maximum value in the probability distribution of the category and the probability value is greater than or equal to the confidence threshold.
8. The visual classification method based on multi-template prompt learning according to claim 7, wherein performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a weight of the sample under the first prompt template comprises:
traversing the class probability distribution of the sample under other first prompt templates except the first prompt template to find a first probability value meeting a first condition;
if the first probability value exists and the class name indicated by the first probability value is inconsistent with the pseudo classification label, the uncertainty measurement of the sample under the first prompt template is positive infinity;
if the first probability value exists and the class name indicated by the first probability value is consistent with the pseudo classification label, or the first probability value does not exist, the uncertainty measure of the sample under the first prompt template is the standard deviation of class probability distribution of the sample under a plurality of first prompt templates;
Calculating the weight of the sample under the first prompt template by using the uncertainty measure of the sample under the first prompt template;
wherein the weight of the sample under the first hint template is inversely related to the uncertainty measure of the sample under the first hint template.
9. The multi-template hint learning based visual classification method according to claim 6, wherein the expression of unsupervised loss is as follows:
Figure QLYQS_1
in the above formula, M is the number of prompt templates, C is the number of category names in the category name set, K is the number of pseudo classification labels of the ith sample without category labels in the second sample set,
Figure QLYQS_2
for the uncertainty measure corresponding to the kth pseudo-classification tag of the ith sample without class labels in the second sample set,/o->
Figure QLYQS_3
For the weight corresponding to the kth pseudo-classification label of the ith sample without category label in the second sample set, ++>
Figure QLYQS_4
Class probability distribution converted for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set,/for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set>
Figure QLYQS_5
The class probability distribution of the ith sample without class labeling in the second sample set under the mth prompting template is obtained;
Wherein the said
Figure QLYQS_6
Determined by the following formula:
Figure QLYQS_7
in the above-mentioned method, the step of,
Figure QLYQS_8
characterizing that the probability value corresponding to the kth pseudo-classification label of the ith sample without the category label in the second sample set is in the category probability distribution to which the probability value belongsIs a position of (c).
10. A multi-template prompt learning-based visual classification device, the device comprising:
the visual input module is used for acquiring videos to be classified; the candidate text generation module is used for generating a candidate text set under each prompt template based on the category name set of the visual classification task for each prompt template in the multiple prompt templates; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;
the visual language coding module is used for inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain the category probability distribution of the video under the prompt template; the visual classification result output module is used for determining a visual classification result of the video by using the category probability distribution of the video under the plurality of prompt templates;
the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;
The improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.
CN202310680502.2A 2023-06-09 2023-06-09 Visual classification method and device based on multi-template prompt learning Active CN116416480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310680502.2A CN116416480B (en) 2023-06-09 2023-06-09 Visual classification method and device based on multi-template prompt learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310680502.2A CN116416480B (en) 2023-06-09 2023-06-09 Visual classification method and device based on multi-template prompt learning

Publications (2)

Publication Number Publication Date
CN116416480A true CN116416480A (en) 2023-07-11
CN116416480B CN116416480B (en) 2023-08-25

Family

ID=87049584

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310680502.2A Active CN116416480B (en) 2023-06-09 2023-06-09 Visual classification method and device based on multi-template prompt learning

Country Status (1)

Country Link
CN (1) CN116416480B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824278A (en) * 2023-08-29 2023-09-29 腾讯科技(深圳)有限公司 Image content analysis method, device, equipment and medium
CN116994188A (en) * 2023-09-22 2023-11-03 腾讯科技(深圳)有限公司 Action recognition method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996513A (en) * 2022-05-11 2022-09-02 湖南大学 Video question-answering method and system based on cross-modal prompt learning
CN115658954A (en) * 2022-10-28 2023-01-31 华东师范大学 Cross-modal retrieval confrontation defense method based on prompt learning
CN115761314A (en) * 2022-11-07 2023-03-07 重庆邮电大学 E-commerce image and text classification method and system based on prompt learning
US20230154146A1 (en) * 2021-11-16 2023-05-18 Salesforce.Com, Inc. Systems and methods for video and language pre-training

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230154146A1 (en) * 2021-11-16 2023-05-18 Salesforce.Com, Inc. Systems and methods for video and language pre-training
CN114996513A (en) * 2022-05-11 2022-09-02 湖南大学 Video question-answering method and system based on cross-modal prompt learning
CN115658954A (en) * 2022-10-28 2023-01-31 华东师范大学 Cross-modal retrieval confrontation defense method based on prompt learning
CN115761314A (en) * 2022-11-07 2023-03-07 重庆邮电大学 E-commerce image and text classification method and system based on prompt learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MUHAMMAD UZAIR KHATTAK 等: "MaPLe: Multi-modal Prompt Learning", Retrieved from the Internet <URL:https://arxiv.org/abs/2210.03117> *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116824278A (en) * 2023-08-29 2023-09-29 腾讯科技(深圳)有限公司 Image content analysis method, device, equipment and medium
CN116824278B (en) * 2023-08-29 2023-12-19 腾讯科技(深圳)有限公司 Image content analysis method, device, equipment and medium
CN116994188A (en) * 2023-09-22 2023-11-03 腾讯科技(深圳)有限公司 Action recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN116416480B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN111694924B (en) Event extraction method and system
CN107729309B (en) Deep learning-based Chinese semantic analysis method and device
CN116416480B (en) Visual classification method and device based on multi-template prompt learning
WO2022142041A1 (en) Training method and apparatus for intent recognition model, computer device, and storage medium
CN113223509B (en) Fuzzy statement identification method and system applied to multi-person mixed scene
CN116955699B (en) Video cross-mode search model training method, searching method and device
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN112417887B (en) Sensitive word and sentence recognition model processing method and related equipment thereof
CN114416979A (en) Text query method, text query equipment and storage medium
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN114691864A (en) Text classification model training method and device and text classification method and device
CN110008699A (en) A kind of software vulnerability detection method neural network based and device
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN114997169A (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN115017879A (en) Text comparison method, computer device and computer storage medium
CN112347247A (en) Specific category text title binary classification method based on LDA and Bert
CN115470799B (en) Text transmission and semantic understanding integrated method for network edge equipment
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN116680407A (en) Knowledge graph construction method and device
CN113901210B (en) Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair
CN113408287B (en) Entity identification method and device, electronic equipment and storage medium
CN114610882A (en) Abnormal equipment code detection method and system based on electric power short text classification
CN114595338A (en) Entity relation joint extraction system and method based on mixed feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant