CN116416480A

CN116416480A - Visual classification method and device based on multi-template prompt learning

Info

Publication number: CN116416480A
Application number: CN202310680502.2A
Authority: CN
Inventors: 杨舒; 王生进
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-07-11
Anticipated expiration: 2043-06-09
Also published as: CN116416480B

Abstract

The invention provides a visual classification method and a visual classification device based on multi-template prompt learning, which relate to the technical field of machine learning and comprise the following steps: generating candidate text sets under a plurality of prompt templates by using the candidate text sets; inputting continuous video frames of the video to be classified and a candidate text set under each prompting template into a visual language coding model to obtain class probability distribution of the video under each prompting template; and determining a visual classification result of the video by using the category probability distribution. According to the invention, the full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment are performed on the visual language pre-training models which are integrated into the frame fusion module to obtain the plurality of prompt templates and the visual language coding models, so that the training sample utilization efficiency when the visual language pre-training models are generalized to the downstream visual understanding task is improved, and the understanding accuracy can be improved when the plurality of prompt templates and the visual language coding models are applied to the downstream visual understanding task.

Description

Visual classification method and device based on multi-template prompt learning

Technical Field

The invention relates to the technical field of machine learning, in particular to a visual classification method and device based on multi-template prompt learning.

Background

Visual-language pre-training (VLP) adopts a multi-modal self-supervised learning approach, and utilizes large-scale "image/video-text pair" data to learn cross-modal semantic associations between vision and language. However, existing visual language pre-training models, when applied to downstream image/video understanding tasks, typically concatenate a new classifier/regressor behind the encoded features, and then perform end-to-end parameter fine-tuning. In this way, there are two problems, in the first aspect, since the downstream task is inconsistent with the pre-training task, the end-to-end learning can cause the loss of knowledge learned in the pre-training stage; in a second aspect, too many parameters of the fine tuning may cause overfitting when there are fewer training samples for the downstream task.

Different from the parameter fine tuning method, the prompt learning method converts the downstream task by means of the prompt template, and enables the downstream task to adapt to the pre-training model, so that the objective function of the downstream task is consistent with that of the pre-training task. For example, in the image classification task, a predefined prompting template "a photo of [ CLASS ]", the candidate CLASS name is used to replace [ CLASS ] during testing, and the obtained text is sent to an image-text pre-training model together with the tested image for encoding and matching, so that the image classification is completed. Or training template parameters on the classified label samples by using a learnable prompt template of 'X X X X X [ CLASS ]'. However, existing prompt learning methods lack or have few or no leachable template parameters and do not effectively utilize downstream task samples, so that the generalization performance of the pre-trained model to downstream tasks is low.

Disclosure of Invention

Aiming at the problem that the utilization efficiency of training samples is low when a visual language pre-training model is generalized to a downstream task in the existing prompt learning method, the invention provides a visual classification method and device based on multi-template prompt learning.

In a first aspect, the present invention provides a visual classification method based on multi-template prompt learning, the method comprising:

acquiring videos to be classified;

for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;

Inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template;

determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates;

the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;

the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.

According to the visual classification method based on multi-template prompt learning provided by the invention, the generation process of the plurality of preset prompt templates comprises the following steps:

generating a plurality of initial hint templates based on the given hint template format; the given prompt template format is that the prompt template consists of a plurality of prompt character bits and a category flag bit; the plurality of initial prompt templates have differences in the number of prompt character bits and/or the positions of category marker bits;

Embedding a word into each prompt character position in each initial prompt template to obtain a plurality of preset prompt templates;

wherein, embed a word for any hint character position in any initial hint template, comprising:

initializing a word to be embedded;

determining the coding serial number of the word to be embedded by using a word list;

substituting the coding serial number into a language embedding model to obtain coding characteristics of the word to be embedded;

embedding the coding feature into the arbitrary hint character bits in the arbitrary initial hint template.

According to the visual classification method based on multi-template prompt learning, embedding a class name into a prompt template is equivalent to embedding a class name into a class zone bit of the prompt template;

inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain the category probability distribution of the video under the prompt template, wherein the method comprises the following steps:

determining, with an image encoder of the visual language coding model, a fused visual feature of successive video frames of the video;

determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;

Respectively comparing the feature similarity between the fusion visual features and the text features of each candidate text in the candidate text set to obtain a comparison result;

the comparison result is marked as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;

and obtaining the category probability distribution of the video under the prompt template based on the probability.

According to the visual classification method based on multi-template prompt learning provided by the invention, the multiple prompt templates and the visual language coding model are obtained by utilizing a semi-labeled visual classification sample set to perform three-stage training of full-supervision template parameter optimization-semi-supervision model optimization and full-supervision template parameter fine adjustment on multiple preset prompt templates and improved visual language pre-training models, and the visual classification method comprises the following steps:

based on the visual language pre-training model, performing full-supervision learning on a plurality of preset prompt templates by using a first sample set contained in the visual classification sample set so as to optimize the preset prompt templates to obtain a plurality of first prompt templates;

based on a plurality of the first prompt templates, performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set so as to optimize the frame fusion module to obtain the visual language coding model;

Based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using a third sample set contained in the visual classification sample set so as to finely tune the plurality of first prompt templates to obtain a plurality of prompt templates;

the first sample set, the second sample set and the third sample set are all obtained by processing a pre-stored full-class annotation video set;

the samples in the first sample set are video intermediate frames carrying class labels;

the second sample set is characterized in that part of samples are continuous video frames carrying category labels, and part of samples are continuous video frames not carrying category labels;

the samples in the third sample set are continuous video frames carrying category labels.

According to the visual classification method based on multi-template prompt learning provided by the invention, based on the visual language pre-training model, the first sample set contained in the visual classification sample set is utilized to perform full-supervision learning on a plurality of preset prompt templates so as to optimize the preset prompt templates to obtain a plurality of first prompt templates, and the method comprises the following steps:

generating a candidate text set under the preset prompt template based on the category name set for each preset prompt template and each sample in the first sample set;

Inputting the sample and the candidate text set under the preset prompting template into the visual language pre-training model to obtain the class probability distribution of the sample under the preset prompting template;

determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of preset prompt templates;

optimizing a plurality of preset prompt templates to obtain a plurality of first prompt templates by utilizing the total supervision loss of the samples in the first sample set;

based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using the third sample set to fine tune the plurality of first prompt templates to obtain a plurality of prompt templates, including:

generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the third sample set;

inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;

determining the total supervision loss of the sample according to the category labels of the sample and the category probability distribution of the sample under a plurality of first prompt templates;

And optimizing a plurality of first prompt templates to obtain a plurality of prompt templates by using the total supervision loss of the samples in the third sample set.

According to the visual classification method based on multi-template prompt learning provided by the invention, the method for performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set based on a plurality of first prompt templates so as to optimize the frame fusion module to obtain the visual language coding model comprises the following steps:

generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the second sample set;

when the sample is provided with a category label, performing uncertainty estimation on the category probability distribution of the sample under the first prompt template to obtain a pseudo-classification label and a weight of the sample under the first prompt template;

determining the unsupervised loss of the sample according to the pseudo classification labels and weights of the sample under a plurality of first prompt templates;

When the sample is not provided with the category label, determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of first prompt templates;

and optimizing the frame fusion module to obtain the visual language coding model according to the unsupervised loss of the sample without the category label in the second sample set and the full supervised loss of the sample with the category label.

According to the visual classification method based on multi-template prompt learning provided by the invention, uncertainty estimation is carried out on the class probability distribution of the sample under the first prompt template to obtain the pseudo classification label of the sample under the first prompt template, and the visual classification method comprises the following steps:

searching a probability value meeting a first condition from the class probability distribution of the sample under the first prompt template;

if the probability value exists, using the category name pointed by the probability value as a pseudo classification label of the sample under the first prompt template; otherwise, the sample is determined to be absent in the pseudo classification label under the first prompt template;

wherein the first condition is: the probability value is the maximum value in the probability distribution of the category and the probability value is greater than or equal to the confidence threshold.

According to the visual classification method based on multi-template prompt learning provided by the invention, uncertainty estimation is carried out on the class probability distribution of the sample under the first prompt template to obtain the weight of the sample under the first prompt template, and the method comprises the following steps:

traversing the class probability distribution of the sample under other first prompt templates except the first prompt template to find a first probability value meeting a first condition;

if the first probability value exists and the class name indicated by the first probability value is inconsistent with the pseudo classification label, the uncertainty measurement of the sample under the first prompt template is positive infinity;

if the first probability value exists and the class name indicated by the first probability value is consistent with the pseudo classification label, or the first probability value does not exist, the uncertainty measure of the sample under the first prompt template is the standard deviation of class probability distribution of the sample under a plurality of first prompt templates;

calculating the weight of the sample under the first prompt template by using the uncertainty measure of the sample under the first prompt template;

wherein the weight of the sample under the first hint template is inversely related to the uncertainty measure of the sample under the first hint template.

According to the visual classification method based on multi-template prompt learning, the expression of the unsupervised loss is as follows:

；

in the above formula, M is the number of prompt templates, C is the number of category names in the category name set, and K is the number of prompt templatesThe number of pseudo class labels of the ith sample of the class label does not exist in the second sample set,

for the uncertainty measure corresponding to the kth pseudo-classification tag of the ith sample without class labels in the second sample set,/o->

For the weight corresponding to the kth pseudo-classification label of the ith sample without category label in the second sample set, ++>

Class probability distribution converted for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set,/for the kth pseudo class label of the ith sample marked by the absence of class in said second sample set>

The class probability distribution of the ith sample without class labeling in the second sample set under the mth prompting template is obtained;

wherein the said

Determined by the following formula:

；

in the above-mentioned method, the step of,

and representing the position of the probability value corresponding to the kth pseudo-classification label of the ith sample without the category label in the second sample set in the category probability distribution to which the probability value belongs.

In a second aspect, the present invention provides a visual classification device based on multi-template prompt learning, the device comprising:

The visual input module is used for acquiring videos to be classified; the candidate text generation module is used for generating a candidate text set under each prompt template based on the category name set of the visual classification task for each prompt template in the multiple prompt templates; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;

the visual language coding module is used for inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain the category probability distribution of the video under the prompt template; the visual classification result output module is used for determining a visual classification result of the video by using the category probability distribution of the video under the plurality of prompt templates;

The invention provides a visual classification method and a visual classification device based on multi-template prompt learning, comprising the following steps: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; and determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates. According to the invention, the full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment are performed on the visual language pre-training models which are integrated into the frame fusion module to obtain the plurality of prompt templates and the visual language coding models, so that the training sample utilization efficiency when the visual language pre-training models are generalized to the downstream visual understanding task is improved, and the understanding accuracy can be improved when the plurality of prompt templates and the visual language coding models are applied to the downstream visual understanding task.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a visual classification method based on multi-template prompt learning;

FIG. 2 is a schematic diagram of a fully supervised template parameter optimization framework provided by the present invention;

FIG. 3 is a schematic diagram of an application of the frame fusion module in a visual language coding model;

FIG. 4 is a schematic diagram of a semi-supervised model optimization framework provided by the present invention;

FIG. 5 is a schematic diagram of a visual classification device based on multi-template prompt learning according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention;

reference numerals:

610: a processor; 620: a communication interface; 630: a memory; 640: a communication bus.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The visual classification method and apparatus based on multi-template prompt learning of the present invention is described below in conjunction with fig. 1-6.

In a first aspect, the present invention provides a visual classification method based on multi-template prompt learning, as shown in fig. 1, the method includes:

s11: acquiring videos to be classified;

s12: for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;

s13: inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; s14: determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set;

The visual language pre-training model is a model which performs multi-mode self-supervision learning on a large-scale image/video-text pair data set in advance, and can perform feature extraction on an image/continuous video frame and a text respectively and map features of the two modes to the same semantic space.

The improved visual language pre-training model is characterized in that a frame fusion module is added behind an image encoder of the visual language pre-training model, and the visual features of continuous video frames can be subjected to feature fusion to obtain a visual feature, so that a foundation is laid for feature comparison.

The invention provides a visual classification method based on multi-template prompt learning, which is characterized in that full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment are carried out on a plurality of preset prompt templates and visual language pre-training models integrated into a frame fusion module to obtain a plurality of prompt templates and visual language coding models, so that the utilization efficiency of training samples when the visual language pre-training models are generalized to a downstream visual understanding task is improved, and further, the accuracy of understanding can be improved when the plurality of prompt templates and visual language coding models are applied to the downstream visual understanding task.

Specifically, the continuous video frames of the video to be classified in S11 are obtained by:

s11.1: acquiring compressed videos to be classified;

s11.2: decoding the compressed video to obtain decoded continuous video frames;

s11.3: preprocessing the decoded continuous video frames to obtain continuous video frames of the video to be classified;

wherein the preprocessing operation includes, but is not limited to, normalization.

Specifically, in the step S12, the plurality of alert templates are obtained by optimizing two-stage template parameters on the basis of a plurality of preset alert templates;

the generating process of the plurality of preset prompt templates comprises the following steps:

SA: generating a plurality of initial hint templates based on the given hint template format; the given prompt template format is that the prompt template consists of a plurality of prompt character bits and a category flag bit; the plurality of initial prompt templates have differences in the number of prompt character bits and/or the positions of category marker bits;

assuming that the hint character bits are represented by X and the CLASS flag bits are represented by [ CLASS ], the initial hint template that includes 7 hint character bits and that the CLASS flag bits are located at the extreme end may be represented as "X X X X X X [ CLASS ]".

SB: embedding a word into each prompt character position in each initial prompt template to obtain a plurality of preset prompt templates;

in order to adapt to machine learning, the content formats of the prompt character bit and the category zone bit in the initial prompt template are word coding features; the generation of the preset prompting template is to initialize the content of the prompting character position of the corresponding initial prompting template (the content is the template parameter to be learned), namely to embed words into the prompting character position of the corresponding initial prompting template.

Wherein, for a word to be embedded in a hint character position in any initial hint template, comprising:

initializing a word to be embedded;

The language embedding model here takes the existing form, the choice of which will determine the dimensions of the coding features. In the invention, the length of a plurality of preset prompt templates, the initial content of prompt character bits, the dimension of coding features and the position distribution of category marker bits all affect the performance of a plurality of final prompt templates.

Further, in the S12, assuming that the total number of class names in the class name set is C and the total number of alert templates is M, embedding the C class names into the mth alert template to generate C texts containing the class names, where the candidate text set formed by the C candidate texts containing the class names is referred to as the candidate text set under the mth alert template; and traversing all the prompt templates to obtain candidate text sets under M prompt templates.

It should be noted that, embedding a category name into a prompt template is equivalent to embedding a category name into a category flag bit of a prompt template, and the embedding method is the same as the word embedding process of the prompt character bit in the generation process of the preset prompt template.

Specifically, the step S13 includes:

s13.1: determining, with an image encoder of the visual language coding model, a fused visual feature of successive video frames of the video;

s13.2: determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;

s13.3: respectively comparing the feature similarity between the fusion visual features and the text features of each candidate text in the candidate text set to obtain a comparison result;

S13.4: the comparison result is marked as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;

it can be seen that the closer the comparison results are, the higher the probability (score), whereby the class probability distribution is actually a kind of score distribution.

S13.5: and obtaining the category probability distribution of the video under the prompt template based on the probability.

It will be appreciated that the image encoder, text encoder and cross-modal feature similarity contrast structures are all structures in the visual language pre-training model, and thus corresponding structures also exist in the visual language encoding model.

The invention compares the feature extraction and similarity of candidate text sets under M prompt templates and the video to obtain M category probability distributions, wherein each category probability distribution consists of probabilities corresponding to C category names.

Specifically, the step S14 includes:

and determining a category probability distribution from the category probability distribution of the video under a plurality of prompt templates by taking an average or voting mode, and taking the category name corresponding to the maximum probability value of the category probability distribution as a visual classification result of the video.

From the above steps, it can be seen that the accuracy of the visual classification method of the present invention depends on a plurality of alert templates and visual language coding models, where the alert templates and the visual language coding models are obtained by performing three-stage training of full-supervision template parameter optimization-half-supervision model optimization and full-supervision template parameter fine-tuning on a plurality of preset alert templates and improved visual language pre-training models by using a half-labeled visual classification sample set, and specifically:

SI: based on the visual language pre-training model, performing full-supervision learning on a plurality of preset prompt templates by using a first sample set contained in the visual classification sample set so as to optimize the preset prompt templates to obtain a plurality of first prompt templates;

SII: based on a plurality of the first prompt templates, performing semi-supervised learning on the improved visual language pre-training model by using a second sample set contained in the visual classification sample set so as to optimize the frame fusion module to obtain the visual language coding model;

SIII: based on the visual language coding model, performing full-supervised learning on the plurality of first prompt templates by using a third sample set contained in the visual classification sample set so as to finely tune the plurality of first prompt templates to obtain a plurality of prompt templates;

Here, fig. 2 is a schematic diagram of a full-supervision template parameter optimization framework, where the learnable alert templates are a plurality of preset alert templates, as shown in fig. 2, and SI specifically is:

SI-1: generating a candidate text set under the preset prompt template based on the category name set for each preset prompt template and each sample in the first sample set;

SI-2: inputting the sample and the candidate text set under the preset prompting template into the visual language pre-training model to obtain the class probability distribution of the sample under the preset prompting template;

SI-3: determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of preset prompt templates;

SI-4: optimizing a plurality of preset prompt templates to obtain a plurality of first prompt templates by utilizing the total supervision loss of the samples in the first sample set;

wherein the expression of the total supervision loss of the sample is:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,

loss of total supervision for the jth sample in the first training set,>

true tag for the j-th sample in the first training set +.>

Distribution of 0-1 transformed, ->

And (3) the probability distribution of the category of the jth sample in the first training set under the mth preset prompting template.

FIG. 3 is a schematic diagram of the application of the frame fusion module in a visual language coding model, wherein the frame fusion module is composed of a frame recoding module and a self-attention pooling module. The frame recoding module inputs the image characteristics of N video frames, embeds and adds the image characteristics with a N-bit leachable position, and sends the image characteristics into a transformer network to obtain N video frame characteristics. The transformer network comprises a multi-headed self-care layer and a fully-connected feed-forward network, wherein a residual connection (residual) is added outside the two network layers and normalized. The self-attention pooling module carries out global average pooling on N video frame features obtained by the frame recoding module to obtain an average feature, then connects the average feature with N video frame features in series to obtain N+1 input features, sends the N+1 input features into a multi-head self-attention layer and a fully connected feedforward network to obtain N+1 features, and outputs the first feature as a final video coding feature. FIG. 4 is a schematic diagram of a semi-supervised model optimization framework, the hint templates referred to in FIGS. 3 and 4 refer specifically to a first hint template, and the text set refers specifically to a class name set in FIG. 4. As shown in fig. 3 and 4, SII is specifically:

SII-1: generating a candidate text set under the first prompt template based on the category name set for each sample in the first prompt template and the second sample set;

SII-2: inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;

SII-3: when the sample is provided with a category label, performing uncertainty estimation on the category probability distribution of the sample under the first prompt template to obtain a pseudo-classification label and a weight of the sample under the first prompt template;

SII-4: determining the unsupervised loss of the sample according to the pseudo classification labels and weights of the sample under a plurality of first prompt templates;

SII-5: when the sample is not provided with the category label, determining the total supervision loss of the sample according to the category label of the sample and the category probability distribution of the sample under a plurality of first prompt templates;

SII-6: and optimizing the frame fusion module to obtain the visual language coding model according to the unsupervised loss of the sample without the category label in the second sample set and the full supervised loss of the sample with the category label.

Further, the SII-3 is divided into two stages of determining a pseudo classification label of the sample under the first prompt template and determining a weight of the sample under the first prompt template;

wherein determining that the sample is under the first hint template comprises:

SII-3-a: searching a probability value meeting a first condition from the class probability distribution of the sample under the first prompt template;

SII-3-B: if the probability value exists, using the category name pointed by the probability value as a pseudo classification label of the sample under the first prompt template; otherwise, the sample is determined to be absent in the pseudo classification label under the first prompt template;

If the probability value satisfying the first condition is not found, the sample is nonsensical and does not participate in the calculation of the unsupervised loss.

Determining the weight of the sample under the first prompt template comprises:

SII-3-I: traversing the class probability distribution of the sample under other first prompt templates except the first prompt template to find a first probability value meeting a first condition

SII-3-II: if the first probability value exists and the class name indicated by the first probability value is inconsistent with the pseudo classification label, the uncertainty measurement of the sample under the first prompt template is positive infinity;

SII-3-III: if the first probability value exists and the class name indicated by the first probability value is consistent with the pseudo classification label, or the first probability value does not exist, the uncertainty measure of the sample under the first prompt template is the standard deviation of class probability distribution of the sample under a plurality of first prompt templates;

SII-3-IV: calculating the weight of the sample under the first prompt template by using the uncertainty measure of the sample under the first prompt template;

In SII-4, the expression for unsupervised loss is as follows:

；

in the above formula, M is the number of prompt templates, C is the number of category names in the category name set, K is the number of pseudo classification labels of the ith sample without category labels in the second sample set,

wherein the said

Determined by the following formula: />

；

In the above-mentioned method, the step of,

a kth pseudo-classification tag characterizing an ith sample in the second sample set for which no class labels existThe position of the corresponding probability value in the probability distribution of the category to which it belongs.

In SII-5, the total supervision loss of the model optimization stage is consistent with the total supervision loss calculation method of the template parameter optimization stage, and the description is omitted.

According to the SII, a plurality of first prompt templates are utilized to automatically generate pseudo labels without labeling samples, so that the problem that label samples are limited in downstream tasks is solved; measuring sample uncertainty by utilizing differences among a plurality of first hint template codes so as to solve the problem of pseudo tag noise; and performing semi-supervised model optimization on the improved visual language pre-training model on the basis of the plurality of first prompt templates so as to improve the efficiency of generalizing the improved visual language pre-training model to downstream tasks.

SIII, specifically:

SIII-1, for each of the first hint template and each of the third sample set, generating a candidate set of text under the first hint template based on the set of category names;

SIII-2, inputting the sample and the candidate text set under the first prompt template into the visual language coding model to obtain the class probability distribution of the sample under the first prompt template;

SIII-3, determining the total supervision loss of the sample according to the class label of the sample and the class probability distribution of the sample under a plurality of first prompt templates;

and SIII-4, optimizing a plurality of the first prompt templates to obtain a plurality of prompt templates by using the total supervision loss of the samples in the third sample set.

Likewise, the total supervision loss in the template parameter fine adjustment stage is consistent with the total supervision loss calculation method in the template parameter optimization stage, and will not be described again.

The SIII is obtained by fine tuning a plurality of prompt templates obtained by SI on the basis of a visual language coding model obtained by SII, the fine tuning process is similar to SI, and the learning rate is reduced.

The method is suitable for video understanding tasks including behavior recognition, and can obtain higher recognition accuracy under the condition of using the same pre-training visual language coding model and the same training sample.

For example, the invention can be applied to video retrieval tasks, and compared with video classification, the type name set is replaced by the retrieval word set.

In addition, the invention can also utilize the visual language pre-training model and the prompt template which only carries out the parameter optimization of the full supervision template to carry out the image understanding task, including the image classification and the image retrieval.

In order to verify the effectiveness of the invention, a CLIP pre-training model is adopted to generalize to behavior recognition tasks for illustration.

The CLIP model refers to a contrast learning-based language-Image Pre-Training model (Contrastive Language-Image Pre-Training, CLIP) proposed by Alec Radford et al in paper Learning Transferable Visual Models From Natural Language Supervision of 2021, which model contains one Image encoder and one text encoder, the data set being Pre-trained using 400M "Image-text descriptions", and the resulting two encoders being applicable to a variety of Image understanding tasks.

The data set HMDB51 adopted by the behavior recognition task comprises 5100 video clips, 3570 training samples and 1530 test samples. These video clips are divided into 51 behavior categories. The present embodiment randomly extracts 60% (i.e., 2142) "video-category pairs" from the training samples as a labeled set, and the remaining 40% (i.e., 1428) training samples use only video, constituting a label-free set.

1. And (5) optimizing the parameters of the full-supervision template.

The parameters of m=3 hint templates were optimized based on using 2142 tagged samples, where the visual input took an intermediate frame image of the video.

2. Semi-supervised model optimization.

And fixing parameters of a prompt template, and adjusting parameters of a frame fusion module in the visual language coding model by adopting a semi-supervised learning mode by using 2142 labeled samples and 1428 unlabeled samples.

3. And (5) fine adjustment of the parameters of the full-supervision template.

The visual language coding model parameters were fixed, the learning rate was reduced to 1/5 of the previous step, and 2142 labeled samples were used to fine tune the parameters of the hint template.

4. And (5) testing.

And testing by using 1530 test samples, summing and normalizing the prediction scores to obtain predicted categories, comparing the predicted categories with category names provided by a data set, and calculating the testing accuracy. Table 1 is a comparative table of the accuracy of the prior art method and the method of the present invention. As shown in Table 1, the invention realizes generalization of the visual language pre-training model to the downstream task, and the generalization capability is improved by 8.4% compared with the accuracy of the existing method on the test set.

TABLE 1

In a second aspect, the visual classification device based on multi-template prompt learning provided by the invention is described, and the visual classification device based on multi-template prompt learning described below and the visual classification method based on multi-template prompt learning described above can be referred to correspondingly. Fig. 5 illustrates a schematic structural diagram of a visual classification device based on multi-template prompt learning, as shown in fig. 5, the device includes:

A visual input module 21 for acquiring video to be classified;

the candidate text generation module 22 generates, for each of a plurality of alert templates, a candidate text set under the alert template based on a category name set of the visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template;

the visual language coding module 23 is configured to input the continuous video frames of the video and the candidate text set into a visual language coding model, so as to obtain a category probability distribution of the video under the prompt template;

a visual classification result output module 24, configured to determine a visual classification result of the video by using a category probability distribution of the video under a plurality of prompt templates;

The invention provides a visual classifying device based on multi-template prompt learning, which is used for carrying out full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and visual language pre-training models integrated into a frame fusion module to obtain a plurality of prompt templates and visual language coding models, so that the utilization efficiency of training samples when the visual language pre-training models are generalized to downstream visual understanding tasks is improved, and further, the accuracy of understanding can be improved when the plurality of prompt templates and visual language coding models are applied to the downstream visual understanding tasks.

On the basis of the foregoing embodiments, as an optional embodiment, the generating process of the plurality of preset alert templates includes:

initializing a word to be embedded;

On the basis of the above embodiments, as an alternative embodiment, embedding a category name into a prompt template is equivalent to embedding a category name into a category flag of the prompt template;

the visual language encoding module comprises:

a visual feature extraction unit for determining visual features of the video using an image encoder of the visual language coding model;

a text feature extraction unit for determining a text feature of each candidate text in the set of candidate texts using a text encoder of the visual language coding model;

the cross-modal feature comparison unit is used for respectively comparing the feature similarity between the fusion visual feature and the text feature of each candidate text in the candidate text set to obtain a comparison result;

The determining unit is used for marking the comparison result as the probability that the category name associated with each candidate text in the candidate text set is the category name of the video;

and the category probability distribution determining unit is used for obtaining the category probability distribution of the video under the prompt template based on the probability.

Based on the foregoing embodiments, as an optional embodiment, the plurality of alert templates and the visual language coding model are obtained by performing three-stage training of full-supervision template parameter optimization-half-supervision model optimization and full-supervision template parameter fine adjustment on a plurality of preset alert templates and an improved visual language pre-training model by using a semi-labeled visual classification sample set, and the method includes:

On the basis of the foregoing embodiments, as an optional embodiment, the performing, based on the visual language pre-training model, full-supervised learning on a plurality of preset alert templates by using a first sample set included in the visual classification sample set to optimize the plurality of preset alert templates to obtain a plurality of first alert templates includes:

Based on the foregoing embodiments, as an optional embodiment, the performing semi-supervised learning on the modified visual language pre-training model by using a second sample set included in the visual classification sample set based on the plurality of first prompt templates to optimize the frame fusion module to obtain the visual language coding model includes:

Based on the foregoing embodiments, as an optional embodiment, performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a pseudo classification label of the sample under the first prompt template, where the method includes:

On the basis of the foregoing embodiments, as an optional embodiment, performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a weight of the sample under the first prompt template, where the method includes:

On the basis of the above embodiments, as an alternative embodiment, the expression of the unsupervised loss is as follows:

；

wherein the said

Determined by the following formula:

；

in the above-mentioned method, the step of,

In a third aspect, fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, where the electronic device may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a visual classification method based on multi-template hint learning, the method comprising: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set; the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of the input continuous video frames.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In a fourth aspect, the present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program when executed by a processor being capable of performing the methods provided by the above methods to perform a multi-template hint learning-based visual classification method, the method comprising: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set; the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of input continuous video frames

In a fifth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided above to perform a multi-template hint learning-based visual classification method, the method comprising: acquiring videos to be classified; for each of a plurality of alert templates, generating a candidate text set under the alert template based on a category name set of a visual classification task; embedding a category name into a prompt template to generate a candidate text associated with the corresponding category name under the corresponding prompt template; inputting the continuous video frames of the video and the candidate text set into a visual language coding model to obtain category probability distribution of the video under the prompt template; determining a visual classification result of the video by using the category probability distribution of the video under a plurality of prompt templates; the visual language coding model is obtained by performing three-stage training of full supervision template parameter optimization-half supervision model optimization and full supervision template parameter fine adjustment on a plurality of preset prompt templates and an improved visual language pre-training model by utilizing a half-labeled visual classification sample set; the improved visual language pre-training model is obtained by accessing a frame fusion module behind an image encoder in the visual language pre-training model; the frame fusion module is used for carrying out feature fusion on visual features of input continuous video frames

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A visual classification method based on multi-template prompt learning, the method comprising:

acquiring videos to be classified;

2. The visual classification method based on multi-template prompt learning according to claim 1, wherein the generating process of the plurality of preset prompt templates comprises:

initializing a word to be embedded;

3. The visual classification method based on multi-template prompt learning according to claim 2, wherein embedding a class name into a prompt template is equivalent to embedding a class name into a class flag of a prompt template;

4. A visual classification method based on multi-template prompt learning according to any one of claims 1-3, wherein the multiple prompt templates and the visual language coding model are obtained by performing three-stage training of full-supervision template parameter optimization-half-supervision model optimization and full-supervision template parameter fine adjustment on multiple preset prompt templates and improved visual language pre-training models by using a half-labeled visual classification sample set, and the method comprises the following steps:

5. The multi-template prompt learning-based visual classification method according to claim 4, wherein the performing full-supervised learning on a plurality of preset prompt templates by using a first sample set included in the visual classification sample set based on the visual language pre-training model to optimize the plurality of preset prompt templates to obtain a plurality of first prompt templates includes:

6. The multi-template prompt learning-based visual classification method as claimed in claim 4, wherein the semi-supervised learning of the improved visual language pre-training model based on the plurality of first prompt templates using a second sample set included in the visual classification sample set to optimize the frame fusion module to obtain the visual language coding model comprises:

7. The visual classification method based on multi-template prompt learning according to claim 6, wherein performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a pseudo classification label of the sample under the first prompt template comprises:

8. The visual classification method based on multi-template prompt learning according to claim 7, wherein performing uncertainty estimation on a class probability distribution of the sample under the first prompt template to obtain a weight of the sample under the first prompt template comprises:

9. The multi-template hint learning based visual classification method according to claim 6, wherein the expression of unsupervised loss is as follows:

；

Wherein the said

Determined by the following formula:

；

in the above-mentioned method, the step of,

characterizing that the probability value corresponding to the kth pseudo-classification label of the ith sample without the category label in the second sample set is in the category probability distribution to which the probability value belongsIs a position of (c).

10. A multi-template prompt learning-based visual classification device, the device comprising: