CN116994098A

CN116994098A - Large model prompt learning method based on category attribute knowledge enhancement

Info

Publication number: CN116994098A
Application number: CN202311261605.1A
Authority: CN
Inventors: 吕凤毛; 聂昶如; 李天瑞; 罗皓楠; 储节磊
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-11-03
Anticipated expiration: 2043-09-27
Also published as: CN116994098B

Abstract

The invention discloses a large model prompt learning method based on category attribute knowledge enhancement, which comprises the steps of acquiring an image recognition training data set, and generating visual attribute sets of various categories through manual annotation or using ChatGPT; generating an attribute perceivable prompting background through an attribute integration module; respectively placing the training image and the corresponding prompt sentence carrying visual attribute information into an image encoder and a text encoder to obtain image characteristics and text classification weights; comparing and learning the image characteristics and the text classification weight, and updating parameters of the attribute integration module through the comparison and learning to obtain the attribute integration module after training; generating a most obvious visual attribute set of each test category according to the category space of the test task; and loading the images to be tested and the visual attribute sets of the test categories into the model, and calculating the text classification weight and the similarity of the image features, wherein the text pair with the maximum similarity is the prediction result. The invention has the advantages of strong zero sample recognition capability, strong expandability and the like.

Description

Large model prompt learning method based on category attribute knowledge enhancement

Technical Field

The invention relates to a large model prompt learning method based on category attribute knowledge enhancement, belonging to the technical field of computer vision and migration learning.

Background

The contextual cue learning is derived from natural language processing. The context hint based paradigm formalizes the tasks of downstream natural language processing as mask language modeling problems and introduces pre-trained language models (e.g., BERT and GPT) to generate results by employing appropriate hint contexts Wen Moban. Compared to traditional fine-tuning paradigms, the hint-based paradigm can bridge the gap between downstream tasks and pre-training tasks. The example of contextual hints inspires the computer vision field by adapting pre-trained visual models (e.g., vision Transformer and Swin transducer) to downstream tasks through hint learning. Radford et al show that a context hint based paradigm can enable CLIP zero sample prediction. However, it often takes a significant amount of time to find and design the appropriate hint templates for the target task. In practice, the prompt engineering is through manual trial and error and careful design. To this end, zhou et al propose a CoOp method to train a leachable hint on the downstream dataset, hint learning being the conversion of hint context into a set of continuous vectors that are end-to-end optimized for downstream tasks. Lu et al further promote the diversity of CoOp cues from a modeling perspective. Khattak et al propose the MaPLe method in CLIP to improve alignment between vision and language embedding.

The existing visual language large model prompt learning method generally causes that the learned prompt background is over-fitted with training data, so that the performance of the large model on a zero sample recognition task is degraded, and the large model particularly has poor performance on a fine-granularity zero sample image recognition task.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a large model prompt learning method based on category attribute knowledge enhancement.

The technical scheme provided by the invention for solving the technical problems is as follows: a large model prompt learning method based on category attribute knowledge enhancement comprises the following steps:

s1, acquiring an image recognition training data set and generating a visual attribute set of each category;

s2, generating attribute perceivable prompt background by the visual attribute sets of each category through an attribute integration module, and combining the attribute perceivable prompt background with the name of the corresponding categoryc _i Together forming a prompt sentence carrying visual attribute information of related categories;

s3, respectively placing the training images and the corresponding types of prompt sentences carrying visual attribute information into an image encoder and a text encoder to obtain image characteristics and text classification weights;

s4, comparing and learning the image characteristics and the text classification weight, and updating parameters of the attribute integration module through the comparison and learning to obtain the attribute integration module after training;

s5, generating a visual attribute set of each test category through manual annotation or using ChatGPT according to the category space of the test task;

s6, generating attribute perceivable prompt backgrounds of all the test categories through the trained attribute integration module by using the visual attribute sets of all the test categories, and then combining the attribute perceivable prompt backgrounds of all the test categories with the names of all the test categoriesc _i Forming prompt sentences of each test category, which carry visual attribute information of related categories;

s7, generating image features of the image to be tested by the image encoder;

s8, generating text classification weights of the test categories through a text encoder by using prompt sentences of the test categories, wherein the prompt sentences carry visual attribute information of the related categories;

and S9, calculating the similarity of the image characteristics of the image to be tested and the text classification weights of all the test categories, wherein the text pair with the maximum similarity is the prediction result.

The further technical scheme is that in the step S1, the visual attribute set is generated through manual annotation or ChatGPT.

The further technical scheme is that the visual attribute set is the most obvious visual attribute set of each category.

The specific process of generating the attribute perceivable prompting background is as follows: generating M most significant visual attribute sets of each category through manual annotation or ChatGPT, categoryiThe set of attributes of (1) is expressed asThrough a learnable attribute integration module +.>Generating a class-specific reminder background->WhereinThe method comprises the steps of carrying out a first treatment on the surface of the Attribute perceptible cue background->Name of corresponding categoryc _i Prompt sentence carrying visual attribute information of related category is formed together>。

Further technical scheme is that the attribute integration module is a double-layer fully-connected neural network, and the dimension of the hidden layer is set to 512.

The further technical scheme is that in the step S3, the image encoder is implemented by res net or ViT, and the text encoder is implemented by a transducer.

The invention has the following beneficial effects:

1. zero samples are strong in recognition capability. The invention can perceive the prompt background through learning the attribute, and can effectively improve the recognition capability of the model to the unknown class. Experiments were performed using three sets of zero sample recognition benchmark tasks AWA2, CUB, SUN, training on known classes of each dataset, testing on unknown classes. Compared with CoOp, the performance of the method is obviously improved on the zero sample image recognition task. For example, on the conventional zero sample recognition task, the invention can raise AWA2 by 0.27%, CUB by 13.47% and SUN by 8.50%; on the generalized zero sample recognition task, the invention can raise AWA2 by 1.64%, CUB by 17.17% and SUN by 10.80%.

2. And the expandability is strong. The category attribute knowledge enhancement mechanism of the present invention can be applied to other prompt learning frameworks (e.g., maPLe) to conduct experiments without adjusting the hyper-parameters of the original prompt learning framework.

3. The traditional manual attribute labeling approach may be replaced with ChatGPT queries. Specifically, we first query ChatGPT for various attributes by using the following templates: "Give me 16 noun attributes for [ class ], each attribute just one word," then uses the category attributes of the query to generate an attribute-aware hint context that performs in a manner comparable to manually labeling attributes.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the large model prompt learning method based on category attribute knowledge enhancement of the present invention includes:

step one, acquiring an image recognition training data set, and generating a most obvious visual attribute set of each category through manual annotation or using ChatGPT;

the visual attribute set refers to the most obvious visual attribute set of each category, such as white-belly sea parcels in a bird data set contain obvious visual features: white abdomen and white chest;

step two, generating attribute perceivable prompt backgrounds from visual attribute sets of all categories through an attribute integration module;

attribute perceivable prompt background generates M most obvious visual attribute sets of each category through manual annotation or ChatGPT, and attribute set of category iRepresented asThrough a learnable attribute integration module +.>Generating a class-specific reminder background->Wherein->The method comprises the steps of carrying out a first treatment on the surface of the Attribute perceptible cue background->Name of corresponding categoryc _i Together form a prompt sentence carrying visual attribute information about the category +.>；

The attribute integration module adopts a double-layer full-connection structure (Linear-ReLU-Linear), and the dimension of the hidden layer is set to 512;

prompting background in the inventionAccording to the dynamic change of different categories, the model can sense the visual attribute information of each category, the mobility among the categories is enhanced, and the recognition capability learned on the training category is migrated to the test category.

Step three, respectively putting the training images and the corresponding types of prompt sentences carrying visual attribute information into an image encoder and a text encoder to obtain image characteristics and text classification weights, wherein the text encoder is realized by a transducer, and the image encoder is realized by ResNet or ViT;

in order to realize effective learning of parameters and prevent catastrophic forgetting of a large model in the training process, parameters of a text encoder and an image encoder are frozen, and only parameters of an attribute integration module are updated. The model only needs to collect a small amount of samples for each class, so that the calculation load can be reduced, and the label collection and labeling work can be reduced;

the prompt background carrying the visual attribute information is helpful for the model to identify unknown test categories;

step four, comparing and learning the image characteristics and the text classification weight, and updating parameters of the attribute integration module through the comparison and learning to obtain the attribute integration module after training;

the invention adopts a contrast learning strategy and calculates model loss by using cross entropykClassification weight of classImage featurexContrast learning;

the contrast learning formula is as follows:

wherein:p(y=i|x) Is shown inxWhen it is image characteristic, it predicts labelyEqual to the firstiProbability of individual category;sim() Representing cosine similarity;Krepresenting the number of categories;τrepresenting a temperature parameter;jrepresent the firstjA category;wirefers to the firstiClassification weight of the category;wjrefers to the firstjClassification weights for the individual categories;xrepresenting image features;

the above formula would be substituted into the cross entropy function calculation model penalty to update the attribute integration module parameters.

The aim of contrast learning is to make the images and the prompt sentences of the belonging classes approach each other in the feature space, and the distances of the prompt sentences of other classes in the feature space are increased, so that the true label score of each image class is maximized;

generating a most obvious visual attribute set of each test category through manual annotation or using ChatGPT according to the category space of the test task;

step six, generating attribute perceivable prompt background of each test category through the trained attribute integration module by the visual attribute set of each test category, and then testing each testAttributes of the classes may perceptibly suggest context and names of the test classesc _i Forming prompt sentences of each test category, which carry visual attribute information of related categories;

step seven, generating image characteristics of an image to be tested by an image encoder;

step eight, generating text classification weights of all test categories through a text encoder by using prompt sentences of all test categories, wherein the prompt sentences carry visual attribute information of related categories;

and step nine, calculating the similarity of the image characteristics of the image to be tested and the text classification weights of all the test categories, wherein the test category corresponding to the maximum similarity is the prediction result.

The data sets in the invention comprise three data sets of AWA2, CUB and SUN, wherein AWA2 is a coarse-grained data set, and CUB and SUN are fine-grained data sets. For example: coarse-grained data sets contain a collection of images of all animal categories, while fine-grained data sets contain a collection of images of a certain animal refinement category.

Attribute integration module: the model uses a double-layer Linear layer (Linear-ReLU-Linear), and the ReLU activation function plays an extremely important role in a neural network and can be used for activating neurons so that the neurons react to input signals and fully utilize the nonlinear transformation characteristics of the neurons.

The present invention is not limited to the above-mentioned embodiments, but is not limited to the above-mentioned embodiments, and any person skilled in the art can make some changes or modifications to the equivalent embodiments without departing from the scope of the technical solution of the present invention, but any simple modification, equivalent changes and modifications to the above-mentioned embodiments according to the technical substance of the present invention are still within the scope of the technical solution of the present invention.

Claims

1. The large model prompt learning method based on category attribute knowledge enhancement is characterized by comprising the following steps:

s7, generating image features of the image to be tested by the image encoder;

2. The large model hint learning method based on category attribute knowledge enhancement according to claim 1, wherein the visual attribute set in step S1 is generated by manual annotation or ChatGPT.

3. The large model hint learning method based on category attribute knowledge enhancement of claim 1 wherein the set of visual attributes is the most prominent set of visual attributes for each category.

4. The large model prompt learning method based on category attribute knowledge enhancement according to claim 3, wherein the specific process of generating the attribute perceivable prompt background is: generating M most significant visual attribute sets of each category through manual annotation or ChatGPT, categoryiThe set of attributes of (1) is expressed asThrough a learning attribute integration moduleGenerating a category-specific attribute perceptible cue background +.>Wherein->The method comprises the steps of carrying out a first treatment on the surface of the Attribute perceptible cue background->Name of corresponding categoryc _i Prompt sentence carrying visual attribute information of related category is formed together>。

5. The large model prompt learning method based on category attribute knowledge enhancement according to claim 4, wherein the attribute integration module is a double-layer fully connected neural network, and a hidden layer dimension is set to 512.

6. The large model hint learning method based on category attribute knowledge enhancement according to claim 1, wherein the image encoder in step S3 is implemented by res net or ViT, and the text encoder is implemented by a transducer.