CN116958700A

CN116958700A - Image classification method based on prompt engineering and contrast learning

Info

Publication number: CN116958700A
Application number: CN202310955757.5A
Authority: CN
Inventors: 李兵; 高绍坤; 卢宇晨
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-08-01
Filing date: 2023-08-01
Publication date: 2023-10-27

Abstract

An image classification method based on prompt engineering and contrast learning is used for realizing adaptation of a downstream task by using Soft Prompt Tuning in two modes of vision and text, so that a cross-mode double-path multi-layer perceptron is realized. The method is based on text and image characteristics, and can realize higher image classification accuracy on 11 public data sets, wherein the average recognition accuracy is more than 82.5%; the method is based on a large-scale pre-training model and a prompt fine adjustment technology, and only a small amount of model parameters need to be adjusted in the training process, so that calculation resources are saved; the method combines the loss of contrast learning and less label learning in the training process of the multi-mode model CLIP, on the basis of the prior art, the model convergence speed is improved.

Description

Image classification method based on prompt engineering and contrast learning

Technical Field

The invention relates to the fields of natural language processing, computer vision and prompt engineering in artificial intelligence, in particular to an image classification method based on prompt engineering and contrast learning.

Background

Currently, in the field of artificial intelligence single-mode model analysis, large-scale pre-training cross-mode models of Natural Language Processing (NLP) and Computer Vision (CV) are mature, and an important feature of the cross-mode field pre-training model is that complete texts and pictures are mapped into a common vector space, and taking classification problems as an example, in the fields of Natural Language Processing (NLP) and Computer Vision (CV), pictures or texts are usually associated with a single label. The feature combination between the text and the image is helpful to improve the performance of the model, but the overall parameter adjustment (Fine-Tuning) taking the task as the center is used for realizing the adaptation of the downstream task, the parameters of the whole model are required to be adjusted, a large amount of calculation resources are consumed, and the model is generally represented in the field of less-label learning.

Prompt engineering (Prompt) is widely applied to two single-mode fields of Natural Language Processing (NLP) and Computer Vision (CV), a large-scale pre-training model can be adjusted to downstream tasks through Prompt fine tuning, and for different downstream tasks, design is needed, compared with fine tuning of the pre-training model, the Prompt engineering is higher in accuracy under the condition of small data quantity, parameters of the whole model do not need to be adjusted, calculation resources are saved, and therefore efficiency of the model is improved.

In summary, developing an image classification method based on Prompt engineering (Prompt) and contrast learning is still a critical problem to be solved in the field of cross-modal models.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention uses Soft Prompt Tuning in two modes of vision and text to realize the adaptation of a downstream task, realizes a cross-mode double-path multi-layer perceptron, and compares the performances in the field of few-label learning on 11 data sets so as to demonstrate the improvement of model accuracy.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention provides an image classification method based on Prompt engineering (Prompt) and contrast learning, which comprises the following steps:

step (1) text and image processing: the method aims at solving the problem of realizing image classification by using a cross-mode pre-training model, and because the labels of the images are often independent texts, such as 'automobiles', 'dogs' and the like; thus, to build a text feature, all tags will be put into the "picture of one [ category ]," picture of one car "," picture of one dog "; the constructed text feature is input into a multi-modal model such as a CLIP model, and the process of text digitizing and text feature extraction is completed in a text encoder of the multi-modal model; in the aspect of an image, the remodeling of the pixel size of the image to 224 x 224 is realized by using a TorchVision development tool so as to realize the process of image numerical value;

step (2) builds a soft hint vector (Soft Prompt Vector) based promt Tuning: the soft hint vector (Soft Prompt Vector) structure includes Text Prompt Tuning adapted to text data, image Prompt Tuning adapted to image data; text Prompt Tuning from the large-scale pre-training model itself, the downstream task is converted into a task closer to the Fill in blank (Fill in blanks) of the training essence of the language model through a Prompt (Prompt) template, and the model is excavated to learn the knowledges in the pre-training process; soft Prompt replaces discrete vocabulary with trainable continuous vectors, and is a new idea for constructing promt Tuning; compared with a template designed manually and a template formed by discrete words, the Soft template avoids the problem of poor performance in a special corpus because the structure of the template is solidified; similar to the text, there are also related methods of Prompt Tuning in visual problems, image Prompt Tuning on the basis of which a multi-layer perceptron is used instead of simple learnable parameters;

training a cross-modal analysis model CLIP based on texts and pictures in the step (3): respectively executing the operations of the step (1) -the step (2) on the multi-modal data sets to obtain a cross-modal data set inserted with a corresponding Prompt (Prompt), and inputting each sample label and the corresponding cross-modal data set into a cross-modal analysis model (CLIP) based on the Prompt for training;

step (4) pre-training cross-modal analysis based on prompt engineering: and downloading a text picture data set in the learning field with few labels to be analyzed, forming a multi-modal data pair, inputting the multi-modal data pair into a cross-modal analysis model based on Prompt engineering (Prompt), and obtaining a training result and a model accuracy.

Further, in the step (1), when setting cross-mode data, text and picture data of a group of data must be in one-to-one correspondence and classification labels are the same, when testing a model, setting is the same as that of CLIP less label learning when matching images and texts, namely, one picture and text data corresponding to all categories are selected as "a photo of a dog", "a photo of an automobile"; when the model is trained, the images and texts in each batch are subjected to image-text matching, so that a matrix of cosine similarity is obtained, the cosine similarity of the position of the diagonal (namely, the situation that the images and the texts are matched) is maximized, and the cosine similarity of the position of the non-diagonal (namely, the situation that the images and the texts are not matched) is minimized.

Further, in step (2), text Prompt Tuning adapted to the text data, the main structure of which is represented by trainable parameters instead of the original discrete vocabulary; the following is shown:

where V represents Soft Prompt Vector (dimension set to 512), M represents the number of Soft Prompt Token, soft Prompt Vector is initialized with aphotofa [ pass ], where class can be replaced with cat, dog, car, etc.;

t＝[V] ₁ [V] ₂ …[V] _M [CLASS]

text after Soft Prompt (Soft Prompt) is used and then input into a text encoder, corresponding text features are extracted, and only corresponding parameters of Soft Prompt Vector are updated in the model training process.

Further, image Prompt Tuning adapted to image data, on the image modality, uses a multi-layer perceptron instead of simple learnable parameters, a specific model structure is as follows:

soft Prompt Vector is first input into the multi-layer perceptron before it is input into the image encoder, only the multi-layer perceptron and Soft Prompt Vector are trainable during model training; considering that a simple Soft Prompt Vector possibly has a phenomenon of overfitting in the model training process, a Drop Out Layer is added in the multi-Layer perceptron; in order to increase the space of the parameter change of Soft Prompt Vector in the training process, the multi-layer perceptron can reduce the dimension of Soft Prompt Vector with high dimension so as to achieve the purpose that the dimension of the output of the multi-layer perceptron is the same as the dimension of the image embedding layer; therefore, for the D1 dimension Soft Prompt Vector input into the multi-Layer perceptron, the output D2 dimension vector is inserted into the Layer of the Image Encoder, based on the multi-Layer perceptron's promt being:

[x ₁ ,Z ₁ ,E ₁ ]＝L ₁ ([x ₀ ,P,E ₀ ])

[x _i ,Z _i ,E _i ]＝L _i ([x _i-1 ,Z _i-1 ,E _i-1 ])i＝2,3,…,N

y＝Head(x _N )。

further, in step (3), the text and picture based cross-modal analysis model CLIP structure includes a transducer encoder for processing text model features and a VIT encoder for processing image mode features, so as to maximize cosine similarity of matched pairs of teletext data and minimize cross-modal fusion of cosine similarity of other non-matched pairs of teletext data.

Further, the main structure of the transducer encoder for processing the text modal characteristics comprises an encoder and a decoder, and the processing flow of the text encoder is as follows: in a transducer, the input text is represented as a sequence of vectors, referred to as embedded vectors; then, the transducer processes the embedded vector through a multi-layer neural network to extract text features; the first layer neural network of the transducer is called an encoder, whose purpose is to represent text as an embedded vector; the encoder consists of a plurality of convolution layers and a pooling layer and is used for extracting the characteristics of the text; the output of the encoder is a sequence of embedded vectors, each representing a feature of the text; the second layer neural network of the transducer is called a decoder, whose purpose is to generate a text representation from the embedded vector sequence; the decoder is also composed of a plurality of convolution layers and a pooling layer for generating a final representation of the text; the input to the decoder is the output of the encoder, the output of which is a representation of the text reflecting the semantics and structure of the text; finally, the output of the transducer is an embedded vector sequence that represents the final feature representation of the text.

Further, the VIT encoder structure for processing image mode features is composed of an image encoder and an object detector, and the processing flow of ViT is as follows: in VIT, the input image is represented as an embedded vector; the embedded vector consists of pixel values of the image, is a vector representation of each pixel; next, the VIT processes the embedded vector through a multi-layer neural network to extract features of the image; the first layer neural network of the VIT is called an image encoder, whose purpose is to represent the image as an embedded vector; the image encoder consists of a plurality of convolution layers and a pooling layer and is used for extracting the characteristics of the image; the output of the image encoder is a sequence of embedded vectors, each representing a feature of the image; the second layer neural network of the VIT is called an object detector, whose purpose is to extract features of the object from the embedded vector sequence; the object detector is also composed of a plurality of convolution layers and pooling layers for identifying objects in the image; the output of the object detector is a sequence of embedded vectors, each representing a feature of the object; the output of the VIT is a sequence of embedded vectors representing the final feature representation of the image; when an image is input into the VIT, the VIT will first represent it as an embedded vector; then, the VIT processes the image by using the embedded vector to extract the characteristics of the image; finally, the output of the VIT is a sequence of embedded vectors representing the final feature representation of the image.

Further, the text modality and image modality confidence calculation process is as follows: giving a graphic data pair (Image i, text) to be respectively input into an Image encoder (Image encoder) and a Text encoder (Text encoder) to extract corresponding coding features; for the coded image features and text features, the CLIP is responsible for maximizing the cosine similarity of matched image-text data pairs and minimizing the cosine similarity of other unmatched image-text data pairs;

for the problem of image classification, the label Y of the image is added to an artificial template aphotofa [ class ] to form a new text description, and then the coding feature is extracted by a text encoder; the likelihood of the final prediction category is as follows:

where τ is the temperature coefficient that the CLIP can learn during training and cos (w, f) is the cosine similarity.

Compared with the prior art, the technical proposal provided by the invention has the following advantages that

The beneficial effects are that:

(1) The invention can realize higher image classification accuracy on 11 public data sets based on text and image characteristics, and the average recognition accuracy is more than 82.5%.

(2) The invention is based on a large-scale pre-training model and a prompt fine-tuning technology, only a small amount of model parameters need to be adjusted in the training process, and the computing resource is saved

(3) According to the invention, in the training process of the multi-mode model CLIP, the loss of contrast learning and less label learning is combined, and the model convergence speed is improved on the basis of the prior art.

Drawings

Fig. 1 is a cross-modal model diagram based on prompt in an embodiment of the invention.

Fig. 2 is a diagram of a Soft Prompt (Soft Prompt) structure in an embodiment of the present invention.

Fig. 3 is a block diagram of a text encoder in an embodiment of the present invention.

Fig. 4 is a block diagram of an image encoder in an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the attached drawings.

Referring to fig. 1, an embodiment of the present invention provides an image classification method based on prompt engineering and contrast learning, which includes the following steps:

step (1), text and picture data set processing: the open source data set of the image classification contains image data, and the image label is collected manually through an official website, and a text file of the label is constructed. Since the labels of the images tend to be separate text, such as "car", "dog", etc. Thus, to build a text feature, all tags will be put into the "picture of one [ category ]," picture of one car "," picture of one dog "; the constructed text feature is input into a multimodal model, such as a CLIP model, and the process of text digitizing and text feature extraction is completed in a multimodal model text encoder. In terms of images, the image pixel size remodeling to 224 x 224 was achieved by using TorchVision development tools to achieve the image digitizing process.

Step (2), constructing a soft hint vector (Soft Prompt Vector) based promt Tuning: soft Prompt replaces discrete vocabulary with trainable continuous vectors, which is a new idea for constructing Prompt Tuning. Compared with a template designed manually and a template formed by discrete words, soft template avoids the problem of poor performance in a special corpus because the structure of the template is solidified. Text Prompt Tuning from the large-scale pre-training model itself, the downstream tasks are converted into Fill-in-blanks (Fill-in-blanks) tasks closer to the training essence of the language model by designing a Prompt template, and the features (knowledges) learned by the model during the pre-training process are mined. Similar to the text, there is also a related approach to Prompt Tuning on visual problems, image Prompt Tuning on the basis of which a multi-layer perceptron is used instead of simple learnable parameters.

As shown in fig. 2, the soft hint vector (Soft Prompt Vector) structure includes Text Prompt Tuning that is adapted to text data and Image Prompt Tuning that is adapted to image data.

The main structure of Text Prompt Tuning adapted to text data is represented by trainable parameters instead of the original discrete vocabulary. The following is shown:

t＝[V] ₁ [V] ₂ …[V] _M [CLASS]

where V represents Soft Prompt Vector (dimension set to 512), M represents the number of Soft Prompt Token, soft Prompt Vector is initialized with aphotofa [ class ], where class can be replaced with cat, dog, car, etc.

Image Prompt Tuning adapted to image data on an image modality, a multi-layer perceptron is utilized instead of simple learnable parameters, a specific model structure is as follows:

soft Prompt Vector is first input to the multi-layer perceptron before it is input to the image encoder, and only the multi-layer perceptron and Soft Prompt Vector are trainable during model training. Considering that a simple Soft Prompt Vector may have a phenomenon of overfitting in the model training process, a Drop Out Layer is added in the multi-Layer perceptron. In order to increase the space of the parameter change of Soft Prompt Vector in the training process, the multi-layer perceptron can reduce the dimension of Soft Prompt Vector with high dimension so as to achieve the purpose that the dimension of the output of the multi-layer perceptron is the same as the dimension of the image embedding layer. Therefore, for the D1 dimension Soft Prompt Vector input into the multi-Layer perceptron, the output D2 dimension vector is inserted into the Layer of the Image Encoder, thus based on the Prompt of the multi-Layer perceptron:

[x ₁ ,Z ₁ ,E ₁ ]＝L ₁ ([x ₀ ,P,E ₀ ])

[x _i ,Z _i ,E _i ]＝L _i ([x _i-1 ,Z _i-1 ,E _i-1 ])i＝2,3,…,N

y＝Head(x _N )

step (3), training a cross-modal analysis model (CLIP) based on the Prompt: and (3) respectively executing the operations of the step (1) -the step (2) on the multi-modal data set to obtain a cross-modal data set inserted with corresponding promt, and inputting each sample label and the corresponding cross-modal data set into a cross-modal analysis model (CLIP) based on the promt for training.

The sample-based cross-modal analysis model CLIP structure comprises a transducer encoder for processing text model features, a VIT encoder for processing image mode features, and a cross-modal fusion specific method for maximizing cosine similarity of matched pairs of teletext data and minimizing cosine similarity of other unmatched pairs of teletext data is used for processing output results of the text and image encoders.

As shown in fig. 3, the main structure of the transducer encoder for processing the text modal characteristics includes an encoder and a decoder, and the text encoder processing flow is as follows: in a transducer, the input text is represented as a sequence of vectors, which are called embedded vectors. Next, the Transformer processes the embedded vector through the multi-layer neural network to extract text features. The first layer of neural network of the transducer is called an encoder, whose purpose is to represent text as an embedded vector. The encoder is composed of a plurality of convolution layers and a pooling layer for extracting features of text. The output of the encoder is a sequence of embedded vectors, each representing a feature of the text. The second layer neural network of the transducer is called a decoder, whose purpose is to generate a text representation from the embedded vector sequence. The decoder is also composed of multiple convolution layers and pooling layers for generating a final representation of the text. The input to the decoder is the output of the encoder, which output is a representation of the text reflecting the semantics and structure of the text. Finally, the output of the transducer is an embedded vector sequence that represents the final feature representation of the text. These features may be used for various natural language processing tasks such as text classification, emotion analysis, machine translation, etc.

As shown in fig. 4, the VIT encoder structure for processing the image mode features is composed of an image encoder and an object detector, and the VIT processing flow is as follows: in VIT, the input image is represented as an embedded vector. The embedded vector consists of pixel values of the image, being a vector representation of each pixel. Next, the VIT processes the embedded vector through a multi-layer neural network to extract features of the image. The first layer neural network of the VIT is called an image encoder, whose purpose is to represent the image as an embedded vector. The image encoder is composed of a plurality of convolution layers and a pooling layer for extracting features of an image. The output of the image encoder is a sequence of embedded vectors, each representing a feature of the image. The second layer neural network of the VIT is called an object detector, whose purpose is to extract features of the object from the embedded vector sequence. The object detector is also composed of a plurality of convolution layers and pooling layers for identifying objects in the image. The output of the object detector is a sequence of embedded vectors, each representing a feature of the object. The output of the VIT is a sequence of embedded vectors representing the final feature representation of the image. When an image is input into the VIT, the VIT will first represent it as an embedded vector. The VIT then uses the embedded vector to process the image to extract features of the image. Finally, the output of the VIT is an embedded vector sequence representing the final feature representation of the image, which can be used for image recognition and object detection tasks.

Step (4), pre-training cross-modal analysis based on prompt engineering: and downloading a text picture data set in the learning field with few labels to be analyzed, forming a multi-modal data pair, inputting the multi-modal data pair into a cross-modal analysis model based on Prompt engineering (Prompt), and obtaining a training result and a model accuracy.

The method is based on text and Image characteristics, can realize higher Image classification accuracy on 11 public data sets (Image Net, caltech101, oxford-IIIT Pet, stanfordCars, flowers102, food101, FGVC-air left, SUN397, UCF101, DTD and EuraSAT), has average recognition accuracy of more than 82.5%, adopts a large-scale pre-training model and a prompt fine-tuning technology, only needs to adjust a small amount of model parameters in the training process, saves calculation resources, combines the loss of contrast learning and less label learning in the training process, and improves the model convergence speed on the basis of the prior technology.

The above description is merely of preferred embodiments of the present invention, and the scope of the present invention is not limited to the above embodiments, but all equivalent modifications or variations according to the present disclosure will be within the scope of the claims.

Claims

1. An image classification method based on prompt engineering and contrast learning is characterized in that: the method comprises the following steps:

step (1), text and image processing:

in order to construct text features, all text labels are constructed according to the categories of the images; the built text features are input into the multi-modal model, and the processes of text numeralization and text feature extraction are completed in a text encoder of the multi-modal model; in the aspect of an image, the remodeling of the pixel size of the image to 224 x 224 is realized by using a TorchVision development tool so as to realize the process of image numerical value;

step (2), constructing a Prompt parameter Prompt Tuning based on a soft Prompt vector Soft Prompt Vector: the soft hint vector Soft Prompt Vector includes Text Prompt Tuning that is adapted to text data, image Prompt Tuning that is adapted to image data;

the text Prompt parameter adjusting Text Prompt Tuning converts a downstream task into a gap-filling task by designing a Prompt template based on the angle of a large-scale pre-training model, and digs knowledge knowledges learned by the model in the pre-training process; image Prompt Tuning replaces the learnable parameters with multi-layer perceptrons.

Step (3), training a cross-modal analysis model CLIP based on texts and pictures: respectively executing the operations of the step (1) -the step (2) on the multi-modal data sets to obtain a cross-modal data set inserted with a corresponding Prompt, and inputting each sample label and the corresponding cross-modal data set into a cross-modal analysis model CLIP based on the Prompt for training;

step (4), pre-training cross-modal analysis based on prompt engineering: and downloading a text picture data set in the learning field with few labels to be analyzed, forming a multi-mode data pair, inputting the multi-mode data pair into a cross-mode analysis model CLIP based on Prompt engineering Prompt, and obtaining a training result and a model accuracy.

2. The image classification method based on prompt engineering and contrast learning according to claim 1, wherein: in the step (1), when setting cross-mode data, text and picture data of a group of data are required to be in one-to-one correspondence and have the same classification labels, and when testing a model, when matching images and texts, the setting of learning with less label of CLIP is the same, namely, one picture and text data corresponding to all categories are selected; when the model is trained, the images and texts in each batch are subjected to image-text matching, so that a matrix of cosine similarity is obtained, cosine similarity of the situation that the images and the texts are matched at the position of the diagonal line is maximized, and cosine similarity of the situation that the images and the texts are not matched at the position of the non-diagonal line is minimized.

3. The image classification method based on prompt engineering and contrast learning according to claim 1, wherein: in step (2), text Prompt Tuning, which is adapted to the text data, its main structure is represented by trainable parameters instead of the original discrete vocabulary. The following is shown:

t＝[V] ₁ [V] ₂ …[V] _M [CLASS]

where V represents Soft Prompt Vector, dimension is set to 512, M represents the number of Soft Prompt Token, soft Prompt Vector is initialized with aphotofa [ class ], class represents the class of image;

the text after Soft Prompt is used and then input into a text encoder, corresponding text characteristics are extracted, and only corresponding parameters of Soft Prompt Vector are updated in the model training process.

4. The image classification method based on prompt engineering and contrast learning according to claim 1, wherein: in step (2), image Prompt Tuning, which is adapted to the image data, on the image modality, the learnable parameters are replaced by a multi-layer perceptron, the specific model structure being as follows:

soft Prompt Vector is first input into the multi-layer perceptron before it is input into the image encoder, only the multi-layer perceptron and Soft Prompt Vector are trainable during model training; adding Drop Out Layer in the multi-Layer perceptron; in order to increase the space of the parameter change of Soft Prompt Vector in the training process, the multi-layer perceptron reduces the dimension of Soft Prompt Vector with high dimension so as to achieve the purpose that the dimension of the output of the multi-layer perceptron is the same as the dimension of the image embedding layer; for the D1 dimension Soft Prompt Vector input into the multi-Layer perceptron, the output D2 dimension vector is inserted into the Layer of the Image Encoder, based on the Prompt of the multi-Layer perceptron as follows,

[x ₁ ,Z ₁ ,E ₁ ]＝L ₁ ([x ₀ ,P,E ₀ ])

[x _i ,Z _i ,E _i ]＝L _i ([x _i-1 ,Z _i-1 ,E _i-1 ])i＝2,3,…,N

y＝Head(x _N )

wherein x is ₀ Representing the initial image features, P represents the initial Soft Prompt parameter, Z _i Representing the feature characteristics computed by the ith layer of the transducer, y represents the extracted image features.

5. The image classification method based on prompt engineering and contrast learning according to claim 1, wherein: in step (3), the text and picture-based cross-modal analysis model CLIP structure includes a transducer encoder for processing text model features and a VIT encoder for processing image mode features, and cross-modal fusion is performed to maximize cosine similarity of matched pairs of teletext data and minimize cosine similarity of other non-matched pairs of teletext data.

6. The image classification method based on prompt engineering and contrast learning according to claim 5, wherein: the main structure of the transducer encoder for processing the text modal characteristics comprises an encoder and a decoder, and the processing flow of the text encoder is as follows: in a transducer, the input text is represented as a sequence of vectors, referred to as embedded vectors; then, the transducer processes the embedded vector through a multi-layer neural network to extract text features; the first layer neural network of the transducer is called an encoder, whose purpose is to represent text as an embedded vector; the encoder consists of a plurality of convolution layers and a pooling layer and is used for extracting the characteristics of the text; the output of the encoder is a sequence of embedded vectors, each representing a feature of the text; the second layer neural network of the transducer is called a decoder, whose purpose is to generate a text representation from the embedded vector sequence; the decoder is also composed of a plurality of convolution layers and a pooling layer for generating a final representation of the text; the input to the decoder is the output of the encoder, the output of which is a representation of the text reflecting the semantics and structure of the text; finally, the output of the transducer is an embedded vector sequence that represents the final feature representation of the text.

7. The image classification method based on prompt engineering and contrast learning according to claim 5, wherein: the VIT encoder structure for processing the image modal characteristics consists of an image encoder and an object detector, and the VIT processing flow is as follows: in VIT, the input image is represented as an embedded vector; the embedded vector consists of pixel values of the image, is a vector representation of each pixel; next, the VIT processes the embedded vector through a multi-layer neural network to extract features of the image; the first layer neural network of the VIT is called an image encoder, whose purpose is to represent the image as an embedded vector; the image encoder consists of a plurality of convolution layers and a pooling layer and is used for extracting the characteristics of the image; the output of the image encoder is a sequence of embedded vectors, each representing a feature of the image; the second layer neural network of the VIT is called an object detector, whose purpose is to extract features of the object from the embedded vector sequence; the object detector is also composed of a plurality of convolution layers and pooling layers for identifying objects in the image; the output of the object detector is a sequence of embedded vectors, each representing a feature of the object; the output of the VIT is a sequence of embedded vectors representing the final feature representation of the image; when an image is input into the VIT, the VIT will first represent it as an embedded vector; then, the VIT processes the image by using the embedded vector to extract the characteristics of the image; finally, the output of the VIT is a sequence of embedded vectors representing the final feature representation of the image.

8. The image classification method based on prompt engineering and contrast learning according to claim 1, wherein: the image-text matching method for comparison learning comprises the following calculation processes: given a pair of teletext data (image i, text) is input to the foregoing transducer encoder for processing the text modal features and the VIT encoder for processing the image modal features, respectively, to extract the corresponding encoding features. For the coded image features and text features, the CLIP is responsible for maximizing the cosine similarity of matched image-text data pairs and minimizing the cosine similarity of other unmatched image-text data pairs;

for the problem of image classification, the label Y of the image is added to the artificial template a photo of a class to constitute a new text description, after which the coding features are extracted by the text encoder. The likelihood of a final prediction category is as follows,

for a given image x and a text set y consisting of K image categories, i, j represent the i, j-th category, w _i Representing the text encoder to extract text features of the ith category, f representing image features of image x extracted by the image encoder; calculating cosine similarity between the text feature and the image feature using cos (w, f); τ is a fixed temperature coefficient of the CLIP during training.