CN116129174A

CN116129174A - Generalized zero sample image classification method based on feature refinement self-supervision learning

Info

Publication number: CN116129174A
Application number: CN202211568665.3A
Authority: CN
Inventors: 郭迎春; 张玉; 朱叶; 于洋; 师硕; 吕华; 阎刚; 刘依
Original assignee: Hebei University of Technology; Tianjin Agricultural University
Current assignee: Hebei University of Technology; Tianjin Agricultural University
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-05-16
Also published as: CN117095196A

Abstract

The invention discloses a generalized zero sample image classification method based on feature refinement self-supervision learning, which introduces a self-supervision learning task, adds two classification heads for a SwinTransformer network and is mainly used for solving the prejudice problem of visible categories. Through the rotation angle classification task and the contrast learning task, visual characteristic positioning is enhanced, correlation between visual characteristics and semantic information is enhanced, and meanwhile, in order to further alleviate the prejudice problem, pseudo labels are generated for the non-seen classes in the training sample construction, so that the GZSL task is set to be direct push learning.

Description

Generalized zero sample image classification method based on feature refinement self-supervision learning

Technical Field

The invention belongs to the field of computer vision, relates to a generalized zero sample image classification method, and in particular relates to a generalized zero sample image classification method based on feature refinement self-supervision learning.

Background

Zero-sample image classification (ZSL) refers to a technique for performing image classification on the condition that a training set and a test set have no intersection on the category of data, and aims to predict and identify the data of an unseen category according to the data of the visible category, together with the relevant common sense information or priori knowledge. The auxiliary information mainly refers to semantic information, which comprises manually defined attribute vectors, text information automatically extracted based on a machine learning method, or a combination of the two, and the semantic information can set up a bridge between visible classes and invisible classes.

In conventional ZSL technology, the test set contains only samples from unseen classes, and in the real world this setup is unreasonable and difficult to achieve. In practical applications, data samples of visible classes are more common than data samples of non-visible classes, and it is necessary that the test set contains both non-visible and visible class samples, and even identifying samples containing both classes at the same time is more important than identifying only non-visible class samples. Thus, to further accommodate the real world, researchers have proposed Generalized Zero-Shot Learning (GZSL) that can identify samples from both visible and non-visible classes.

Most existing GZSL methods focus on methods based on embedded models and on generated models, the embedded methods focus on embedding visual features and semantic descriptions into a common space, for example, mapping the visual features to a semantic space and measuring similarity between two modalities, patent CN 113139591a discloses a generalized zero sample image classification method based on enhanced multi-modality alignment, and also uses the embedded methods to perform alignment, and the method uses a hypersphere encoder to construct a latent layer space for the visual features and the semantic features, so as to promote the modality alignment; the generative model approach first trains the generator to generate visual features of the unseen classes, such as generating a countermeasure network (Generative Adversarial Network, GAN) or a variational self-encoder (Variational Autoencoders, VAE), and then trains the classifier to distinguish between the different classes using samples of the visible classes and the unseen class samples synthesized by the generator. Chinese patent CN113177587a discloses a generalized zero sample target classification method based on active learning and variation self-encoder, which significantly improves the classification accuracy of generalized zero samples. Recently, attention-based methods have been popular because they can directly identify portions of an image that are related to semantic information, thereby capturing global features and local information of the image, but attention-based methods still have unavoidable problems of bias to the visible class.

Although in the setting of GZSL there is no intersection of the label space of visible and invisible classes, there is still an overlap area of visible and invisible domains during model training, especially when processing fine-grained datasets. For example, tiger whales and seafloor whales are visible classes that can be accessed during the training phase, and dolphins are undiscovered classes for testing. These three species share a large amount of visual features and semantic information, since the GZSL model uses only visual class data during the training phase, a dolphin sample can easily be mistakenly identified as tiger whale and whale, which can lead to a reduction in classification accuracy for the unseen class, and although attention-based models can focus on the semantically related parts very accurately, the model can still be biased towards the visual class because the model fails to specifically process the semantically unrelated parts of the image features.

Disclosure of Invention

In order to overcome the defects and shortcomings in the prior art, the invention provides a generalized zero sample image classification method based on feature refinement self-supervision learning, which uses a Shifted Windows (Swin) transform to extract visual features of an image. Unlike the existing Swin transducer encoder, the encoder of this patent is based on Self-supervised learning (Self-Supervised Learning, SSL) tasks, which are respectively introduced by adding two classification heads (token): (1) A rotation angle classification task and (2) a contrast learning task. In addition, a visual feature refinement module and a semantic feature refinement module are constructed to further refine the feature. The visual characteristic refinement module is mainly used for enabling the visual characteristic to be further adapted to a fine-granularity data set by using a bilinear pooling algorithm and adapting to fine-granularity image classification, and the semantic characteristic refinement module is mainly used for enhancing the correlation between the visual characteristic and semantic information through image self-adapting characteristics (Image Adaptive Semantics, IAS) and relieving the prejudice problem. The core innovation points of the invention are as follows: firstly, a Swin converter model is improved, and two token are added to introduce a self-supervision learning task; secondly, a brand new contrast learning task is provided, the visual features obtained after the same image passes through different feature extractors (ResNet 101 and Swin transform) are constrained to be close enough in the feature space, and the visual features obtained by different images through the same feature extractor are constrained to be far enough in the feature space.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a generalized zero sample image classification method based on feature refinement self-supervision learning comprises the following steps:

(1) Acquiring an image dataset and a semantic attribute dataset of a generalized zero sample classification model, and training the semantic attribute dataset by utilizing a ResNet101 network to acquire a semantic feature A;

(2) Using the image data set and the semantic attribute data set of the visible class in the image data set for training to obtain a visible class conditional vision classifier taking the semantic attribute as a condition, and then using the image data set and the semantic attribute data set of the invisible class for training to obtain the invisible class conditional vision classifier taking the semantic attribute as a condition; obtaining a weight matrix W of the visible class through a conditional vision classifier of the visible class _s In a weight matrix W of visible class _s Obtaining an unseen class image x as the classification weight of the unseen class conditional vision classifier _u Is a pseudo tag of (a)

The obtained pseudo tag is used as an unseen tag, a new image data set is formed by an unseen image in the image data set and the pseudo tag obtained through an unseen conditional vision classifier, the new image data set is formed by the image data set of the visible type, the unseen image and the pseudo tag in the image data set, and is used as a training sample of a generalized zero sample classification model for subsequent classification training, and the test sample comprises the image data set of the visible type and the image data set of the unseen type with the unseen real tag of the unseen type.

(3) Constructing a generalized zero sample classification model;

the generalized zero sample classification model comprises a visual feature refinement module, a semantic feature refinement module, a Swin transform network added with two classification tokens and two constructed self-supervision learning tasks, wherein the two newly added classification tokens respectively correspond to the two self-supervision learning tasks: the rotation angle classification task needs to randomly rotate training samples by four different angles, and finally predicts rotation types; the contrast learning task is to input training samples into ResNet101 and Swin transducer networks respectively to obtain visual features extracted by different feature extractors, and to constrain the visual features by contrast loss functions.

And (3) constructing a self-supervision learning module: the self-supervision learning module comprises a rotation angle classification task and a contrast learning task, wherein the rotation angle classification task realizes self-supervision learning through a preposition task for predicting the rotation angle of an image, images in a training sample are randomly rotated by four different angles (0 DEG, 90 DEG, 180 DEG and 270 DEG) to obtain a rotation image dataset with rotation type labels, and the rotation image dataset is input into a Swin transform network to obtain visual characteristics with rotation information corresponding to the rotation image dataset and used for predicting the rotation types of the rotation image dataset; through a Swin transducer network The visual characteristics with the rotation angle obtained by encoding are only used for predicting the rotation angle of an image, and are not directly participated in the subsequent network model of the invention, but are required to correspond to a classification token in the Swin transform network; the contrast learning task obtains different visual characteristics by inputting training samples into ResNet101 and Swin transducer respectively, and the training samples pass through L _NCE The loss function constrains the entire contrast learning task.

The training samples are input into a visual feature refinement module and a semantic feature refinement module after passing through a Swin transform encoder, wherein the visual feature refinement module adopts the idea of bilinear pooling, and the semantic feature refinement module introduces image self-adaptive features.

Building a visual characteristic refinement module: the visual feature refinement module is improved based on the idea of bilinear pooling, and is an improved method of feature fusion, the invention adopts the method of homologous bilinear pooling, the input of the visual feature refinement module is the visual feature x obtained by training a sample through a SwinTransformer network, the visual feature refinement module can enable the visual feature to be more suitable for a fine granularity data set, the visual feature refinement module comprises Hadamard operation, reshape operation, a full connection layer and a normalization layer, the specific operation is to copy the visual feature x for subsequent feature fusion, the initial dimension of x is q, the dimensions of the two visual features x are respectively transformed into 1 xq and q x1 after reshape operation, the parameters are named as x1 and x2, and the x1 is decomposed into a parameter matrix U1 and a feature vector

x2 into a parameter matrix U2 and a feature vector +.>

Parameter matrix U1 and eigenvector->

The product of (2) is x1, eigenvector +.>

The product of the parameter matrix U2 is x2, and then U1 and U2 are Hadamard-operatedInput into the global vector layer, < >>

and />

The visual characteristics are input into a global vector layer through Hadamard operation, and are aggregated into a global vector z in the global vector layer, and the visual characteristics are output as thinned visual characteristics ∈x through a full connection layer and normalization operation>

And (3) constructing a semantic feature refinement module: the semantic feature refinement module is called an image adaptive (Image Adaptive Semantics, IAS) module, and the IAS module can combine original semantic features for distinguishing between classes with image specific attention vectors for intra-class changes, and then map the image adaptive semantic features into corresponding visual spaces, so that the accuracy of GZSL image classification is improved. The IAS module comprises a 1 st full-connection layer FC, a 1 st classification function softmax, hadamard operation and operation, a 2 nd full-connection layer, a 3 rd full-connection layer FC, a 1 st normalization layer, a 2 nd normalization layer and a 2 nd classification function softmax, wherein the input of the IAS module is the visual characteristic output by the visual characteristic refinement module

And semantic feature A, visual feature +.A obtained by training the semantic attribute dataset with ResNet101 network >

Hadamard operation and operation are carried out on the result processed by the 1 st full connection layer FC and the 1 st classification function softmax and the semantic features, so that improved semantic features are obtained

Improved semantic feature->

After being processed by a 2 nd full-connection layer FC, a 3 rd full-connection layer FC and a 2 nd normalization layer, hadamard operation and a 2 nd classification function softmax processing are executed with the result of the visual feature x after being processed by the 2 nd normalization layer, and semantic features are mapped into a visual space;

the training samples are input into a Swin transducer network, and the classification category of the generalized zero sample image classification task is output through a visual characteristic refinement module and an IAS module in sequence, a generalized zero sample classification model is trained, and the total loss function L of the generalized zero sample classification model is trained _TOT The sum of the loss function for self-supervised learning (the loss function for rotation angle classification task and the loss function for contrast learning task) and the loss for generalized zero sample classification task is formulated as:

L _TOT ＝L _CE +L _MSE +L _NCE (1)

wherein ,L_TOT The total loss function L of the generalized zero sample classification model _CE Loss function for rotation angle classification task, L _MSE Loss function for generalized zero sample classification task, L _NCE A loss function for the contrast learning task;

the loss function of the generalized zero sample classification task is:

/>

Wherein M represents the number of training samples, y _i And

predictive labels respectively representing real labels and generalized zero sample image classification tasks;

the loss function of the rotation angle classification task is:

wherein ,

a epsilon {0,1,2,3} represents 4 rotation angles;

the loss function of the contrast learning task is:

wherein M represents the number of training samples, x _j ,

Respectively representing visual characteristics of the training sample after passing through the ResNet101 network and the Swin transducer network, wherein W represents a weight matrix of the Swin transducer network, & lt/EN & gt>

Represents x _j ,/>

Similarity between them.

So far, a trained generalized zero sample classification model is obtained;

(4) Recognizing the generalized zero sample image by using the trained generalized zero sample classification model to finish the classification task of the generalized zero sample; and inputting the unseen images with known semantic features into a trained generalized zero sample classification model to obtain the predictive labels of the unseen images.

The step (1) acquires an image dataset and a semantic attribute dataset, and specifically comprises the following steps:

(1.1) utilizing existing image classification and fine-grained image classification datasets as image datasets, comprising: animal with Attribute 2 (AWA 2), caltech-UCSD-Birds-200-2011 (CUB) and SUN Attribute Datase (SUN). Wherein the CUB and SUN data sets are fine granularity data sets, and the AWA2 data sets are coarse granularity data sets. The data set AWA2 contains 30745 animal pictures and 37322 animal pictures of 50 classes, wherein 40 classes are used as training classes, and 10 classes are used as training classes A test class; the data set CUB is a fine granularity data set of bird pictures, and contains 11788 pictures of 200 birds in total, wherein 150 kinds of birds are used as training class 50 kinds of birds as testing classes; the data set SUN covers fine-grained data sets of various environmental scenes and internal images, and comprises 717 classes of 14340 pictures in total, wherein 645 classes are used as training categories, and 72 classes are used as testing categories. In order to meet the input requirement of a backbone network, namely a SwinTransformer network, the resolution of the image is unified to 224 multiplied by 224, and the visual characteristics X of the visible image are obtained _s And tag Y thereof _s Denoted as D _s ＝{X _s ,Y _s As a set of image data of the visible class, the visual features X of the image of the invisible class _u And tag Y thereof _u Denoted as D _u ＝{X _u ,Y _u And (3) as an unseen image dataset.

(1.2) in general, zero sample classification requires the construction of semantic space by means of auxiliary semantic information, and the interaction relation between the visual mode and the semantic mode is established. The auxiliary semantic information can be divided into two main types, namely auxiliary information based on manual definition and auxiliary information based on learning. Wherein, based on the auxiliary information defined manually, semantic features with different dimensions are obtained through a ResNet101 network, and the semantic feature A obtained by the ResNet101 network of the semantic attribute data set is expressed as A=A _s ∪A _u The subscript s denotes the visible class and u denotes the invisible class, wherein the AWA2 dataset uses 85-dimensional semantic features, the CUB dataset uses 312-dimensional semantic features, and the SUN dataset uses 102-dimensional semantic features.

In order to further alleviate the problem of prejudice to the visible class, the invention redefines the generalized zero sample image classification as a visual classification problem conditioned on semantic attributes, the image dataset and the semantic attribute dataset of the visible class are trained to obtain the condition visual classifier of the invisible class conditioned on semantic attributes by training a condition visual classifier of the visible class conditioned on semantic attributes, and the classification weight W of the condition visual classifier of the visible class is obtained by training the image dataset and the semantic attribute dataset of the invisible class _s Weight of conditional visual classifier as unseen class to get pseudo tagPseudo tag and X _u New training data is formed and is input into the Swin transform network as training samples to participate in the training process, so that the generalized zero sample method is converted into a direct push type setting.

The step (3) is to construct a generalized zero sample classification model, and the generalized zero sample classification model is specifically as follows:

(3.1) Swin transducer network adding two classification token: the invention creatively uses the Swin transform network as a backbone network, which is an important attempt in the problem of generalized zero sample image classification, the network obtains strong performance on the tasks of image classification, target detection and semantic segmentation, and the images in the training samples can be changed into a series of flattened 2Dpatch x after being input into the Swin transform network _p The resolution of each patch is p x p, and each patch dimension is set to 48 for matching with the dimension of the transducer.

The flattened patch block can obtain the visual characteristic x of the input image after 4 stages, each stage comprises a patch merging layer and a Swin transform block, the patch merging layer is similar to pooling operation, and the Swin transform block comprises a window multi-head self-attention module and a moving window multi-head self-attention module.

(3.2) two self-supervising task mechanisms:

rotation angle classification task: the method comprises the steps of constructing a task of predicting the rotation angle of an image to realize self-supervision learning, namely randomly rotating the image in a training sample by four different angles (0 DEG, 90 DEG, 180 DEG and 270 DEG), further processing the training sample to obtain a rotation image data set, setting the four rotated angles to four categories respectively, setting labels to be (0, 1,2 and 3), and inputting the rotation image data set into a Swin converter network to obtain visual characteristics with rotation information corresponding to the rotation image data set for predicting the rotation categories of the rotation image data set. The process of solving this task does not involve category attributes or semantic information, so semantically irrelevant parts of the visual features can be removed.

Contrast learning task: the training samples are respectively input into a ResNet101 network and a Swin converter network to obtain visual features of the same image after passing through different feature extractors, the two visual features processed by the different feature extractors are used as positive sample pairs, the similarity between the positive sample pairs is calculated through cosine similarity, and the distance between the positive sample pairs is constrained to be close enough through contrast loss.

(3.3) constructing a visual characteristic refinement module: in order to further adapt the visual characteristics obtained by the Swin transducer network to the fine-grained data set, a visual characteristic refinement module is constructed in the GZSL method, the visual characteristics after characteristic fusion are obtained by improving the homologous bilinear pooling method, the input of the module is the visual characteristics x processed by the step (3.1), and the output is the visual characteristics after refinement

Input into the IAS module.

(3.4) constructing an IAS module: in order to enrich the diversity of semantic features and better distinguish fine-grained data sets, an IAS module is constructed in the GZSL method, so that different categories can be better distinguished, visual differences in the categories can be reflected, and the input of the IAS module is the refined visual features processed in the step (3.3)

And the semantic feature A processed by the ResNet101 network is output as the mapping of the visual feature x and the semantic feature A in the visual space.

(3.5) inputting the training sample and the rotating image data set into a Swin transform network at the same time for feature extraction, respectively obtaining respective corresponding visual features x, respectively performing a generalized zero sample image classification task and a rotating angle classification task by using the obtained visual features, and jointly training in the step (3.1) and the step (3.2) to share the same Swin transform network; training a generalized zero sample classification model, and a total loss function L of the generalized zero sample classification model _TOT Sum of loss function for self-supervision learning and loss of generalized zero sample classification task;

training the Swin transducer network, the visual characteristic refinement module and the IAS module together, initializing the iteration number to be K, enabling the maximum iteration number to be K, wherein K is more than or equal to 30, and enabling k=1;

obtaining a trained generalized zero sample classification model.

(4) The classification model identifies the generalized zero sample image, the classification task of the generalized zero sample is completed, the image in the test sample is used as the input of the step (3.1) to perform feature extraction, then the visual features and the semantic features obtained in the step (3.1) are input into the feature refinement module to obtain the mapping of the visual features and the semantic features in the visual space, and finally the probability of each class can be obtained in the visual space to obtain the final classification result.

The invention also protects a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the generalized zero sample image classification method.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention provides a novel generalized zero sample image classification method, which introduces a self-supervision learning task, adds two classification heads for a Swin transform network and is mainly used for solving the problem of prejudice to visible categories. Through the rotation angle classification task and the contrast learning task, visual characteristic positioning is enhanced, correlation between visual characteristics and semantic information is enhanced, and meanwhile, in order to further alleviate the prejudice problem, pseudo labels are generated for the non-seen classes in the training sample construction, so that the GZSL task is set to be direct push learning.

(2) The method realizes the refinement of visual features and semantic features, the visual features refinement is mainly realized by applying a bilinear pooling method, the semantic features refinement is to generate specific semantic features for each image, the concept of representing each class by using image self-adaptive features is introduced, the features can better distinguish semantic features of samples in the classes, further the accuracy of final image classification is improved, and the description of single semantic features is avoided.

(3) The method of the invention has excellent performance on different data sets, greatly improves the accuracy of generalized zero sample identification, and shows that the neural network constructed by the method has stronger expression capability and generalization capability.

Drawings

FIG. 1 is a flow chart of the generalized zero sample image classification method and system of the present invention.

Fig. 2 is a network overview architecture diagram of the generalized zero sample classification model of the present invention.

FIG. 3 is a schematic diagram of the overall network structure of the Swin transducer of the present invention.

FIG. 4 is a schematic diagram of the network structure of the Swin transducer block of the present invention.

Fig. 5 is a schematic diagram of a network structure of the visual feature refinement module of the present invention.

Fig. 6 is a schematic diagram of a network structure of an IAS module according to the present invention.

Fig. 7 is a schematic structural diagram of a push-push arrangement according to the present invention.

Detailed Description

The technical scheme of the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings, but the scope of protection of the present application is not limited thereto.

The invention relates to a generalized zero sample image classification method based on feature refinement self-supervision learning, which is characterized by comprising the following steps:

(2) Constructing a conditional vision classifier which takes an image dataset and a semantic attribute dataset as input, generating a pseudo tag for an unseen image, taking the obtained pseudo tag as the unseen image, forming a new image dataset by the unseen image in the image dataset and the pseudo tag obtained by the conditional vision classifier, and forming a new image dataset by the image dataset of the visible image in the image dataset, the unseen image and the pseudo tag as training samples of a generalized zero sample classification model for subsequent classification training;

(3) Constructing a generalized zero sample classification model;

the generalized zero sample classification model comprises a Swin Transformer network, a visual feature refinement module and a semantic feature refinement module, wherein the visual feature refinement module is realized by introducing a bilinear pooling method, the semantic feature refinement module combines original semantic features for distinguishing between classes with image specific attention vectors for intra-class changes through an image self-adaption module, and then maps the image self-adaption semantic features into corresponding visual spaces to realize GZSL image classification;

training process of generalized zero sample classification model:

and (3) constructing a self-supervision module: the self-supervision module comprises a rotation angle classification task and a contrast learning task, wherein the rotation angle classification task is to randomly rotate images in a training sample by four different angles to obtain a rotation image dataset with rotation type labels, and the rotation image dataset is input into a Swin transducer network (double input) to obtain visual characteristics with rotation information corresponding to the rotation image dataset and used for predicting rotation types of the rotation image dataset;

The contrast learning task is to update the parameters of the Swin transform network by constraining the distance between positive sample pairs obtained by the same training sample image through the ResNet101 network and the Swin transform network by a contrast loss.

The method comprises the steps of inputting a Swin transform network by a training sample and a rotating image data set, inputting a visual feature x obtained by the training sample into a visual feature refinement module and an IAS module, outputting classification types of generalized zero sample image classification tasks, training a generalized zero sample classification model, and a total loss function L of the generalized zero sample classification model _TOT The sum of the loss of the self-supervised learning loss function and the loss of the generalized zero sample classification task is formulated as:

L _TOT ＝L _CE +L _MSE +L _NCE (1)

wherein ,L_TOT The total loss function L of the generalized zero sample classification model _CE Loss function for rotation angle classification task, L _MSE Loss function for generalized zero sample classification task, L _NCE For comparative studyA loss function of the task.

The loss function of the generalized zero sample classification task is:

wherein M represents the number of training samples, y _i And

the self-supervised learning loss function is:

wherein ,

A epsilon {0,1,2,3};

the loss function of the contrast learning task is:

wherein M represents the number of training samples, x _j ,

Represents x _j ,/>

Similarity between them.

So far, a trained generalized zero sample classification model is obtained;

(4) And identifying the generalized zero sample image by using the trained generalized zero sample classification model to finish the classification task of the generalized zero sample.

The pseudo tag obtaining process comprises the following steps: using the image data set and the semantic attribute data set of the visible class in the image data set for training to obtain a visible class conditional vision classifier taking the semantic attribute as a condition, and then using the image data set and the semantic attribute data set of the invisible class for training to obtain the invisible class conditional vision classifier taking the semantic attribute as a condition; obtaining a weight matrix W of the visible class through a conditional vision classifier of the visible class _s In a weight matrix W of visible class _s Obtaining an unseen class image x as the classification weight of the unseen class conditional vision classifier _u Is a pseudo tag of (a)

Fig. 2 is a network overview architecture diagram of the generalized zero sample classification model of the present invention. The generalized zero sample classification model consists of a Swin Transformer network and a feature refinement module, and corresponds to three inputs respectively, wherein the three inputs comprise two image inputs and one semantic attribute input. The training samples of the GZSL data set are subjected to a Swin transducer network to obtain visual characteristics x, and the generalized zero sample classification task prediction is performed through MSE loss. The rotating image data set is obtained by randomly rotating four angles on the basis of the original image data set, the rotating image data set is also input into the Swin transform network to obtain the corresponding visual characteristics of the rotating image data set, a front task for predicting the rotating angle is realized, the rotating image data set is used for training together with the generalized zero sample classification task prediction, and meanwhile, a comparison learning task is introduced to jointly train the Swin transform network together. Then, the visual features x obtained from the original image dataset are subjected to a visual feature refinement module to obtain refined visual features, and then the refined visual features and semantic features A are input into an IAS module, and the IAS module can perform image self-adaptive attention supplement on the semantic features, so that the semantic features of each class are expanded.

Fig. 3 is a Swin transform network structure of the invention, wherein the input is an original data set after the size is adjusted, the original data set is divided into patch blocks, then the patch blocks are sequentially input into 4 stages through linear embedded layers, each stage comprises a patch merging layer and a Swin transform block, and finally the visual characteristics corresponding to the training samples are obtained.

FIG. 4 is a schematic diagram of a network architecture of a Swin Transformer block according to the present invention, comprising a window multi-head self-attention module and a moving window multi-head self-attention module, the window multi-head self-attention module comprising a window multi-head self-attention W-MSA operation and an MLP operation, each operation being preceded by a Norm layer, a residual connection being present between the two operations, the moving window multi-head self-attention module being similar to the window multi-head self-attention module, comprising a moving window multi-head self-attention SW-MSA operation and an MLP operation.

Fig. 5 is a schematic diagram of a network structure of the visual feature refinement module of the present invention. The input of the visual characteristic refinement module is the visual characteristic x obtained by training a sample through a Swin transducer network, the visual characteristic refinement module can enable the visual characteristic to be more suitable for a fine granularity data set, the visual characteristic refinement module comprises Hadamard operation, reshape operation, a full connection layer and a normalization layer, the specific operation is to copy the visual characteristic x for subsequent characteristic fusion, the initial dimension of x is q, the dimensions of two characteristic vectors x are respectively transformed into 1 xq and q x1 after reshape operation, the parameters are named as x1 and x2, and the x1 is decomposed into a parameter matrix U1 and the characteristic vectors

x2 into a parameter matrix U2 and a feature vector +.>

Parameter matrix U1 and eigenvector->

The product of (2) is x1, eigenvector +.>

The product of the parameter matrix U2 and is x2, then U1 and U2 are input into the global vector layer via Hadamard operation, < >>

and />

Fig. 6 is a schematic diagram of a network structure of an IAS module according to the present invention. The input to the IAS module is the refined visual features obtained by the visual feature refinement module

And the semantic feature A extracted after the manually marked semantic attribute data set passes through the Resnet101 network is used for obtaining the specific semantic feature corresponding to each image through a self-attention mechanism. The method specifically comprises a 1 st full-connection layer FC, a 1 st classification function softmax, hadamard operation and operation, a 2 nd full-connection layer, a 3 rd full-connection layer FC, a 1 st normalization layer, a 2 nd normalization layer and a 2 nd classification function softmax, wherein the thinned visual characteristics are->

Hadamard operation and operation are carried out on the result processed by the 1 st full connection layer FC and the 1 st classification function softmax and the semantic features, so that improved semantic features are obtained >

Improved semantic feature->

After being processed by the 2 nd full connection layer FC, the 3 rd full connection layer FC and the 1 st normalization layer, the visual characteristic x passes through the 2 nd normalization layerAnd executing Hadamard operation and class 2 function softmax processing on the processed result, and mapping the semantic features into a visual space.

FIG. 7 shows the structure of the direct push arrangement of the present invention, with visible semantic features A _s Obtaining classification weight W for conditional visual classifier of conditional training visible class _s The resulting classification weights are then passed to an unseen class conditional vision classifier to obtain an unseen class label Y _u 。

As shown in fig. 1, the invention provides a generalized zero sample image classification method based on feature refinement self-supervision learning, which comprises the following steps:

step one: acquiring an image dataset and a semantic attribute dataset of a generalized zero sample classification model;

acquisition of image dataset D containing visible classes from AWA2, CUB, SUN dataset _s ＝{X _s ,Y _s And (c) wherein the AWA2 dataset comprises 23547 visible class images, the CUB dataset comprises 7057 visible class images, and the SUN dataset comprises 10320 visible class images. At the same time, the image data set D which is not seen is selected from different data sets _u ＝{X _u ,Y _u The AWA2, CUB, and SUN datasets contained 7913, 2967, and 1440 unseen class images, respectively. The visible and invisible classes corresponding to the different data sets are also different, the AWA2 data set includes 40 visible classes and 10 invisible classes, the CUB data set includes 150 visible classes and 50 invisible classes, and the SUN data set includes 645 visible classes and 72 invisible classes. After the visible class and the invisible class are divided, all images are preprocessed, and image samples are cut into uniform sizes, wherein the image resolution is 224 multiplied by 224. The semantic feature A dimensions are also different for each dataset, with the AWA2 dataset using 85-dimensional semantic features, the CUB dataset using 312-dimensional semantic features, and the SUN dataset using 102-dimensional semantic features.

Step two: training samples:

as shown in fig. 7, a conditional visual classifier is constructed with the image dataset and the semantic attribute dataset as inputs to convert the GZSL question into a new conditional visual classification question, which is a diagram of the unseen classThe image is generated with a pseudo tag, so that the generalized zero sample learning is converted into a direct push zero sample learning method. Image dataset D with input part being visible class _s ＝{X _s ,Y _s Sum semantic attribute dataset a=a _s ∪A _u ，

W＝f(A) (5)

Wherein W represents a weight matrix, f represents a relation between a semantic attribute data set and the weight matrix in the conditional vision classifier, and the image data set D of the visible class is obtained _s ＝{X _s ,Y _s Input of the data set of the } and semantic attribute, and obtaining a weight matrix W of the visible class through a formula (5) _s At the same time will not see the class data set D _s ＝{X _s ,Y _s Inputting the obtained information into a conditional vision classifier to obtain a conditional vision classifier with no visible class, and reusing the weight matrix W of the visible class _s Substituting the image X into the conditional vision classifier of the unseen class to obtain the unseen class image X _u Is a pseudo tag of (a)

Calculating X according to formula (6) _u And (5) obtaining the pseudo tag of the unseen class image.

Wherein σ is a learnable scalar, W _i Is the conditional vision classifier weight vector of the i-th class, num is the total number of classes, and ρ is the probability. To avoid

The presence of noise suggests a filtering strategy to mitigate the effects of noise. Is provided with->

Is an unseen class image u _i ∈X _u Is provided with ∈10 at the same time>

and />

Is a score of a first score and a second score. Assigned to u _i The pseudo tag of (2) should satisfy the formula (7):

where γ is the threshold used to control the peak, this constraint is to prevent the use of incorrect label assignments in the conditional visual classifier. Furthermore, a loss function is provided for the conditional visual classifier task:

wherein ,

is x _u Belonging to y _u Is a super parameter, and when the noise level is high, a higher value is preferentially chosen. T (T) _u Is a classification task and the loss function may update the conditional visual classifier with the generated pseudo tags in the back propagation.

Pseudo tag and X _u New training data is formed and is input into the SwinTransformer network as training samples to participate in the training process, so that the generalized zero sample method is converted into a direct push type setting.

As shown in fig. 2, a new image dataset is formed by the image dataset of the visible class, the image of the invisible class and the pseudo tag in the image dataset, and is used as a training sample of a generalized zero sample classification model, and the images in the training sample are input into a Swin transducer network to obtain visual characteristics x corresponding to the images.

Step three: constructing a generalized zero sample classification model;

step 3.1, as shown in FIG. 3, to conform to the structure of the transducer, the training sample isThe image in this becomes a series of flattened 2D patches

In the invention, c=3 and p=4, namely, splitting the image of the training sample into patch blocks, and the characteristic dimension of each patch block after flattening is 4×4×3=48. The image resolution in the training samples is (H, W), N is the number of patches and is the effective sequence length of the final incoming transducer. The Patch block is passed through a full link layer to obtain corresponding linear embeddings which are then input into a number of Swin transducer blocks with improved self-attention. In contrast to Vision Transformer (ViT), swin transducer is a hierarchical structure with a gradual decrease in resolution, 4-fold, 8-fold, 16-fold downsampling, respectively, while ViT keeps 16-fold downsampling.

In step 3.2, as shown in fig. 3, in order to implement the hierarchical structure, the network architecture of the Swin Transformer is divided into 4 stages, after each stage, the resolution of the patch block is halved, and the number of channels is doubled, each stage includes two patch merging layers and the Swin Transformer block, the patch merging layers are similar to the pooling operation but do not lose information, and the Swin Transformer block is similar to the Transformer block, except that the original multi-head self-attention mechanism is replaced by a window multi-head self-attention mechanism and a moving window multi-head self-attention mechanism.

Step 3.3, as shown in fig. 4, the Swin transducer block includes a window multi-head self-attention module and a moving window multi-head self-attention module, the linear embedding of the patch block is input into the window multi-head self-attention module of the Swin transducer block after passing through the patch merging layer, the window multi-head self-attention module includes a single-layer first, then passes through a window multi-head self-attention operation, passes through a residual connection, then passes through a single-layer and MLP (multi-layer perceptron) operation, and then is input into another moving window multi-head self-attention module, the moving window multi-head self-attention module includes a single-layer first, then passes through a moving window multi-head self-attention operation, passes through a single-layer and MLP operation after passing through a residual connection, and each stage includes a Swin transducer block as described above. The calculation formula is as follows:

wherein W-MSA and SW-MSA represent window multi-head self-attention operation and moving window multi-head self-attention operation, z, respectively ^l-1 Representing the visual features of the linear embedding corresponding to the patch blocks in step 3.1 obtained through the patch merge layer in step 3.2,

representing visual features after multi-headed self-attention manipulation through a window, z ^l Representing the output characteristics after passing through the window multi-head self-attention module, and simultaneously being the input characteristics of the moving window multi-head self-attention module,/for the window multi-head self-attention module>

Representing visual features after multi-head self-attention operation through a moving window, z ^l+1 Representing the visual characteristics through the entire Swin transducer block.

In step 3.4, as shown in fig. 4, the present invention designs two self-supervised learning tasks, one of which is to obtain implicit visual features by rotating an angle classification task, which corresponds to a classification token of the Swin transducer network. We performed random rotations of 0 °, 90 °, 180 °, and 270 ° on the images, then recorded the rotation tags, and training the Swin transducer network could correctly identify the rotated images. Let g (. Cndot. A) be the operator that rotates an image by 90 DEG x a, where a ε {0,1,2,3}. The input of the task is the visual characteristic of the rotating image data set obtained through the Swin transducer network, the rotating image data set is used for carrying out the rotation angle classification task and does not directly participate in the generalized zero sample image classification task, but the task and the generalized zero sample image classification task are trained together, so that the two tasks share the Swin transducer network, and the task can indirectly influence the final generalized zero sample image classification task. Further, the process of solving this pre-task does not involve category attributes or semantic information, so a decorrelation operation of visual features can be implemented.

The second is to further train the parameters of the Swin transducer network by comparing the learning task, form a positive sample pair which is used for comparing the visual characteristics obtained by the Swin transducer network by inputting training samples into the ResNet101 network, train the parameters of the Swin transducer network by comparing loss, and the process is realized by training to minimize the distance between the positive sample pair.

As shown in fig. 3, the training samples are input into the Swin converter network, so that the corresponding visual characteristics can be obtained, and the two newly added classification tokens in the Swin converter network are only used for solving the rotation angle classification task and the contrast learning task, and the parameters of the Swin converter network are updated by solving the two self-supervision tasks, but do not participate in the generalized zero sample image classification task.

And 3.5, inputting the visual characteristics obtained by the training sample in the step 3.4 through the Swin transducer network into a visual characteristic refinement module. As shown in FIG. 5, feature refinement improves the method of homologous bilinear pooling, whose original formula is expressed as

Wherein z represents global vector after feature fusion, and x _s And

representing two feature vectors participating in feature fusion respectively, wherein the two feature vectors of homologous bilinear pooling come from the same feature extractor, and the product of feature dimension = feature x and feature y dimensions after conventional bilinear pooling fusion has the risk of excessively high dimensions, the application further decomposes the feature vectors into a learnable parameter matrix U (smaller dimension) and a feature vector +_>

(larger dimension) and inputting the fused feature vector to the full connection layer for dimension conversion, thereby relieving the problem of overhigh dimension, and the formula (13) can be rewritten as follows:

wherein, DEG represents Hadamard (Hadamard) operation,

and U1 is a vector after x1 decomposition, < >>

And U2 is a vector obtained by decomposing a vector x2, x1 and x2 respectively represent feature vectors obtained by performing different reshape operations on visual features x obtained by a Swin transform network, LN represents a full connection layer and normalization operations,>

representing the refined visual features output by the visual feature refinement module.

Step 3.6, as shown in FIG. 6, guiding the visual features, that is, the refined visual features obtained by the visual feature refinement module

And the self-attention of the semantic feature A, thereby realizing the purpose of adding different attention to the semantic feature by an IAS module, firstly obtaining the semantic feature A of each data set by a Resnet101 network, taking the semantic feature as an original semantic feature, only learning a single embedded vector by the original semantic feature, and inputting the original semantic feature A into the IAS module to obtain improved semantic feature->

The specific calculation process is as follows:

wherein softmax (g (x)) represents the semantic attention predicted by a linear layer g,

representing Hadamard (Hadamard) operations by which attention and semantic features are combined together, and finally, through two full-connection layer normalization layers, semantic features are mapped to visual space, the whole process can be expressed as:

wherein A represents the original semantic features,

representing the improved semantic features, and V (x) represents the corresponding semantic features.

And 3.7, training the generalized zero sample classification model, iteratively updating according to a gradient descent method, thereby determining parameters of the generalized zero sample classification model, and updating a weight matrix according to a mean square error loss function, a cross entropy loss function and a contrast loss function to obtain the trained generalized zero sample classification model.

Step four: and identifying the generalized zero sample image through the trained generalized zero sample classification model, and completing the image classification task of the generalized zero sample.

The network of the present invention is pre-trained on ImageNet and fine-tuned on the dataset used for the experiment. ResNet-101 uses 2048-dimensional final pool layers, patch size in the Swin transform network is set to 4×4, with 24 hidden layers, 1024 dimensions each, with 16 heads and 24 series decoders per layer. Our model was trained with Adam optimizer at a fixed learning rate of 0.0001 and the model could eventually converge. All methods were implemented on a PyTorch. The invention selects Macc with class average accuracy as the evaluation index of zero sample classification, and the calculation formula is as follows

wherein ,K_u Representing the total number of invisible classes, y ^ts The non-visible class is represented by a representation,

representing the classification accuracy of the ith class in the invisible class, nu represents the number of samples of the invisible class. For the generalized zero sample classification task, because the search space at test is not limited to not just the unseen class, but also the visible class. The unified classification standard is a zero sample performance evaluation index named as harmonic average accuracy (harmonic mean accuracy) on the basis of class average accuracy, and the calculation formula is as follows: / >

wherein ,

and />

Representing class average accuracy obtained over the image dataset of the visible class and the image dataset of the non-visible class, respectively. The results of the present invention were validated on AWA2, CUB and SUN datasets, and network models of Region Graph Embedding Network (RGEN), over-Complete Distribution-Conditional Variational Autoencoders (OCD-CVAE) and Cross Attribute-Guided Transformer for Zero-Shot Learning (Transzero++) and the present invention methods were experimentally validated to obtain experimental comparison results shown in Table 1.

Table 1 experimental comparison results of different methods

As shown in Table 1, S represents the classification accuracy of visible classes, U represents the classification accuracy of non-visible classes, H represents the harmonic mean of S and U, and is also the final evaluation index of the generalized zero sample image classification task. Experimental results show that the method of the invention realizes the best harmonic mean value on all three data sets, namely the best classification effect can be achieved on the three data sets, because the method has a remarkable effect on improving the classification accuracy of the unseen class, and meanwhile, because the model is effective in alleviating the deviation problem.

In summary, the invention provides a feature refinement self-supervision learning-based method for solving the GZSL task. The self-supervision learning is realized by adding two classification tokens through a Swin transform network and respectively corresponding to two self-supervision tasks, and the visual features are further adapted to fine-granularity image classification and semantic decorrelation operation on the visual features through visual feature refinement and semantic feature refinement. The visual classifier task which is redefined as a condition on semantic attributes generates a pseudo tag for the unseen class sample, and generalized zero sample learning is converted into direct push type setting, so that the prejudice problem on the visible class is further relieved. Finally, experiments on three popular data sets also demonstrate the superiority of the model of the present application.

The above example is one of applications of the present invention, but embodiments of the present invention are not limited thereto. Any other technical changes that do not depart from the principles and spirit of the present invention are intended to be included within the scope of the present invention.

The invention is applicable to the prior art where it is not described.

Claims

1. A generalized zero sample image classification method based on feature refinement self-supervision learning comprises the following steps:

(3) Constructing a generalized zero sample classification model;

The generalized zero sample classification model comprises a visual feature refinement module, a semantic feature refinement module, a Swin transform network added with two classification tokens and two constructed self-supervision learning tasks, wherein the two newly added classification tokens respectively correspond to the two self-supervision learning tasks: the rotation angle classification task needs to randomly rotate training samples by four different angles, and finally predicts rotation types; the contrast learning task is to input training samples into ResNet101 and Swin transform networks respectively to obtain visual features extracted by different feature extractors, and restrict the visual features through contrast loss functions;

building a visual characteristic refinement module: the visual characteristic refinement module is based on bilinear poolingThe invention adopts a method of homologous bilinear pooling, the input of a visual feature refinement module is the visual feature x obtained after training a sample through a Swin transform network, the visual feature refinement module can enable the visual feature to be more suitable for a fine granularity data set, the visual feature comprises Hadamard operation, reshape operation, a full connection layer and a normalization layer, the specific operation is to copy the visual feature x for subsequent feature fusion, the initial dimension of x is q, the dimensions of two visual features x after reshape operation are respectively transformed into 1 xq and qx1, the two visual features are named as x1 and x2, and x1 are decomposed into a parameter matrix U1 and a feature vector

x2 into a parameter matrix U2 and a feature vector +.>

Parameter matrix U1 and eigenvector->

The product of (2) is x1, eigenvector +.>

and />

Construction of semantic feature refinementAnd (3) a module: the semantic feature refinement module is called an image adaptive (Image Adaptive Semantics, IAS) module, and the IAS module can combine original semantic features for distinguishing between classes with image specific attention vectors for intra-class changes, and then map the image adaptive semantic features into corresponding visual spaces, so that the accuracy of GZSL image classification is improved. The IAS module comprises a 1 st full-connection layer FC, a 1 st classification function softmax, hadamard operation and operation, a 2 nd full-connection layer, a 3 rd full-connection layer FC, a 1 st normalization layer, a 2 nd normalization layer and a 2 nd classification function softmax, wherein the input of the IAS module is the visual characteristic output by the visual characteristic refinement module

Hadamard operation and operation are carried out on the result processed by the 1 st full connection layer FC and the 1 st classification function softmax and the semantic features, so that improved semantic features are obtained>

Improved semantic feature->

the training samples are input into a Swin transducer network, and the classification category of the generalized zero sample image classification task is output through a visual characteristic refinement module and an IAS module in sequence, a generalized zero sample classification model is trained, and the total loss function L of the generalized zero sample classification model is trained _TOT Loss function for self-supervision learning (loss function for rotation angle classification task and contrast learning arbitraryLoss function of the task) and the generalized zero sample classification task, the formula is:

L _TOT ＝L _CE +L _MSE +L _NCE (1)

the loss function of the generalized zero sample classification task is:

Wherein M represents the number of training samples, y _i And

the loss function of the rotation angle classification task is:

wherein ,

a epsilon {0,1,2,3} represents 4 rotation angles;

the loss function of the contrast learning task is:

wherein M represents the number of training samples, x _j ,

Represents x _j ,/>

Similarity between them.

So far, a trained generalized zero sample classification model is obtained;

2. The generalized zero sample image classification method according to claim 1, wherein the pseudo tag obtaining process is: using the image data set and the semantic attribute data set of the visible class in the image data set for training to obtain a visible class conditional vision classifier taking the semantic attribute as a condition, and then using the image data set and the semantic attribute data set of the invisible class for training to obtain the invisible class conditional vision classifier taking the semantic attribute as a condition; obtaining a weight matrix W of the visible class through a conditional vision classifier of the visible class _s In a weight matrix W of visible class _s Obtaining an unseen class image x as the classification weight of the unseen class conditional vision classifier _u Is a pseudo tag of (a)

3. The generalized zero sample image classification method according to claim 1, wherein the step (1) obtains an image dataset and a semantic attribute dataset of a generalized zero sample classification model, specifically:

(1.1) classifying by means of existing imagesAnd fine-grained image classified data sets as image data sets, comprising: animal with Attribute2 (AWA 2), caltech-UCSD-Birds-200-2011 (CUB) and SUN Attribute Datase (SUN). Wherein the CUB and SUN data sets are fine granularity data sets, and the AWA2 data sets are coarse granularity data sets. The data set AWA2 comprises 50 types of 30745 and 37322 animal pictures, wherein 40 types are used as training types, and 10 types are used as testing types; the data set CUB is a fine granularity data set of bird pictures, and contains 11788 pictures of 200 birds in total, wherein 150 kinds of birds are used as training class 50 kinds of birds as testing classes; the data set SUN covers fine-grained data sets of various environmental scenes and internal images, and comprises 717 classes of 14340 pictures in total, wherein 645 classes are used as training categories, and 72 classes are used as testing categories. In order to meet the input requirement of a backbone network, namely a SwinTransformer network, the resolution of the image is unified to 224 multiplied by 224, and the visual characteristics X of the visible image are obtained _s And tag Y thereof _s Denoted as D _s ＝{X _s ,Y _s As a set of image data of the visible class, the visual features X of the image of the invisible class _u And tag Y thereof _u Denoted as D _u ＝{X _u ,Y _u And (3) as an unseen image dataset.

4. The generalized zero sample image classification method according to claim 1, wherein the four different angles are 0 °,90 °,180 ° and 270 °, respectively, and the rotation labels are set to 0,1,2,3, respectively.

5. The generalized zero sample image classification method according to claim 1, wherein the step (3) constructs a generalized zero sample classification model, specifically:

(3.1) Swin transducer network adding two classification token: taking a Swin transducer network as a backbone network, and inputting images in a training sample into the Swin transducer network to become a series of flattened 2D patch x _p The resolution of each patch is p p×p, and each patch dimension is set to 48;

the flattened patch block obtains the visual characteristic x of an input image after 4 stages, each stage comprises a patch merging layer and a Swin transform block, and the Swin transform block comprises a window multi-head self-attention module and a moving window multi-head self-attention module;

(3.2) two self-supervising task mechanisms:

rotation angle classification task: the method comprises the steps of constructing a task of predicting the rotation angle of an image to realize self-supervision learning, namely randomly rotating the image in a training sample by four different angles (0 DEG, 90 DEG, 180 DEG and 270 DEG), further processing the training sample to obtain a rotation image data set, setting the four rotated angles to four categories respectively, setting labels to be (0, 1,2 and 3), and inputting the rotation image data set into a Swin converter network to obtain visual characteristics with rotation information corresponding to the rotation image data set for predicting the rotation categories of the rotation image data set;

Contrast learning task: respectively inputting training samples into a ResNet101 network and a Swin converter network to obtain visual features of the same image after passing through different feature extractors, taking the two visual features processed by the different feature extractors as positive sample pairs, calculating the similarity between the positive sample pairs through cosine similarity, and restricting the distance between the positive sample pairs to be close enough through contrast loss;

(3.3) the visual characteristic refinement module inputs the visual characteristic x processed in the step (3.1) and outputs the visual characteristic x as the refined visual characteristic

(3.4) input to the IAS Module is the refined visual characteristics after the processing of step (3.3)

And the semantic feature A processed by the ResNet101 network is output as the mapping of the visual feature x and the semantic feature A in the visual space;

(3.5) inputting the training sample and the rotating image data set into a SwinTransformer network simultaneously for feature extraction, respectively obtaining respective corresponding visual features x, respectively performing a generalized zero sample image classification task and a rotating angle classification task by utilizing the obtained visual features, and jointly training in the step (3.1) and the step (3.2) to share the same SwinTransformer network; training a generalized zero sample classification model, and a total loss function L of the generalized zero sample classification model _TOT Sum of loss function for self-supervision learning and loss of generalized zero sample classification task;

training the SwinTransformer network, the visual characteristic refinement module and the IAS module together, initializing the iteration number to be K, enabling the maximum iteration number to be K, wherein K is more than or equal to 30, and enabling k=1;

obtaining a trained generalized zero sample classification model.

6. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the generalized zero sample image classification method according to any of the claims 1-5.