CN116797821A

CN116797821A - Generalized zero sample image classification method based on fusion visual information

Info

Publication number: CN116797821A
Application number: CN202310590814.4A
Authority: CN
Inventors: 潘杰; 潘强烽; 邹筱瑜
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-09-22

Abstract

The invention discloses a generalized zero sample image classification method based on fusion visual information. Then inputting the feature vector into a feature fusion embedding module, performing differential processing on visual information with different resolutions by adopting a multi-level architecture, obtaining fusion visual information by using feature fusion, and predicting contained semantic attributes; and finally, calculating attribute regression loss and attribute cross entropy loss according to the semantic attribute predicted value, and updating the parameters of the optimization model to obtain the optimal model for testing. And in the test stage, a test image is input into a model, semantic attribute combinations in the image are obtained, and the image category is predicted according to the cosine similarity score. The invention adopts a mode of fusing visual information, improves the processing capability of the model on different resolution information, and obtains better zero sample image classification performance.

Description

Generalized zero sample image classification method based on fusion visual information

Technical Field

The invention belongs to the field of deep learning, is used for processing the problem of image classification, and particularly relates to a generalized zero sample image classification method based on fusion visual information.

Background

The existing deep neural network model often adopts a training mode of supervised learning, and the parameters of the model are trained through a large amount of labeling data to achieve the optimal state. The labeling of all images brings extremely high cost, and meanwhile, aiming at scenes of endangered species identification, medical image processing and the like which lack training data, the traditional model is difficult to achieve good identification effect on the categories due to insufficient data. In order to alleviate the dependence of the model on the labeling data in part of special tasks, researchers propose a new research field, namely zero sample learning. In 2009, an AWA data set was proposed, and then a learner proposed an algorithm for performing category prediction based on attributes, and in the same year, a concept of zero sample learning was first proposed, and a prologue of zero sample learning was pulled.

Zero sample, namely no training sample, aims at enabling the deep learning model to identify a new class which is not trained, and realizing classification of the invisible class images which are not trained by the model in a knowledge migration mode under the condition that the class of data of the training set and the testing set is not intersected or the class of data of the testing set is larger than the class of data of the training set. Zero sample learning relies on tagged visible categories, and semantic information that is associated with invisible categories and visible categories. According to the training of the visible class sample, the public semantic attribute corresponding to the visual feature in the image is obtained, the visible class attribute feature is migrated to the invisible class, and the coupling relation between the visible class and the invisible class is established under the semantic information associated with the class, so that the classification of the invisible class is completed on the premise that the invisible class label is not used as the training sample. The test stage can be divided into traditional zero sample learning and generalized zero sample learning, wherein the traditional zero sample learning represents that only invisible classes are predicted, the generalized zero sample learning represents that both visible classes and invisible classes are predicted, the generalized zero sample learning is more in line with the requirements of real tasks, but simultaneously brings greater difficulty, and the model is provided with higher requirements.

The zero sample learning method has several advantages over the traditional classification method:

1. for certain specific classes (such as endangered species, newly designed industrial products and the like) of which the data sets are not established, identification and classification can be realized by a zero sample learning method;

2. the zero sample learning implementation method has a plurality of common points with the human learning method, so that the interpretation of the model can be improved;

3. the zero sample learning and deep learning methods can be organically combined and developed in a fusion way, so that the requirements of object recognition tasks are better met.

The related research of zero sample learning has very high theoretical value and potential application capability, is also one of the necessary trends of the target classification technology, and has become one of the research focuses in the image recognition classification task.

Disclosure of Invention

The invention aims at: aiming at the prior art, the invention provides a generalized zero sample image classification method based on fusion visual information, which improves visual feature processing and multi-resolution information interaction capability and obtains a zero sample image classification model with better performance.

In order to achieve the above purpose, the present invention provides the following technical solutions: a generalized zero sample image classification method based on fusion visual information obtains a generalized zero sample image classification model through steps S1 to S4, applies the generalized zero sample image classification model according to steps i to j, and classifies images to be classified:

step S1, obtaining all category labels, all semantic attributes and corresponding relations between each category label and each semantic attribute respectively based on a preset data set;

s2, constructing a generalized zero sample image classification model based on all semantic attributes of a preset data set, wherein the generalized zero sample image classification model takes an image as an input and outputs semantic attribute combinations of the image; the generalized zero sample image classification model comprises an image preprocessing module, a feature fusion embedding module and an attribute extraction module; the image preprocessing module obtains local coding feature vectors of the image according to the image; the feature fusion embedding module is used for carrying out feature fusion processing on the local coding feature vector to obtain fusion visual information; the attribute extraction module realizes the mapping from the fusion visual information to the preset semantic attribute;

s3, constructing a training set based on the visible class with training samples in the preset data set; each training sample in the training set comprises an image, a semantic attribute combination of the image and a category label to which the image belongs;

s4, training the generalized zero sample image classification model by utilizing the training set, and learning the mapping relation between the image and the semantic attribute;

step i, inputting an image to be classified into a trained generalized zero sample image classification model to obtain a semantic attribute combination of the image to be classified;

and j, respectively performing cosine similarity calculation on the semantic attribute combination of the image to be classified and the semantic attribute combination under all the category labels in the preset data set, and taking the category label corresponding to the semantic attribute combination with the highest similarity as the category label of the image to be classified to realize the generalized zero sample image classification task.

Further, in the step S2, the step of preprocessing the image by the image preprocessing module includes:

step S201, performing visual feature extraction on the image by using a feature extractor to obtain a visual feature map of the image;

step S202, dividing the visual feature map obtained in the step S201 into blocks, and vectorizing each block; the specific method comprises the following steps: firstly uniformly remolding a visual feature map of an image to 224×224 resolution, then dividing the remolded visual feature map by taking 4×4 resolution as a region block to generate 56×56 non-overlapping region blocks, and then vectorizing each block to obtain 192-dimensional block feature vectors;

step S203, adding relative position codes to the feature vectors of each block to obtain local coding feature vectors of the image.

Further, in the aforementioned step S201, a convolutional neural network pre-trained on the ImageNet dataset is adopted as the feature extractor.

Further, in the step S203, a step of adding a relative position code to the block vector is specifically described as follows: adding relative position vectors to tile vectorsAnd->The relative positional relationship between the tiles is obtained by the following formula:

wherein x is _i 、x _j For inputting vector d _k To embed dimensions e _ij Is x _i 、x _j Is the dot product similarity of alpha _ij E is _ij X obtained by softmax function _i 、x _j Weight of similarity between W ^Q 、W ^K 、W ^V Query (Q), key (K), value (V) parameter matrix, z optimized for update _i Representing the processed locally encoded feature vector.

Further, in the step S2, the feature fusion embedding module is composed of DeiT, tile fusion and feature fusion; the block fusion is used for realizing conversion between different resolutions; the feature fusion adopts strategy fusion of serial-parallel combination; the method comprises the steps that an input local coding feature vector is processed by a DeiT of a first level, the processed feature vector is input into a later level and is simultaneously subjected to feature fusion to reconstruct into a first-layer feature map, 4 adjacent image blocks are combined into 1 image block with 4 times of size by image block fusion, the resolution is increased, then the DeiT of the level is processed, the processed feature vector is continuously input into a next level, and meanwhile, a two-layer feature map is generated by feature fusion reconstruction, and through the three image block fusion layers, the model can process visual information of four resolution levels which are sequentially increased; feature fusion reconstructs feature vectors processed by DeiT of each level into corresponding feature images, then the feature images output by DeiT of each level are input into a feature fusion layer in parallel, and then fusion visual information considering local and global is obtained by weighting. The DeiT is composed of a multi-head self-attention layer MSA and an MLP, wherein the MSA and the MLP are subjected to layer normalization before residual connection, and the network relationship can be represented by the following formula:

wherein z is ^l-1 As an output characteristic of the front layer,z is the input feature of the rear layer ^l Is an output feature of the rear layer.

Further, in the step S2, the attribute extraction module includes two parts, namely a global average pooling layer GAP and a semantic attribute predictor; the semantic attribute predictor is constructed based on all semantic attributes in a preset data set, fusion visual information is processed through a global average pooling layer GAP to obtain a fusion visual feature map, and then the fusion visual feature map is input into the semantic attribute predictor to obtain corresponding semantic attributes.

Further, in the step S4, the step of training the generalized zero sample image classification model includes:

s401, inputting a sample image into a generalized zero sample image classification model to obtain a semantic attribute combination of the sample image, and marking the semantic attribute combination as a predicted semantic attribute combination of the sample image;

step S402, obtaining semantic attribute combinations of corresponding categories of sample images from a preset database based on category labels of the sample images;

step S403, calculating the overall loss of the model by using a loss function based on semantic attribute combination of the sample image, prediction semantic attribute combination and semantic attribute combination of the corresponding category, and optimizing parameters of the generalized zero sample image classification model according to the overall loss value of the model;

step S404, iteratively updating parameters of the generalized zero sample image classification model by using sample images in the training set until the parameters converge to obtain a trained generalized zero sample image classification model;

further, in the step S403, calculating the model total loss using the loss function specifically includes:

calculating attribute regression loss L based on each single attribute in semantic attribute combination and predicted semantic attribute combination of sample image _AR The calculation formula is as follows:

wherein M is the total number of semantic attributes contained in the preset data set,to predict the value of each single attribute in a semantic attribute combination, a _i The value of each single attribute in the semantic attribute combination of the sample image;

calculating attribute cross entropy loss L based on predicted semantic attribute combination of sample images and semantic attribute combination of corresponding categories _ACE The calculation formula is as follows:

wherein y is ^s For the visual class training set to be used,is a combination of attributes contained in the visible class;

attribute-based regression loss L _AR Attribute cross entropy loss L _ACE The overall model loss L is calculated according to the following formula:

L＝L _AR +aL _ACE

where α is the weighting coefficient between the two losses.

Further, in the step j, the specific formula of the cosine similarity calculation is as follows:

in the method, in the process of the invention,representing combinations of attributes contained in the comparison class (containing both visible and invisible classes).

Compared with the prior art, the generalized zero sample image classification method based on the fusion visual information has the following technical effects:

1. the method adopts effective visual information fusion to improve the processing capability of the model on different resolution levels, so that the zero sample learning classification model has better performance and higher precision;

2. the method provided by the invention combines the image global analysis capability and the local feature extraction capability by the multi-resolution visual information fusion, has a better visual-semantic embedding effect, and realizes the migration from visible classes to invisible classes.

Drawings

FIG. 1 is a flow chart of steps of a generalized zero sample image classification method based on fusion visual information according to the present invention;

FIG. 2 is a schematic diagram of a framework of a feature fusion embedded module according to the present invention;

fig. 3 is an image processing schematic diagram of a generalized zero sample image classification model based on fusion visual information in the present embodiment.

Detailed Description

For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.

Aspects of the invention are described herein with reference to the drawings, in which there are shown many illustrative embodiments. The embodiments of the present invention are not limited to the embodiments described in the drawings. It is to be understood that this invention is capable of being carried out by any of the various concepts and embodiments described above and as such described in detail below, since the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.

As shown in fig. 1, the generalized zero sample image classification method based on fusion visual information provided by the invention comprises the following steps:

(1) Obtaining all category labels, semantic attributes and corresponding relations of the category labels and the semantic attributes of the data set;

(2) Constructing a generalized zero sample image classification model;

(3) Constructing a visible training sample set;

(4) Training a generalized zero sample image classification model by using a training sample set;

(5) Inputting the images to be classified into a trained model to obtain corresponding semantic attribute combinations;

(6) Comparing the obtained semantic attribute combination with semantic attribute combinations of all categories, and outputting a category label with highest similarity.

In this embodiment, three generalized zero sample image classification general data sets are used respectively: AWA2, CUB, SUN. The AWA2 is an animal data set and comprises 37322 pictures, 50 categories and 85 semantic attributes. The CUB is a bird identification dataset and contains 11788 pictures, 200 categories and 102 semantic attributes. The SUN is a place identification dataset comprising 14340 pictures, 717 categories, 312 semantic attributes, and each sample data comprises an image, a category label to which the image belongs, and a corresponding semantic attribute combination. Specific information for the dataset is as in table 1:

TABLE 1

Data set	Training sample	Number of visible/invisible test samples	Visible/non-visible category number	Semantic attribute count
					AWA2	23527	5882/7913	40/10	85
CUB	7057	1440/2580	150/50	102
					SUN	10320	7924/1483	645/72	312

The generalized zero sample image classification model takes an image as an input, outputs semantic attribute combination of the image, and consists of an image preprocessing module, a feature fusion embedding module and an attribute extraction module. In this embodiment, the generalized zero sample image classification model uses pytorch as a deep learning framework, adam optmizer is used to perform parameter optimization at a fixed learning rate of 0.0001, batch size is set to 64, loss weight coefficient is set to 0.01, NVIDIA RTX 3090gpu 24gb is used to perform experiments, and iterative rounds are set to 100 epochs.

As shown in fig. 2, the feature fusion embedding module includes DeiT, tile fusion, and feature fusion; the block fusion is used for realizing conversion among different resolutions, and the resolution of each level is sequentially increased before the DeiT outside the first level, so that the self-attention calculation of the block vectors with different resolutions by the DeiT of different levels is realized; the feature fusion part adopts a fusion strategy of serial-parallel combination, and fuses feature graphs output by each level in parallel with the feature fusion layers while maintaining serial structures among different levels. The DeiT is composed of a multi-head self-attention layer MSA and an MLP, wherein the MSA and the MLP are subjected to layer normalization before residual connection, and the network relationship can be represented by the following formula:

The attribute extraction module comprises a global average pooling layer GAP and a semantic attribute predictor; the semantic attribute predictors are respectively constructed based on all semantic attributes in the corresponding data sets, fusion visual information is processed through a global average pooling layer GAP to obtain a fusion visual feature map, and then the fusion visual feature map is input into the semantic attribute predictors to obtain corresponding semantic attribute combinations.

As shown in fig. 3, the process of training a generalized zero sample image classification model using a training sample set includes the steps of: performing visual feature extraction on an input image by using a convolutional neural network pre-trained on an ImageNet data set as a feature extractor to obtain a visual feature map of the image; uniformly remolding the visual feature map to 224×224 resolution, dividing the remolded visual feature map by taking the 4×4 resolution as a region block, generating 56×56 region blocks which are not overlapped, and vectorizing each block to obtain 192-dimensional block feature vectors; and finally, adding relative position codes to the feature vectors of all the blocks to obtain the local coding feature vectors of the image. The specific method for adding the relative position codes comprises the following steps: adding relative position vectors to tile vectorsAnd->The relative positional relationship between the tiles is obtained by the following formula:

wherein x is _i 、x _j For inputting vector d _k To embed dimensions e _ij Is x _i 、x _j Is the dot product similarity of alpha _ij E is _ij X obtained by softmax function _i 、x _j Weight of similarity between W ^Q ，W ^K ，W ^V Query (Q), key (K), value (V) parameter matrix, z optimized for update _i Representing the processed locally encoded feature vector.

The method comprises the steps that a feature fusion embedding module is used for inputting a local coding feature vector of an image, a first-level DeiT is used for processing the input local coding feature vector, the processed feature vector is input into a later-level and is reconstructed into a first-level feature image through feature fusion, the feature vector of the later-level is input into a block with 4 adjacent blocks being combined into a block with the size of 1 time and 4 times through block fusion, the resolution is increased, the DeiT of the level is used for processing, the feature vector obtained through processing is continuously input into the next level, and meanwhile, a two-level feature image is generated through feature fusion reconstruction, and the visual information processing of four sequentially increased resolution levels can be realized through the three block fusion layers; feature fusion reconstructs feature vectors processed by DeiT of each level into corresponding feature images, then the feature images output by DeiT of each level are input into a feature fusion layer in parallel, and then fusion visual information considering local and global is obtained by weighting.

Inputting the fusion visual information into an attribute extraction module for semantic attribute prediction to obtain a semantic attribute predicted value, comparing the semantic attribute predicted value with a semantic attribute true value of an image, and optimizing parameters of a generalized zero sample image classification model according to the model overall loss value; the method for calculating the overall loss of the model comprises the following steps:

wherein M is the semantic meaning contained in the preset data setThe total number of attributes is set,to predict the value of each single attribute in a semantic attribute combination, a _i The value of each single attribute in the semantic attribute combination of the sample image;

L＝L _AR +αL _ACE

where α is the weighting coefficient between the two losses.

Training the model by using training samples in AWA2, CUB and SUN data sets respectively, and iteratively updating parameters of the generalized zero sample image classification model until the parameters converge to obtain the trained generalized zero sample image classification model.

Inputting the image to be classified into a trained generalized zero sample image classification model, obtaining a semantic attribute combination corresponding to the image based on the mapping relation between the image generated by the generalized zero sample image classification model training and each semantic attribute, respectively carrying out cosine similarity calculation on the obtained semantic attribute combination and the semantic attribute combinations of all the categories in the corresponding data set, and taking a category label corresponding to the semantic attribute combination with the highest similarity as a classification result of the generalized zero sample image classification model to be output. The concrete formula of cosine similarity calculation is as follows:

In the embodiment, experiments are carried out under the setting of GZSL, visible samples and invisible samples are classified in a test stage, and precision comparison is carried out with three classification methods of a recent main stream, wherein the three classification methods are respectively a zero sample learning method GAZSL based on generating countermeasure, a semantic keeping countermeasure embedded network SP-AEN and a zero sample learning method ViT-ZSL based on ViT; the results of the comparison are detailed in Table 2, the highest accuracy of the various indices on each dataset is shown in bold, where Acc _S 、Acc _U Top-1 precision, acc, respectively representing class predictions for visible and invisible class samples _H Represents Acc _S Sum Acc _U Representing the comprehensive resolution performance of the model for two large-class prediction; as can be seen from Table 2, the method of the invention has the highest classification accuracy on three zero sample learning common data sets, and proves the effectiveness of visual information fusion adopted by the method of the invention. And the comprehensive index Acc in three data sets _H The superior method is obviously improved by 1.2%, 3.5% and 6.3%, which shows that the method of the invention has better balance in the prediction of visible class and invisible class, so that the overall accuracy of the model is at a leading level compared with other recent zero sample learning models.

TABLE 2

While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.

Claims

1. The generalized zero sample image classification method based on the fusion visual information is used for classifying invisible images which are not trained in a model and is characterized in that a generalized zero sample image classification model is obtained through steps S1 to S4, the generalized zero sample image classification model is applied according to steps i to j, and the images to be classified are classified:

step i, inputting an image to be classified into a trained generalized zero sample image classification model to obtain a corresponding semantic attribute combination; and j, respectively performing cosine similarity calculation on the semantic attribute combination of the image to be classified and the semantic attribute combination under all the category labels in the preset data set, and taking the category label corresponding to the semantic attribute combination with the highest similarity as the category label of the image to be classified to realize the generalized zero sample image classification task.

2. The generalized zero sample image classification method based on fusion visual information according to claim 1, wherein in the step S2, the step of preprocessing the image by the image preprocessing module includes:

3. The method according to claim 2, wherein in step S201, a convolutional neural network pre-trained on an ImageNet dataset is used as the feature extractor.

4. The method of classifying generalized zero sample images based on fusion visual information according to claim 2, wherein in the step S203, the step of adding a relative position code to the tile vector is specifically as follows: adding relative position vectors to tile vectorsAnd->The relative positional relationship between the tiles is obtained by the following formula:

5. The generalized zero sample image classification method based on fusion visual information according to claim 1, wherein in the step S2, the feature fusion embedding module is composed of DeiT, tile fusion and feature fusion; the block fusion is used for realizing conversion between different resolutions; the feature fusion adopts strategy fusion of serial-parallel combination;

DeiT is composed of multi-head self-attention layers MSA and MLP, wherein MSA and MLP are subjected to layer normalization before residual connection, and the network relationship can be represented by the following formula:

wherein z is ^l-1 As an output characteristic of the front layer,z is the input feature of the rear layer ^l Is the output characteristic of the rear layer;

the method comprises the steps that an input local coding feature vector is processed by a DeiT of a first level, the processed feature vector is input into a later level and is simultaneously subjected to feature fusion to reconstruct into a first-layer feature map, 4 adjacent image blocks are combined into 1 image block with 4 times of size by image block fusion, the resolution is increased, then the DeiT of the level is processed, the processed feature vector is continuously input into a next level, and meanwhile, a two-layer feature map is generated by feature fusion reconstruction, and through the three image block fusion layers, the model can process visual information of four resolution levels which are sequentially increased; feature fusion reconstructs feature vectors processed by DeiT of each level into corresponding feature images, then the feature images output by DeiT of each level are input into a feature fusion layer in parallel, and then fusion visual information considering local and global is obtained by weighting.

6. The generalized zero sample image classification method based on fusion visual information according to claim 1, wherein in the step S2, the attribute extraction module includes two parts, namely a global average pooling layer GAP and a semantic attribute predictor; the semantic attribute predictor is constructed based on all semantic attributes in a preset data set, fusion visual information is processed through a global average pooling layer GAP to obtain a fusion visual feature map, and then the fusion visual feature map is input into the semantic attribute predictor to obtain corresponding semantic attributes.

7. The method for classifying generalized zero-sample images based on fused visual information according to claim 1, wherein in the step S4, the step of training the generalized zero-sample image classification model comprises:

and step S404, iteratively updating parameters of the generalized zero sample image classification model by using sample images in the training set until the parameters converge to obtain the trained generalized zero sample image classification model.

8. The generalized zero sample image classification method according to claim 7, wherein in step S403, calculating the model overall loss using the loss function specifically includes:

L＝L _AR +αL _ACE

where α is the weighting coefficient between the two losses.

9. The generalized zero sample image classification method based on fusion visual information according to claim 1, wherein in the step j, a specific formula of cosine similarity calculation is as follows: