CN114119979A

CN114119979A - Fine-grained image classification method based on segmentation mask and self-attention neural network

Info

Publication number: CN114119979A
Application number: CN202111480727.0A
Authority: CN
Inventors: 牛毅; 张玉婷; 马明明; 李甫; 张犁
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-03-01

Abstract

The invention provides a fine-grained image classification method based on a segmentation mask and a self-attention neural network, and mainly solves the problems that the existing fine-grained image classification method is complicated in training steps, high in training difficulty and incapable of accurately positioning and classifying the position of an object when picture background noise is too large. The scheme is as follows: downloading the divided training and testing sample sets from the disclosed fine-grained image data sets to obtain class labels corresponding to the images; constructing a fine-grained classification model consisting of a loss network, a segmentation mask generation network and a self-attention neural network in a cascade manner; training the classification model by using a training sample set and adopting a gradient descent method; and inputting the test sample set into a trained fine-grained classification model to obtain a classification result of the fine-grained image. The method can automatically generate the mask for enhancing the image, can position the foreground position of the image, weakens background noise, has simple training strategy and further improves the classification precision. The method can be used for intelligent security and unmanned retail business activities.

Description

Fine-grained image classification method based on segmentation mask and self-attention neural network

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a fine-grained image classification method which can be used for intelligent security and unmanned retail business activities.

Background

With the continuous development of deep learning technology in recent years, more and more mature convolutional neural network models are applied to the field of computer vision, such as: unmanned driving, face recognition, medical image analysis, target tracking and the like. The image classification problem is an important research topic in the field of computer vision. The traditional image classification mainly distinguishes large classes containing objects in the image, such as cats and dogs, the labels are always coarse-grained, and the fine-grained image classification problem is to identify small classes under the large classes, so that the classification labels are finer. Because of the small classification granularity, images of the same large class are more challenging than traditional image classification problems. For example: distinguish the type of bird, the style of vehicle, the type of flower, etc. Since the fine classification task of an image has the characteristics of small difference between sub-classes and large difference inside the sub-classes, the fine-grained image classification is centered on extracting detailed features of the image. For example: birds of different subclasses have similar characteristics in the same posture and static state, and the characteristics extracted by the birds of the same subclass have obvious difference under the conditions of different postures, illumination, shielding and the like, so that the characteristics of parts such as beak, head, wings and the like of the birds need to be extracted emphatically.

Fine-grained image classification has extensive research requirements and application scenarios both in academic and industrial circles. The related research fields mainly relate to the fields of vehicles, flowers, birds, airplanes and the like. In life, a wide range of application scenarios also exist. For example: in a security scene: the road monitoring needs to identify the type and the annual cost of a running vehicle; in emerging unmanned retail scenarios: the goods need to be accurately identified, and the shopping requirements of consumers are met. Therefore, how to design an accurate and efficient fine-grained image classification algorithm model has very important significance.

In the traditional computer vision, the feature extraction of fine-grained images is performed by adopting a strong supervision method of manual labeling, and besides using classification labels of the images, the detection of foreground objects and the local feature extraction of local region positioning are completed by means of a labeling frame. Although the method can effectively improve the classification precision, a large amount of manual labeling cost is consumed, the position of manual labeling is not necessarily the area with the most obvious detail characteristics, the classification precision is greatly influenced by a manual labeling person, and the data set with a large number of image samples and a large number of image categories cannot be quickly and effectively identified, so the practicability is not high.

In recent years, with the development and application of deep learning, a fine-grained recognition algorithm model based on weak supervision gradually becomes mainstream, and a convolutional neural network is mainly adopted for feature extraction. The deep learning based algorithmic models can be roughly divided into two main categories: namely a local positioning method and a feature encoding method. The local positioning method is mainly used for positioning the image through the attention mechanism module, namely a detail area with distinguishing characteristics is used as local information and then is fused with the overall image information, and a more accurate classification result is output. A fine-grained image classification method based on feature fusion as disclosed in patent application CN 202110179265.2; and a fine-grained image classification method based on a multi-layer focused attention network disclosed in patent application CN202011588241.4 all belong to methods based on local positioning. Although the method can automatically position the image salient region and fuse the overall characteristics and detail characteristics of the image, the method has the defects of high training difficulty, high training complexity, multiple characteristic extraction stages and complicated process; based on the characteristic coding algorithm model, the bilinear characteristic of the image is obtained by calculating the outer product of different spatial positions and calculating the average pooling of the characteristics of the different spatial positions. For example, the high-order feature coding Bilinear feature network (Bilinear-CNN, B-CNN) proposed by Lin et al has high classification precision, but cannot extract the nonlinear relationship between feature channels.

In recent years, a significant breakthrough is made in the development of a self-attention neural network transducer in the visual field, and the original transducer is the most effective model aiming at the natural language processing field and is mainly applied to tasks such as machine translation, emotion analysis, information extraction and the like. Dosovitskiy et al propose for the first time the use of a visual attention neural network (viion Transformer, ViT) in the field of computer Vision: the network divides the image into fixed-size image slices and extracts features from the attention neural network. Although the transform feature extraction capability is strong, the image division form is too single, and when the background proportion in the image is too large, the target object cannot be effectively positioned, and the feature extraction effect is poor.

Disclosure of Invention

The invention aims to provide a fine-grained image classification method based on a segmentation mask and a self-attention neural network aiming at the defects of the prior art, so as to simplify the network training process, effectively position a target object with large background ratio and noisy background noise on the premise of ensuring the classification precision of the fine-grained image, and improve the feature extraction effect.

The technical idea for realizing the aim of the invention is as follows: extracting the outline characteristics and the approximate position information of the original image through a segmentation network to generate a segmentation mask, so that the target object can be effectively positioned when the background proportion is large and the background noise is noisy; point-to-point fusion is carried out on the segmentation mask and the original image, and the fused image is input into a self-attention neural network, so that the overall and local detail features of the image are extracted; the classification result is directly output by the last layer of classification head of the self-attention neural network, so that the training process of the network pair is simplified.

According to the above concept, the implementation scheme of the invention comprises the following steps:

(1) downloading a divided training sample set and a divided testing sample set from a public data set of the fine-grained images to obtain class labels corresponding to the images;

(2) constructing a fine-grained classification model:

(2a) establishing a loss network formed by sequentially cascading an input layer, a regularization layer and a central vector updating layer;

(2b) selecting a segmentation mask to generate a network and a self-attention neural network, and cascading the network and a loss network to form a fine-grained classification model;

(2c) setting the total loss function L of the fine-grained classification model as a cross entropy loss function L₁Center vector loss function L₂And a contrast loss function L₃The sum, expressed as follows:

L＝αL₁(y，y’)+βL₂+γL₃

wherein each of α, β, and γ is L₁、L₂、L₃Y and y' respectively represent a real image tag and a network predicted image tag;

(3) and training the classification model by using a training sample set and adopting a gradient descent method. Obtaining a trained fine-grained classification model;

(4) and inputting the test sample set into a trained fine-grained classification model to obtain a classification result of the fine-grained image.

Compared with the prior art, the invention has the following advantages:

1. the fine-grained classification model constructed by the method can automatically generate the mask for enhancing the image, position the foreground position of the image and weaken background noise, so that the network can find integral and detailed features with more discrimination.

2. In the classification of the fine-grained images, the classification model is not required to be subjected to multi-stage complex training, but is directly optimized through the defined total loss, the whole process realizes end-to-end optimization, the training strategy is simple, and the classification precision is further improved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

fig. 2 is a schematic structural diagram of a fine-grained image classification model constructed in the present invention.

Detailed Description

The present invention is described in further detail below with reference to the attached drawings and specific examples.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, downloading a divided training sample set and a divided testing sample set from a public data set of the disclosed fine-grained images, and obtaining a category label corresponding to each image.

This example uses, but is not limited to, datasets as public datasets commonly used in the field of fine classification, including the Bird dataset Caltech UCSD Bird-2011, proposed by the california institute of technology in 2010, the vehicle dataset Stanford Cars, proposed by the Stanford university artificial intelligence laboratory, and the canine data Stanford Dogs. These data sets have been partitioned into a training sample set and a testing sample set, and provided class labels for the samples, where:

the CUB-200-2011 data set has 11788 bird images, which comprise 200 bird subclasses, the training data set has 5994 images, the test set has 5794 images, each image provides image category information, a bounding box of the bird in the image, key part information of the bird and attribute information of the bird. The experiment of the embodiment only needs the category information of the birds, and does not need other supervision information;

the Stanford Cars data set comprises 16185 images of vehicles, including 196 types of automobiles, the training data set comprises 8144 images, the testing data set comprises 8041 images, the number of each type is equivalent, the number of the images contained in each type is consistent, and the images are classified mainly according to the brand, the model and the year of the automobile;

the Stanford Dogs dataset had a total of 20580 dog images including 120 different dog species worldwide, a training set of 12000 images, and a test set of 8580 images, each of which provided class label and bounding box information, and as such, only the class information for the images is required for this example.

And 2, constructing a fine-grained classification model.

Referring to fig. 2, the specific implementation of this step is as follows:

(2.1) establishing a loss network which is formed by sequentially cascading an input layer, a regularization layer and a center vector updating layer:

input layer for inputting the central feature vector c_iWhen the training data set is CUB-200-2011, the dimensionality of the central vector is 200 multiplied by 9216, the learnable matrix with the value of all 0 is provided, the rows of the matrix represent the category number of the fine-grained image public data set and are listed as the product of the number of the self-attention neural network heads and the number of the hidden layers, and the dimensionality is adjusted along with the difference of the data set or the difference of the model size;

the normalization layer is used for updating the central characteristic vector in a normalization mode of 2 norm so as to enable the central characteristic vector to meet normal distribution;

a central vector updating layer for updating the central feature vector, wherein the updating weight is 0.05;

(2.2) selecting a segmentation mask generation network for outputting a mask with the size consistent with that of an original image, wherein the mask generation network comprises a convolution layer, 6 layering depth aggregation layers, a node fusion layer and an deconvolution layer, and the segmentation mask generation network comprises:

the 6 layered depth aggregation layers are used for down-sampling the image, each layered depth aggregation layer is a tree structure with a similar structure, the tree depth of each layered depth aggregation layer is 1, 2 and 1, each tree structure comprises a root node and 2-8 basic blocks, each basic block comprises two layers of convolutions, a regularization layer and a ReLU activation layer, and corresponds to the first column of the left half part of the graph 2, wherein the numbers in the square frame represent the times of down-sampling, and the image sequentially outputs 4 times, 8 times, 16 times and 32 times of down-sampling characteristics through the 6 layered depth aggregation layers.

And the node fusion layer is used for up-sampling and fusing nodes output by the hierarchical depth aggregation layer and comprises a depth aggregation up-sampling operation and an iteration depth aggregation up-sampling operation. Both operations complete the upsampling operation on the image using two layers of deconvolution, as shown by the dashed arrows in fig. 2. By performing fusion after upsampling on features of different sizes in different stages, the characterization capability of the model can be enhanced.

And the deconvolution layer is used for performing up-sampling operation on the output of the node fusion layer, the resolution of the final output feature map is 448x448, and the feature map outputs a mask with the same size as the original image after sigmoid activation operation.

The output dimensions of the above layers are shown in table 1:

TABLE 1 output dimensionality of the layers of the network

And 2.3) selecting a self-attention neural network for extracting the overall and local features of the image. The image slice self-attention feedforward coding method comprises an image slice embedding layer, a position coding layer, an L layer self-attention feedforward coding layer and two full-connection layers, wherein:

the image slice embedding layer is used for dividing the image x after the mask enhancement into image slices with the size of D multiplied by D,

h, W, C indicate the length, width and channel number of the image, so the number of the divided image slices N is N HW/D²；

The position coding layer is used for coding the position information of the image. The example experiment employed standard learnable one-dimensional codes.

The L-layer self-attention feedforward coding layer is used for extracting the features of the image, each layer of self-attention feedforward coding layer comprises a multi-head attention mechanism module MSA, a feedforward connection module MLP and two layers of normalization operation, wherein the number of heads of the multi-head attention mechanism module is K, and the feedforward connection module comprises two layers of full connection layers.

Output Z 'of multi-head attention module in self-attention feedforward network of each layer'_pOutput Z of the feedforward connection module_pRespectively, as follows:

Z’_p＝MSA(LN(Z_p-1))+Z_p-1，

Z_p＝MLP(LN(Z’_p))+Z’_p，

wherein LN represents a layer normalization operation, Z_p-1Represents the output of the p-1 layer self-attention feedforward coding layer, p-1

L is the total number of layers of the self-attention feedforward coding layer.

Since the input of the L-th self-attention feedforward coding layer is the most discriminative K image slices in the L-1 self-attention feedforward coding layer, these image slices can be calculated from the attention map generated by the self-attention feedforward coding layer, and the specific implementation is as follows:

first, the output of the L-1 layer self-attention feedforward coding layer is represented as:

wherein Z is_L-1For the output of the L-1 layer self-attention feedforward coding layer,

hooking the classification head of the L-1 layer, N is the number of the divided image slices,

representing the U image slice output by the self-attention feedforward coding layer of the L-1 layer, wherein the value range of U is 1-N;

similarly, list a of attention diagrams output by K heads in the multi-head attention mechanism module in the p-th layer self-attention feedforward coding layer_pIs represented as follows:

wherein the content of the first and second substances,

representing the attention diagram output by the Wth head in the multi-head attention mechanism module in the self-attention feedforward coding layer of the p layer, wherein the value range of W is 1-K;

then, the weights of all the heads of the previous L-1 layer are fused to obtain the final weight a_final；

Finally, only choose in each head at a_finalThe image slice corresponding to the maximum position of the medium response value finally obtains the input Z of the self-attention feedforward coding layer of the L-th layer_local：

Wherein A is_WRepresents the W-th head of the first-head self-attention mechanism module,

the image slice representing the most discriminating image slice selected from the W-th head of the multi-head attention mechanism module in the L-1-th layer of the attention feedforward coding layer has a value ranging from 1 to K, in this example, K is 12, L is 12, and D is 14.

The full connection layer is used for obtaining a final classification prediction result y',

and 2.4) sequentially cascading the segmentation mask generation network, the self-attention neural network and the loss network to form a fine-grained classification model, as shown in FIG. 2.

2.5) setting the total loss function L of the fine-grained classification model as a cross entropy loss function L₁Center vector loss function L₂And a contrast loss function L₃The sum, expressed as follows:

L＝αL₁(y，y’)+βL₂+γL₃

wherein each of α, β, and γ is L₁、L₂、L₃Y, y' represent the real image tag and the network predicted image tag, respectively.

Cross entropy loss function L₁Center vector loss function L₂Contrast loss function L₃Respectively, as follows:

L₁＝-[ylogy’+(1-y)log(1-y’)]

wherein N is the batch size, z_iAnd z_jRespectively represent pictures of i-th class and pictures of j-th class, alpha is loss boundary, y_iAnd y_jClass, sim (z) representing pictures of the ith and jth class_i，z_j) Representing pictures of type i z_iAnd j-th class picture z_jCosine similarity between them;

represents y_iThe feature center vector of a class is,

represents the vector output from the 0 th image slice at the L th level of the attention neural network.

And 3, training the classification model by using a training sample set and adopting a gradient descent method. Obtaining a trained fine-grained classification model:

(3a) initializing parameters of a segmentation mask generation network, a self-attention neural network and a loss network:

pre-training a weight of a segmentation mask generation network on an ImageNet image data set to serve as an initial weight;

taking the weight pre-trained by the self-attention neural network on the ImageNet image data set as the initial weight of the self-attention neural network;

the lossy network is randomly initialized.

Setting the maximum iteration number of a fine-grained classification model to 10000, the initial learning rate to be 0.03, a learning rate attenuation strategy to be fixed step attenuation, attenuation to be 0.1 every 3000 steps, selection of an optimizer to be SGD, and momentum to be 0.9;

(3b) the method comprises the steps of sequentially carrying out cutting, random horizontal turning, random erasing and normalization pretreatment on pictures of a training sample set, and then sending the pictures into a fine-grained classification model;

(3c) setting the values of the weights alpha and gamma of the cross entropy and the contrast loss function as 1, and setting the weight beta of the central vector loss function as 0.1; calculating a cross entropy loss function L according to a prediction result y' output by the fine-grained classification model and the real label y₁(ii) a According to the output category y of the same batch_iWith the class center vector

Computing a central vector loss function L₂(ii) a Calculating a contrast loss function L according to the cosine similarity between image slices of different classes and between image slices of the same class₃；

(3d) Substituting the calculation result of the step (3c) into a total loss function L of the fine-grained classification model, and performing iterative updating on the parameters of the fine-grained classification model by using a gradient descent algorithm;

(3e) and recording the accuracy and the loss value of each training iteration, carrying out verification once every 100 times of iteration, and storing the verification model parameters with the best effect. And obtaining a trained fine-grained classification model until the maximum iteration reaches 10000.

And 4, inputting the test sample set into the trained fine-grained classification model to obtain a classification result of the fine-grained image.

The effects of the present invention can be further illustrated by the following comparative data:

5794 test samples of the CUB-200-2011 data set are selected to verify and compare the effectiveness of the method and 7 methods of DBTNet, S3N, FDL, API-Net, StackedLSTM, ViT and TransFG, and the experimental results are shown in Table 2.

TABLE 2 results of classification of different methods on CUB-200-2011 dataset

Method	Accuracy of identification
		DBTNet	88.1％
S3N	88.5％
		FDL	89.1％
API-Net	90.0％
		StackedLSTM	90.4％
ViT	90.3％
		TransFG	91.4％
The invention	91.7％

As can be seen from table 2, the recognition accuracy of the present invention is significantly higher than the first few mainstream methods, and has significant advantages over other methods.

The above description is only a specific example of the present invention and does not constitute any limitation to the present invention, and it is obvious to those skilled in the art that various modifications and changes in form and detail, such as modifications or replacements of the segmentation mask generation network or the self-attention neural network in the present invention, may be made without departing from the principle and structure of the present invention after understanding the content and principle of the present invention, but those modifications and changes based on the inventive concept are still within the scope of the present invention as defined in the appended claims.

Claims

1. A fine-grained image classification method based on a segmentation mask and a self-attention neural network is characterized by comprising the following steps:

(1) downloading a divided training sample set and a divided testing sample set from a public fine-grained image data set to obtain a category label corresponding to each image;

(2) constructing a fine-grained classification model:

L＝αL₁(y,y’)+βL₂+γL₃

2. The method of claim 1, wherein (2a) the established loss network has the following parameters for each layer:

the input layer is used for inputting a learnable matrix which is a central characteristic vector, has the dimensionality of 200 multiplied by 9216 and has the value of 0, the row of the matrix represents the category number of a fine-grained image public data set and is listed as the product of the number of the self-attention neural network heads and the number of hidden layers, and the dimensionality is adjusted along with the difference of the data sets or the difference of the model sizes;

and a center vector updating layer for updating the center feature vector, wherein the updating weight is beta equal to 0.05.

3. The method according to claim 1, wherein the segmentation mask selected in (2b) generates a network for outputting mask masks in accordance with the size of the original image, the mask masks including a convolutional layer, 6 hierarchical deep aggregation layers, a node fusion layer, and an anti-convolutional layer, wherein:

the 6 hierarchical deep aggregation layers are used for node down-sampling, each hierarchical deep aggregation layer is a tree structure with a similar structure, the tree depth is 1, 2 and 1, each tree structure comprises a root node and 2-8 basic blocks, and each basic block comprises two layers of convolutions, a regularization layer and a ReLU activation layer;

the node fusion layer is used for up-sampling and fusing nodes and comprises a depth aggregation up-sampling operation and an iterative depth aggregation up-sampling operation;

the deep aggregation upsampling operation is to fuse the output of the third deep aggregation layer and the output of the fourth deep aggregation layer respectively to generate a first node O₁Fusing the output of the fourth layer depth polymerization layer and the output of the fifth layer depth polymerization layer to generate a second node O₂Fusing the output of the fifth layer depth polymerization layer and the output of the sixth layer depth polymerization layer to generate a third node O₃To the first node O₁And a second node O₂Fusing to generate a fourth node O₄To the second node O₂And a third node O₃Fusing to generate a fifth node O₅To the fourth node O₄And fifth O₅Fusing to generate a sixth node O₆。

The iterative deep aggregation up-sampling operation is performed on the fifth node O₅And the sixth sectionPoint O₆Fusing to generate a seventh node O₇To the seventh node O₇And a third node O₃Fusing to generate an eighth node O₈To the eighth node O₈Fusing with the output of the third deep polymerization layer to generate a ninth node O₉。

4. The method as claimed in claim 1, wherein the self-attention neural network in (2b) is composed of a picture embedding layer, a position encoding layer, a 12-layer self-attention feedforward encoding layer and two fully-connected layers, which are connected in sequence, each self-attention feedforward encoding layer comprises a multi-head attention mechanism module MSA and a feedforward connection module MLP, the number of heads of the multi-head attention mechanism module is 12, and the dimension of the feedforward connection module is 3072.

5. The method of claim 1, wherein the cross-entropy loss function L in (2c)₁Center vector loss function L₂Contrast loss function L₃Respectively, as follows:

L₁＝-[ylogy’+(1-y)log(1-y’)]

wherein y and y' represent a real image tag and a network predicted image tag respectively, N is the batch size, and z_iAnd z_jRespectively represent pictures of i-th class and pictures of j-th class, alpha is loss boundary, y_iAnd y_jClass, sim (z) representing pictures of the ith and jth class_i,z_j) Representing pictures of type i z_iAnd j-th class picture z_jCosine similarity between them; c. C_yiRepresents y_iThe feature center vector of a class is,

6. The method of claim 1, wherein the fine-grained classification model is trained in (3) by a gradient descent method using a training sample set. The method is realized as follows:

the lossy network is randomly initialized.

(3c) setting the values of the weights alpha and gamma of the cross entropy and the contrast loss function as 1, and setting the weight beta of the central vector loss function as 0.1; calculating a cross entropy loss function L according to a prediction result y' output by the fine-grained classification model and the real label y₁(ii) a According to the output category y of the same batch_iWith the class center vector c_yiComputing a central vector loss function L₂(ii) a Calculating a contrast loss function L according to the cosine similarity between image slices of different classes and between image slices of the same class₃；