CN114119979A - Fine-grained image classification method based on segmentation mask and self-attention neural network - Google Patents

Fine-grained image classification method based on segmentation mask and self-attention neural network Download PDF

Info

Publication number
CN114119979A
CN114119979A CN202111480727.0A CN202111480727A CN114119979A CN 114119979 A CN114119979 A CN 114119979A CN 202111480727 A CN202111480727 A CN 202111480727A CN 114119979 A CN114119979 A CN 114119979A
Authority
CN
China
Prior art keywords
layer
fine
node
image
grained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111480727.0A
Other languages
Chinese (zh)
Inventor
牛毅
张玉婷
马明明
李甫
张犁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202111480727.0A priority Critical patent/CN114119979A/en
Publication of CN114119979A publication Critical patent/CN114119979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a fine-grained image classification method based on a segmentation mask and a self-attention neural network, and mainly solves the problems that the existing fine-grained image classification method is complicated in training steps, high in training difficulty and incapable of accurately positioning and classifying the position of an object when picture background noise is too large. The scheme is as follows: downloading the divided training and testing sample sets from the disclosed fine-grained image data sets to obtain class labels corresponding to the images; constructing a fine-grained classification model consisting of a loss network, a segmentation mask generation network and a self-attention neural network in a cascade manner; training the classification model by using a training sample set and adopting a gradient descent method; and inputting the test sample set into a trained fine-grained classification model to obtain a classification result of the fine-grained image. The method can automatically generate the mask for enhancing the image, can position the foreground position of the image, weakens background noise, has simple training strategy and further improves the classification precision. The method can be used for intelligent security and unmanned retail business activities.

Description

Fine-grained image classification method based on segmentation mask and self-attention neural network
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a fine-grained image classification method which can be used for intelligent security and unmanned retail business activities.
Background
With the continuous development of deep learning technology in recent years, more and more mature convolutional neural network models are applied to the field of computer vision, such as: unmanned driving, face recognition, medical image analysis, target tracking and the like. The image classification problem is an important research topic in the field of computer vision. The traditional image classification mainly distinguishes large classes containing objects in the image, such as cats and dogs, the labels are always coarse-grained, and the fine-grained image classification problem is to identify small classes under the large classes, so that the classification labels are finer. Because of the small classification granularity, images of the same large class are more challenging than traditional image classification problems. For example: distinguish the type of bird, the style of vehicle, the type of flower, etc. Since the fine classification task of an image has the characteristics of small difference between sub-classes and large difference inside the sub-classes, the fine-grained image classification is centered on extracting detailed features of the image. For example: birds of different subclasses have similar characteristics in the same posture and static state, and the characteristics extracted by the birds of the same subclass have obvious difference under the conditions of different postures, illumination, shielding and the like, so that the characteristics of parts such as beak, head, wings and the like of the birds need to be extracted emphatically.
Fine-grained image classification has extensive research requirements and application scenarios both in academic and industrial circles. The related research fields mainly relate to the fields of vehicles, flowers, birds, airplanes and the like. In life, a wide range of application scenarios also exist. For example: in a security scene: the road monitoring needs to identify the type and the annual cost of a running vehicle; in emerging unmanned retail scenarios: the goods need to be accurately identified, and the shopping requirements of consumers are met. Therefore, how to design an accurate and efficient fine-grained image classification algorithm model has very important significance.
In the traditional computer vision, the feature extraction of fine-grained images is performed by adopting a strong supervision method of manual labeling, and besides using classification labels of the images, the detection of foreground objects and the local feature extraction of local region positioning are completed by means of a labeling frame. Although the method can effectively improve the classification precision, a large amount of manual labeling cost is consumed, the position of manual labeling is not necessarily the area with the most obvious detail characteristics, the classification precision is greatly influenced by a manual labeling person, and the data set with a large number of image samples and a large number of image categories cannot be quickly and effectively identified, so the practicability is not high.
In recent years, with the development and application of deep learning, a fine-grained recognition algorithm model based on weak supervision gradually becomes mainstream, and a convolutional neural network is mainly adopted for feature extraction. The deep learning based algorithmic models can be roughly divided into two main categories: namely a local positioning method and a feature encoding method. The local positioning method is mainly used for positioning the image through the attention mechanism module, namely a detail area with distinguishing characteristics is used as local information and then is fused with the overall image information, and a more accurate classification result is output. A fine-grained image classification method based on feature fusion as disclosed in patent application CN 202110179265.2; and a fine-grained image classification method based on a multi-layer focused attention network disclosed in patent application CN202011588241.4 all belong to methods based on local positioning. Although the method can automatically position the image salient region and fuse the overall characteristics and detail characteristics of the image, the method has the defects of high training difficulty, high training complexity, multiple characteristic extraction stages and complicated process; based on the characteristic coding algorithm model, the bilinear characteristic of the image is obtained by calculating the outer product of different spatial positions and calculating the average pooling of the characteristics of the different spatial positions. For example, the high-order feature coding Bilinear feature network (Bilinear-CNN, B-CNN) proposed by Lin et al has high classification precision, but cannot extract the nonlinear relationship between feature channels.
In recent years, a significant breakthrough is made in the development of a self-attention neural network transducer in the visual field, and the original transducer is the most effective model aiming at the natural language processing field and is mainly applied to tasks such as machine translation, emotion analysis, information extraction and the like. Dosovitskiy et al propose for the first time the use of a visual attention neural network (viion Transformer, ViT) in the field of computer Vision: the network divides the image into fixed-size image slices and extracts features from the attention neural network. Although the transform feature extraction capability is strong, the image division form is too single, and when the background proportion in the image is too large, the target object cannot be effectively positioned, and the feature extraction effect is poor.
Disclosure of Invention
The invention aims to provide a fine-grained image classification method based on a segmentation mask and a self-attention neural network aiming at the defects of the prior art, so as to simplify the network training process, effectively position a target object with large background ratio and noisy background noise on the premise of ensuring the classification precision of the fine-grained image, and improve the feature extraction effect.
The technical idea for realizing the aim of the invention is as follows: extracting the outline characteristics and the approximate position information of the original image through a segmentation network to generate a segmentation mask, so that the target object can be effectively positioned when the background proportion is large and the background noise is noisy; point-to-point fusion is carried out on the segmentation mask and the original image, and the fused image is input into a self-attention neural network, so that the overall and local detail features of the image are extracted; the classification result is directly output by the last layer of classification head of the self-attention neural network, so that the training process of the network pair is simplified.
According to the above concept, the implementation scheme of the invention comprises the following steps:
(1) downloading a divided training sample set and a divided testing sample set from a public data set of the fine-grained images to obtain class labels corresponding to the images;
(2) constructing a fine-grained classification model:
(2a) establishing a loss network formed by sequentially cascading an input layer, a regularization layer and a central vector updating layer;
(2b) selecting a segmentation mask to generate a network and a self-attention neural network, and cascading the network and a loss network to form a fine-grained classification model;
(2c) setting the total loss function L of the fine-grained classification model as a cross entropy loss function L1Center vector loss function L2And a contrast loss function L3The sum, expressed as follows:
L=αL1(y,y’)+βL2+γL3
wherein each of α, β, and γ is L1、L2、L3Y and y' respectively represent a real image tag and a network predicted image tag;
(3) and training the classification model by using a training sample set and adopting a gradient descent method. Obtaining a trained fine-grained classification model;
(4) and inputting the test sample set into a trained fine-grained classification model to obtain a classification result of the fine-grained image.
Compared with the prior art, the invention has the following advantages:
1. the fine-grained classification model constructed by the method can automatically generate the mask for enhancing the image, position the foreground position of the image and weaken background noise, so that the network can find integral and detailed features with more discrimination.
2. In the classification of the fine-grained images, the classification model is not required to be subjected to multi-stage complex training, but is directly optimized through the defined total loss, the whole process realizes end-to-end optimization, the training strategy is simple, and the classification precision is further improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
fig. 2 is a schematic structural diagram of a fine-grained image classification model constructed in the present invention.
Detailed Description
The present invention is described in further detail below with reference to the attached drawings and specific examples.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1, downloading a divided training sample set and a divided testing sample set from a public data set of the disclosed fine-grained images, and obtaining a category label corresponding to each image.
This example uses, but is not limited to, datasets as public datasets commonly used in the field of fine classification, including the Bird dataset Caltech UCSD Bird-2011, proposed by the california institute of technology in 2010, the vehicle dataset Stanford Cars, proposed by the Stanford university artificial intelligence laboratory, and the canine data Stanford Dogs. These data sets have been partitioned into a training sample set and a testing sample set, and provided class labels for the samples, where:
the CUB-200-2011 data set has 11788 bird images, which comprise 200 bird subclasses, the training data set has 5994 images, the test set has 5794 images, each image provides image category information, a bounding box of the bird in the image, key part information of the bird and attribute information of the bird. The experiment of the embodiment only needs the category information of the birds, and does not need other supervision information;
the Stanford Cars data set comprises 16185 images of vehicles, including 196 types of automobiles, the training data set comprises 8144 images, the testing data set comprises 8041 images, the number of each type is equivalent, the number of the images contained in each type is consistent, and the images are classified mainly according to the brand, the model and the year of the automobile;
the Stanford Dogs dataset had a total of 20580 dog images including 120 different dog species worldwide, a training set of 12000 images, and a test set of 8580 images, each of which provided class label and bounding box information, and as such, only the class information for the images is required for this example.
And 2, constructing a fine-grained classification model.
Referring to fig. 2, the specific implementation of this step is as follows:
(2.1) establishing a loss network which is formed by sequentially cascading an input layer, a regularization layer and a center vector updating layer:
input layer for inputting the central feature vector ciWhen the training data set is CUB-200-2011, the dimensionality of the central vector is 200 multiplied by 9216, the learnable matrix with the value of all 0 is provided, the rows of the matrix represent the category number of the fine-grained image public data set and are listed as the product of the number of the self-attention neural network heads and the number of the hidden layers, and the dimensionality is adjusted along with the difference of the data set or the difference of the model size;
the normalization layer is used for updating the central characteristic vector in a normalization mode of 2 norm so as to enable the central characteristic vector to meet normal distribution;
a central vector updating layer for updating the central feature vector, wherein the updating weight is 0.05;
(2.2) selecting a segmentation mask generation network for outputting a mask with the size consistent with that of an original image, wherein the mask generation network comprises a convolution layer, 6 layering depth aggregation layers, a node fusion layer and an deconvolution layer, and the segmentation mask generation network comprises:
the 6 layered depth aggregation layers are used for down-sampling the image, each layered depth aggregation layer is a tree structure with a similar structure, the tree depth of each layered depth aggregation layer is 1, 2 and 1, each tree structure comprises a root node and 2-8 basic blocks, each basic block comprises two layers of convolutions, a regularization layer and a ReLU activation layer, and corresponds to the first column of the left half part of the graph 2, wherein the numbers in the square frame represent the times of down-sampling, and the image sequentially outputs 4 times, 8 times, 16 times and 32 times of down-sampling characteristics through the 6 layered depth aggregation layers.
And the node fusion layer is used for up-sampling and fusing nodes output by the hierarchical depth aggregation layer and comprises a depth aggregation up-sampling operation and an iteration depth aggregation up-sampling operation. Both operations complete the upsampling operation on the image using two layers of deconvolution, as shown by the dashed arrows in fig. 2. By performing fusion after upsampling on features of different sizes in different stages, the characterization capability of the model can be enhanced.
And the deconvolution layer is used for performing up-sampling operation on the output of the node fusion layer, the resolution of the final output feature map is 448x448, and the feature map outputs a mask with the same size as the original image after sigmoid activation operation.
The output dimensions of the above layers are shown in table 1:
TABLE 1 output dimensionality of the layers of the network
Figure BDA0003395163580000051
And 2.3) selecting a self-attention neural network for extracting the overall and local features of the image. The image slice self-attention feedforward coding method comprises an image slice embedding layer, a position coding layer, an L layer self-attention feedforward coding layer and two full-connection layers, wherein:
the image slice embedding layer is used for dividing the image x after the mask enhancement into image slices with the size of D multiplied by D,
Figure BDA0003395163580000061
h, W, C indicate the length, width and channel number of the image, so the number of the divided image slices N is N HW/D2
The position coding layer is used for coding the position information of the image. The example experiment employed standard learnable one-dimensional codes.
The L-layer self-attention feedforward coding layer is used for extracting the features of the image, each layer of self-attention feedforward coding layer comprises a multi-head attention mechanism module MSA, a feedforward connection module MLP and two layers of normalization operation, wherein the number of heads of the multi-head attention mechanism module is K, and the feedforward connection module comprises two layers of full connection layers.
Output Z 'of multi-head attention module in self-attention feedforward network of each layer'pOutput Z of the feedforward connection modulepRespectively, as follows:
Z’p=MSA(LN(Zp-1))+Zp-1
Zp=MLP(LN(Z’p))+Z’p
wherein LN represents a layer normalization operation, Zp-1Represents the output of the p-1 layer self-attention feedforward coding layer, p-1
L is the total number of layers of the self-attention feedforward coding layer.
Since the input of the L-th self-attention feedforward coding layer is the most discriminative K image slices in the L-1 self-attention feedforward coding layer, these image slices can be calculated from the attention map generated by the self-attention feedforward coding layer, and the specific implementation is as follows:
first, the output of the L-1 layer self-attention feedforward coding layer is represented as:
Figure BDA0003395163580000062
wherein Z isL-1For the output of the L-1 layer self-attention feedforward coding layer,
Figure BDA0003395163580000063
hooking the classification head of the L-1 layer, N is the number of the divided image slices,
Figure BDA0003395163580000064
representing the U image slice output by the self-attention feedforward coding layer of the L-1 layer, wherein the value range of U is 1-N;
similarly, list a of attention diagrams output by K heads in the multi-head attention mechanism module in the p-th layer self-attention feedforward coding layerpIs represented as follows:
Figure BDA0003395163580000071
wherein the content of the first and second substances,
Figure BDA0003395163580000072
representing the attention diagram output by the Wth head in the multi-head attention mechanism module in the self-attention feedforward coding layer of the p layer, wherein the value range of W is 1-K;
then, the weights of all the heads of the previous L-1 layer are fused to obtain the final weight afinal
Figure BDA0003395163580000073
Finally, only choose in each head at afinalThe image slice corresponding to the maximum position of the medium response value finally obtains the input Z of the self-attention feedforward coding layer of the L-th layerlocal
Figure BDA0003395163580000074
Wherein A isWRepresents the W-th head of the first-head self-attention mechanism module,
Figure BDA0003395163580000075
the image slice representing the most discriminating image slice selected from the W-th head of the multi-head attention mechanism module in the L-1-th layer of the attention feedforward coding layer has a value ranging from 1 to K, in this example, K is 12, L is 12, and D is 14.
The full connection layer is used for obtaining a final classification prediction result y',
and 2.4) sequentially cascading the segmentation mask generation network, the self-attention neural network and the loss network to form a fine-grained classification model, as shown in FIG. 2.
2.5) setting the total loss function L of the fine-grained classification model as a cross entropy loss function L1Center vector loss function L2And a contrast loss function L3The sum, expressed as follows:
L=αL1(y,y’)+βL2+γL3
wherein each of α, β, and γ is L1、L2、L3Y, y' represent the real image tag and the network predicted image tag, respectively.
Cross entropy loss function L1Center vector loss function L2Contrast loss function L3Respectively, as follows:
L1=-[ylogy’+(1-y)log(1-y’)]
Figure BDA0003395163580000076
Figure BDA0003395163580000077
wherein N is the batch size, ziAnd zjRespectively represent pictures of i-th class and pictures of j-th class, alpha is loss boundary, yiAnd yjClass, sim (z) representing pictures of the ith and jth classi,zj) Representing pictures of type i ziAnd j-th class picture zjCosine similarity between them;
Figure BDA0003395163580000082
represents yiThe feature center vector of a class is,
Figure BDA0003395163580000081
represents the vector output from the 0 th image slice at the L th level of the attention neural network.
And 3, training the classification model by using a training sample set and adopting a gradient descent method. Obtaining a trained fine-grained classification model:
(3a) initializing parameters of a segmentation mask generation network, a self-attention neural network and a loss network:
pre-training a weight of a segmentation mask generation network on an ImageNet image data set to serve as an initial weight;
taking the weight pre-trained by the self-attention neural network on the ImageNet image data set as the initial weight of the self-attention neural network;
the lossy network is randomly initialized.
Setting the maximum iteration number of a fine-grained classification model to 10000, the initial learning rate to be 0.03, a learning rate attenuation strategy to be fixed step attenuation, attenuation to be 0.1 every 3000 steps, selection of an optimizer to be SGD, and momentum to be 0.9;
(3b) the method comprises the steps of sequentially carrying out cutting, random horizontal turning, random erasing and normalization pretreatment on pictures of a training sample set, and then sending the pictures into a fine-grained classification model;
(3c) setting the values of the weights alpha and gamma of the cross entropy and the contrast loss function as 1, and setting the weight beta of the central vector loss function as 0.1; calculating a cross entropy loss function L according to a prediction result y' output by the fine-grained classification model and the real label y1(ii) a According to the output category y of the same batchiWith the class center vector
Figure BDA0003395163580000083
Computing a central vector loss function L2(ii) a Calculating a contrast loss function L according to the cosine similarity between image slices of different classes and between image slices of the same class3
(3d) Substituting the calculation result of the step (3c) into a total loss function L of the fine-grained classification model, and performing iterative updating on the parameters of the fine-grained classification model by using a gradient descent algorithm;
(3e) and recording the accuracy and the loss value of each training iteration, carrying out verification once every 100 times of iteration, and storing the verification model parameters with the best effect. And obtaining a trained fine-grained classification model until the maximum iteration reaches 10000.
And 4, inputting the test sample set into the trained fine-grained classification model to obtain a classification result of the fine-grained image.
The effects of the present invention can be further illustrated by the following comparative data:
5794 test samples of the CUB-200-2011 data set are selected to verify and compare the effectiveness of the method and 7 methods of DBTNet, S3N, FDL, API-Net, StackedLSTM, ViT and TransFG, and the experimental results are shown in Table 2.
TABLE 2 results of classification of different methods on CUB-200-2011 dataset
Method Accuracy of identification
DBTNet 88.1%
S3N 88.5%
FDL 89.1%
API-Net 90.0%
StackedLSTM 90.4%
ViT 90.3%
TransFG 91.4%
The invention 91.7%
As can be seen from table 2, the recognition accuracy of the present invention is significantly higher than the first few mainstream methods, and has significant advantages over other methods.
The above description is only a specific example of the present invention and does not constitute any limitation to the present invention, and it is obvious to those skilled in the art that various modifications and changes in form and detail, such as modifications or replacements of the segmentation mask generation network or the self-attention neural network in the present invention, may be made without departing from the principle and structure of the present invention after understanding the content and principle of the present invention, but those modifications and changes based on the inventive concept are still within the scope of the present invention as defined in the appended claims.

Claims (6)

1. A fine-grained image classification method based on a segmentation mask and a self-attention neural network is characterized by comprising the following steps:
(1) downloading a divided training sample set and a divided testing sample set from a public fine-grained image data set to obtain a category label corresponding to each image;
(2) constructing a fine-grained classification model:
(2a) establishing a loss network formed by sequentially cascading an input layer, a regularization layer and a central vector updating layer;
(2b) selecting a segmentation mask to generate a network and a self-attention neural network, and cascading the network and a loss network to form a fine-grained classification model;
(2c) setting the total loss function L of the fine-grained classification model as a cross entropy loss function L1Center vector loss function L2And a contrast loss function L3The sum, expressed as follows:
L=αL1(y,y’)+βL2+γL3
wherein each of α, β, and γ is L1、L2、L3Y and y' respectively represent a real image tag and a network predicted image tag;
(3) and training the classification model by using a training sample set and adopting a gradient descent method. Obtaining a trained fine-grained classification model;
(4) and inputting the test sample set into a trained fine-grained classification model to obtain a classification result of the fine-grained image.
2. The method of claim 1, wherein (2a) the established loss network has the following parameters for each layer:
the input layer is used for inputting a learnable matrix which is a central characteristic vector, has the dimensionality of 200 multiplied by 9216 and has the value of 0, the row of the matrix represents the category number of a fine-grained image public data set and is listed as the product of the number of the self-attention neural network heads and the number of hidden layers, and the dimensionality is adjusted along with the difference of the data sets or the difference of the model sizes;
the normalization layer is used for updating the central characteristic vector in a normalization mode of 2 norm so as to enable the central characteristic vector to meet normal distribution;
and a center vector updating layer for updating the center feature vector, wherein the updating weight is beta equal to 0.05.
3. The method according to claim 1, wherein the segmentation mask selected in (2b) generates a network for outputting mask masks in accordance with the size of the original image, the mask masks including a convolutional layer, 6 hierarchical deep aggregation layers, a node fusion layer, and an anti-convolutional layer, wherein:
the 6 hierarchical deep aggregation layers are used for node down-sampling, each hierarchical deep aggregation layer is a tree structure with a similar structure, the tree depth is 1, 2 and 1, each tree structure comprises a root node and 2-8 basic blocks, and each basic block comprises two layers of convolutions, a regularization layer and a ReLU activation layer;
the node fusion layer is used for up-sampling and fusing nodes and comprises a depth aggregation up-sampling operation and an iterative depth aggregation up-sampling operation;
the deep aggregation upsampling operation is to fuse the output of the third deep aggregation layer and the output of the fourth deep aggregation layer respectively to generate a first node O1Fusing the output of the fourth layer depth polymerization layer and the output of the fifth layer depth polymerization layer to generate a second node O2Fusing the output of the fifth layer depth polymerization layer and the output of the sixth layer depth polymerization layer to generate a third node O3To the first node O1And a second node O2Fusing to generate a fourth node O4To the second node O2And a third node O3Fusing to generate a fifth node O5To the fourth node O4And fifth O5Fusing to generate a sixth node O6
The iterative deep aggregation up-sampling operation is performed on the fifth node O5And the sixth sectionPoint O6Fusing to generate a seventh node O7To the seventh node O7And a third node O3Fusing to generate an eighth node O8To the eighth node O8Fusing with the output of the third deep polymerization layer to generate a ninth node O9
4. The method as claimed in claim 1, wherein the self-attention neural network in (2b) is composed of a picture embedding layer, a position encoding layer, a 12-layer self-attention feedforward encoding layer and two fully-connected layers, which are connected in sequence, each self-attention feedforward encoding layer comprises a multi-head attention mechanism module MSA and a feedforward connection module MLP, the number of heads of the multi-head attention mechanism module is 12, and the dimension of the feedforward connection module is 3072.
5. The method of claim 1, wherein the cross-entropy loss function L in (2c)1Center vector loss function L2Contrast loss function L3Respectively, as follows:
L1=-[ylogy’+(1-y)log(1-y’)]
Figure FDA0003395163570000021
Figure FDA0003395163570000022
wherein y and y' represent a real image tag and a network predicted image tag respectively, N is the batch size, and ziAnd zjRespectively represent pictures of i-th class and pictures of j-th class, alpha is loss boundary, yiAnd yjClass, sim (z) representing pictures of the ith and jth classi,zj) Representing pictures of type i ziAnd j-th class picture zjCosine similarity between them; c. CyiRepresents yiThe feature center vector of a class is,
Figure FDA0003395163570000031
represents the vector output from the 0 th image slice at the L th level of the attention neural network.
6. The method of claim 1, wherein the fine-grained classification model is trained in (3) by a gradient descent method using a training sample set. The method is realized as follows:
(3a) initializing parameters of a segmentation mask generation network, a self-attention neural network and a loss network:
pre-training a weight of a segmentation mask generation network on an ImageNet image data set to serve as an initial weight;
taking the weight pre-trained by the self-attention neural network on the ImageNet image data set as the initial weight of the self-attention neural network;
the lossy network is randomly initialized.
Setting the maximum iteration number of a fine-grained classification model to 10000, the initial learning rate to be 0.03, a learning rate attenuation strategy to be fixed step attenuation, attenuation to be 0.1 every 3000 steps, selection of an optimizer to be SGD, and momentum to be 0.9;
(3b) the method comprises the steps of sequentially carrying out cutting, random horizontal turning, random erasing and normalization pretreatment on pictures of a training sample set, and then sending the pictures into a fine-grained classification model;
(3c) setting the values of the weights alpha and gamma of the cross entropy and the contrast loss function as 1, and setting the weight beta of the central vector loss function as 0.1; calculating a cross entropy loss function L according to a prediction result y' output by the fine-grained classification model and the real label y1(ii) a According to the output category y of the same batchiWith the class center vector cyiComputing a central vector loss function L2(ii) a Calculating a contrast loss function L according to the cosine similarity between image slices of different classes and between image slices of the same class3
(3d) Substituting the calculation result of the step (3c) into a total loss function L of the fine-grained classification model, and performing iterative updating on the parameters of the fine-grained classification model by using a gradient descent algorithm;
(3e) and recording the accuracy and the loss value of each training iteration, carrying out verification once every 100 times of iteration, and storing the verification model parameters with the best effect. And obtaining a trained fine-grained classification model until the maximum iteration reaches 10000.
CN202111480727.0A 2021-12-06 2021-12-06 Fine-grained image classification method based on segmentation mask and self-attention neural network Pending CN114119979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111480727.0A CN114119979A (en) 2021-12-06 2021-12-06 Fine-grained image classification method based on segmentation mask and self-attention neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111480727.0A CN114119979A (en) 2021-12-06 2021-12-06 Fine-grained image classification method based on segmentation mask and self-attention neural network

Publications (1)

Publication Number Publication Date
CN114119979A true CN114119979A (en) 2022-03-01

Family

ID=80366994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111480727.0A Pending CN114119979A (en) 2021-12-06 2021-12-06 Fine-grained image classification method based on segmentation mask and self-attention neural network

Country Status (1)

Country Link
CN (1) CN114119979A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332544A (en) * 2022-03-14 2022-04-12 之江实验室 Image block scoring-based fine-grained image classification method and device
CN115035389A (en) * 2022-08-10 2022-09-09 华东交通大学 Fine-grained image identification method and device based on reliability evaluation and iterative learning
CN115100509A (en) * 2022-07-15 2022-09-23 山东建筑大学 Image identification method and system based on multi-branch block-level attention enhancement network
CN115187819A (en) * 2022-08-23 2022-10-14 北京医准智能科技有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN115294400A (en) * 2022-08-23 2022-11-04 北京医准智能科技有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN115471724A (en) * 2022-11-02 2022-12-13 青岛杰瑞工控技术有限公司 Fine-grained fish epidemic disease identification fusion algorithm based on self-adaptive normalization
CN115794357A (en) * 2023-01-16 2023-03-14 山西清众科技股份有限公司 Device and method for automatically building multi-task network
CN115830402A (en) * 2023-02-21 2023-03-21 华东交通大学 Fine-grained image recognition classification model training method, device and equipment
CN116109629A (en) * 2023-04-10 2023-05-12 厦门微图软件科技有限公司 Defect classification method based on fine granularity recognition and attention mechanism
CN117593557A (en) * 2023-09-27 2024-02-23 北京邮电大学 Fine-grained biological image classification method based on transducer model
CN117593215A (en) * 2024-01-19 2024-02-23 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Large-scale vision pre-training method and system for generating model enhancement
CN117853875A (en) * 2024-03-04 2024-04-09 华东交通大学 Fine-granularity image recognition method and system
CN117853875B (en) * 2024-03-04 2024-05-14 华东交通大学 Fine-granularity image recognition method and system

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332544A (en) * 2022-03-14 2022-04-12 之江实验室 Image block scoring-based fine-grained image classification method and device
CN114332544B (en) * 2022-03-14 2022-06-07 之江实验室 Image block scoring-based fine-grained image classification method and device
JP7373624B2 (en) 2022-03-14 2023-11-02 之江実験室 Method and apparatus for fine-grained image classification based on scores of image blocks
CN115100509A (en) * 2022-07-15 2022-09-23 山东建筑大学 Image identification method and system based on multi-branch block-level attention enhancement network
CN115035389A (en) * 2022-08-10 2022-09-09 华东交通大学 Fine-grained image identification method and device based on reliability evaluation and iterative learning
CN115035389B (en) * 2022-08-10 2022-10-25 华东交通大学 Fine-grained image identification method and device based on reliability evaluation and iterative learning
CN115294400B (en) * 2022-08-23 2023-03-31 北京医准智能科技有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN115187819A (en) * 2022-08-23 2022-10-14 北京医准智能科技有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN115294400A (en) * 2022-08-23 2022-11-04 北京医准智能科技有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN115471724A (en) * 2022-11-02 2022-12-13 青岛杰瑞工控技术有限公司 Fine-grained fish epidemic disease identification fusion algorithm based on self-adaptive normalization
CN115794357A (en) * 2023-01-16 2023-03-14 山西清众科技股份有限公司 Device and method for automatically building multi-task network
CN115830402B (en) * 2023-02-21 2023-09-12 华东交通大学 Fine-granularity image recognition classification model training method, device and equipment
CN115830402A (en) * 2023-02-21 2023-03-21 华东交通大学 Fine-grained image recognition classification model training method, device and equipment
CN116109629B (en) * 2023-04-10 2023-07-25 厦门微图软件科技有限公司 Defect classification method based on fine granularity recognition and attention mechanism
CN116109629A (en) * 2023-04-10 2023-05-12 厦门微图软件科技有限公司 Defect classification method based on fine granularity recognition and attention mechanism
CN117593557A (en) * 2023-09-27 2024-02-23 北京邮电大学 Fine-grained biological image classification method based on transducer model
CN117593215A (en) * 2024-01-19 2024-02-23 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Large-scale vision pre-training method and system for generating model enhancement
CN117593215B (en) * 2024-01-19 2024-03-29 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Large-scale vision pre-training method and system for generating model enhancement
CN117853875A (en) * 2024-03-04 2024-04-09 华东交通大学 Fine-granularity image recognition method and system
CN117853875B (en) * 2024-03-04 2024-05-14 华东交通大学 Fine-granularity image recognition method and system

Similar Documents

Publication Publication Date Title
CN114119979A (en) Fine-grained image classification method based on segmentation mask and self-attention neural network
US10713563B2 (en) Object recognition using a convolutional neural network trained by principal component analysis and repeated spectral clustering
Thoma Analysis and optimization of convolutional neural network architectures
CN107066559B (en) Three-dimensional model retrieval method based on deep learning
Van Der Maaten Accelerating t-SNE using tree-based algorithms
Cheng et al. Image recognition technology based on deep learning
CN110929029A (en) Text classification method and system based on graph convolution neural network
CN109063666A (en) The lightweight face identification method and system of convolution are separated based on depth
CN113572742B (en) Network intrusion detection method based on deep learning
JP2018513507A (en) Relevance score assignment for artificial neural networks
CN109740686A (en) A kind of deep learning image multiple labeling classification method based on pool area and Fusion Features
CN105184298A (en) Image classification method through fast and locality-constrained low-rank coding process
CN114332544B (en) Image block scoring-based fine-grained image classification method and device
Jha et al. Extracting low‐dimensional psychological representations from convolutional neural networks
CN112418261B (en) Human body image multi-attribute classification method based on prior prototype attention mechanism
CN112766283A (en) Two-phase flow pattern identification method based on multi-scale convolution network
CN116363439A (en) Point cloud classification method, device and equipment based on multi-head self-attention
CN116310563A (en) Noble metal inventory management method and system
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
Kuhn et al. Brcars: a dataset for fine-grained classification of car images
US20230154185A1 (en) Multi-source panoptic feature pyramid network
Garcia et al. A methodology for neural network architectural tuning using activation occurrence maps
CN111652079B (en) Expression recognition method and system applied to mobile crowd and storage medium
CN114170460A (en) Multi-mode fusion-based artwork classification method and system
CN110597983B (en) Hierarchical text classification calculation method based on category embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination