CN111985538A - Small sample picture classification model and method based on semantic auxiliary attention mechanism - Google Patents

Small sample picture classification model and method based on semantic auxiliary attention mechanism Download PDF

Info

Publication number
CN111985538A
CN111985538A CN202010732273.0A CN202010732273A CN111985538A CN 111985538 A CN111985538 A CN 111985538A CN 202010732273 A CN202010732273 A CN 202010732273A CN 111985538 A CN111985538 A CN 111985538A
Authority
CN
China
Prior art keywords
semantic
attention
small sample
picture classification
sample picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010732273.0A
Other languages
Chinese (zh)
Inventor
徐行
徐贤达
沈复民
贾可
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN202010732273.0A priority Critical patent/CN111985538A/en
Publication of CN111985538A publication Critical patent/CN111985538A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a small sample picture classification model and method based on a semantic auxiliary attention mechanism, and belongs to the field of small sample picture classification in computer vision. The system comprises a convolutional neural network, an extension model for zero sample picture classification, a spatial attention module and a semantic alignment module. The method comprises the following steps: selecting a training data set; constructing a network structure of the small sample picture classification model based on the semantic auxiliary attention mechanism; preprocessing the training data, dividing the training data into a training set, a verification set and a test set, and subdividing each subdata set into data packets comprising a support set and a test set; training a small sample picture classification model; and verifying the small sample picture classification model. The invention combines the principles of attention mechanism and multi-module learning, is divided into two sub-modules of a space attention module and a semantic alignment module, can be concentrated on a local area, and can realize better classification of small sample pictures.

Description

Small sample picture classification model and method based on semantic auxiliary attention mechanism
Technical Field
The invention belongs to the field of small sample picture classification in computer vision, and particularly relates to a small sample picture classification model and method based on a semantic auxiliary attention mechanism.
Background
The small sample learning problem (Few-shot learning) aims to solve how to perform efficient machine learning in a small sample situation. The small sample learning is closer to the learning mode of human beings, and has high academic and industrial values. First, small sample learning helps to alleviate the acquisition pressure of supervisory data; secondly, small sample learning helps to solve the learning problem of rare samples. Because the small sample learning is free from dependence on a large amount of labeled data to a certain extent, the small sample learning becomes one of research hotspots in the field of artificial intelligence in recent years.
The small sample picture classification problem (Few-shot image classification) is a sub-application problem of the small sample learning problem, and aims to solve the classification problem in the case of providing a small number of picture samples. The data set D of the small sample picture classification is composed of three subdata sets, namely a training data set, a verification data set and a test data set. The three subdata sets have different class spaces, that is, a certain class in one subdata set does not appear in the other two subdata sets.
In the conventional classification problem, the three sub data sets share the same class space and each class has sufficient picture samples to learn. This is why we can easily train a classifier on the training data set that performs well on the test data set. But in the case of the small sample classification problem, the traditional classifier training method performs poorly because there is a separate class space for the three sub data sets and the samples for each class are limited at this time. Therefore, researchers have introduced the concept of Meta-Learning (Meta-Learning) for small sample classification problems to learn how to migrate learned knowledge during training to the process of testing to solve the same problem.
Most successful small sample classification algorithms follow a training framework of "Episodic learning", which divides a task into a plurality of segments, each of the small segments contains a Support set (Support set) and a Test set (Test set) from a sub data set, and the algorithms use information of labeled pictures in the Support set to complete classification of pictures in the Test set. All training, validation and testing processes are performed in terms of segments (episodes), each segment being randomly selected from a corresponding subset of data. In each training segment, the model is updated according to the result of each segment, and the verification and test are also fed back according to the result of each segment.
The embedding and measurement algorithm is one of the commonly used algorithms to solve the small sample classification problem and the zero sample classification problem. For small sample classification problems, embedding and metric algorithms typically train a neural network, map all labeled and unlabeled samples to an embedding space, and then match similar samples by some distance metric. Specifically, the method comprises the following three steps: 1. and (3) a characteristic embedding process: mapping the pictures of the support set and the test set to an embedding space by using a feature embedding neural network to obtain a feature embedding vector of each picture; 2. class-centric representation procedure: embedding vectors of each class are obtained according to the feature embedding vectors of the support set pictures to represent class center vectors; 3. the distance measurement process comprises the following steps: and classifying according to the distance between the feature embedding vector of the test set picture and the class center vector of each class.
The traditional embedding and measuring algorithm has the ambiguity problem in feature embedding, so that the class center representation has deviation, thereby having negative influence on the classification task. This ambiguity problem is mainly due to the fact that the areas of interest of the conventional convolutional neural network structure are susceptible to external environment. Therefore, there is a need for an improvement to the conventional convolutional neural network structure, which can focus on local regions.
Disclosure of Invention
The invention aims to provide a small sample picture classification model and method based on a semantic auxiliary attention mechanism.
The invention solves the technical problem, and adopts the technical scheme that: the small sample picture classification model based on the semantic auxiliary attention mechanism comprises a convolutional neural network, an expansion model for zero sample picture classification, a spatial attention module and a semantic alignment module;
the convolutional neural network is used for extracting picture features and measuring the features in an embedding space;
the extended model for zero sample picture classification is used for mapping the semantic vector to an embedding space to obtain a class center vector and mapping the picture to the embedding space to obtain a feature vector of each picture;
the spatial attention module is used for generating two corresponding visual maps by applying average pooling operation and maximum pooling operation in channel dimensions, merging the two visual maps, and determining an attention position and an inhibition position by using convolution operation to obtain a visual attention map;
the semantic alignment module is used for acquiring class label semantic embedding vectors, calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vectors of the corresponding class, and activating the visual attention map and the semantic attention map to obtain a refined attention map.
Further, the convolutional neural network includes four convolutional modules, and the convolutional modules specifically are: a convolutional layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer, and a nonlinear activation function layer.
Further, in the nonlinear activation function layer, an attention module based on semantic assistance is used in supporting set sample training, and a spatial attention layer is used in testing set samples.
Further, there is a 2 x 2 max pooling layer between every two convolution modules.
Further, the extended model for zero-sample picture classification comprises a class embedding network and a feature embedding network;
the class embedded network comprises a full connection layer, a batch normalization layer and a nonlinear activation layer, and is used for mapping semantic vectors to an embedded space to obtain class center vectors;
the feature embedding network selects a pre-trained GoogLeNet as a feature embedding model and is used for mapping the pictures to an embedding space to obtain a feature vector of each picture.
Further, the calculation formula corresponding to the average pooling operation is as follows:
Figure RE-GDA0002727437180000031
the calculation formula corresponding to the maximum pooling operation is as follows:
Figure RE-GDA0002727437180000032
wherein: h denotes the height of the input picture, W denotes the width of the input picture,
Figure RE-GDA0002727437180000033
the dimensions representing the atlas are 1, H and W in depth, height and width, respectively.
Further, when the semantic alignment module obtains the class label semantic embedded vector, the method comprises the following steps: and inquiring on the pre-learning semantic model to obtain a corresponding class label vector, and obtaining a class label semantic embedded vector after passing through a multi-layer perceptron MLP.
Further, the semantic attention map is obtained by calculating the embedded vectors of the feature map of the input picture and the class labels of the corresponding classes, and the method comprises the following steps: measuring the feature map of the input picture and the class label semantic embedded vector of the corresponding class, and obtaining the semantic attention map through Softmax.
Further, the visual attention map and the semantic attention map are activated together by using a Sigmoid function, so that a refined attention map is obtained.
Further, the small sample image classification model has loss during training, and the calculation formula of the loss is as follows: loss is lossc+λ·lossw
Therein, losscLoss of task for model picture classificationwAnd (3) taking lambda as the adaptability of the hyperparameter control loss for the loss of the MLP of the semantic alignment module.
Further, loss of the picture classification taskcThe calculation method is as follows:
Figure RE-GDA0002727437180000034
wherein N represents the number of classes, Q represents the number of test pictures, pcRepresents the center vector of class c and q represents each test picture.
Further, loss of the semantic alignment module multi-layer perceptron MLPwThe calculation method is as follows:
Figure RE-GDA0002727437180000041
wherein w represents the central vector of the class, F represents the input map, MwRepresenting a matching map, function f representing a convolutional network mapping, function fsRepresenting the distance metric function, α is an artificially set boundary value.
In addition, the invention also provides a construction method of the small sample picture classification model based on the semantic assistant attention mechanism, which is applied to the small sample picture classification model based on the semantic assistant attention mechanism and comprises the following steps:
step 1, selecting a training data set;
step 2, constructing a network structure of the small sample picture classification model based on the semantic auxiliary attention mechanism;
step 3, preprocessing the training data, dividing the training data into a training set, a verification set and a test set, and subdividing each subdata set into data packets comprising a support set and a test set;
step 4, training a small sample picture classification model;
and 5, verifying the small sample picture classification model.
Further, in step 2, the network structure is a network structure of a convolutional neural network.
The small sample picture classification model and the small sample picture classification method based on the semantic assistance attention mechanism have the advantages that the attention mechanism based on the semantic assistance is introduced into the small sample picture classification model, the characteristic extraction process is optimized, and therefore the performance of the model in a small sample picture classification task is improved. And the mechanism combines the principles of an attention mechanism and multi-module learning and is divided into two sub-modules, namely a space attention module and a semantic alignment module, wherein the space attention module can focus on local information of the picture and extract a representative local feature vector, and the semantic alignment module assists the space attention module and refines local features by connecting a local feature region and a class label vector. Therefore, the invention is divided into two sub-modules, namely a space attention module and a semantic alignment module, by combining the principles of an attention mechanism and multi-module learning, can be concentrated on a local area, and can realize better classification of small sample pictures.
Drawings
FIG. 1 is a flowchart of the above method for classifying small sample pictures based on the semantic assistant attention mechanism according to the present invention;
FIG. 2 is a schematic diagram of a semantic assistance-based attention mechanism according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a loss calculation of the semantic alignment module according to the embodiment of the present invention;
FIG. 4 is a diagram of a network structure of a small sample image classification model according to an embodiment of the present invention;
FIG. 5 is a network structure diagram of a zero-sample picture classification model according to an embodiment of the present invention;
FIG. 6 is a CAM visualization, in accordance with an embodiment of the present invention.
Wherein, the ' poodle ' represents the corresponding category of the picture, W is a pre-learning semantic model, MLP is a multilayer perceptron, D is a distance function, X is a multiplication function, S is an activation function, Softmax is a flexibility maximization function, and W 'poodleEtc. representing each categoryClass label vector, D is distance function, M is difference loss function, WS is weighted sum, Conv is convolution layer, BN is batch regularization layer, SAM is space attention layer, ReLU is nonlinear activation function layer, MaxPoint is maximum value pooling layer, SAAM is attention layer based on semantic assistance, A represents attribute model, each class obtains attribute vector a, W represents semantic model, each class obtains class label vector W, FC is full connection layer, BN is batch regularization layer, ReLU is nonlinear activation function layer, GoogleNet is classical deep convolution model, House Finch, Arctic Fox and the like respectively represent the picture category.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and embodiments.
The invention firstly provides a small sample picture classification model based on a semantic auxiliary attention mechanism, which comprises a convolutional neural network, an expansion model for zero sample picture classification, a spatial attention module and a semantic alignment module.
In the model, a convolutional neural network is used for extracting picture features and measuring the features in an embedding space; the extended model is used for zero sample picture classification and is used for mapping the semantic vector to an embedding space to obtain a class center vector and mapping the picture to the embedding space to obtain a feature vector of each picture; the spatial attention module is used for generating two corresponding visual maps by applying average pooling operation and maximum pooling operation in channel dimensions, merging the two visual maps, and determining an attention position and an inhibition position by using convolution operation to obtain a visual attention map; and the semantic alignment module is used for acquiring class label semantic embedding vectors, calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vectors of the corresponding class, and activating the visual attention map and the semantic attention map to obtain a refined attention map.
In the above model, the convolutional neural network preferably includes four convolutional modules, and the convolutional modules specifically are: a convolutional layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer, and a nonlinear activation function layer. In the nonlinear activation function layer, an attention module based on semantic assistance is used in the training of the support set samples, and a spatial attention layer is used in the testing of the test set samples. In addition, there may be a 2 x 2 max pooling layer between every two convolution modules.
In practical application, the convolutional neural network is used as a basic network of a small sample picture classification model and used for extracting picture features and measuring the features in an embedding space. The network consists of 4 convolution modules, the first three of which include a convolution layer (Conv) containing 64 3 x 3 filters, a bulk normalization layer (BN), a spatial attention layer (SAM) and a nonlinear activation function layer (ReLU). At the fourth convolution module, for the support set samples, the spatial attention layer is replaced with a semantic assistance based attention module (SAAM); for the test set samples, the spatial attention layer remains. There is a 2 x 2 max pooling layer (MaxPool) between every two convolution modules. The 4 convolution modules are a compromise for fitting degree, and if the number of the convolution modules is too small, the problem of under-fitting is caused; if too many convolution modules are used, an overfitting problem can result. In each convolution module, an attention layer assists in feature extraction, and the difference between the support set sample and the test set sample is that the label of each class is visible when the support set sample is used for training, so that in the fourth convolution module, the attention module based on semantic assistance is used when the support set sample is used for training, and the spatial attention layer is also used when the test set sample is used for testing.
In addition, the extended model for zero-sample picture classification can comprise a class embedded network and a feature embedded network; the class embedded network comprises a full connection layer, a batch normalization layer and a nonlinear activation layer and is used for mapping the semantic vector to an embedded space to obtain a class center vector; the feature embedding network selects a pre-trained GoogLeNet as a feature embedding model for mapping the picture to an embedding space to obtain a feature vector of each picture.
It should be noted that the calculation formula corresponding to the average pooling operation is preferably:
Figure RE-GDA0002727437180000061
The calculation formula corresponding to the maximum pooling operation is preferably:
Figure RE-GDA0002727437180000062
wherein: h denotes the height of the input picture, W denotes the width of the input picture,
Figure RE-GDA0002727437180000063
the dimensions representing the atlas are 1, H and W in depth, height and width, respectively.
In addition, when the semantic alignment module obtains the class label semantic embedded vector, the method is preferably as follows: inquiring on the pre-learning semantic model to obtain a corresponding class label vector, and obtaining a class label semantic embedded vector after passing through a multi-layer perceptron MLP; calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vector of the corresponding class, wherein the method is preferably as follows: measuring the feature map of the input picture and the class label semantic embedded vector of the corresponding class, and obtaining the semantic attention map through Softmax.
In the invention, an attention mechanism based on semantic assistance can be defined, and the attention mechanism is composed of a space attention module and a semantic alignment module.
The spatial attention module aims to mine internal spatial correlation of features of the input feature map, help the model determine where the region is concerned in the input feature map, highlight key region features in the input feature map, inhibit useless features and form attention. Specifically, the spatial attention mechanism first applies the average pooling and maximum pooling operations in the channel dimension to generate the corresponding two visual maps: after this operation, it merges the two visual maps and uses a convolution operation to determine where to focus and where to suppress, resulting in an attention map.
The semantic alignment module aims to join local feature regions and class label vectors to refine the feature attention map generated under the spatial attention mechanism. In particular, by learning relationships between visual information and semantic information in multiple modalities, the model can be helped to better locate critical features. Semantic information here refers to a semantic vector for each class label obtained from a pre-learned semantic model. The corresponding class label vector is obtained by inquiring on a GloVe semantic model, and the class label semantic embedded vector is obtained after the vector passes through a multi-layer perceptron MLP. The module measures the feature map of the input picture and the class label semantic embedded vector of the corresponding class, obtains a semantic attention map through Softmax, and activates the semantic attention map and the visual attention map obtained by the spatial attention module by using a Sigmoid function to obtain a refined attention map.
In general, the small sample image classification model has loss during training, and the loss calculation formula is as follows: loss is lossc+λ·lossw(ii) a Therein, losscLoss of task for model picture classificationwAnd (3) taking lambda as the adaptability of the hyperparameter control loss for the loss of the MLP of the semantic alignment module.
In particular, loss of the picture classification taskcThe calculation method is as follows:
Figure RE-GDA0002727437180000071
wherein N represents the number of classes, Q represents the number of test pictures, pcRepresents the center vector of class c and q represents each test picture.
Loss of multi-layer perceptron MLP of semantic alignment modulewThe calculation method is as follows:
Figure RE-GDA0002727437180000072
wherein w represents the central vector of the class, F represents the input map, MwRepresenting a matching map, function f representing a convolutional network mapping, function fsRepresenting the distance metric function, α is an artificially set boundary value.
Therefore, the small sample image classification model provided by the invention emphatically explores the potential of feature extraction. The attention mechanism based on semantic assistance can well optimize feature extraction. The method can search refined local features for the input feature map, so that the class center vector is more representative, and the classification performance is improved. In particular, the mechanism consists of two sub-modules, a spatial attention module and a semantic alignment module. The spatial attention module is used for learning a mask, highlighting a critical area in the image space and suppressing an irrelevant area, so that the network model is focused on a certain area of the image to form attention. The semantic alignment module utilizes the thought of multi-mode learning, introduces pre-learned class label semantic vectors, embeds visual semantics through an alignment mechanism, and refines the attention map of the spatial attention module.
Therefore, referring to fig. 1, which is a flowchart of a method for constructing a small sample image classification model based on a semantic assistant attention mechanism, the steps of solving a small sample image classification task by using the small sample image classification model of the present invention are as follows:
step 1, selecting a training data set;
step 2, constructing a network structure of a small sample picture classification model based on a semantic auxiliary attention mechanism;
step 3, preprocessing the training data, dividing the training data into a training set, a verification set and a test set, and subdividing each subdata set into data packets comprising a support set and a test set;
step 4, training a small sample picture classification model;
and 5, verifying the small sample picture classification model.
In step 2 of the above method, the network structure is preferably a network structure of a convolutional neural network.
Here, the small sample picture classification model of the present invention is extended to solve the zero sample picture classification task. The zero-sample picture classification model is composed of a class embedding network and a feature extraction network. Class embedding networks learn a linear network to construct a 1024-dimensional embedding space. Each linear module consists of a full connection layer, a batch normalization layer and a nonlinear activation layer. The default class attribute vector provided by the data set and the class label vector obtained from the GloVe are selected to be combined to form a combined semantic vector for learning. The feature embedding network selects a pre-trained GoogLeNet as a main feature embedding model. We change the last linear layer of GoogLeNet into a new linear layer, so that the feature dimension of the output can reach that of the embedding space.
Examples
The embodiment of the invention takes a small sample picture classification model based on a semantic auxiliary attention mechanism as an example for detailed description. The overall structure of the module is shown in fig. 2. The module consists of two branches: a spatial attention mechanism and a semantic alignment mechanism. Suppose that our input feature map is
Figure RE-GDA0002727437180000081
The module is intended to generate a mask for it, i.e. an attention map based on semantic assistance
Figure RE-GDA0002727437180000082
Thus, assume that the output feature map is
Figure RE-GDA0002727437180000083
Calculated according to the following formula:
Figure RE-GDA0002727437180000084
wherein the content of the first and second substances,
Figure RE-GDA0002727437180000085
is a bitwise multiplication operation that applies an attention value to the input feature map.
The spatial attention mechanism aims to mine the internal spatial correlation of features of the input feature map. In other words, it helps the model to determine where to focus on the region in the input feature map, highlights key region features in the input feature map, suppresses unwanted features, and creates attention.
Specifically, the spatial attention mechanism first applies the average pooling and maximum pooling operations in the channel dimension to generate the corresponding two visual maps: corresponding to average pooling operation
Figure RE-GDA0002727437180000086
And corresponding to maximum pooling operation
Figure RE-GDA0002727437180000087
After this operation, it merges the two visual maps and uses a convolution operation to determine where to focus and where to suppress, resulting in an attention map Ma. The overall calculation formula is as follows:
Ma=Conv([favg(F);fmax(F)])
=Conv([Favg;Fmax])
in the formula, favgRepresenting an average pooling operation, fmaxRepresenting the maximum pooling operation, Conv is a convolution operation with a convolution kernel of 7 × 7.
The semantic alignment mechanism is used to join local feature regions and class label vectors to refine the feature attention map generated under the spatial attention mechanism. It should be noted that it is only applied to the samples of the support set, because the pictures of the test set do not touch the class labels during classification. In particular, by learning relationships between visual information and semantic information in multiple modalities, the model can be helped to better locate critical features. Semantic information here refers to a semantic vector for each class label obtained from a pre-learned semantic model.
The pre-learning semantic model used in this embodiment is Global vector for Word Representation, which is a Word Representation tool based on Global Word frequency statistics and can convert a text Word into a real number vector. These real vectors capture some semantic features between words, such as similarity, analogies, and so on. By carrying out similarity calculation on the real number vector, semantic similarity between two words can be obtained.
GloVe provides semantic models with different dimensions, and the semantic model with the length of 100 is selected in the embodiment. As shown in FIG. 4-1, taking Poode class as an example, the class label "Poode" is queried on GloVe semantic model to obtain the corresponding real number vector wpoodleAfter passing through a multi-layer perceptron MLP, semantic embedded vector w 'is obtained'poodle
It is known that each slice contains pictures from N classes. So, for each sample of the support set, it has a positive label embedded vector w+With N-1 negative tags embedded in the vector
Figure RE-GDA0002727437180000091
Positive label embedded vector is used for generating visual semantic matching map MwWherein M iswiDenotes w+And inputting each region F of the feature mapiThe degree of correlation between them. The calculation formula is as follows:
Figure RE-GDA0002727437180000092
wherein f issIs a similarity function that measures the similarity of two embedded vectors.
Thus, the refined feature map can be obtained according to the following formula: two attention maps were multiplied bitwise and activated with Sigmoid function:
Figure RE-GDA0002727437180000093
to join local feature regions and class label vectors, we introduce a loss herew
FIG. 3 shows a schematic diagram of the calculated loss in a 5-way scenario. Here, since pictures whose collection samples are the "Poodle" class are currently supported, the positive class is the "Poodle" class, and the other 4 classes are the negative classes. Specifically, the loss is calculated according to the following formula:
Figure RE-GDA0002727437180000094
where α is an artificially set hyper-parameter threshold. | x | non grid+Is a simple operation:
Figure RE-GDA0002727437180000095
the network structure after adding the attention module based on semantic assistance is shown in fig. 4.
Likewise, the network structure consists of 4 convolution modules. The first three convolution modules include a convolution layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer and a nonlinear activation function layer. In the fourth convolution module, the spatial attention layer is replaced with a semantic assistance-based attention module for the support set samples, and the spatial attention layer remains for the test set samples. There is a 2 x 2 max pooling layer between every two convolution modules.
The network generates embedded feature vectors of dimension 1600 for all support set samples and test set pictures within the segment. These vectors are fed into the metric learning module following the class center representation and distance metric process. The embedded feature vectors of the support set samples are used to generate class center representations, and the embedded feature vectors and each class center representation of the test set pictures are classified by distance measurement.
In zero-sample classification, the paper researches and expands a small-sample classification method, and introduces additional semantic information, namely a class label vector, to optimize a class center embedding process. In particular, the class feature vector and the class label vector will be combined into a merged semantic vector. The class embedding network learns the mapping method in the merged semantic vector and maps the semantic vector into an embedding space to form a class center vector. The network structure follows the network structure of most embedding and measurement algorithms on the zero sample classification problem, as shown in fig. 5:
class embedding networks learn a linear network to construct a 1024-dimensional embedding space. Each linear module consists of a full connection layer, a batch normalization layer and a nonlinear activation layer. The default class attribute vector provided by the data set and the class label vector obtained from the GloVe are selected to be combined to form a combined semantic vector for learning.
The feature embedding network selects a pre-trained GoogLeNet as a main feature embedding model. Google lenet is the champion of ImageNet challenge in 2014. Before its birth, most neural network structures obtain better training effect by increasing the depth of the network, and google lenet innovatively proposes the structure of the inclusion, and optimizes from another perspective: the computing resources are more efficiently utilized, more characteristic quantities are extracted under the same computing resource limit, and therefore the training effect is improved. In order to enable the feature dimension extracted by the feature embedding network to be consistent with the feature dimension of the embedding space, the last linear layer of the google lenet is changed into a new linear layer, so that the output feature dimension can reach the feature dimension of the embedding space, namely 1024.
The CAM visualization experiment results are shown in fig. 6 by verification.

Claims (14)

1. The small sample picture classification model based on the semantic auxiliary attention mechanism is characterized by comprising a convolutional neural network, an expansion model for zero sample picture classification, a spatial attention module and a semantic alignment module;
the convolutional neural network is used for extracting picture features and measuring the features in an embedding space;
the extended model for zero sample picture classification is used for mapping the semantic vector to an embedding space to obtain a class center vector and mapping the picture to the embedding space to obtain a feature vector of each picture;
the spatial attention module is used for generating two corresponding visual maps by applying average pooling operation and maximum pooling operation in channel dimensions, merging the two visual maps, and determining an attention position and an inhibition position by using convolution operation to obtain a visual attention map;
the semantic alignment module is used for acquiring class label semantic embedding vectors, calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vectors of the corresponding class, and activating the visual attention map and the semantic attention map to obtain a refined attention map.
2. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the convolutional neural network comprises four convolutional modules, specifically: a convolutional layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer, and a nonlinear activation function layer.
3. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 2 is characterized in that in the nonlinear activation function layer, a semantic assistant-based attention module is used in supporting set sample training, and a spatial attention layer is used in testing set samples.
4. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 2 or 3, characterized in that there is a maximum pooling layer of 2 x 2 between every two convolution modules.
5. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the extended model for zero sample picture classification comprises a class embedding network and a feature embedding network;
the class embedded network comprises a full connection layer, a batch normalization layer and a nonlinear activation layer, and is used for mapping semantic vectors to an embedded space to obtain class center vectors;
the feature embedding network selects a pre-trained GoogLeNet as a feature embedding model and is used for mapping the pictures to an embedding space to obtain a feature vector of each picture.
6. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the average pooling operation corresponds to a calculation formula:
Figure RE-RE-FDA0002727437170000011
the calculation formula corresponding to the maximum pooling operation is as follows:
Figure RE-RE-FDA0002727437170000012
wherein: h denotes the height of the input picture, W denotes the width of the input picture,
Figure RE-RE-FDA0002727437170000013
the dimensions representing the atlas are 1, H and W in depth, height and width, respectively.
7. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the semantic alignment module obtains the class label semantic embedding vector by the following method: and inquiring on the pre-learning semantic model to obtain a corresponding class label vector, and obtaining a class label semantic embedded vector after passing through a multi-layer perceptron MLP.
8. The small sample picture classification model based on the semantic aided attention mechanism according to claim 1, wherein the semantic attention map is obtained by calculating a semantic embedding vector according to the feature map of the input picture and the class label semantic embedding vector of the corresponding class, and the method comprises the following steps: measuring the feature map of the input picture and the class label semantic embedded vector of the corresponding class, and obtaining the semantic attention map through Softmax.
9. The small sample picture classification model based on the semantic aided attention mechanism as claimed in claim 1, 7 or 8, wherein the visual attention map and the semantic attention map are activated together by a Sigmoid function to obtain a refined attention map.
10. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1 is characterized in that the small sample picture classification model has loss during training, and the calculation formula of the loss is as follows: loss is lossc+λ·lossw
Therein, losscLoss of task for model picture classificationwAnd (3) taking lambda as the adaptability of the hyperparameter control loss for the loss of the MLP of the semantic alignment module.
11. The small sample picture classification model based on the semantic aided attention mechanism as claimed in claim 10, wherein the loss of the picture classification task is lesscThe calculation method is as follows:
Figure RE-RE-FDA0002727437170000021
wherein N represents the number of classes, Q represents the number of test pictures, pcRepresents the center vector of class c and q represents each test picture.
12. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 10 or 11, wherein the loss of the semantic alignment module multi-layer perceptron MLP is lesswThe calculation method is as follows:
Figure RE-RE-FDA0002727437170000022
wherein w represents the central vector of the class, F represents the input map, MwRepresenting a matching map, function f representing a convolutional network mapping, function fsRepresenting the distance metric function, α is an artificially set boundary value.
13. The construction method of the small sample picture classification model based on the semantic assistant attention mechanism is applied to the small sample picture classification model based on the semantic assistant attention mechanism in claims 1 to 12, and is characterized by comprising the following steps of:
step 1, selecting a training data set;
step 2, constructing a network structure of the small sample picture classification model based on the semantic auxiliary attention mechanism;
step 3, preprocessing the training data, dividing the training data into a training set, a verification set and a test set, and subdividing each subdata set into data packets comprising a support set and a test set;
step 4, training a small sample picture classification model;
and 5, verifying the small sample picture classification model.
14. The method for constructing the small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 13, wherein in the step 2, the network structure is a network structure of a convolutional neural network.
CN202010732273.0A 2020-07-27 2020-07-27 Small sample picture classification model and method based on semantic auxiliary attention mechanism Pending CN111985538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010732273.0A CN111985538A (en) 2020-07-27 2020-07-27 Small sample picture classification model and method based on semantic auxiliary attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010732273.0A CN111985538A (en) 2020-07-27 2020-07-27 Small sample picture classification model and method based on semantic auxiliary attention mechanism

Publications (1)

Publication Number Publication Date
CN111985538A true CN111985538A (en) 2020-11-24

Family

ID=73444272

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010732273.0A Pending CN111985538A (en) 2020-07-27 2020-07-27 Small sample picture classification model and method based on semantic auxiliary attention mechanism

Country Status (1)

Country Link
CN (1) CN111985538A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990282A (en) * 2021-03-03 2021-06-18 华南理工大学 Method and device for classifying fine-grained small sample images
CN113221951A (en) * 2021-04-13 2021-08-06 天津大学 Time domain attention pooling network-based dynamic graph classification method and device
CN113298096A (en) * 2021-07-07 2021-08-24 中国人民解放军国防科技大学 Method, system, electronic device and storage medium for training zero sample classification model
CN113343974A (en) * 2021-07-06 2021-09-03 国网天津市电力公司 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN113435531A (en) * 2021-07-07 2021-09-24 中国人民解放军国防科技大学 Zero sample image classification method and system, electronic equipment and storage medium
CN113610164A (en) * 2021-08-10 2021-11-05 北京邮电大学 Fine-grained image recognition method and system based on attention balance
CN113869418A (en) * 2021-09-29 2021-12-31 哈尔滨工程大学 Small sample ship target identification method based on global attention relationship network
CN113989405A (en) * 2021-12-27 2022-01-28 浙江大学 Image generation method based on small sample continuous learning
CN116503674A (en) * 2023-06-27 2023-07-28 中国科学技术大学 Small sample image classification method, device and medium based on semantic guidance

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990282A (en) * 2021-03-03 2021-06-18 华南理工大学 Method and device for classifying fine-grained small sample images
CN112990282B (en) * 2021-03-03 2023-07-18 华南理工大学 Classification method and device for fine-granularity small sample images
CN113221951A (en) * 2021-04-13 2021-08-06 天津大学 Time domain attention pooling network-based dynamic graph classification method and device
CN113343974A (en) * 2021-07-06 2021-09-03 国网天津市电力公司 Multi-modal fusion classification optimization method considering inter-modal semantic distance measurement
CN113435531B (en) * 2021-07-07 2022-06-21 中国人民解放军国防科技大学 Zero sample image classification method and system, electronic equipment and storage medium
CN113298096B (en) * 2021-07-07 2021-10-01 中国人民解放军国防科技大学 Method, system, electronic device and storage medium for training zero sample classification model
CN113435531A (en) * 2021-07-07 2021-09-24 中国人民解放军国防科技大学 Zero sample image classification method and system, electronic equipment and storage medium
CN113298096A (en) * 2021-07-07 2021-08-24 中国人民解放军国防科技大学 Method, system, electronic device and storage medium for training zero sample classification model
CN113610164A (en) * 2021-08-10 2021-11-05 北京邮电大学 Fine-grained image recognition method and system based on attention balance
CN113610164B (en) * 2021-08-10 2023-12-22 北京邮电大学 Fine granularity image recognition method and system based on attention balance
CN113869418A (en) * 2021-09-29 2021-12-31 哈尔滨工程大学 Small sample ship target identification method based on global attention relationship network
CN113989405A (en) * 2021-12-27 2022-01-28 浙江大学 Image generation method based on small sample continuous learning
CN116503674A (en) * 2023-06-27 2023-07-28 中国科学技术大学 Small sample image classification method, device and medium based on semantic guidance
CN116503674B (en) * 2023-06-27 2023-10-20 中国科学技术大学 Small sample image classification method, device and medium based on semantic guidance

Similar Documents

Publication Publication Date Title
CN111985538A (en) Small sample picture classification model and method based on semantic auxiliary attention mechanism
Huang et al. Instance-aware image and sentence matching with selective multimodal lstm
CN111476294B (en) Zero sample image identification method and system based on generation countermeasure network
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
Tang et al. RGBT salient object detection: Benchmark and a novel cooperative ranking approach
CN114926746B (en) SAR image change detection method based on multiscale differential feature attention mechanism
CN114067160A (en) Small sample remote sensing image scene classification method based on embedded smooth graph neural network
Gao et al. Multi‐dimensional data modelling of video image action recognition and motion capture in deep learning framework
CN108765383B (en) Video description method based on deep migration learning
Jiang et al. Hyperspectral image classification with spatial consistence using fully convolutional spatial propagation network
Lei et al. Ultralightweight spatial–spectral feature cooperation network for change detection in remote sensing images
Liang et al. Comparison detector for cervical cell/clumps detection in the limited data scenario
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
JP2008123486A (en) Method, system and program for detecting one or plurality of concepts by digital media
Li et al. Transfer learning for toxoplasma gondii recognition
Zhang et al. Quantifying the knowledge in a DNN to explain knowledge distillation for classification
Zhao et al. PCA dimensionality reduction method for image classification
Wang et al. Deep multi-person kinship matching and recognition for family photos
Zheng et al. Fine-grained image recognition via weakly supervised click data guided bilinear CNN model
Fan et al. A hierarchical Dirichlet process mixture of generalized Dirichlet distributions for feature selection
Xiong et al. An interpretable fusion siamese network for multi-modality remote sensing ship image retrieval
Xu et al. Graphical modeling for multi-source domain adaptation
Li et al. Enhanced bird detection from low-resolution aerial image using deep neural networks
Lin et al. The design of error-correcting output codes based deep forest for the micro-expression recognition
Akbar et al. Face recognition using hybrid feature space in conjunction with support vector machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination