CN111985538A

CN111985538A - Small sample picture classification model and method based on semantic auxiliary attention mechanism

Info

Publication number: CN111985538A
Application number: CN202010732273.0A
Authority: CN
Inventors: 徐行; 徐贤达; 沈复民; 贾可; 申恒涛
Original assignee: Chengdu Koala Youran Technology Co ltd
Current assignee: Chengdu Koala Youran Technology Co ltd
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-11-24

Abstract

The invention discloses a small sample picture classification model and method based on a semantic auxiliary attention mechanism, and belongs to the field of small sample picture classification in computer vision. The system comprises a convolutional neural network, an extension model for zero sample picture classification, a spatial attention module and a semantic alignment module. The method comprises the following steps: selecting a training data set; constructing a network structure of the small sample picture classification model based on the semantic auxiliary attention mechanism; preprocessing the training data, dividing the training data into a training set, a verification set and a test set, and subdividing each subdata set into data packets comprising a support set and a test set; training a small sample picture classification model; and verifying the small sample picture classification model. The invention combines the principles of attention mechanism and multi-module learning, is divided into two sub-modules of a space attention module and a semantic alignment module, can be concentrated on a local area, and can realize better classification of small sample pictures.

Description

Small sample picture classification model and method based on semantic auxiliary attention mechanism

Technical Field

The invention belongs to the field of small sample picture classification in computer vision, and particularly relates to a small sample picture classification model and method based on a semantic auxiliary attention mechanism.

Background

The small sample learning problem (Few-shot learning) aims to solve how to perform efficient machine learning in a small sample situation. The small sample learning is closer to the learning mode of human beings, and has high academic and industrial values. First, small sample learning helps to alleviate the acquisition pressure of supervisory data; secondly, small sample learning helps to solve the learning problem of rare samples. Because the small sample learning is free from dependence on a large amount of labeled data to a certain extent, the small sample learning becomes one of research hotspots in the field of artificial intelligence in recent years.

The small sample picture classification problem (Few-shot image classification) is a sub-application problem of the small sample learning problem, and aims to solve the classification problem in the case of providing a small number of picture samples. The data set D of the small sample picture classification is composed of three subdata sets, namely a training data set, a verification data set and a test data set. The three subdata sets have different class spaces, that is, a certain class in one subdata set does not appear in the other two subdata sets.

In the conventional classification problem, the three sub data sets share the same class space and each class has sufficient picture samples to learn. This is why we can easily train a classifier on the training data set that performs well on the test data set. But in the case of the small sample classification problem, the traditional classifier training method performs poorly because there is a separate class space for the three sub data sets and the samples for each class are limited at this time. Therefore, researchers have introduced the concept of Meta-Learning (Meta-Learning) for small sample classification problems to learn how to migrate learned knowledge during training to the process of testing to solve the same problem.

Most successful small sample classification algorithms follow a training framework of "Episodic learning", which divides a task into a plurality of segments, each of the small segments contains a Support set (Support set) and a Test set (Test set) from a sub data set, and the algorithms use information of labeled pictures in the Support set to complete classification of pictures in the Test set. All training, validation and testing processes are performed in terms of segments (episodes), each segment being randomly selected from a corresponding subset of data. In each training segment, the model is updated according to the result of each segment, and the verification and test are also fed back according to the result of each segment.

The embedding and measurement algorithm is one of the commonly used algorithms to solve the small sample classification problem and the zero sample classification problem. For small sample classification problems, embedding and metric algorithms typically train a neural network, map all labeled and unlabeled samples to an embedding space, and then match similar samples by some distance metric. Specifically, the method comprises the following three steps: 1. and (3) a characteristic embedding process: mapping the pictures of the support set and the test set to an embedding space by using a feature embedding neural network to obtain a feature embedding vector of each picture; 2. class-centric representation procedure: embedding vectors of each class are obtained according to the feature embedding vectors of the support set pictures to represent class center vectors; 3. the distance measurement process comprises the following steps: and classifying according to the distance between the feature embedding vector of the test set picture and the class center vector of each class.

The traditional embedding and measuring algorithm has the ambiguity problem in feature embedding, so that the class center representation has deviation, thereby having negative influence on the classification task. This ambiguity problem is mainly due to the fact that the areas of interest of the conventional convolutional neural network structure are susceptible to external environment. Therefore, there is a need for an improvement to the conventional convolutional neural network structure, which can focus on local regions.

Disclosure of Invention

The invention aims to provide a small sample picture classification model and method based on a semantic auxiliary attention mechanism.

The invention solves the technical problem, and adopts the technical scheme that: the small sample picture classification model based on the semantic auxiliary attention mechanism comprises a convolutional neural network, an expansion model for zero sample picture classification, a spatial attention module and a semantic alignment module;

the convolutional neural network is used for extracting picture features and measuring the features in an embedding space;

the extended model for zero sample picture classification is used for mapping the semantic vector to an embedding space to obtain a class center vector and mapping the picture to the embedding space to obtain a feature vector of each picture;

the spatial attention module is used for generating two corresponding visual maps by applying average pooling operation and maximum pooling operation in channel dimensions, merging the two visual maps, and determining an attention position and an inhibition position by using convolution operation to obtain a visual attention map;

the semantic alignment module is used for acquiring class label semantic embedding vectors, calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vectors of the corresponding class, and activating the visual attention map and the semantic attention map to obtain a refined attention map.

Further, the convolutional neural network includes four convolutional modules, and the convolutional modules specifically are: a convolutional layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer, and a nonlinear activation function layer.

Further, in the nonlinear activation function layer, an attention module based on semantic assistance is used in supporting set sample training, and a spatial attention layer is used in testing set samples.

Further, there is a 2 x 2 max pooling layer between every two convolution modules.

Further, the extended model for zero-sample picture classification comprises a class embedding network and a feature embedding network;

the class embedded network comprises a full connection layer, a batch normalization layer and a nonlinear activation layer, and is used for mapping semantic vectors to an embedded space to obtain class center vectors;

the feature embedding network selects a pre-trained GoogLeNet as a feature embedding model and is used for mapping the pictures to an embedding space to obtain a feature vector of each picture.

Further, the calculation formula corresponding to the average pooling operation is as follows:

the calculation formula corresponding to the maximum pooling operation is as follows:

wherein: h denotes the height of the input picture, W denotes the width of the input picture,

the dimensions representing the atlas are 1, H and W in depth, height and width, respectively.

Further, when the semantic alignment module obtains the class label semantic embedded vector, the method comprises the following steps: and inquiring on the pre-learning semantic model to obtain a corresponding class label vector, and obtaining a class label semantic embedded vector after passing through a multi-layer perceptron MLP.

Further, the semantic attention map is obtained by calculating the embedded vectors of the feature map of the input picture and the class labels of the corresponding classes, and the method comprises the following steps: measuring the feature map of the input picture and the class label semantic embedded vector of the corresponding class, and obtaining the semantic attention map through Softmax.

Further, the visual attention map and the semantic attention map are activated together by using a Sigmoid function, so that a refined attention map is obtained.

Further, the small sample image classification model has loss during training, and the calculation formula of the loss is as follows: loss is loss_c+λ·loss_w；

Therein, loss_cLoss of task for model picture classification_wAnd (3) taking lambda as the adaptability of the hyperparameter control loss for the loss of the MLP of the semantic alignment module.

Further, loss of the picture classification task_cThe calculation method is as follows:

wherein N represents the number of classes, Q represents the number of test pictures, p_cRepresents the center vector of class c and q represents each test picture.

Further, loss of the semantic alignment module multi-layer perceptron MLP_wThe calculation method is as follows:

wherein w represents the central vector of the class, F represents the input map, M_wRepresenting a matching map, function f representing a convolutional network mapping, function f_sRepresenting the distance metric function, α is an artificially set boundary value.

In addition, the invention also provides a construction method of the small sample picture classification model based on the semantic assistant attention mechanism, which is applied to the small sample picture classification model based on the semantic assistant attention mechanism and comprises the following steps:

step 1, selecting a training data set;

step 2, constructing a network structure of the small sample picture classification model based on the semantic auxiliary attention mechanism;

step 3, preprocessing the training data, dividing the training data into a training set, a verification set and a test set, and subdividing each subdata set into data packets comprising a support set and a test set;

step 4, training a small sample picture classification model;

and 5, verifying the small sample picture classification model.

Further, in step 2, the network structure is a network structure of a convolutional neural network.

The small sample picture classification model and the small sample picture classification method based on the semantic assistance attention mechanism have the advantages that the attention mechanism based on the semantic assistance is introduced into the small sample picture classification model, the characteristic extraction process is optimized, and therefore the performance of the model in a small sample picture classification task is improved. And the mechanism combines the principles of an attention mechanism and multi-module learning and is divided into two sub-modules, namely a space attention module and a semantic alignment module, wherein the space attention module can focus on local information of the picture and extract a representative local feature vector, and the semantic alignment module assists the space attention module and refines local features by connecting a local feature region and a class label vector. Therefore, the invention is divided into two sub-modules, namely a space attention module and a semantic alignment module, by combining the principles of an attention mechanism and multi-module learning, can be concentrated on a local area, and can realize better classification of small sample pictures.

Drawings

FIG. 1 is a flowchart of the above method for classifying small sample pictures based on the semantic assistant attention mechanism according to the present invention;

FIG. 2 is a schematic diagram of a semantic assistance-based attention mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a loss calculation of the semantic alignment module according to the embodiment of the present invention;

FIG. 4 is a diagram of a network structure of a small sample image classification model according to an embodiment of the present invention;

FIG. 5 is a network structure diagram of a zero-sample picture classification model according to an embodiment of the present invention;

FIG. 6 is a CAM visualization, in accordance with an embodiment of the present invention.

Wherein, the ' poodle ' represents the corresponding category of the picture, W is a pre-learning semantic model, MLP is a multilayer perceptron, D is a distance function, X is a multiplication function, S is an activation function, Softmax is a flexibility maximization function, and W '_poodleEtc. representing each categoryClass label vector, D is distance function, M is difference loss function, WS is weighted sum, Conv is convolution layer, BN is batch regularization layer, SAM is space attention layer, ReLU is nonlinear activation function layer, MaxPoint is maximum value pooling layer, SAAM is attention layer based on semantic assistance, A represents attribute model, each class obtains attribute vector a, W represents semantic model, each class obtains class label vector W, FC is full connection layer, BN is batch regularization layer, ReLU is nonlinear activation function layer, GoogleNet is classical deep convolution model, House Finch, Arctic Fox and the like respectively represent the picture category.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and embodiments.

The invention firstly provides a small sample picture classification model based on a semantic auxiliary attention mechanism, which comprises a convolutional neural network, an expansion model for zero sample picture classification, a spatial attention module and a semantic alignment module.

In the model, a convolutional neural network is used for extracting picture features and measuring the features in an embedding space; the extended model is used for zero sample picture classification and is used for mapping the semantic vector to an embedding space to obtain a class center vector and mapping the picture to the embedding space to obtain a feature vector of each picture; the spatial attention module is used for generating two corresponding visual maps by applying average pooling operation and maximum pooling operation in channel dimensions, merging the two visual maps, and determining an attention position and an inhibition position by using convolution operation to obtain a visual attention map; and the semantic alignment module is used for acquiring class label semantic embedding vectors, calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vectors of the corresponding class, and activating the visual attention map and the semantic attention map to obtain a refined attention map.

In the above model, the convolutional neural network preferably includes four convolutional modules, and the convolutional modules specifically are: a convolutional layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer, and a nonlinear activation function layer. In the nonlinear activation function layer, an attention module based on semantic assistance is used in the training of the support set samples, and a spatial attention layer is used in the testing of the test set samples. In addition, there may be a 2 x 2 max pooling layer between every two convolution modules.

In practical application, the convolutional neural network is used as a basic network of a small sample picture classification model and used for extracting picture features and measuring the features in an embedding space. The network consists of 4 convolution modules, the first three of which include a convolution layer (Conv) containing 64 3 x 3 filters, a bulk normalization layer (BN), a spatial attention layer (SAM) and a nonlinear activation function layer (ReLU). At the fourth convolution module, for the support set samples, the spatial attention layer is replaced with a semantic assistance based attention module (SAAM); for the test set samples, the spatial attention layer remains. There is a 2 x 2 max pooling layer (MaxPool) between every two convolution modules. The 4 convolution modules are a compromise for fitting degree, and if the number of the convolution modules is too small, the problem of under-fitting is caused; if too many convolution modules are used, an overfitting problem can result. In each convolution module, an attention layer assists in feature extraction, and the difference between the support set sample and the test set sample is that the label of each class is visible when the support set sample is used for training, so that in the fourth convolution module, the attention module based on semantic assistance is used when the support set sample is used for training, and the spatial attention layer is also used when the test set sample is used for testing.

In addition, the extended model for zero-sample picture classification can comprise a class embedded network and a feature embedded network; the class embedded network comprises a full connection layer, a batch normalization layer and a nonlinear activation layer and is used for mapping the semantic vector to an embedded space to obtain a class center vector; the feature embedding network selects a pre-trained GoogLeNet as a feature embedding model for mapping the picture to an embedding space to obtain a feature vector of each picture.

It should be noted that the calculation formula corresponding to the average pooling operation is preferably：

The calculation formula corresponding to the maximum pooling operation is preferably:

In addition, when the semantic alignment module obtains the class label semantic embedded vector, the method is preferably as follows: inquiring on the pre-learning semantic model to obtain a corresponding class label vector, and obtaining a class label semantic embedded vector after passing through a multi-layer perceptron MLP; calculating to obtain a semantic attention map according to the feature map of the input picture and the class label semantic embedding vector of the corresponding class, wherein the method is preferably as follows: measuring the feature map of the input picture and the class label semantic embedded vector of the corresponding class, and obtaining the semantic attention map through Softmax.

In the invention, an attention mechanism based on semantic assistance can be defined, and the attention mechanism is composed of a space attention module and a semantic alignment module.

The spatial attention module aims to mine internal spatial correlation of features of the input feature map, help the model determine where the region is concerned in the input feature map, highlight key region features in the input feature map, inhibit useless features and form attention. Specifically, the spatial attention mechanism first applies the average pooling and maximum pooling operations in the channel dimension to generate the corresponding two visual maps: after this operation, it merges the two visual maps and uses a convolution operation to determine where to focus and where to suppress, resulting in an attention map.

The semantic alignment module aims to join local feature regions and class label vectors to refine the feature attention map generated under the spatial attention mechanism. In particular, by learning relationships between visual information and semantic information in multiple modalities, the model can be helped to better locate critical features. Semantic information here refers to a semantic vector for each class label obtained from a pre-learned semantic model. The corresponding class label vector is obtained by inquiring on a GloVe semantic model, and the class label semantic embedded vector is obtained after the vector passes through a multi-layer perceptron MLP. The module measures the feature map of the input picture and the class label semantic embedded vector of the corresponding class, obtains a semantic attention map through Softmax, and activates the semantic attention map and the visual attention map obtained by the spatial attention module by using a Sigmoid function to obtain a refined attention map.

In general, the small sample image classification model has loss during training, and the loss calculation formula is as follows: loss is loss_c+λ·loss_w(ii) a Therein, loss_cLoss of task for model picture classification_wAnd (3) taking lambda as the adaptability of the hyperparameter control loss for the loss of the MLP of the semantic alignment module.

In particular, loss of the picture classification task_cThe calculation method is as follows:

Loss of multi-layer perceptron MLP of semantic alignment module_wThe calculation method is as follows:

Therefore, the small sample image classification model provided by the invention emphatically explores the potential of feature extraction. The attention mechanism based on semantic assistance can well optimize feature extraction. The method can search refined local features for the input feature map, so that the class center vector is more representative, and the classification performance is improved. In particular, the mechanism consists of two sub-modules, a spatial attention module and a semantic alignment module. The spatial attention module is used for learning a mask, highlighting a critical area in the image space and suppressing an irrelevant area, so that the network model is focused on a certain area of the image to form attention. The semantic alignment module utilizes the thought of multi-mode learning, introduces pre-learned class label semantic vectors, embeds visual semantics through an alignment mechanism, and refines the attention map of the spatial attention module.

Therefore, referring to fig. 1, which is a flowchart of a method for constructing a small sample image classification model based on a semantic assistant attention mechanism, the steps of solving a small sample image classification task by using the small sample image classification model of the present invention are as follows:

step 1, selecting a training data set;

step 2, constructing a network structure of a small sample picture classification model based on a semantic auxiliary attention mechanism;

step 4, training a small sample picture classification model;

and 5, verifying the small sample picture classification model.

In step 2 of the above method, the network structure is preferably a network structure of a convolutional neural network.

Here, the small sample picture classification model of the present invention is extended to solve the zero sample picture classification task. The zero-sample picture classification model is composed of a class embedding network and a feature extraction network. Class embedding networks learn a linear network to construct a 1024-dimensional embedding space. Each linear module consists of a full connection layer, a batch normalization layer and a nonlinear activation layer. The default class attribute vector provided by the data set and the class label vector obtained from the GloVe are selected to be combined to form a combined semantic vector for learning. The feature embedding network selects a pre-trained GoogLeNet as a main feature embedding model. We change the last linear layer of GoogLeNet into a new linear layer, so that the feature dimension of the output can reach that of the embedding space.

Examples

The embodiment of the invention takes a small sample picture classification model based on a semantic auxiliary attention mechanism as an example for detailed description. The overall structure of the module is shown in fig. 2. The module consists of two branches: a spatial attention mechanism and a semantic alignment mechanism. Suppose that our input feature map is

The module is intended to generate a mask for it, i.e. an attention map based on semantic assistance

Thus, assume that the output feature map is

Calculated according to the following formula:

wherein the content of the first and second substances,

is a bitwise multiplication operation that applies an attention value to the input feature map.

The spatial attention mechanism aims to mine the internal spatial correlation of features of the input feature map. In other words, it helps the model to determine where to focus on the region in the input feature map, highlights key region features in the input feature map, suppresses unwanted features, and creates attention.

Specifically, the spatial attention mechanism first applies the average pooling and maximum pooling operations in the channel dimension to generate the corresponding two visual maps: corresponding to average pooling operation

And corresponding to maximum pooling operation

After this operation, it merges the two visual maps and uses a convolution operation to determine where to focus and where to suppress, resulting in an attention map M_a. The overall calculation formula is as follows:

M_a＝Conv([f_avg(F)；f_max(F)])

＝Conv([F_avg；F_max])

in the formula, f_avgRepresenting an average pooling operation, f_maxRepresenting the maximum pooling operation, Conv is a convolution operation with a convolution kernel of 7 × 7.

The semantic alignment mechanism is used to join local feature regions and class label vectors to refine the feature attention map generated under the spatial attention mechanism. It should be noted that it is only applied to the samples of the support set, because the pictures of the test set do not touch the class labels during classification. In particular, by learning relationships between visual information and semantic information in multiple modalities, the model can be helped to better locate critical features. Semantic information here refers to a semantic vector for each class label obtained from a pre-learned semantic model.

The pre-learning semantic model used in this embodiment is Global vector for Word Representation, which is a Word Representation tool based on Global Word frequency statistics and can convert a text Word into a real number vector. These real vectors capture some semantic features between words, such as similarity, analogies, and so on. By carrying out similarity calculation on the real number vector, semantic similarity between two words can be obtained.

GloVe provides semantic models with different dimensions, and the semantic model with the length of 100 is selected in the embodiment. As shown in FIG. 4-1, taking Poode class as an example, the class label "Poode" is queried on GloVe semantic model to obtain the corresponding real number vector w_poodleAfter passing through a multi-layer perceptron MLP, semantic embedded vector w 'is obtained'_poodle。

It is known that each slice contains pictures from N classes. So, for each sample of the support set, it has a positive label embedded vector w⁺With N-1 negative tags embedded in the vector

Positive label embedded vector is used for generating visual semantic matching map M_wWherein M is_wiDenotes w⁺And inputting each region F of the feature map_iThe degree of correlation between them. The calculation formula is as follows:

wherein f is_sIs a similarity function that measures the similarity of two embedded vectors.

Thus, the refined feature map can be obtained according to the following formula: two attention maps were multiplied bitwise and activated with Sigmoid function:

to join local feature regions and class label vectors, we introduce a loss here_w：

FIG. 3 shows a schematic diagram of the calculated loss in a 5-way scenario. Here, since pictures whose collection samples are the "Poodle" class are currently supported, the positive class is the "Poodle" class, and the other 4 classes are the negative classes. Specifically, the loss is calculated according to the following formula:

where α is an artificially set hyper-parameter threshold. | x | non grid₊Is a simple operation:

the network structure after adding the attention module based on semantic assistance is shown in fig. 4.

Likewise, the network structure consists of 4 convolution modules. The first three convolution modules include a convolution layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer and a nonlinear activation function layer. In the fourth convolution module, the spatial attention layer is replaced with a semantic assistance-based attention module for the support set samples, and the spatial attention layer remains for the test set samples. There is a 2 x 2 max pooling layer between every two convolution modules.

The network generates embedded feature vectors of dimension 1600 for all support set samples and test set pictures within the segment. These vectors are fed into the metric learning module following the class center representation and distance metric process. The embedded feature vectors of the support set samples are used to generate class center representations, and the embedded feature vectors and each class center representation of the test set pictures are classified by distance measurement.

In zero-sample classification, the paper researches and expands a small-sample classification method, and introduces additional semantic information, namely a class label vector, to optimize a class center embedding process. In particular, the class feature vector and the class label vector will be combined into a merged semantic vector. The class embedding network learns the mapping method in the merged semantic vector and maps the semantic vector into an embedding space to form a class center vector. The network structure follows the network structure of most embedding and measurement algorithms on the zero sample classification problem, as shown in fig. 5:

class embedding networks learn a linear network to construct a 1024-dimensional embedding space. Each linear module consists of a full connection layer, a batch normalization layer and a nonlinear activation layer. The default class attribute vector provided by the data set and the class label vector obtained from the GloVe are selected to be combined to form a combined semantic vector for learning.

The feature embedding network selects a pre-trained GoogLeNet as a main feature embedding model. Google lenet is the champion of ImageNet challenge in 2014. Before its birth, most neural network structures obtain better training effect by increasing the depth of the network, and google lenet innovatively proposes the structure of the inclusion, and optimizes from another perspective: the computing resources are more efficiently utilized, more characteristic quantities are extracted under the same computing resource limit, and therefore the training effect is improved. In order to enable the feature dimension extracted by the feature embedding network to be consistent with the feature dimension of the embedding space, the last linear layer of the google lenet is changed into a new linear layer, so that the output feature dimension can reach the feature dimension of the embedding space, namely 1024.

The CAM visualization experiment results are shown in fig. 6 by verification.

Claims

1. The small sample picture classification model based on the semantic auxiliary attention mechanism is characterized by comprising a convolutional neural network, an expansion model for zero sample picture classification, a spatial attention module and a semantic alignment module;

2. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the convolutional neural network comprises four convolutional modules, specifically: a convolutional layer containing 64 3 x 3 filters, a batch normalization layer, a spatial attention layer, and a nonlinear activation function layer.

3. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 2 is characterized in that in the nonlinear activation function layer, a semantic assistant-based attention module is used in supporting set sample training, and a spatial attention layer is used in testing set samples.

4. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 2 or 3, characterized in that there is a maximum pooling layer of 2 x 2 between every two convolution modules.

5. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the extended model for zero sample picture classification comprises a class embedding network and a feature embedding network;

6. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the average pooling operation corresponds to a calculation formula:

7. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1, wherein the semantic alignment module obtains the class label semantic embedding vector by the following method: and inquiring on the pre-learning semantic model to obtain a corresponding class label vector, and obtaining a class label semantic embedded vector after passing through a multi-layer perceptron MLP.

8. The small sample picture classification model based on the semantic aided attention mechanism according to claim 1, wherein the semantic attention map is obtained by calculating a semantic embedding vector according to the feature map of the input picture and the class label semantic embedding vector of the corresponding class, and the method comprises the following steps: measuring the feature map of the input picture and the class label semantic embedded vector of the corresponding class, and obtaining the semantic attention map through Softmax.

9. The small sample picture classification model based on the semantic aided attention mechanism as claimed in claim 1, 7 or 8, wherein the visual attention map and the semantic attention map are activated together by a Sigmoid function to obtain a refined attention map.

10. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 1 is characterized in that the small sample picture classification model has loss during training, and the calculation formula of the loss is as follows: loss is loss_c+λ·loss_w；

11. The small sample picture classification model based on the semantic aided attention mechanism as claimed in claim 10, wherein the loss of the picture classification task is less_cThe calculation method is as follows:

12. The small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 10 or 11, wherein the loss of the semantic alignment module multi-layer perceptron MLP is less_wThe calculation method is as follows:

13. The construction method of the small sample picture classification model based on the semantic assistant attention mechanism is applied to the small sample picture classification model based on the semantic assistant attention mechanism in claims 1 to 12, and is characterized by comprising the following steps of:

step 1, selecting a training data set;

step 4, training a small sample picture classification model;

and 5, verifying the small sample picture classification model.

14. The method for constructing the small sample picture classification model based on the semantic assistant attention mechanism as claimed in claim 13, wherein in the step 2, the network structure is a network structure of a convolutional neural network.