CN112784921A

CN112784921A - Task attention guided small sample image complementary learning classification algorithm

Info

Publication number: CN112784921A
Application number: CN202110150081.3A
Authority: CN
Inventors: 程塨; 李瑞敏; 郎春博; 韩军伟; 郭雷
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2021-05-11

Abstract

The invention provides a task attention-guided small sample image complementary learning classification algorithm. Firstly, a double-branch multi-part complementary feature learning module is designed, and distinguishing features of a plurality of significant parts are fused, so that the network can deeply explore and utilize the whole space region of a feature map, and further more distinguishing information can be obtained; then, a task-related attention-guiding module is introduced to enable the neural network to obtain the capability of distinguishing the most important features of the current input category by strengthening or suppressing part of knowledge provided by the meta-learner and finding representative features related to the current task. By combining the multi-part complementary feature learning module and the attention module related to the task, the complementary feature most related to the current input category can be deeply mined, the discrimination capability of the network is improved, the high classification precision is realized under the condition of a small number of training samples, and the high classification accuracy and the high robustness are realized.

Description

Task attention guided small sample image complementary learning classification algorithm

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a small sample image complementary learning classification algorithm guided by task attention, which can realize rapid classification of new class images under the condition of small samples.

Background

Deep learning has achieved significant results in recent years in many data-intensive applications, such as object detection, image classification, and semantic segmentation. However, the performance of the deep learning technique depends heavily on the size of the labeled data volume, and lacks learning ability and generalization ability in a low data state. In real life, a certain difficulty exists in the collection of a large amount of labeled data, and the further development of deep learning is greatly limited. On the one hand, in certain fields, such as the military, it is difficult to obtain a large number of samples due to various limitations. On the other hand, mass data labeling requires a large amount of manpower and material resources. Particularly in some professional fields, the data labeling work needs experts in the industry, and great difficulty is brought to the labeling work of a large amount of data. The small sample learning utilizes prior knowledge, and has higher classification accuracy in the face of new classes with only a small amount of labeled data.

The existing small sample image classification method can be broadly divided into four categories, namely a small sample image classification algorithm based on a model, a small sample image classification algorithm based on measurement, a small sample image classification algorithm based on optimization and a small sample image classification algorithm based on data amplification. The model-based method aims to quickly update parameters of a small number of samples by designing a model structure and directly establish a mapping function between input and prediction, but the traditional gradient descent algorithm has more parameters and cannot quickly realize optimization. The metric-based approach mainly learns the mapping of the image to the embedding space and makes the space somewhat discriminative, however, it is difficult to generalize quickly to new classes with limited training data because it is task-independent. The purpose of the small sample image classification algorithm based on optimization is to obtain a better initialized model or gradient descending direction, so that the model still has good generalization capability when facing a new class with limited sample size, however, the method is easy to be trapped in a local optimal point due to limited data size. The method based on data amplification proposes to generate false data by using a small amount of marker samples so as to realize data amplification, but noise is easily brought to a network due to irrational generated data.

In addition, most of the small sample image classification methods are based on shallow feature extraction networks, and the performance difference of the small sample image classification algorithm on a data set can be obviously reduced when the number of layers of the backbone network is deep. Specifically, when the number of layers of the feature extraction network is shallow, the influence of the intra-class difference on the performance of the algorithm is large, but when a deeper backbone network is used, the influence of the intra-class difference on the performance of the network is significantly reduced. Therefore, it is a future development trend to use a deep backbone network to solve the small sample image classification problem. However, deep networks are prone to overfitting problems, and the overfitting problems caused by the feature expression capability and the network depth of the network need to be effectively balanced. First, deep networks typically tend to identify local regions from the most discriminative object parts, rather than from the entire object, resulting in incomplete feature representations. Furthermore, in small sample image classification algorithms, meta-learning is a learning problem on a set of tasks, and meta-learners are typically shared among all tasks. To achieve the correct classification of new classes under different tasks, one base learner needs to be learned for each task. In such cases, it is a challenge to make the underlying learner more specialized and thus respond to different inputs in a task-dependent manner for different tasks.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a task attention-guided small sample image complementary learning classification algorithm. A double-branch multi-part complementary feature learning module is designed, and the distinguishing features of a plurality of remarkable parts are fused, so that the network can deeply explore and utilize the whole space region of a feature map, and further more distinguishing information can be obtained; then, a task-related attention-guiding module is introduced to enable the neural network to obtain the capability of distinguishing the most important features of the current input category by strengthening or suppressing part of knowledge provided by the meta-learner and finding representative features related to the current task. The invention overcomes the problem of overfitting brought by a deep backbone network from two aspects: on one hand, a GAP layer in a backbone network is used for replacing an FC layer in a VGG network, and a task-related attention guidance module is used for capturing characteristic representation related to a task, so that the parameters of the network are greatly reduced, and overfitting in a small sample scene is avoided; on the other hand, the "erase" operation in the multi-site complementary feature learning module is a "Dropout" strategy with a significant learning capability, which generates a mask according to a given threshold value, thereby inactivating some neurons of the extracted feature map of the backbone network, and finally realizing generalization of the network. The invention can rapidly learn a new category under the condition of a small amount of marked samples, and has higher classification precision and good generalization.

A task attention-guided small sample image complementary learning classification algorithm is characterized by comprising the following steps:

step 1, data preprocessing: classifying an image dataset C into a base class C_baseAnd new class C_novelTwo subsets, C_baseThe image in (1) is a training image with class labels, a new class C_novelEach category only has k marked images, and the value range of k is [1,20 ]](ii) a Pair base class C_baseCarrying out preprocessing operation on the image to obtain a preprocessed base class image; from new class C_novelRandomly extracting a plurality of groups of images to simulate small sample conditions, wherein each group of images is a task, each task comprises n categories, each category comprises k marked images and m images without marks, the images with the marks are marked as support images, the images without the marks are marked as query images, and the support images and the query images are respectively preprocessed to obtain preprocessed images; the preprocessing operation is normalization processing by using a mean value and a standard deviation; n is in the range of [1,5 ]]K has a value range of [1,20 ]]M takes the value of 15;

step 2, constructing a metalearner network: backbone network f of meta-learner network_θHead of Chinese character' HeNetwork f_φConstituting, backbone network f_θThe first w convolutional layers of the VGG network, wherein the value of w is 5; head network f_φConvolution operation with a plurality of different convolution kernels is included, wherein the convolution kernel size of a front p layer is 3 x 3, the convolution kernel size of a rear q layer is 1 x 1, p is 2, and q is 1;

using base class C_baseTraining the metalearner network by all the preprocessed images to obtain a pre-trained metalearner network; wherein, the loss function of the network adopts a cross entropy loss function;

step 3, constructing a basic learner network: modifying the head network f based on the pre-trained metalearner network_φObtaining a basic learner network; wherein the modified header network f_φThe system mainly comprises a multi-part complementary feature learning module and a task-related attention module;

the multi-part complementary feature learning module is composed of a branch A and a branch B which are connected in sequence, and specifically comprises the following steps: backbone network f_θOutput characteristic F of_mInputting the data into branch A, and obtaining a feature expression F with n channels through two convolutional layers with convolutional kernel size of 3 multiplied by 3 and one convolutional layer with convolutional kernel size of 1 multiplied by 1_ha，F_haAnd performing thresholding operation on the obtained activation mapping to obtain the characteristic mask corresponding to the most significant part of the object_AThe threshold parameter of the thresholding operation is a predefined parameter, and the value range is [0.5,0.9 ]](ii) a Then, at F_mThe mask_ASetting the corresponding value to zero to obtain the mask not included_ACharacteristic map F'_mPrepared from F'_mInput to branch B, output feature F_hb(ii) a The branch B comprises two layers of convolution layers with convolution kernel size of 3 multiplied by 3 and one layer of convolution layer with convolution kernel size of 1 multiplied by 1;

the specific implementation process of the attention module related to the task is as follows: first, the backbone network f is operated by global average pooling_θOutput characteristic F of_mIs compressed to obtain the global representation characteristics s of C channels[s₁,s₂,...,s_C]Wherein s is_iDenotes the average characteristic of the ith channel characteristic, i is 1,2, … C, C denotes the characteristic F_mThe number of channels of (a); then, the feature s is transformed through two full-connection layers connected in series to obtain the weight u of each channel_a＝W₂(W₁(s)), wherein,

is a parameter of the first fully-connected layer,

taking the value of r as the parameter of the second full connection layer, adding a ReLU activation function behind the first full connection layer, enabling the number of output channels of the second full connection layer to be consistent with the number n of categories, and executing u 'to the weight of each channel by adopting a sigmoid function'_a＝σ(u_a) Operation to obtain normalized weight u'_aσ (·) denotes a sigmoid function; meanwhile, feature maps F 'obtained from the multi-part complementary feature learning module are subjected to'_mAlso treated as above to give F'_mNormalized weight u 'of corresponding each channel'_b(ii) a Finally, weight u'_aAnd u'_bFeature map F obtained in feature learning module respectively complementary with multiple parts_haAnd F_hbMultiplying to obtain a classification feature map F 'of the branch A and the branch B'_haAnd F'_hb：

Step 4, training a basic learner network: firstly, inputting each preprocessed support image into a basic learner network to obtain a classification feature map F 'of branch A'_haAnd a classified feature map F 'of branch B'_hb(ii) a Then, F 'are respectively mixed'_haAnd F'_hbInputting the data into a GAP layer, and outputting a classification feature map F 'respectively obtaining A branches through a softmax layer'_haAnd a classified feature map F 'of branch B'_hbThe classification Loss of the basic learner network is calculated according to the prediction probabilities of the two branches, and the basic learner network is updated by adopting a gradient descent method, wherein the overall classification Loss function Loss of the network is as follows:

Loss＝Loss_A+λLoss_B (3)

Loss_A＝L(f_α(F_m),y_i) (4)

Loss_B＝L(f_β((F_m⊙mask_A),y_i)) (5)

therein, Loss_ARepresents the class Loss, of the A branch_BRepresents the classification loss of the B branch, and lambda represents the weight occupied by the B branch and has the value range of [0.1, 1%](ii) a L (-) represents the cross entropy loss, f_α(. and f)_β(. represents a feature extraction operation, f_α(. a) two convolutional layers of convolution kernel size 3 × 3 and one convolutional layer of convolution kernel size 1 × 1 including the A branch and a task-dependent attention module in step 3, f_β(. two convolutional layers with a convolutional kernel size of 3 × 3 and one convolutional layer with a convolutional kernel size of 1 × 1 including B branches and a task-related attention module in step 3, indicating channel-by-channel multiplication, y_iA label indicating the ith input image, i ═ 1,2, …, k;

step 5, verifying the classification effect: firstly, inputting each preprocessed inquiry image into the basic learner network trained in the step 4 to obtain a classification feature map F'_haAnd F'_hb(ii) a Then prepared from feature F'_haAnd F'_hbPerforming fusion, and combining the fused characteristics F_hInputting the query image into a GAP layer, obtaining the prediction probabilities of n classes of the query image through a softmax layer, and taking the class corresponding to the maximum value of the prediction probabilities as a classification result; the fusion refers to the characteristic F'_haAnd F'_hbThe value of each position in the comparisonTaking the maximum value of each position as the feature F after fusion_hThe value at that position.

The invention has the beneficial effects that: the prior knowledge learned by the metalearner network in the base class is reserved, so that the basic learner can obtain the rapid learning capability by utilizing the prior knowledge; in the basic learner network, as the processing technology of the multi-part complementary feature learning module is adopted, the discrimination features of the target complementary parts are extracted by utilizing the double-branch network, so that the network can obtain higher classification effect; in the basic learner network, due to the adoption of the attention module related to the task, the network has the capability of distinguishing the most important characteristics of the current input category; by combining the multi-part complementary feature learning module and the attention module related to the task, the complementary features most related to the current input category can be deeply mined, and the identification capability of the network is improved. The invention can realize higher classification precision under the condition of a small amount of training samples, and has higher classification accuracy and better robustness.

Drawings

FIG. 1 is a basic flow chart of the task attention-directed small sample image complementary learning classification algorithm of the present invention;

FIG. 2 is a basic framework diagram of a Metalearner network of the present invention;

FIG. 3 is a basic framework diagram of the basic learner network during a training phase in accordance with the present invention;

FIG. 4 is a basic framework diagram of the basic learner network during a verification phase of the present invention;

FIG. 5 is an example of a database image used by an embodiment of the present invention;

fig. 6 is a visualization result image subjected to classification processing by the method of the present invention.

Detailed Description

The present invention will be further described with reference to the following drawings and examples, which include, but are not limited to, the following examples.

The hardware environment for implementation of the embodiment is as follows: intel (R) core (TM) i3-8100 CPU computer, 8.0GB memory, the running software environment is: ubuntu16.04.5lts and Pycharm 2017. Public databases miniImageNet and CUB-200 were used. The miniImageNet is composed of 100 classes, each class comprises 600 samples, 60000 samples in total, the size of each graph is 84 multiplied by 84, and 100 different classes are divided into 64 base classes, 16 verification classes and 20 new classes; the CUB-200 includes 200 kinds of birds, 11788 images in total, 100, 50, and 50 kinds of images are randomly extracted from 200 kinds of images to form a base class, a verification class, and a new class, and in this embodiment, no processing or use is performed on the verification class in the two data sets. In order to show the reliability of the data, 500 tasks are randomly extracted from the new class respectively under the settings of 5-way 1-shot and 5-way 5-shot for verifying the effect of the model. Under the 5-way 1-shot setting, each task includes 5 categories, each of which picks 1 support image and 15 query images. Under the 5-way 5-shot setting, each task includes 5 categories, each of which picks 5 support images and 15 query images.

As shown in fig. 1, the specific implementation process of the present invention is as follows:

1. data pre-processing

Classifying an image dataset C into a base class C_baseAnd new class C_novelTwo subsets, C_baseThe image in (1) is a training image with class labels, a new class C_novelEach category only has k marked images, and the value range of k is [1,20 ]](ii) a Pair base class C_baseCarrying out preprocessing operation on the image to obtain a preprocessed base class image; from new class C_novelRandomly extracting a plurality of groups of images to simulate small sample conditions, wherein each group of images is a task, each task comprises n categories, each category comprises k marked images and 15 images without marks, the images with the marks are marked as support images, the images without the marks are marked as query images, and the support images and the query images are respectively preprocessed to obtain preprocessed images. n is in the range of [1,5 ]]K has a value range of [1,20 ]]. In this embodiment, n is 5, and k is 1 and 5 in the two tasks, respectively.

The preprocessing operation is normalization processing by using a mean value and a standard deviation, and specifically comprises the following steps:

normalizing the three RGB channels of each image I according to the following formula:

wherein, I_cC channel, I 'representing an image'_cDenotes the normalized c-th channel, Mean_cRepresents the mean value of the c-th channel, Std_cRepresents the standard deviation of the c-th channel.

2. Building Meta learner networks

As shown in FIG. 2, the Metalearner network is composed of a backbone network f_θAnd a head network f_φConstituting, backbone network f_θThe first w convolutional layers of the VGG network, wherein the value of w is 5; head network f_φThe convolution operation comprises a plurality of convolution kernels with different sizes, wherein the sizes of the convolution kernels of the first two layers are 3 multiplied by 3, and the sizes of the convolution kernels of the next layer are 1 multiplied by 1.

Using base class C_baseTraining the metalearner network by all the preprocessed images to obtain a pre-trained metalearner network; wherein, the loss function of the network adopts a cross entropy loss function.

3. Building a basic learner network

Based on the pre-trained metalearner network, the backbone network f obtained in the step 2 is used_θAt C_baseFixing the parameters obtained from the data, modifying the head network f_φObtaining a basic learner network; wherein the modified header network f_φThe system mainly comprises a multi-part complementary feature learning module and a task-related attention module.

The multi-part complementary feature learning module is composed of a branch A and a branch B which are connected in sequence, and specifically comprises the following steps: backbone network f_θOutput characteristic F of_m∈R^C×W×HInput to branches A, C as feature F_mW is the width of the feature map, H is the height of the feature map, in this embodiment, C is 512, and F_mTwo convolution layers with convolution kernel size of 3 × 3 and one convolution layer with convolution kernel size of 1 × 1 are passed through to obtain a convolution layer with 5 channelsCharacterization of F_ha，F_haThe feature dimension with the largest response is the activation mapping of the target class, wherein in fig. 3 and 4, the convolutional layer with a two-layer convolutional kernel size of 3 × 3 is denoted as "new layer". Carrying out threshold operation on the obtained activation mapping to obtain the characteristic mask of the most significant part of the corresponding object_AThe threshold parameter tau of the thresholding operation is a predefined parameter with a value range of [0.5,0.9 ]]In this embodiment, when the data set is miniImageNet, τ is 0.4; when the data set is CUB, tau is 0.5; then, at F_mThe mask_ASetting the corresponding value to zero to obtain the mask not included_ACharacteristic map F'_mPrepared from F'_mInput to branch B, output feature F_hb. Branch B includes two convolutional layers with a convolutional kernel size of 3 × 3 and one convolutional layer with a convolutional kernel size of 1 × 1.

The specific implementation process of the attention module related to the task is as follows: first, the backbone network f is operated by global average pooling_θOutput characteristic F of_mIs compressed to obtain the global representation feature s ═ s of C channels₁,s₂,...,s₅₁₂]Wherein s is_iDenotes the average characteristic of the ith channel characteristic, i is 1,2, … C, C denotes the characteristic F_mThe number of channels of (a), in this embodiment, C is 512; then, the feature s is transformed through two full-connection layers connected in series to obtain the weight u of each channel_a＝W₂(W₁(s)), wherein,

is a parameter of the first fully-connected layer,

for the second fully-connected layer parameter, r takes the value 32, in this embodiment

The first full connection layer is added with a ReLU activation function, the output channel number of the second full connection layer is consistent with the class number n, in the embodiment, n is 5,and executing u 'on weight of each channel by adopting sigmoid function'_a＝σ(u_a) Operation to obtain normalized weight u'_aσ (·) denotes a sigmoid function; meanwhile, feature maps F 'obtained from the multi-part complementary feature learning module are subjected to'_mAlso treated as above to give F'_mNormalized weight u 'of corresponding each channel'_b(ii) a Finally, weight u'_aAnd u'_bFeature map F obtained in feature learning module respectively complementary with multiple parts_haAnd F_hbMultiplying to obtain a classification feature map F 'of the branch A and the branch B'_haAnd F'_hb：

4. Network model for training basic learners

As shown in FIG. 3, firstly, inputting each preprocessed support image into the basic learner network to obtain the classification characteristic map F 'of branch A'_haAnd a classified feature map F 'of branch B'_hb(ii) a Then, F 'are respectively mixed'_haAnd F'_hbInputting the data into a GAP layer, and outputting a classification feature map F 'respectively obtaining A branches through a softmax layer'_haAnd a classified feature map F 'of branch B'_hbThe prediction probability of n classes, n is 5, the classification Loss of the basic learner network is calculated according to the prediction probabilities of the two branches, and the basic learner network is updated by adopting a gradient descent method, wherein the overall classification Loss function Loss of the network is as follows:

Loss＝Loss_A+λLoss_B (9)

Loss_A＝L(f_α(F_m),y_i) (10)

Loss_B＝L(f_β((F_m⊙mask_A),y_i)) (11)

therein, Loss_ARepresents the class Loss, of the A branch_BRepresents the classification loss of the B branch, and lambda represents the weight occupied by the B branch and has the value range of [0.1, 1%]In this embodiment, λ is 0.5 for miniImageNet dataset and 0.1 for CUB-200; l (-) represents the cross entropy loss, f_α(. and f)_β(. represents a feature extraction operation, f_α(. a) two convolutional layers of convolution kernel size 3 × 3 and one convolutional layer of convolution kernel size 1 × 1 including the A branch and a task-dependent attention module in step 3, f_β(. two convolutional layers with a convolutional kernel size of 3 × 3 and one convolutional layer with a convolutional kernel size of 1 × 1 including B branches and a task-related attention module in step 3, indicating channel-by-channel multiplication, y_iA label indicating the ith input image, i ═ 1,2, …, k.

5. Classification effect verification

The verification process of the basic learner network is as shown in fig. 4, firstly, inputting each preprocessed inquiry image into the basic learner network trained in the step 4 to obtain a classification feature map F'_haAnd F'_hb(ii) a Then prepared from feature F'_haAnd F'_hbPerforming fusion, and combining the fused characteristics F_hInputting the image data into a GAP layer, obtaining the prediction probabilities of n categories of the query image through a softmax layer, taking n as 5, and taking the category corresponding to the maximum value of the prediction probabilities as the classification result. The fusion refers to the characteristic F'_haAnd F'_hbComparing the values of each position, and taking the maximum value of each position as the feature F after fusion_hThe value at that position.

And (4) evaluating the effectiveness of the method by selecting the classification accuracy rate accurve. The accuracy is the percentage of the number of correctly classified samples in the total number of samples, and generally, the larger the value of the accuracy is, the better the algorithm effect is. The accuracy is calculated as follows:

the relationship among TP, TN, FP and FN is shown in Table 1.

TABLE 1

The classification result obtained by adopting the method of the invention is compared with the baseline method on the miniimagenet data set, the comparison result is shown in table 2, and the classification accuracy shows the effectiveness of the method of the invention. Compared with the method, the baseline model does not comprise a multi-part complementary feature learning module and a task-related attention-guiding module. Specifically, the Baseline model is composed of the first 5 convolution blocks of VGG16, and is followed by three convolution layers, where the number of convolution kernels of the first two convolution layers is 512, the size of the convolution kernel is 3 × 3, the step size is 1, the number of convolution kernels of the next convolution layer is 5, the size of the convolution kernel is 1 × 1, and the step size is 1.

TABLE 2

Model	1-shot	5-shot
			Baseline	56.75±0.89％	77.22±0.66％
Ours	59.31％±0.99％	79.21％±0.64％

The classification result obtained by the method of the invention is compared with the baseline method on the CUB data set, the comparison result is shown in Table 3, and the classification accuracy rate shows the effectiveness of the method of the invention. Fig. 5 is an example of a partial image of a CUB data set, and the visualization result of fig. 6 on the CUB data set proves the excellent classification effect exhibited by the method of the present invention.

TABLE 3

Model	1-shot	5-shot
			Baseline	74.81％±0.88％	92.61％±0.35％
Ours	77.30％±0.86％	94.20％±0.34％

Claims

1. A task attention-guided small sample image complementary learning classification algorithm is characterized by comprising the following steps:

step 1, data preprocessing: classifying an image dataset C into a base class C_baseAnd new class C_novelTwo subsets, C_baseThe image in (1) is a training image with class labels, a new class C_novelEach category only has k marked images, and the value range of k is [1,20 ]](ii) a Pair base class C_baseDrawing (1) ofCarrying out preprocessing operation on the image to obtain a preprocessed base class image; from new class C_novelRandomly extracting a plurality of groups of images to simulate small sample conditions, wherein each group of images is a task, each task comprises n categories, each category comprises k marked images and m images without marks, the images with the marks are marked as support images, the images without the marks are marked as query images, and the support images and the query images are respectively preprocessed to obtain preprocessed images; the preprocessing operation is normalization processing by using a mean value and a standard deviation; n is in the range of [1,5 ]]K has a value range of [1,20 ]]M takes the value of 15;

step 2, constructing a metalearner network: backbone network f of meta-learner network_θAnd a head network f_φConstituting, backbone network f_θThe first w convolutional layers of the VGG network, wherein the value of w is 5; head network f_φConvolution operation with a plurality of different convolution kernels is included, wherein the convolution kernel size of a front p layer is 3 x 3, the convolution kernel size of a rear q layer is 1 x 1, p is 2, and q is 1;

the multi-part complementary feature learning module is composed of a branch A and a branch B which are connected in sequence, and specifically comprises the following steps: backbone network f_θOutput characteristic F of_mInputting the data into branch A, and obtaining a feature expression F with n channels through two convolutional layers with convolutional kernel size of 3 multiplied by 3 and one convolutional layer with convolutional kernel size of 1 multiplied by 1_ha，F_haThe activation mapping with the maximum response characteristic dimension as the target category is subjected to thresholding operation to obtain the acquired activation mappingFeature mask to the most significant part of the corresponding object_AThe threshold parameter of the thresholding operation is a predefined parameter, and the value range is [0.5,0.9 ]](ii) a Then, at F_mThe mask_ASetting the corresponding value to zero to obtain the mask not included_ACharacteristic map F'_mPrepared from F'_mInput to branch B, output feature F_hb(ii) a The branch B comprises two layers of convolution layers with convolution kernel size of 3 multiplied by 3 and one layer of convolution layer with convolution kernel size of 1 multiplied by 1;

the specific implementation process of the attention module related to the task is as follows: first, the backbone network f is operated by global average pooling_θOutput characteristic F of_mIs compressed to obtain the global representation feature s ═ s of C channels₁,s₂,...,s_C]Wherein s is_iDenotes the average characteristic of the ith channel characteristic, i is 1,2, … C, C denotes the characteristic F_mThe number of channels of (a); then, the feature s is transformed through two full-connection layers connected in series to obtain the weight u of each channel_a＝W₂(W₁(s)), wherein,

is a parameter of the first fully-connected layer,

taking the value of r as the parameter of the second full connection layer, adding a ReLU activation function behind the first full connection layer, enabling the number of output channels of the second full connection layer to be consistent with the number n of categories, and executing u 'to the weight of each channel by adopting a sigmoid function'_a＝σ(u_a) Operation to obtain normalized weight u'_aσ (·) denotes a sigmoid function; meanwhile, feature maps F 'obtained from the multi-part complementary feature learning module are subjected to'_mAlso treated as above to give F'_mNormalized weight u 'of corresponding each channel'_b(ii) a Finally, weight u'_aAnd u'_bFeature map F obtained in feature learning module respectively complementary with multiple parts_haAnd F_hbMultiplying to obtain branch A and branch BClassification feature map F 'of branches'_haAnd F'_hb：

Loss＝Loss_A+λLoss_B (3)

Loss_A＝L(f_α(F_m),y_i) (4)

Loss_B＝L(f_β((F_m⊙mask_A),y_i)) (5)

therein, Loss_ARepresents the class Loss, of the A branch_BRepresents the classification loss of the B branch, and lambda represents the weight occupied by the B branch and has the value range of [0.1, 1%](ii) a L (-) represents the cross entropy loss, f_α(. and f)_β(. represents a feature extraction operation, f_α(. a) two convolutional layers of convolution kernel size 3 × 3 and one convolutional layer of convolution kernel size 1 × 1 including the A branch and a task-dependent attention module in step 3, f_β(. The) two layers of convolution layers with convolution kernel size of 3 x 3 and one layer of convolution layers with convolution kernel size of 1 x 1 including B branch and the stepsThe task-related attention module in step 3, which indicates a channel-by-channel multiplication, y_iA label indicating the ith input image, i ═ 1,2, …, k;

step 5, verifying the classification effect: firstly, inputting each preprocessed inquiry image into the basic learner network trained in the step 4 to obtain a classification feature map F'_haAnd F'_hb(ii) a Then prepared from feature F'_haAnd F'_hbPerforming fusion, and combining the fused characteristics F_hInputting the query image into a GAP layer, obtaining the prediction probabilities of n classes of the query image through a softmax layer, and taking the class corresponding to the maximum value of the prediction probabilities as a classification result; the fusion refers to the characteristic F'_haAnd F'_hbComparing the values of each position, and taking the maximum value of each position as the feature F after fusion_hThe value at that position.