CN110796183A

CN110796183A - Weak supervision fine-grained image classification algorithm based on relevance-guided discriminant learning

Info

Publication number: CN110796183A
Application number: CN201910986800.8A
Authority: CN
Inventors: 王智慧; 王世杰; 李豪杰; 唐涛
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2020-02-14

Abstract

The invention belongs to the technical field of computer vision, and provides a correlation-guided weak-supervision fine-grained image classification algorithm for discriminant learning. An end-to-end relevance guided discriminant learning model is provided to fully mine and utilize the relevance of weak supervision fine-grained image classification to improve discriminability. First, a discriminative area grouping sub-network is proposed that first establishes correlations between areas and then enhances each area by weighting together all correlations from other areas to guide the network to find more discriminative area groups. Finally, discriminative feature-enhancing sub-networks have been proposed to mine and learn the intra-spatial correlations between the feature vector elements of each patch, and to improve its local discriminative power by jointly enhancing information elements while suppressing useless elements. A large number of experiments prove that the DRG and DFS are effective and achieve the most advanced performance.

Description

Weak supervision fine-grained image classification algorithm based on relevance-guided discriminant learning

Technical Field

The invention belongs to the technical field of computer vision, and provides a weakly supervised fine grained image classification algorithm based on relevance-guided discriminant learning, which takes improvement of fine grained image classification accuracy and efficiency as a starting point.

Background

Unlike general image classification, weakly supervised fine grained image classification (WFGIC) uses only image-level tags to identify objects at a more detailed class and granularity. WFGIC has attracted a great deal of attention in academia and industry due to its large number of potential applications in image understanding and computer vision systems. WFGIC is an open problem in the field of computer vision for two reasons. First, images belonging to the same sub-category vary greatly in size, pose, color, and background, while images of different sub-categories may be very similar in these respects. Second, WFGIC only provides image-level tags, in addition to object or region annotations, which presents greater difficulty in extracting valid discriminative features to distinguish subtle differences between sub-categories,

because the key differences between fine-grained subcategory images are subtle and often local to some specific portion of the object, the latest best-performing WFGIC system is working on finding local discriminant patches using heuristic based schemes or learning methods. Objects are first located using saliency extraction and co-segmentation, and then two defined spatial constraints are applied to select distinguishable parts from a large number of candidate patches. The limitation of heuristic approaches is that they do not guarantee that the selected patch is sufficiently discriminative. Therefore, recent work has focused on designing an end-to-end deep learning process to guide the automatic discovery of discriminative patch through appropriate loss functions. However, all previous work attempted to find the discriminative regions/patches independently, only using region features, ignoring the correlation between regions. We believe that using this correlation is very helpful in distinguishing fine-grained images, since regional combinations are more descriptive and discriminant than single regions. This prompted us to incorporate the correlation between regions into the discriminant patch selection. To this end, we propose a discriminative zone grouping (DRG) sub-network to model the correlation between zones and implicitly find a discriminative zone group that is more powerful for WFGIC by learning the correlation. Figure 1 shows our motivation and from (b) we can see that the head and chest are more prominent when each region is considered independently. After taking into account the correlation (c), the discrimination scores for head and tail become large, as head-tail combinations may be more effective in distinguishing this type of bird from other subcategories.

The feature representation is another key point of the WFGIC. Recently, some work has been done to encode CNN feature vectors into higher-order information through an end-to-end mechanism to improve the discrimination of features. Their methods are effective because of their invariance to the translation and pose of the object. Because the feature vectors are based on local image feature aggregation in a chaotic way, the feature vectors are translation invariant in design. However, these methods ignore the internal spatial correlation. In addition, there is some context with less discriminant or noise in the discriminant patch, such as the background region in fig. 1(d) (e). Such background information or less discriminative information may be detrimental to fine-grained classification because all subcategories have similar background information (e.g., all birds typically live on trees or fly in the sky). Based on the above intuitive but important observations and analyses, we propose a discriminative feature enhancement subnetwork to explore the intra-spatial correlations between discriminative elements in the feature vector to obtain better discriminative power. We achieve this goal by jointly learning interdependencies between feature vector elements and emphasizing informative elements while suppressing less discriminative elements

Disclosure of Invention

The invention provides a weak supervision fine-grained image classification algorithm based on relevance-guided discriminant learning, which is shown in figure 2.

The technical scheme of the invention is as follows:

a weakly supervised fine grained image classification algorithm based on relevance guided discriminant learning comprises two sub-networks:

(1) discriminative area packet (DRG) subnetworks

In this sub-network we propose a new method to establish the association between the zones. Given input profileShow M_I∈R^C×H×WWe propose to input the input feature representation into the discriminative area grouping module F:

M_R＝f(M_I)， (1)

wherein F is composed of three region-generating layers, a relation layer and a fusion layer. M_R∈R^C×H×WWherein W, H represent the width and height of the feature representation, and C represents the number of channels.

The region generation layer is calculated by a simple convolution operation and matrix transformation as follows:

M_T＝f(W_I·M_I+b_T)， (2)

wherein W_T∈R^C×1×1×CAnd b_TWhich are the learning weight parameters and the bias vectors of the convolutional layer, respectively. 1 × 1 is the size of the convolution kernel. M_T∈R^C×H×WRepresenting a new feature map. Specifically, we consider a 1 × 1 convolution filter as a small area detector. M_TEach V on a channel at a fixed spatial position_T∈R^C×1×1The vector represents a small region of the corresponding position in the original image.

In order to obtain the correlation weight coefficient between the regions, a correlation layer is introduced to perform two feature maps calculated by the upper generation layer

And

comparison of the multiplication fields of (1).

Let us take just a single correlation of two positions as an example. P in the first feature map₁And p of the second feature map₂The correlation between two positions is defined as

Wherein V₁And V₂Respectively representing regional characteristics in different characteristic diagramsAnd (5) vector quantity. In actual operation, for each position p in the first profile₁We calculate its correlation to all locations in the second map.

For each combination of two positions, we obtain a correlation value. Specifically, we organize the relative displacement in the channel and obtain an output correlation characteristic map M_C∈R^K×H×WWhere K ═ W × H is the region of the input feature map. Then, M_CPassing through softmax layer to generate a discriminant correlation weight map R E R^K×H×W：

In the forward propagation process, the higher the discriminant of the regions, the greater the correlation between them. For back propagation, we derive for each bottom blob accordingly. When the probability value of the classification is low, the penalty will be propagated backwards to reduce the relevance weight of the two regions and at the same time update the feature representation computed by the region generation layer.

Next, we will generate feature vectors by the third region generator layerAnd the relevance weight graph R is input into the fusion layer f:

whereinIs that

W of^thRows and wh^thVector of columns, R^ijkIs the ith^thLine j (th)^thColumn k^thThe weight coefficient of the channel. At M_FMiddle (i)^thLine j (th)^thVector of columns

Can be calculated by combining all position vectors with corresponding correlation coefficients, wherein the feature map

The index mapping relationship with the correlation weight coefficient map R is k ═ W-1 × W + h. In this way, the discrimination capability of the region aggregation is taken into account.

Inspired by ResNet, we propose residual learning:

M_R＝α·M_F+M_I， (7)

where α is the adaptive weight parameter and is gradually trained to learn to assign more weight to discriminant-related features, its range is [0, 1 ]]The initialization value is approximately 0. M_RIncluding adaptive discriminant correlation features and raw input features to pick out more discriminant patches. Integrating global semantic information and local detail information may lead to more stable performance.

(2) Pick discriminant patch

In this work, we generate a default patch from three different scales of feature maps, based on the heuristics of target detection. The profiles of the different layers have different Receptive Fields (RF). We elaborated the scaling size, scaling step size and aspect ratio of the patch based on the respective RF of each feature map so that different feature maps can account for different sized discriminative regions.

Let us use the feature map M only_RFor example. We will residual feature M_RAnd inputting the scoring layer. Specifically, we add a 1 × 1 × N convolutional layer and a sigmoid function σ to learn the discriminative probability map Sedi R^K×H×WThis indicates the effect of the discriminative region on the final classification result.

S＝σ(W_s·M_R+b_S)， (8)

Wherein W_S∈R^C×1×1×NIs a parameter of the convolution kernel, N is a feature map M_RThe default number of taps in a given position, b_SThe deviation is indicated.

At the same time, we assign a discriminant probability value to each default patch as p_i,j,k. Each patch has its default coordinates (t)_x，t_y，t_w，t_h) And a discriminative probability value s_i,j,kWherein s is_i,j,kDenotes the ith^thLine j (th)^thColumn k^thThe value of the channel:

p_i,j,k＝[t_x，t_y，t_w，t_h，s_i,j,k]， (9)

finally, the network selects the first M patches with discriminative probability values, where M is a hyperparameter.

(3) Discriminative feature enhancement (DFS) subnetworks

The selected patch typically contains noise, so the extracted features tend to contain non-discriminative information. At the same time, most current work forms the feature representation of a region directly from the output of CNN, rarely considering spatial correlation in the feature vector. To solve the above problem, we propose a discriminant feature enhancer network to mine and exploit the correlation between feature vector elements, which consists of a feature perception filter layer and an enhancement layer. The feature-aware filtering layer aims to generate a global filter to filter out useless information by a non-linear operation that inverts negative values in feature vectors. The enhancement layer is used to adaptively learn the interdependence relationship by using the weighted sum between the discriminative elements in the feature vector to improve the discriminative power of the feature vector.

We will refer to the feature vector V'_P∈R^C×1Input to the feature perception filter to filter out unwanted information as follows:

V_P＝ReLU(BN(W*V'_P+b_P)) (10)

wherein W_PAnd b_PIs the weight matrix and bias of the linear layer, BN and ReLU represent batch regressionNormalized and linear correction unit (ReLU) functions. V^～ _P∈R^C×1Representing the filtered discriminative feature vector.

Then, we will V^～ _PInput to the enhancement layer. Specifically, the interdependence score graph S of the discriminant elements_E∈R^C×CIs through V^～ _PAnd V^～ _PIs generated as follows:

where σ is the softmax function used for normalization.

Is the ith before normalization^thIndividual discriminant element and j^thThe interdependence between the discriminant elements,

represents the ith after normalization^thIndividual discriminant element and j^thInterdependencies between discriminant elements; the larger the discrimination between any two elements, the stronger their interdependencies.

Next, we pass the patch feature vector V^～ _PAnd interdependence score plot S^～ _EThe matrix multiplication between the two improves the discrimination capability of the feature vector:

V＝V^～ _P⊙S^～ _E(12)

taking into account the inter-spatial dependencies between discriminant elements of the feature vector, information elements can be enhanced while suppressing less powerful elements. We also introduced a residual learning mechanism to ensure the robustness of the network:

V^～＝β·V+V_P， (13)

wherein β is a weight that learns gradually from 0 and adjusts to an accurate value by back-propagation。V^～Containing enhanced feature vectors V and original input feature vectors V for final classification_P。

(4) Loss function

The complete multitask penalty function L can be expressed as:

whereinRepresenting a fine-grained classification penalty.

And

indicating the guiding loss, the correlation loss and the grade loss, respectively. The balance between these losses is determined by the hyperparameter λ₁，λ₂，λ₃And (5) controlling. Through multiple times of experimental verification, the parameter lambda is set₁＝λ₂＝λ₃＝1。

We denote the selected discriminant patch as P ═ P₁，P₂，...，P_NAnd represent the corresponding discrimination probability score as S ═ S₁，S₂，...，S_N}. Then, the guidance loss and the associated loss and the rank loss are defined as follows:

where X is the original image and function C is a summary reflecting the classification into the correct classConfidence function of rate, P_cIs a concatenation of all selected patch features.

The purpose of the steering loss function is to steer the network to select a more discriminative area. When the prediction probability value of the selected region is lower than that of the whole image, the network will be penalized and weight adjusted by back propagation. The correlation loss function may ensure that the prediction probability of the combined feature is greater than the prediction probability of a single patch feature. The rank penalty stimulates the discriminant score and the final classification probability value in equal order, trying to keep both of the selected patches consistent.

The invention has the beneficial effects that:

(1) to our knowledge, we are the first approach to explore and use the correlation between discriminant regions and feature vector elements to improve the discrimination ability of regions and their representatives for WFGIC.

(2) We propose an end-to-end correlation-guided discriminant learning (CDL) model that incorporates discriminant region grouping and discriminant feature enhancement into a unified framework, so that two levels of correlation can be efficiently and jointly learned.

(3) We evaluated our proposed method on the challenging Caltech-UCSD libraries-200-. Experimental results show that the method achieves the best performance in both classification precision and efficiency. In particular, our method achieves an accuracy improvement of about 1.4% and a speed of 12FPS operation faster than the best previous techniques.

Drawings

Fig. 1 is a motivation illustration diagram of a relevance guide discriminant learning method proposed by the present invention.

FIG. 2 is a network framework diagram of a relevance-guided discriminant learning (CDL) model proposed by the present invention.

Fig. 3 is an explanatory diagram of the discrimination area grouping proposed by the present invention.

Fig. 4 is an illustration of the discrimination feature enhancement proposed by the present invention.

Fig. 5 is a visualization result of the region correlation of the present invention, and (a) is an original image. (b) (c) (d) (e) represents the correlation between the area of a particular location and all other areas.

FIG. 6 is a visual intermediate of the discrimination area grouping of the present invention, where (a) is the original image. (b) Indicating a dependency aggregation feature graph. (c) Representing a residual feature map. (d) Is the positioning result.

Fig. 7 shows the result of visual localization with or without the region correlation contrast according to the present invention, where (a) is the original image. (b) (c) a discriminative score map by scoring phase without and with correlation, respectively. (d) And (e) the positioning results with no correlation and with correlation respectively.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following detailed description of the embodiments of the present invention is provided.

Data set: experimental evaluation was performed on two reference datasets: Caltech-UCSD Birds-200 and Stanford Cars, which are widely used sets of contest data for fine-grained image classification. The CUB-200-2011 data set covers 200 birds and contains 11788 bird images, which are divided into a training set of 5994 images and a test set of 5794 images. The stanford car dataset contained 16,185 images in 196 categories, with approximately 50 groupings per category.

Implementation details: in all our experiments, all images were resized to 448 × 448. We used the full convolution network ResNet-50 as the feature extractor and applied "batch normalization" as the regularizer. Our optimizer uses Momentum SGD with an initial learning rate of 0.001 and multiplies 0.1 after every 60 epochs. We set the weighted decay rate to 1 e-4. To reduce patch redundancy, we apply a nonmaximum suppression (NMS) to the default patch based on its discriminant score and set the NMS threshold to 0.25.

Ablation experiment: the main advantage of our method is to select a more discriminative patch based on the correlation between regions and enhance the feature vector by mining the interdependencies of discriminative elements in the feature vector. We performed a few ablation experiments to illustrate the effectiveness of our proposed module, including the impact of discrimination region grouping and discrimination feature enhancement.

First, we extract features from the entire image through Resnet-50, do no object or partial annotation to make fine-grained classification, and set it as baseline. Then, we select default patch as the local feature to improve classification accuracy. However, a large number of redundant default patches results in a low classification speed. When we introduce a scoring mechanism to retain only highly discriminative patches and reduce the number of patches to single-digit numbers, the top-1 classification accuracy of the CUB-200-2011 data set is improved by 0.6%, and real-time classification is achieved at a speed of 50 fps. In addition, after the discrimination capability of the region aggregation is considered, the classification precision is improved by 1.3 percent. Finally, a feature perception filter is introduced, the interdependence of feature vector values is mined, and the classification precision reaches 88.4% of the latest result. We also analyzed the feature-aware filter in DFS and demonstrated its effectiveness without adding additional computational cost. The results are reported in table 2. Ablation experiments show that the network proposed by the user really learns the discriminant region, useless information is filtered, the discriminant characteristic value is enhanced, and accuracy is effectively improved.

TABLE 2 identification of ablation experiments for different variations of the method of the invention

And (3) qualitative comparison: and (3) accuracy comparison: our comparison focuses on weakly supervised approaches, since the proposed model uses only image-level annotations, and does not use any object or region annotations. In tables 3 and 4, we show the performance of the different methods on the CUB-200-2011 dataset and the Stanford Cars-196 dataset, respectively. From top to bottom of each table, the methods are divided into six groups, respectively, (1) supervised multi-level methods, (2) weakly supervised multi-level frameworks, (3) weakly supervised end-to-end feature coding, (4) end-to-end location classification sub-networks, (5) other methods (e.g. reinforcement learning, knowledge representation) and (6) our CDL.

TABLE 3 comparison of the different methods on CUB-200-

TABLE 4 comparison of the different procedures on Stanford Cars-196

Early multi-stage methods may generally rely on object and even site annotation and therefore may achieve better results. However, using object or site annotations limits performance because the annotations only give coordinates and not actual discriminative area information. The weakly supervised multi-stage framework gradually defeats the powerful supervision approach by picking out the discriminative regions. The end-to-end feature coding method has good performance by coding CNN feature vectors into high-order information, while resulting in higher computational cost. Although location classification subnetworks may work well on a variety of data sets, they still lack correlation between discrimination regions. Other methods also achieve good performance due to the use of additional information (e.g., semantic embedding). Our end-to-end CDL approach achieves the best results without any additional comments and has consistent performance across various data sets.

Our approach outperforms these powerful supervised approaches in the first group, which suggests that the proposed approach can find the discriminant patch without any supervised annotation. Compared with other weak supervision methods, the method can achieve the best performance. The performance of the proposed CDL on CUB is 1.4% higher than that of KERL, because we can perform region representation from global image level and local region level, and encode richer information. The DT-RAM selects accurate discriminant regions by using reinforcement learning, and more discriminant patches are selected by learning the correlation among the regions and mining the interdependency of elements in the feature vector to emphasize information elements and restrain useless elements, so that the method is more excellent than the DT-RAM in performance, the accuracy is improved by 2.4% on CUB and 1.1% on an automobile.

And (3) speed comparison: we performed experiments on Titan X graphics at a batch size 8 measurement speed. Table 4 shows a comparison with other methods. WSDL also applies multi-scale features to generate a patch, and selects the patch by detecting a score. Although we have chosen 2 discriminant patches based on the discriminant score map, we outperform other methods in both speed and accuracy. When we increase the number of discriminant patches from 2 to 4, the proposed model can achieve the most advanced classification accuracy and can maintain the real-time performance of 40 fps.

TABLE 5 comparison of efficiency and effectiveness of different methods on CUB-200-

Quantitative analysis: to better illustrate the impact of the correlation between regions, we visualize the correlation weight coefficient map in fig. 5. The correlation coefficient map indicates the correlation between a certain fixed region and all regions. We can observe that the feature maps learned by association tend to pay attention to some fixed regions (highlighted regions). The more discriminative the regions, the greater their correlation. The most discriminative regions occupy a higher proportion in the clustering process.

As shown in fig. 6, we visualize the correlation aggregate feature map obtained by the operation of the weight sum and the residual feature map combining all the regions. The residual feature map is obtained by fusing the original feature map and the correlation aggregation feature map. The raw features map a response to a particular size of the region of discriminant and focus on a number of local details. The relevance aggregation profile has a global view, noting the most discriminating regions. The residual feature map contains both local detailed information and global discriminant information to achieve stable performance.

To illustrate the effectiveness of the discriminative area grouping module, we visualize the discriminative scoring graph with and without discriminative area grouping sub-networks in FIG. 7. We can see that the discriminant score map without the correlation stage focuses only on one discriminant region, and the selected patch concentrates on its neighboring regions. However, our discrimination area grouping sub-network can notice a plurality of valid areas as shown in fig. 7 (c). To present the image more intuitively, we show the positioning results in the original image. It can be observed that the selected patch is concentrated in several different regions, thus resulting in a region-gathering feature that is more discriminative.

In the method, a CDL method is provided for classifying weakly supervised fine grained images, and the method integrates a discriminant region grouping sub-network and a discriminant feature enhancement sub-network into a unified framework. The discriminative region grouping subnetwork may learn the correlation weight coefficients between regions to guide finding the discriminative patch, while the discriminative feature enhancement subnetwork may mine the interdependencies between internal discriminative elements in the feature vector to enhance informational elements and suppress unwanted elements. Experiments have shown that our method has consistent improved results on both fine-grained image datasets. We achieved the most advanced accuracy and real-time speed of 42 fps.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A weakly supervised fine grained image classification algorithm based on relevance guided discriminant learning is characterized by comprising two sub-networks:

(1) discriminative area packet DRG subnetwork

A new method is proposed in the sub-network to establish the connection between the areas; given input feature representation M_I∈R^C ^×H×WInputting the input feature representation into a discrimination region grouping module F:

M_R＝f(M_I)， (1)

wherein F is composed of three region-generating layers, a relation layer and a fusion layer; m_R∈R^C×H×WWherein W, H represent the width and height of the feature representation, and C represents the number of channels;

the area generation layer is calculated by convolution operation and matrix transformation as follows:

M_T＝f(W_I·M_I+b_T)， (2)

wherein, W_T∈R^C×1×1×CAnd b_TLearning weight parameters and deviation vectors of the convolutional layer respectively; 1 × 1 is the size of the convolution kernel; m_T∈R^C×H×WRepresenting a new feature map; specifically, a 1 × 1 convolution filter is considered to be a small area detector; m_TEach V on a channel at a fixed spatial position_T∈R^C×1×1The vector represents a small area of the corresponding position in the original image;

in order to obtain the correlation weight coefficient between the regions, a relation layer is introduced to carry out two feature maps calculated by a region generation layer

And

comparison of the multiplicative fields of;

single correlation of two positions: p in the first feature map₁And p of the second feature map₂The correlation between two positions is defined as

Wherein, V₁And V₂Respectively representing regional characteristic vectors in different characteristic graphs; in actual operation, for each position p in the first profile₁Calculating the correlation between the first map and all the positions in the second map;

for each combination of two positions, a correlation value is obtained; specifically, relative displacement of tissue in the channel is obtained, and an output correlation characteristic map M is obtained_C∈R^K×H×WWhere K ═ W × H is the region of the input feature map; then, M_CPassing through softmax layer to generate a discriminant correlation weight map R E R^K×H×W：

In the forward propagation process, the higher the discriminant of the regions is, the greater the correlation between the regions is; for back propagation, the derivation is performed for each bottom blob accordingly; when the probability value of the classification is low, the punishment is propagated reversely to reduce the relevance weight of the two regions, and the feature representation calculated by the region generation layer is updated at the same time;

next, the feature vectors generated by the third region generator layerAnd the relevance weight graph R is input into the fusion layer f:

wherein the content of the first and second substances,is that

W of^thRows and wh^thVector of columns, R^ijkIs the ith^thLine j (th)^thColumn k^thA weight coefficient of the channel; at M_FMiddle (i)^thLine j (th)^thVector of columns

By combining all position vectors with corresponding correlation coefficients, wherein the feature mapThe index mapping relation with the correlation weight coefficient graph R is k ═ W-1 × W + h;

residual learning is proposed:

M_R＝α·M_F+M_I， (7)

wherein α is the adaptive weight parameter and is gradually trained and learned to assign more weight to discriminant-related features, α is the range [0, 1%]The initialization value is 0; m_RThe method comprises the steps of including self-adaptive discriminant correlation characteristics and original input characteristics to select more discriminant patches;

(2) pick discriminant patch

Generating a default patch from three feature maps with different scales according to target detection; the profiles of the different layers have different reception fields RF; scaling the step size and the aspect ratio according to the scale size of the corresponding RF design patch of each feature map so as to enable different feature maps to be responsible for different sized discrimination areas;

for the feature map M_RThe residual error characteristics M_RInputting a rating layer; specifically, a 1 × 1 × N convolutional layer and a sigmoid function σ are added to learn the discriminative probability map Sedi R^K×H×WThe influence of the discriminant region on the final classification result is shown;

S＝σ(W_s·M_R+b_S)， (8)

wherein, W_S∈R^C×1×1×NIs a parameter of the convolution kernel, N is a feature map M_RThe default number of taps in a given position, b_SIndicating a deviation;

meanwhile, a discriminant probability value is assigned to each default patch as p_i,j,k(ii) a Each patch has its default coordinates (t)_x，t_y，t_w，t_h) And a discriminative probability value s_i,j,kWherein s is_i,j,kDenotes the ith^thLine j (th)^thColumn k^thThe value of the channel:

p_i,j,k＝[t_x，t_y，t_w，t_h，s_i,j,k]， (9)

finally, the network selects the first M patches with discriminant probability values, wherein M is a hyper-parameter;

(3) discriminative feature enhanced DFS subnetworks

A discriminant feature enhancer network to mine and exploit correlations between feature vector elements, consisting of a feature-aware filter layer and an enhancement layer; the feature-aware filtering layer is intended to generate a global filter to filter the features by negating negative values in the feature vectors; the enhancement layer is used for adaptively learning the interdependence relationship by using the weighted sum of the discriminant elements in the feature vector;

feature vector V'_P∈R^C×1Input to the feature perception filter to filter out unwanted information as follows:

V_P＝ReLU(BN(W*V'_P+b_P))(10)

wherein W and b_PIs the weight matrix and deviation of the linear layer, BN and ReLU represent batch normalization and linear correction unit functions; v^～ _P∈R^C×1Representing the filtered discriminant feature vector;

then, V is put^～ _PInput to the enhancement layer; specifically, the interdependence score graph S of the discriminant elements_E∈R^C×CIs through V^～ _PAnd V^～ _PBy a matrix between transposes ofThe method operation is generated as follows:

where σ is the softmax function used for normalization;

represents the ith after normalization^thIndividual discriminant element and j^thInterdependencies between discriminant elements; the larger the discrimination value between any two elements, the stronger their interdependencies;

next, pass the patch feature vector V^～ _PAnd interdependence score plot S^～ _EThe matrix multiplication between the two improves the discrimination capability of the feature vector:

V＝V^～ _P⊙S^～ _E(12)

the internal space interdependence relation between discriminant elements of the feature vector is considered, information elements are enhanced, and elements with small inhibition effect are obtained; a residual learning mechanism is also introduced to ensure the robustness of the network:

V^～＝β·V+V_P，(13)

where β is a weight that learns gradually from 0 and adjusts to an accurate value by back-propagation, V^～Containing enhanced feature vectors V and original input feature vectors V for final classification_P；

(4) Loss function

The complete multitask penalty function L is expressed as:

wherein the content of the first and second substances,

represents a fine-grained classification penalty;and

respectively representing a guide loss, a correlation loss and a grade loss; the balance between these losses is determined by the hyperparameter λ₁，λ₂，λ₃Controlling; through multiple times of experimental verification, setting parameter lambda₁＝λ₂＝λ₃＝1；

The selected discriminant patch is expressed as P ═ P₁，P₂，...，P_NAnd represent the corresponding discrimination probability score as S ═ S₁，S₂，...，S_N}; then, the guidance loss and the associated loss and the rank loss are defined as follows:

where X is the original image, function C is a confidence function reflecting the probability of classifying into the correct class, P_cIs a concatenation of all selected patch features;

the purpose of the steering loss function is to steer the network to select a more discriminative area; when the prediction probability value of the selected area is lower than that of the whole image, the network is punished and carries out weight adjustment through back propagation; the correlation loss function ensures that the prediction probability of the combined feature is greater than that of a single patch feature; the rank penalty stimulates the discriminant score and the final classification probability value in equal order, trying to keep both of the selected patches consistent.