CN113936145A

CN113936145A - Fine-grained identification method based on attention diagram sorting

Info

Publication number: CN113936145A
Application number: CN202111173394.7A
Authority: CN
Inventors: 张小瑞; 王营营; 孙伟; 宋爱国; 刘青山; 张开华
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2022-01-14
Anticipated expiration: 2041-10-08
Also published as: CN113936145B

Abstract

The invention discloses a fine-grained identification method based on attention diagram sorting, belonging to the technical field of computer vision and pattern identification; firstly, acquiring the characteristics of an original image; processing the output result of the third convolution layer by using 3 × 3 convolution Conv, a global maximum pooling GAP and a global average pooling GMP, and fully connecting the processing results to obtain the original image characteristics with multiple scales; then, carrying out weak supervision attention learning; then, positioning and refining, positioning a fine-grained identification area through a boundary frame, and extracting the characteristics of the area; then, sorting the attention diagrams according to an importance sorting algorithm, selecting a region with the most discrimination through a bounding box, and extracting the characteristics of the region; strengthening the learning of the most discriminative region by an attention-driven graph importance ranking algorithm; and finally, cascading the characteristics of the original image, the positioned fine-grained identification area and the selected area with the highest identification.

Description

Fine-grained identification method based on attention diagram sorting

Technical Field

The invention relates to a fine-grained identification method based on attention diagram sorting, and belongs to the technical field of computer vision and pattern identification.

Background

Fine-grained image recognition is mainly to classify subclasses of a certain class finely, for example, not only large classes such as flowers, birds and dogs, but also different subclasses of dogs, such as fine classes of husky, samar, golden hair, and the like. When the fine classification is distinguished, the overall appearance is almost the same, the distinction is needed through some local details, and the positions of the local details in the image are different along with the posture of the current target, so that the fine-grained image recognition is more difficult than the traditional image recognition.

The task of fine-grained image recognition has been a challenge in the field of computer vision, mainly for the following reasons: (1) high intra-class variance. Objects belonging to the same category often exhibit significantly different poses; (2) low inter-class differences. Objects belonging to different categories are very similar except for some minor differences, e.g. very similar except for bird head color, bird's beak shape; (3) training data is limited. Marking fine-grained classes typically requires a lot of expertise and labeling time, and therefore fine-grained recognition datasets are typically small in size. For these reasons, it is difficult to obtain accurate classification results by means of existing coarse-grained Convolutional Neural Networks (CNNs) alone.

In order to distinguish different subclasses, for example, to distinguish different types of birds, in addition to extracting features from the entire picture, it is necessary to select distinctive local regions, such as local regions of the head, mouth, and feet of the bird, and use the features of the local regions to assist the final class judgment. As for background information, such as flowers and grasses, it is not important to judge the category, since different birds may stay on trees, lawns, and the information concerning trees and lawns may not play a crucial role in the identification of birds. Therefore, it is a very effective technique to introduce an attention mechanism in the image recognition field, so that the deep learning model focuses more on a certain distinctive local area. And because the local area difference of the fine-grained identification distinctiveness is finer, the intermediate layer features are used for classification in the invention, and the intermediate layer features have higher resolution compared with the high-layer features, contain more position and detail information, and simultaneously overcome the characteristics of low semantic property and more noise of the low-layer features. Meanwhile, multi-scale information is obtained through operations of convolution, global average pooling and global maximum pooling, so that a fine-grained identification task with slight difference only in local areas is facilitated.

Marking fine-grained classes typically requires domain experts to spend a significant amount of labeling time relying on expertise to accomplish, and thus fine-grained identification datasets are typically small. For this reason, data expansion is particularly necessary. The conventional data enhancement method generally performs random cropping on a picture, so that a background area or an incomplete part area which is cropped is easily cropped, the background area or the cropped incomplete part area is equivalent to noise, and especially when the size of an object to be recognized is small, the introduced noise is more.

Disclosure of Invention

Aiming at the problems, the invention provides a fine-grained identification method based on attention map sorting, which can extract features from three layers of an original image, a positioned fine-grained identification area and a selected area with the highest identification performance to perform category prediction, thereby improving the precision of fine-grained identification.

The technical scheme of the invention is as follows:

in order to achieve the purpose of the invention, the invention provides a fine-grained identification method based on attention map sorting, which comprises the following steps:

(1) acquiring the characteristics of an original image;

(2) performing weak supervision attention learning;

(3) positioning and refining, namely positioning a fine-grained identification area through a bounding box and extracting the characteristics of the area;

(4) sorting the attention diagrams according to an importance sorting algorithm, and selecting the most discriminative region to participate in category prediction;

(5) and cascading the characteristics of the original image, the positioned fine-grained identification area and the selected most discriminative area for final prediction.

Further, in the step (1), the acquiring original image features specifically includes:

extracting the characteristics of the images in the training set by using the first three convolutional layers of the convolutional neural network inclusion v3, and then respectively using 3 × 3 convolutional Conv, global maximum pooling GAP and global average pooling GMP to output results X of the third convolutional layer₃And (3) processing, wherein the three processed characteristics are as follows:

cascading to obtain the cascaded characteristics

Then, carrying out Batch standardized Batch Normalization processing on the cascaded features to accelerate the training speed of the convolution network, and obtaining a feature map of the image through full-connection processing; and adjusting the obtained feature map to the same size by a bilinear interpolation method, thereby obtaining the features of the original image for final class prediction.

Further, in step (2), the performing the weakly supervised attention learning includes:

(2.1) acquiring a characteristic diagram and an attention diagram;

(2.2) bilinear attention focusing;

(2.3) attention regularization;

and (2.4) carrying out attention diagram-oriented data expansion in the training process, wherein the data expansion comprises an enhancement diagram, attention clipping and attention reduction.

Further, in step (2.1), the obtaining a feature map and an attention map specifically includes:

using convolutional neural networksExtracting the characteristics of the images in the training set to obtain a characteristic diagram F, wherein F belongs to R^H×W×NR represents dimension, H, W represents height and width of characteristic diagram, N represents channel number of characteristic diagram, distribution of each part of object is in attention diagram A ∈ R^H×W×MM represents the number of attention maps, and the attention map a is obtained from F by the following formula:

wherein F represents a feature map, F (F) represents a convolution operation on the feature map, k represents a counter, k is equal to [1, M ]]，A_kThe kth attention map is shown.

Further, in step (2.2), the bilinear attention is specifically:

after the attention map a is obtained, features are extracted from these portions using bilinear attention-focused BAP, and the feature map F is multiplied by each attention map by element to generate a part feature map, as shown in the following formula:

F_k＝A_k⊙F(k＝1，2，…M)

wherein, F_k∈R^1×NIndicates the kth part feature diagram, which indicates element-by-element multiplication operation;

further extracting distinctive local features through feature extraction operation to obtain the kth further extracted part feature f_k∈R^1×NThe following formula shows:

f_k＝g(F_k)

wherein f is_kDenotes the feature of the part after the k-th further extraction, g (F)_k) Feature diagram F of the k-th part_kCarrying out feature extraction operation;

the characteristics of the whole object are determined by a part characteristic matrix P ∈ R^M×NThe matrix is formed by superposing the further extracted part characteristics, the part characteristic matrix can be expressed by a lower formula,

where M denotes the number of attention maps and N denotes the number of feature map channels.

Further, in step (2.3), the attention regularization specifically includes:

for each fine-grained category, a k-th attention map A is expected_kThe k same part of the representative object punishs the difference of the part features which belong to the same part and are further extracted, and the k part feature f which is further extracted_kWill approach the k-th global feature center c_k∈R^1×NAnd the kth attention map A_kWill be activated in parts of the same object, attention regularization lost

As shown in the following equation:

c_kthe update formula is as follows:

c_k←c_k+β(f_k-c_k)

where M represents the number of attention maps, k represents a counter, k ∈ [1, M]，f_kRepresenting the feature of the part after the k-th further extraction, c_kThe k-th global feature center is represented,

representing the feature f of the part after the k-th further extraction_kThe difference from the kth global feature center is squared by a two-norm, where β represents c_kThe update rate of.

Further, in step (2.4), the attention-oriented data expansion in the training process includes data expansion of an enhancement map, attention clipping, and attention reduction, specifically:

the data expansion steps of the enhanced graph are as follows:

when the size of the object is small, a large part of the image is background, in which case the random data enhancement efficiency is low, and for each training image, an attention map is randomly selected to guide the reinforcement learning process and normalized to an enhancement map, which can be expressed by the following formula:

wherein,

an enhancement diagram showing the kth attention diagram,

r represents dimension, H, W represents height and width of the enhancement map of the attention map, respectively, A_kDenotes the kth attention map, min (A)_k) The pixel value, max (A), representing the pixel point with the smallest pixel value in the kth attention map_k) Representing the pixel value of the pixel point with the maximum pixel value in the kth attention diagram;

the attention clipping data expansion steps are as follows:

firstly, the first step is to

Is larger than a clipping threshold value theta c set by people and belongs to [0, 1 ]]The clipping mask of the pixel point is set to be 1, the clipping masks of other pixel points are set to be 0, and the following formula is shown:

wherein, (i, j) represents pixel points with coordinates of the horizontal axis and the vertical axis being i and j respectively, and C_k(i, j) represents a cropping mask of pixel points (i, j) obtained from the kth enhancement map,

representing the value of a pixel point (i, j) in the kth enhancement image;

bounding box B determined from the kth enhancement map_kCan be covered with C_k(i, j) is a positive value, and B is enlarged from the original image_kThe surrounded area is used as enhanced input data, and features with finer granularity are extracted;

the attention-deficit data expansion step is as follows:

by mixing

Greater than an artificially set drop threshold theta_d∈[0，1]The falling mask of the pixel point is set to be 0, and the falling masks of other pixel points are set to be 1, as shown in the following formula:

wherein D_k(i, j) represents a falling mask of the pixel point (i, j) obtained from the k-th enhancement map.

Further, in the step (3), the positioning and refining, positioning a fine-grained identification region through a bounding box and extracting features of the region, specifically:

obtaining the attention diagram A after the step (2.1) by using the trained network model, and indicating the average A of the M attention diagrams of the position of the object_averCalculated from the following formula:

according to A_averCropping A from the original image according to the data expansion step of attention cropping described in step (2.4)_averAnd (3) an indicated object area is the positioned fine-grained identification area, the area is amplified by using a bilinear interpolation method, and the characteristics of the area are extracted by using the same network structure to obtain the characteristics of the fine-grained identification area for final class prediction.

Further, in step (4), the step of sorting the attention diagrams according to an importance sorting algorithm, and selecting the most discriminative region to participate in the category prediction specifically includes:

obtaining the attention diagram A after the step (2.1) by utilizing the trained network model, and cutting the attention diagram A from the original image_kAmplifying the indicated object area by using a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and judging the probability Q of the area belonging to the groudtruth class according to the characteristics₁，Q₂，Q₃，...，Q_mSelecting the maximum value Q of the probability belonging to the group route class_kA corresponding one of the regions, A corresponding to the region_kThe region is regarded as the anchor node with the highest importance degree, the coordinates of the geometric center of each region are calculated, all regions with the geometric centers being less than margin from the geometric center of the anchor node are selected, and the attention diagrams A corresponding to the regions are obtained correspondingly_k，A_l，...，A_tAveraging these attention maps yields A_averCutting out A from the original image_averAnd (3) an indicated object region is the region with the most discrimination, the region is amplified by using a bilinear interpolation method, and the features of the region are extracted by using the same network structure to obtain the region features with the most discrimination for final class prediction.

Advantageous effects

1. The invention provides an attention map importance ordering algorithm, which can order each attention map according to importance, thereby positioning an area with the most discrimination in an original image according to the importance degree of the attention map, strengthening the learning of the area with the most discrimination, and simultaneously solving the problem that excessive unnecessary noise is introduced due to strong randomness of a data enhancement mode of random cutting;

2. according to the method, the first three convolutional layers are adopted for extraction when the original image features are extracted, the extracted middle layer features are higher in resolution compared with the high layer features and contain more position and detail information, and meanwhile, the problems of low semantic property and high noise of the low layer features are solved; then, the output result of the third convolution layer is operated by using the 3 × 3 convolution Conv, the global maximum pooling GAP and the global average pooling GMP, so that multi-scale information can be obtained, and a fine-grained identification task with slight difference only in a local area is facilitated;

3. the invention uses attention clipping and attention reduction operation, and applies the idea of reinforcement learning to drive the network to extract more distinctive features.

Drawings

FIG. 1 is a flow chart of a fine-grained identification method based on attention-driven graph sorting according to the present invention;

FIG. 2 is a general block diagram of a fine-grained identification method based on attention-graph sorting according to the present invention;

FIG. 3 is a schematic illustration of the weakly supervised attention learning process of FIG. 2;

FIG. 4 is a schematic diagram of the bilinear attention focusing process of FIG. 2;

FIG. 5 is a schematic diagram of the importance ranking algorithm of FIG. 2, with the intent to attempt.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In this embodiment, a flow of the fine-grained identification method based on attention-driven graph sorting is shown in fig. 1, and an overall frame diagram thereof is shown in fig. 2, and the method includes the following steps:

(1) acquiring the characteristics of an original image;

extracting features of images in the training set by using the first three convolutional layers of convolutional neural network inclusion v3, and then respectively processing the output result X3 of the third convolutional layer by using 3 × 3 convolutional Conv, global maximum pooling GAP and global average pooling GMP, as shown in fig. 3, processing the obtained three features:

cascading to obtain the cascaded characteristics

(2) Performing weak supervision attention learning;

(2.1) obtaining a characteristic diagram and an attention diagram:

extracting the characteristics of the images in the training set by using a convolutional neural network to obtain a characteristic diagram F, wherein F belongs to R^H×W×NDistribution of parts of an object by means of an attention diagram A ∈ R^H×W×MR represents dimension, H, W represents height and width, respectively, N represents number of characteristic diagram channels, M represents number of attention diagrams, and attention diagram a is obtained from F by the following formula:

(2.2) bilinear attention focusing:

after obtaining the Attention map a, features are extracted from these portions using Bilinear Attention focusing (BAP), as shown in fig. 4, which is a Bilinear Attention focusing process diagram, and a feature map F is generated by multiplying the feature map F by each Attention map by elements, as shown in the following formula:

F_k＝A_k⊙F(k＝1，2，...M)

wherein, F_k∈R^1×NIndicating the kth feature diagram, which represents an element-by-element multiplication operation.

f_k＝g(F_k)

wherein f is_kDenotes the feature of the part after the k-th further extraction, g (F)_k) For the k characteristic diagram F_kAnd carrying out feature extraction operation.

The characteristics of the whole object are determined by a part characteristic matrix P ∈ R^M×NThe matrix is formed by the superposition of the features of the parts after further extraction, can be represented by a lower formula,

(2.3) attention regularization:

for each fine-grained category, a k-th attention map A is expected_kRepresenting the kth identical part of the subject, the present invention proposes a loss of attention regularization to weakly supervise the attention learning process. Punishment of the difference of the part features belonging to the same object after further extraction, and the kth part feature f after further extraction_kWill approach the k-th global feature center c_k∈R^1×NAnd the kth attention map A_kWill be activated in parts of the same object, attention regularization lost

As shown in the following equation:

c_kthe update formula is as follows:

c_k←c_k+β(f_k-c_k)

Wherein, the data expansion step of the enhancement graph is as follows:

when the size of the object is small, a large part of the image is background, in which case the random data enhancement is inefficient. With the attention map, the data can be more efficiently augmented. For each training image, randomly selecting an attention map to guide the reinforcement learning process, and normalizing the k-th attention map into an enhancement map

Expressed as:

wherein

R represents dimension, H, W represents height and width, respectively, A_kDenotes the kth attention map, min (A)_k) The pixel value, max (A), representing the pixel point with the smallest pixel value in the kth attention map_k) And representing the pixel value of the pixel point with the maximum pixel value in the kth attention map.

The data expansion steps for attention clipping are as follows:

with the enhancement map, the corresponding area in the original map is enlarged and extractedTaking more detailed local features, in particular, first will

wherein (i, j) represents pixel points with coordinates of the horizontal axis and the vertical axis being i and j respectively, C_k(i, j) represents a cropping mask of pixel points (i, j) obtained from the kth enhancement map,

and (3) representing the value of the pixel point (i, j) in the kth enhanced graph.

Bounding box B determined from the kth enhancement map_kCan be covered with C_k(i, j) is a positive value, and B is enlarged from the original image_kThe enclosed area serves as enhanced input data as shown in fig. 3. Due to the fact that the proportion of the object part is increased, the object can be better seen, and the features with finer granularity are extracted.

The attention-deficit data expansion steps are as follows:

attention regularization loss supervision kth attention map A_kRepresenting the kth site of the same subject, while different attention maps may focus on similar sites of the subject. Attention deficit has been proposed to encourage the attention map to represent multiple distinct regions of a subject. In particular, by mixing

By D_k(i, j) masking the original image to remove the kth region will encourage the network to extract other distinctive parts since the kth region is removed from the image, which means that the object can be better seen, the robustness of classification and the accuracy of positioning will be improved.

(3) Positioning and refining, namely positioning a fine-grained identification area through a bounding box and extracting the characteristics of the area:

obtaining an attention diagram A after the step (2.1) by using the trained network model, and indicating the average A of M attention diagrams of the object position_averCalculated from the following formula:

according to A_averCutting A from the original image according to the data expansion step of attention cutting in the step (2.4)_averAnd (3) an indicated object area is the positioned fine-grained identification area, the area is amplified by using a bilinear interpolation method, and the characteristics of the area are extracted by using the same network structure to obtain the characteristics of the fine-grained identification area for final class prediction.

(4) The attention is sought to be ranked according to an importance ranking algorithm, and the most distinctive part is selected to participate in the category prediction, as shown in fig. 5, which is a schematic diagram of the attention ranking algorithm.

Obtaining an attention diagram A after the step (2) by using the trained network model, and obtaining the attention diagram A according to the A_kCutting A from the original image according to the data expansion step of attention cutting in the step (2.4)_kAmplifying the indicated object area by using a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and judging the probability Q of the area belonging to the groudtruth class according to the characteristics₁，Q₂，Q₃，...，Q_mSelecting the maximum value Q of the probability belonging to the group route class_kA corresponding one of the regions, A corresponding to the region_kThe region is considered to be the most important, and is regarded as an anchor node. Calculating the coordinates of the geometric center of each region, and selecting all regions with the geometric center less than margin from the geometric center of the anchor node, wherein the regions correspond to an attention diagram A_k，A_l，...，A_tAveraging these attention maps yields A_averCutting A from the original image according to the data expansion of the attention cutting in the step (2.4)_averThe indicated object region is the region with the most discrimination, the region is amplified by a bilinear interpolation method, the characteristics of the region are extracted by the same network structure, and the region characteristics with the most discrimination are obtained for final category prediction;

(5) and (4) cascading (concat) the characteristics of the original image, the positioned fine-grained identification area and the selected area with the most discriminativity for final prediction.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the embodiments of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A fine-grained identification method based on attention map sorting is characterized by comprising the following steps:

(1) acquiring the characteristics of an original image;

(2) performing weak supervision attention learning;

2. The fine-grained identification method according to claim 1, wherein in the step (1), the obtaining of the original image features specifically comprises:

cascading to obtain the cascaded characteristics

3. The fine-grained identification method according to claim 2, wherein in the step (2), the performing the weakly supervised attention learning comprises:

(2.1) acquiring a characteristic diagram and an attention diagram;

(2.2) bilinear attention focusing;

(2.3) attention regularization;

4. The fine grain identification method according to claim 3, wherein in the step (2.1), the obtaining of the feature map and the attention map is specifically:

extracting the characteristics of the images in the training set by using a convolutional neural network to obtain a characteristic diagram F, wherein F belongs to R^H×W×NR represents dimension, H, W represents height and width of characteristic diagram, N represents channel number of characteristic diagram, distribution of each part of object is in attention diagram A ∈ R^H ^×W×MM represents the number of attention maps, and the attention map a is obtained from F by the following formula:

5. The fine grain identification method according to claim 4, wherein in the step (2.2), the bilinear attention is specifically:

F_k＝A_k⊙F(k＝1，2，...M)

f_k＝g(F_k)

the characteristics of the whole object are determined by a part characteristic matrix F ∈ R^M×NRepresenting the matrix from these further extracted partsThe feature is formed by superposition, a part feature matrix can be represented by a lower formula,

6. A fine-grained identification method according to claim 5, wherein in step (2.3), the attention regularization is specifically:

As shown in the following equation:

c_kthe update formula is as follows:

c_k←c_k+β(f_k-c_k)

representing the feature f of the part after the k-th further extraction_kThe difference from the kth global feature center is squared by a two-norm, where β representsc_kThe update rate of.

7. The fine-grained recognition method according to claim 6, wherein in step (2.4), the attention-graph-oriented data expansion in the training process includes data expansion of an enhancement graph, attention clipping, and attention reduction, specifically:

the data expansion steps of the enhanced graph are as follows:

wherein,

an enhancement diagram showing the kth attention diagram,

the attention clipping data expansion steps are as follows:

firstly, the first step is to

representing the value of a pixel point (i, j) in the kth enhancement image;

bounding box B determined from the kth enhancement map_kCan be covered with C_k(i, j) is a positive value, and B is enlarged from the original image_kThe surrounded area is used as enhanced input data, and features with finer granularity are extracted; the attention-deficit data expansion step is as follows:

by mixing

8. The fine grain identification method according to claim 7, wherein in the step (3), the positioning and refining, the fine grain identification area is positioned through the bounding box and the feature of the area is extracted, specifically:

9. The fine grain identification method according to claim 8, wherein in the step (4), the attention maps are ranked according to an importance ranking algorithm, and the selection of the most discriminative region to participate in the category prediction specifically comprises: