CN113936145B

CN113936145B - Fine granularity identification method based on attention-seeking diagram ordering

Info

Publication number: CN113936145B
Application number: CN202111173394.7A
Authority: CN
Inventors: 张小瑞; 王营营; 孙伟; 宋爱国; 刘青山; 张开华
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2024-06-11
Anticipated expiration: 2041-10-08
Also published as: CN113936145A

Abstract

The invention discloses a fine granularity identification method based on attention-seeking ordering, belonging to the technical field of computer vision and pattern identification; firstly, acquiring original image features; processing the output result of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, and fully connecting the processing results to obtain multi-scale original image features; then, performing weak supervision attention learning; then, positioning and refining are carried out, and a fine-granularity identification area is positioned through a boundary box and the characteristics of the area are extracted; then, ordering attention attempts according to an importance ordering algorithm, selecting the most discriminative area through a boundary box and extracting the characteristics of the area; enhancing learning of the most discriminative region by noting the striving to importance ranking algorithm; and finally, cascading the features of the three layers of the original image, the positioned fine-granularity identification region and the selected most-discriminative region.

Description

Fine granularity identification method based on attention-seeking diagram ordering

Technical Field

The invention relates to a fine granularity identification method based on attention-seeking ordering, and belongs to the technical field of computer vision and pattern identification.

Background

Fine-grained image recognition is mainly to categorize subclasses of a certain class, for example, not only large classes such as flowers, birds, dogs, but also different subclasses of dogs, for example, subdivided classes such as halftime, samol, gold wool, etc. When classifying subdivision categories, the overall appearance is often similar, and the subdivision categories need to be classified by some local details, and the local details are different in positions in the image according to different postures of the current target, so that fine-granularity image recognition is more difficult compared with traditional image recognition.

The fine-grained image recognition task has been a challenge in the field of computer vision, mainly for the following reasons: (1) high intra-class differences. Objects belonging to the same category often exhibit significantly different poses; (2) inter-class differences. Objects belonging to different classes are very similar except for some minor differences, e.g. except for bird head color, bird beak shape; (3) training data is limited. Marking fine-grained categories typically requires a great deal of expertise and labeling time, and thus fine-grained identification datasets are typically small in size. For these reasons, it is difficult to obtain accurate classification results by means of only the existing coarse-grained Convolutional Neural Network (CNN).

In order to distinguish between different subclasses, for example, to distinguish between different types of birds, it is necessary to pick out a distinguishing local area, such as a local area of the head, mouth, foot, etc. of the bird, in addition to extracting features from the whole picture, and to assist the final class judgment with the features of the local area. As for background information, such as flowers and grasses, it is not important to judge the category because different birds will stay on the trees, grasslands, and information concerning trees and grasslands cannot play a critical role in the identification of birds. Therefore, the introduction of attention mechanisms in the field of image recognition is a very effective technique, and the deep learning model is focused on a certain distinguishing local area. Because the difference of the local areas with distinguishing fine granularity identification is relatively fine, the intermediate layer features are used for classification, compared with the high-layer features, the intermediate layer features have higher resolution and contain more position and detail information, and meanwhile, the characteristics of low semanteme and more noise of the low-layer features are overcome. Meanwhile, multi-scale information is obtained through convolution, global average pooling and global maximum pooling, so that fine-granularity recognition tasks with only fine differences of local areas are facilitated.

Marking fine-grained categories typically requires a domain expert to rely on expertise to spend a significant amount of marking time to complete, and thus fine-grained identification datasets are typically small. For this reason, data expansion is particularly necessary. The conventional data enhancement method generally cuts the picture randomly, so that the background area or the incomplete component area is easily cut, and the background area or the incomplete component area cut corresponds to noise, especially when the size of the object to be identified is small, more noise is introduced.

Disclosure of Invention

Aiming at the problems, the invention provides a fine granularity recognition method based on attention seeking to sort, by which features can be extracted from three layers of an original image, a positioned fine granularity recognition area and a selected most discriminative area, and category prediction is carried out, so that the accuracy of fine granularity recognition is improved.

The technical scheme of the invention is as follows:

In order to achieve the purpose of the invention, the invention provides a fine granularity identification method based on attention seeking to sort, which comprises the following steps:

(1) Acquiring original image features;

(2) Performing weak supervision attention learning;

(3) Positioning and refining, namely positioning a fine grain identification area through a boundary frame and extracting the characteristics of the area;

(4) Ordering attention force according to an importance ordering algorithm, and selecting the most discriminative area to participate in category prediction;

(5) And cascading the characteristics of the three layers of the original image, the positioned fine-granularity identification region and the selected most-discriminative region for final prediction.

Further, in step (1), the acquiring the original image feature specifically includes:

Extracting features of images in a training set by using the first three convolution layers of the convolutional neural network Inception v, then respectively processing an output result X ₃ of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, and processing the three obtained features: cascading to obtain the characteristics/> Then, carrying out batch standardization Batch Normalization on the characteristics after cascading to accelerate the training speed of the convolution network, and obtaining a characteristic diagram of the image through full connection processing; and adjusting the obtained feature images to the same size by a bilinear interpolation method, so as to obtain the features of the original image, and using the features for final category prediction.

Further, in the step (2), the performing the weak supervision and attention mechanics exercise includes:

(2.1) obtaining a feature map and an attention map;

(2.2) bilinear concentration;

(2.3) attention regularization;

(2.4) data expansion for attention seeking direction during training, including data expansion for enhanced graphics, attention clipping, and attention decline.

Further, in the step (2.1), the obtaining a feature map and the attention seeking chart specifically includes:

Extracting features of images in a training set by using a convolutional neural network to obtain a feature map F, wherein F epsilon R ^H×W×N, R represents dimensions, H, W represents the height and width of the feature map respectively, N represents the channel number of the feature map, the distribution of each part of an object is represented by attention map A epsilon R ^H×W×M, M represents the number of attention maps A, and the attention map A is obtained by F through the following formula:

where F represents the feature map, F (F) represents convolving the feature map, k represents a counter, k ε [1, M ], and A _k represents the k Zhang Zhuyi th force diagram.

Further, in the step (2.2), the bilinear attention is specifically directed to:

after obtaining attention patterns a, features are extracted from these parts using bilinear attention patterns BAP, and feature patterns F are multiplied by each attention pattern per element to generate a part feature pattern as shown in the following formula:

F_k＝A_k⊙F(k＝1，2，…M)

Wherein F _k∈R^1×N represents the kth part feature map, "-represents the element-wise multiplication;

The distinguishing local features are further extracted through the feature extraction operation, and the kth further extracted part feature f _k∈R^1×N is obtained, wherein the following formula is shown:

f_k＝g(F_k)

Wherein F _k represents the kth further extracted part feature, g (F _k) represents the feature extraction operation performed on the kth part feature map F _k;

The integral characteristic of the object is represented by a part characteristic matrix P epsilon R ^M×N, which is formed by superposition of the further extracted part characteristics, the part characteristic matrix can be represented by the following formula,

Where M represents the number of attention attempts and N represents the number of feature map channels.

Further, in the step (2.3), the attention regularization is specifically:

For each fine-grained category, it is expected that the kth attention map A _k represents the kth same location of the object, penalizing differences in the further extracted part features belonging to the same location, the kth further extracted part feature f _k will be close to the kth global feature center c _k∈R^1×N, and the kth attention map A _k will be activated in the part of the same object, attention regularization loss The following formula is shown:

The update formula for c _k is as follows:

c_k←c_k+β(f_k-c_k)

Where M represents the number of attention attempts, k represents the counter, k e [1, M ], f _k represents the kth further extracted part feature, c _k represents the kth global feature center, Representing the difference between the kth further extracted part feature f _k and the kth global feature center squared, β represents the update rate of c _k.

Further, in step (2.4), the data expansion of the attention seeking direction in the training process includes data expansion of enhancement map, attention clipping and attention decline, specifically:

the data expansion steps of the enhancement map are as follows:

When the size of the object is small, a large portion of the image is background, in which case random data enhancement is inefficient, and for each training image, an attention map is randomly selected to guide the reinforcement learning process and normalized to an enhancement map, which can be expressed by the following formula:

Wherein, Enhanced graph representing kth attention graph,/>R represents dimensions, H, W represents the height and width of the enhancement map of the attention map, A _k represents the k Zhang Zhuyi map, min (A _k) represents the pixel value of the minimum pixel point of the pixel values in the k Zhang Zhuyi map, and max (A _k) represents the pixel value of the maximum pixel point of the pixel values in the k Zhang Zhuyi map, respectively;

the data expansion steps of the attention clipping are as follows:

will first The clipping mask of the pixel point larger than the artificially set clipping threshold value thetac epsilon [0,1] is set to 1, and the clipping masks of other pixel points are set to 0, as shown in the following formula:

Wherein (i, j) represents a pixel point having coordinates of i and j on the horizontal axis and the vertical axis, respectively, C _k (i, j) represents a clipping mask of the pixel point (i, j) obtained from the k Zhang Zengjiang diagram, A value representing pixel point (i, j) in fig. k Zhang Zengjiang;

The bounding box B _k determined from fig. k Zhang Zengjiang may cover the region where C _k (i, j) is positive, and the region surrounded by B _k is enlarged from the original image as enhanced input data, extracting finer granularity features;

The data expansion steps of the attention reduction are as follows:

by combining The falling mask of the pixel point larger than the artificially set falling threshold value θ _d e [0,1] is set to 0, and the falling mask of the other pixel points is set to 1, as shown in the following formula:

where D _k (i, j) represents the fall mask of pixel point (i, j) obtained from the k Zhang Zengjiang plot.

Further, in the step (3), the positioning and refinement are performed, and the fine-grained identification area is positioned through the bounding box and the characteristics of the area are extracted, specifically:

Obtaining the attention map a after step (2.1) using the trained network model, the average a _aver of the M attention maps indicating the object location being calculated by:

And (3) cutting out an object region indicated by A _aver from the original image according to the data expansion step of attention cutting in the step (2.4) according to A _aver, wherein the region is the positioned fine-grained identification region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the fine-grained identification region characteristics for final category prediction.

Further, in step (4), the attention is focused on ranking according to an importance ranking algorithm, and the selection of the most discriminative region to participate in the category prediction is specifically:

Obtaining the attention map A after a step (2.1) is carried out by utilizing a trained network model, cutting out an object region indicated by A _k from an original image, amplifying the region by utilizing a bilinear interpolation method, utilizing the same network structure to extract the characteristics of the region, judging the probability Q ₁,Q₂,Q₃,...,Q_m of the region belonging to groundtruth types according to the characteristics, selecting a region corresponding to the maximum value Q _k of the probability belonging to groundtruth types, regarding the region corresponding to the A _k type as an anchor node, calculating the coordinate of the geometric center of each region, selecting all regions with the geometric center smaller than the margin from the geometric center of the anchor node, correspondingly obtaining the attention map A _k,A_l,...,A_t corresponding to the regions, averaging the attention maps to obtain A _aver, cutting out an object region indicated by A _aver from the original image, amplifying the region by utilizing the bilinear interpolation method, utilizing the same network structure to extract the characteristics of the region to obtain the most discriminative region characteristics for final category prediction.

Advantageous effects

1. The invention provides an attention force diagram importance sorting algorithm, which can sort each attention force diagram according to importance, so as to locate the most discriminative area in the original diagram according to the importance degree of the attention force diagram, strengthen the study of the most discriminative area, and solve the problem that too much unnecessary noise is introduced due to the strong randomness of the data enhancement mode of random clipping;

2. The invention adopts the first three convolution layers to extract when extracting the original image features, and compared with the features of the higher layer, the extracted features of the middle layer have higher resolution and contain more position and detail information, and meanwhile, the problems of low semanteme and more noise of the features of the lower layer are solved; then carrying out operation on the output result of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, so that multi-scale information can be obtained, and fine granularity recognition tasks with only fine differences in local areas are facilitated;

3. The invention uses the operations of attention clipping and attention dropping, applies the idea of reinforcement learning and drives the network to extract more distinguishing characteristics.

Drawings

FIG. 1 is a flow chart of a fine granularity identification method based on attention seeking to sort according to the present invention;

FIG. 2 is a general framework diagram of a fine granularity recognition method based on attention-seeking ranking according to the present invention;

FIG. 3 is a schematic diagram of the weakly supervised attention learning process of FIG. 2;

FIG. 4 is a schematic diagram of the bilinear concentration process of FIG. 2;

fig. 5 is a schematic diagram of the attention seeking importance ranking algorithm of fig. 2.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

In this embodiment, a fine granularity identification method based on attention try to sort is shown in fig. 1, and a general frame diagram is shown in fig. 2, and includes the following steps:

(1) Acquiring original image features;

the features of the images in the training set are extracted by using the first three convolution layers of the convolutional neural network Inception v, then the output result X3 of the third convolution layer is respectively processed by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, as shown in FIG. 3, and the three features obtained by processing are obtained: cascading to obtain the characteristics/> Then, carrying out batch standardization Batch Normalization on the characteristics after cascading to accelerate the training speed of the convolution network, and obtaining a characteristic diagram of the image through full connection processing; and adjusting the obtained feature images to the same size by a bilinear interpolation method, so as to obtain the features of the original image, and using the features for final category prediction.

(2) Performing weak supervision attention learning;

(2.1) acquiring a feature map and an attention map:

The characteristic of the image in the training set is extracted by utilizing a convolutional neural network to obtain a characteristic diagram F, F epsilon R ^H×W×N, the distribution of each part of the object is represented by attention diagram A epsilon R ^H×W×M, R represents dimensions, H, W respectively represent height and width, N represents the number of characteristic diagram channels, M represents the number of attention diagrams, and the attention diagram A is obtained by F through the following formula:

(2.2) Bilinear concentration:

After obtaining the attention map a, features are extracted from these parts using bilinear attention concentrations (Bilinear Attention Pooling, BAP), as shown in fig. 4, which is a bilinear attention concentration process diagram, the feature map F is multiplied by each attention map by element to generate a part feature map as shown in the following formula:

F_k＝A_k⊙F(k＝1，2，...M)

wherein F _k∈R^1×N denotes the k Zhang Tezheng chart, and wherein, as follows, the element-wise multiplication operation is indicated.

f_k＝g(F_k)

Where F _k denotes the kth further extracted part feature, g (F _k) denotes the feature extraction operation performed on the k Zhang Tezheng drawing F _k.

The overall characteristics of the object are represented by a part characteristic matrix P e R ^M×N, which is formed by superposition of these further extracted part characteristics, which can be represented by the following equation,

(2.3) Attention regularization:

For each fine-grained class, it is expected that kth attention map A _k represents the kth same location of the object, and the present invention proposes attention regularization loss to weakly supervise the attention learning process. Punishment of differences in further extracted part features belonging to the same object, the kth further extracted part feature f _k will be close to the kth global feature center c _k∈R^1×N, and the kth attention map a _k will be activated in the part of the same object, attention regularization loss The following formula is shown:

The update formula for c _k is as follows:

c_k←c_k+β(f_k-c_k)

where M represents the number of attention attempts, k represents the counter, k ε [1, M ], f _k represents the kth further extracted part feature, c _k represents the kth global feature center, Representing the difference between the kth further extracted part feature f _k and the kth global feature center squared, β represents the update rate of c _k.

The data expansion steps of the enhancement graph are as follows:

When the size of the object is small, a large portion of the image is background, in which case random data enhancement is inefficient. With attention seeking, data may be more effectively expanded. For each training image, randomly selecting an attention map to guide the reinforcement learning process, and normalizing the kth Zhang Zhuyi map to an enhancement map Expressed as:

Wherein the method comprises the steps of R represents dimensions, H, W represents height and width, A _k represents a k Zhang Zhuyi map, min (A _k) represents a pixel value of a pixel point having a minimum pixel value in a k Zhang Zhuyi map, and max (A _k) represents a pixel value of a pixel point having a maximum pixel value in a k Zhang Zhuyi map, respectively.

The data expansion steps of the attention clipping are as follows:

With the enhancement map, more detailed local features are extracted by enlarging the corresponding region of the enhancement map in the original map, specifically, first The clipping mask of the pixel point larger than the artificially set clipping threshold value thetac epsilon [0,1] is set to 1, and the clipping masks of other pixel points are set to 0, as shown in the following formula:

Where (i, j) represents a pixel point whose horizontal and vertical axes are i and j, respectively, C _k (i, j) represents a clipping mask of the pixel point (i, j) obtained from the k Zhang Zengjiang diagram, The value of pixel (i, j) in figure k Zhang Zengjiang is shown.

The bounding box B _k determined from fig. k Zhang Zengjiang may cover the area where C _k (i, j) is positive, and the area enclosed by B _k is enlarged from the original image as enhanced input data, as shown in fig. 3. As the proportion of the object parts increases, the object can be better seen, and features of finer granularity are extracted.

The data expansion steps for the attention loss are as follows:

Attention regularization loss monitors that kth attention map a _k represents the kth location of the same object, while different attention maps may focus on similar locations of the object. To encourage attention patterns to represent multiple distinct locations of a subject, attention loss is proposed. Specifically, by combining The falling mask of the pixel point larger than the artificially set falling threshold value θ _d e [0,1] is set to 0, and the falling mask of the other pixel points is set to 1, as shown in the following formula:

Masking the original image by D _k (i, j) removes the kth region of the region, which, due to the elimination of the kth region of the image, encourages the network to extract other distinct parts, which means that the object can also be seen better-the robustness of classification and accuracy of localization will be improved.

(3) Positioning and refining, namely positioning a fine grain identification area through a boundary box and extracting the characteristics of the area:

Obtaining an attention map a after step (2.1) using the trained network model, the average a _aver of the M attention maps indicating the object location being calculated by:

And (3) cutting out an object region indicated by A _aver from the original image according to the data expansion step of attention cutting in the step (2.4) according to A _aver, wherein the region is a positioned fine-grained identification region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the fine-grained identification region characteristics for final category prediction.

(4) Attention is directed to ranking by an importance ranking algorithm, and the most distinctive part is selected to participate in category prediction, as shown in fig. 5, which is a schematic diagram of the importance ranking algorithm for attention.

Obtaining attention map A by using a trained network model after the step (2), cutting out an object region indicated by A _k from an original image according to A _k by using a data expansion step of attention cutting in the step (2.4), amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, judging the probability Q ₁,Q₂,Q₃,...,Q_m of the region belonging to groundtruth types according to the characteristics, selecting a region corresponding to the maximum value Q _k of the probability belonging to groundtruth types, and taking the region corresponding to A _k as an anchor node, wherein the importance degree of the region is highest. Calculating the coordinates of the geometric center of each region, selecting all regions whose geometric center is less than the margin from the geometric center of the anchor node, the corresponding attention attempts a _k,A_l,...,A_t for these regions, averaging these attention attempts to obtain a _aver, expanding an object region indicated by A _aver from an original image according to the data of attention clipping in the step (2.4), wherein the region is the most discriminative region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the most discriminative region characteristics for final category prediction;

(5) And cascading (concat) the characteristics of the three layers of the original image, the positioned fine-granularity identification region and the selected most discriminative region for final prediction.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present invention disclosed in the embodiments of the present invention should be covered by the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A fine granularity identification method based on attention seeking to sort, which is characterized by comprising the following steps:

(1) Acquiring original image features, wherein the acquiring of the original image features specifically comprises:

Extracting features of images in a training set by using the first three convolution layers of the convolutional neural network Inception v, then respectively processing an output result X ₃ of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, and processing the three obtained features: cascading to obtain the characteristics/> Then, carrying out batch standardization Batch Normalization on the characteristics after cascading to accelerate the training speed of the convolution network, and obtaining a characteristic diagram of the image through full connection processing; the obtained feature images are adjusted to the same size through a bilinear interpolation method, so that the features of the original images are obtained and used for final category prediction;

(2) Performing weak supervised attention learning, the performing weak supervised attention learning comprising:

(2.1) obtaining a feature map and an attention map;

(2.2) bilinear concentration;

(2.3) attention regularization;

(2.4) data expansion for attention seeking direction during training, including data expansion for enhanced graphics, attention clipping, and attention decline;

2. The fine-grained identification method according to claim 1, wherein in step (2.1), the acquiring feature map and attention map are specifically:

Extracting features of images in a training set by using a convolutional neural network to obtain a feature map F, wherein F epsilon R ^H×W×N, R represents dimensions, H, W represents the height and width of the feature map respectively, N represents the channel number of the feature map, the distribution of each part of an object is represented by attention map A epsilon R ^H ^×W×M, M represents the number of attention maps A, and the attention map A is obtained by F through the following formula:

3. The fine-grained identification method according to claim 2, characterized in that in step (2.2), the bilinear focus is specifically:

F_k＝A_k⊙F(k＝1，2，...M)

f_k＝g(F_k)

4. A fine-grained identification method according to claim 3, characterized in that in step (2.3), the attention regularization is in particular:

The update formula for c _k is as follows:

c_k←c_k+β(f_k-c_k)

5. The fine-grained identification method according to claim 4, wherein in step (2.4), the data expansion of attention seeking to guide during the training process comprises data expansion of enhancement graphs, attention clipping and attention degradation, in particular:

the data expansion steps of the enhancement map are as follows:

the data expansion steps of the attention clipping are as follows:

The bounding box B _k determined from figure k Zhang Zengjiang may cover the area where C _k (i, j) is positive, enlarge the area enclosed by B _k from the original image as enhanced input data, extract finer granularity features;

The data expansion steps of the attention reduction are as follows:

6. The fine-grained identification method according to claim 5, wherein in the step (3), the positioning and refinement locate the fine-grained identification region through a bounding box and extract features of the region, specifically:

7. The fine-grained identification method according to claim 6, wherein in the step (4), the attention is sought to be ranked according to an importance ranking algorithm, and the most discriminative area is selected to participate in the category prediction specifically comprises:

obtaining attention map A after a trained network model passes through the step (2.1), cutting out an object area indicated by Ak from an original image, amplifying the area by a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, judging the probability Q ₁,Q₂,Q₃,...,Q_m of the area belonging to the groundtruth types according to the characteristics, selecting an area corresponding to the maximum value Q _k of the groundtruth types probability, regarding the area as an anchor node, calculating the coordinate of the geometric center of each area, selecting all areas with the geometric center being smaller than the geometric center of the margin from the anchor node, correspondingly obtaining attention map A _k,A_l,...,A_t corresponding to the areas, obtaining A _aver by averaging the attention maps, obtaining an object area indicated by A _aver from the original image, amplifying the area by using the bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and obtaining the most discriminative area characteristics for final class prediction.