CN113936145A - Fine-grained identification method based on attention diagram sorting - Google Patents

Fine-grained identification method based on attention diagram sorting Download PDF

Info

Publication number
CN113936145A
CN113936145A CN202111173394.7A CN202111173394A CN113936145A CN 113936145 A CN113936145 A CN 113936145A CN 202111173394 A CN202111173394 A CN 202111173394A CN 113936145 A CN113936145 A CN 113936145A
Authority
CN
China
Prior art keywords
attention
map
feature
fine
area
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111173394.7A
Other languages
Chinese (zh)
Other versions
CN113936145B (en
Inventor
张小瑞
王营营
孙伟
宋爱国
刘青山
张开华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202111173394.7A priority Critical patent/CN113936145B/en
Publication of CN113936145A publication Critical patent/CN113936145A/en
Application granted granted Critical
Publication of CN113936145B publication Critical patent/CN113936145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a fine-grained identification method based on attention diagram sorting, belonging to the technical field of computer vision and pattern identification; firstly, acquiring the characteristics of an original image; processing the output result of the third convolution layer by using 3 × 3 convolution Conv, a global maximum pooling GAP and a global average pooling GMP, and fully connecting the processing results to obtain the original image characteristics with multiple scales; then, carrying out weak supervision attention learning; then, positioning and refining, positioning a fine-grained identification area through a boundary frame, and extracting the characteristics of the area; then, sorting the attention diagrams according to an importance sorting algorithm, selecting a region with the most discrimination through a bounding box, and extracting the characteristics of the region; strengthening the learning of the most discriminative region by an attention-driven graph importance ranking algorithm; and finally, cascading the characteristics of the original image, the positioned fine-grained identification area and the selected area with the highest identification.

Description

Fine-grained identification method based on attention diagram sorting
Technical Field
The invention relates to a fine-grained identification method based on attention diagram sorting, and belongs to the technical field of computer vision and pattern identification.
Background
Fine-grained image recognition is mainly to classify subclasses of a certain class finely, for example, not only large classes such as flowers, birds and dogs, but also different subclasses of dogs, such as fine classes of husky, samar, golden hair, and the like. When the fine classification is distinguished, the overall appearance is almost the same, the distinction is needed through some local details, and the positions of the local details in the image are different along with the posture of the current target, so that the fine-grained image recognition is more difficult than the traditional image recognition.
The task of fine-grained image recognition has been a challenge in the field of computer vision, mainly for the following reasons: (1) high intra-class variance. Objects belonging to the same category often exhibit significantly different poses; (2) low inter-class differences. Objects belonging to different categories are very similar except for some minor differences, e.g. very similar except for bird head color, bird's beak shape; (3) training data is limited. Marking fine-grained classes typically requires a lot of expertise and labeling time, and therefore fine-grained recognition datasets are typically small in size. For these reasons, it is difficult to obtain accurate classification results by means of existing coarse-grained Convolutional Neural Networks (CNNs) alone.
In order to distinguish different subclasses, for example, to distinguish different types of birds, in addition to extracting features from the entire picture, it is necessary to select distinctive local regions, such as local regions of the head, mouth, and feet of the bird, and use the features of the local regions to assist the final class judgment. As for background information, such as flowers and grasses, it is not important to judge the category, since different birds may stay on trees, lawns, and the information concerning trees and lawns may not play a crucial role in the identification of birds. Therefore, it is a very effective technique to introduce an attention mechanism in the image recognition field, so that the deep learning model focuses more on a certain distinctive local area. And because the local area difference of the fine-grained identification distinctiveness is finer, the intermediate layer features are used for classification in the invention, and the intermediate layer features have higher resolution compared with the high-layer features, contain more position and detail information, and simultaneously overcome the characteristics of low semantic property and more noise of the low-layer features. Meanwhile, multi-scale information is obtained through operations of convolution, global average pooling and global maximum pooling, so that a fine-grained identification task with slight difference only in local areas is facilitated.
Marking fine-grained classes typically requires domain experts to spend a significant amount of labeling time relying on expertise to accomplish, and thus fine-grained identification datasets are typically small. For this reason, data expansion is particularly necessary. The conventional data enhancement method generally performs random cropping on a picture, so that a background area or an incomplete part area which is cropped is easily cropped, the background area or the cropped incomplete part area is equivalent to noise, and especially when the size of an object to be recognized is small, the introduced noise is more.
Disclosure of Invention
Aiming at the problems, the invention provides a fine-grained identification method based on attention map sorting, which can extract features from three layers of an original image, a positioned fine-grained identification area and a selected area with the highest identification performance to perform category prediction, thereby improving the precision of fine-grained identification.
The technical scheme of the invention is as follows:
in order to achieve the purpose of the invention, the invention provides a fine-grained identification method based on attention map sorting, which comprises the following steps:
(1) acquiring the characteristics of an original image;
(2) performing weak supervision attention learning;
(3) positioning and refining, namely positioning a fine-grained identification area through a bounding box and extracting the characteristics of the area;
(4) sorting the attention diagrams according to an importance sorting algorithm, and selecting the most discriminative region to participate in category prediction;
(5) and cascading the characteristics of the original image, the positioned fine-grained identification area and the selected most discriminative area for final prediction.
Further, in the step (1), the acquiring original image features specifically includes:
extracting the characteristics of the images in the training set by using the first three convolutional layers of the convolutional neural network inclusion v3, and then respectively using 3 × 3 convolutional Conv, global maximum pooling GAP and global average pooling GMP to output results X of the third convolutional layer3And (3) processing, wherein the three processed characteristics are as follows:
Figure BDA0003292618490000021
cascading to obtain the cascaded characteristics
Figure BDA0003292618490000022
Then, carrying out Batch standardized Batch Normalization processing on the cascaded features to accelerate the training speed of the convolution network, and obtaining a feature map of the image through full-connection processing; and adjusting the obtained feature map to the same size by a bilinear interpolation method, thereby obtaining the features of the original image for final class prediction.
Further, in step (2), the performing the weakly supervised attention learning includes:
(2.1) acquiring a characteristic diagram and an attention diagram;
(2.2) bilinear attention focusing;
(2.3) attention regularization;
and (2.4) carrying out attention diagram-oriented data expansion in the training process, wherein the data expansion comprises an enhancement diagram, attention clipping and attention reduction.
Further, in step (2.1), the obtaining a feature map and an attention map specifically includes:
using convolutional neural networksExtracting the characteristics of the images in the training set to obtain a characteristic diagram F, wherein F belongs to RH×W×NR represents dimension, H, W represents height and width of characteristic diagram, N represents channel number of characteristic diagram, distribution of each part of object is in attention diagram A ∈ RH×W×MM represents the number of attention maps, and the attention map a is obtained from F by the following formula:
Figure BDA0003292618490000031
wherein F represents a feature map, F (F) represents a convolution operation on the feature map, k represents a counter, k is equal to [1, M ]],AkThe kth attention map is shown.
Further, in step (2.2), the bilinear attention is specifically:
after the attention map a is obtained, features are extracted from these portions using bilinear attention-focused BAP, and the feature map F is multiplied by each attention map by element to generate a part feature map, as shown in the following formula:
Fk=Ak⊙F(k=1,2,…M)
wherein, Fk∈R1×NIndicates the kth part feature diagram, which indicates element-by-element multiplication operation;
further extracting distinctive local features through feature extraction operation to obtain the kth further extracted part feature fk∈R1×NThe following formula shows:
fk=g(Fk)
wherein f iskDenotes the feature of the part after the k-th further extraction, g (F)k) Feature diagram F of the k-th partkCarrying out feature extraction operation;
the characteristics of the whole object are determined by a part characteristic matrix P ∈ RM×NThe matrix is formed by superposing the further extracted part characteristics, the part characteristic matrix can be expressed by a lower formula,
Figure BDA0003292618490000032
where M denotes the number of attention maps and N denotes the number of feature map channels.
Further, in step (2.3), the attention regularization specifically includes:
for each fine-grained category, a k-th attention map A is expectedkThe k same part of the representative object punishs the difference of the part features which belong to the same part and are further extracted, and the k part feature f which is further extractedkWill approach the k-th global feature center ck∈R1×NAnd the kth attention map AkWill be activated in parts of the same object, attention regularization lost
Figure BDA0003292618490000041
As shown in the following equation:
Figure BDA0003292618490000042
ckthe update formula is as follows:
ck←ck+β(fk-ck)
where M represents the number of attention maps, k represents a counter, k ∈ [1, M],fkRepresenting the feature of the part after the k-th further extraction, ckThe k-th global feature center is represented,
Figure BDA0003292618490000043
representing the feature f of the part after the k-th further extractionkThe difference from the kth global feature center is squared by a two-norm, where β represents ckThe update rate of.
Further, in step (2.4), the attention-oriented data expansion in the training process includes data expansion of an enhancement map, attention clipping, and attention reduction, specifically:
the data expansion steps of the enhanced graph are as follows:
when the size of the object is small, a large part of the image is background, in which case the random data enhancement efficiency is low, and for each training image, an attention map is randomly selected to guide the reinforcement learning process and normalized to an enhancement map, which can be expressed by the following formula:
Figure BDA0003292618490000044
wherein,
Figure BDA0003292618490000045
an enhancement diagram showing the kth attention diagram,
Figure BDA0003292618490000046
r represents dimension, H, W represents height and width of the enhancement map of the attention map, respectively, AkDenotes the kth attention map, min (A)k) The pixel value, max (A), representing the pixel point with the smallest pixel value in the kth attention mapk) Representing the pixel value of the pixel point with the maximum pixel value in the kth attention diagram;
the attention clipping data expansion steps are as follows:
firstly, the first step is to
Figure BDA0003292618490000047
Is larger than a clipping threshold value theta c set by people and belongs to [0, 1 ]]The clipping mask of the pixel point is set to be 1, the clipping masks of other pixel points are set to be 0, and the following formula is shown:
Figure BDA0003292618490000048
wherein, (i, j) represents pixel points with coordinates of the horizontal axis and the vertical axis being i and j respectively, and Ck(i, j) represents a cropping mask of pixel points (i, j) obtained from the kth enhancement map,
Figure BDA0003292618490000051
representing the value of a pixel point (i, j) in the kth enhancement image;
bounding box B determined from the kth enhancement mapkCan be covered with Ck(i, j) is a positive value, and B is enlarged from the original imagekThe surrounded area is used as enhanced input data, and features with finer granularity are extracted;
the attention-deficit data expansion step is as follows:
by mixing
Figure BDA0003292618490000052
Greater than an artificially set drop threshold thetad∈[0,1]The falling mask of the pixel point is set to be 0, and the falling masks of other pixel points are set to be 1, as shown in the following formula:
Figure BDA0003292618490000053
wherein Dk(i, j) represents a falling mask of the pixel point (i, j) obtained from the k-th enhancement map.
Further, in the step (3), the positioning and refining, positioning a fine-grained identification region through a bounding box and extracting features of the region, specifically:
obtaining the attention diagram A after the step (2.1) by using the trained network model, and indicating the average A of the M attention diagrams of the position of the objectaverCalculated from the following formula:
Figure BDA0003292618490000054
according to AaverCropping A from the original image according to the data expansion step of attention cropping described in step (2.4)averAnd (3) an indicated object area is the positioned fine-grained identification area, the area is amplified by using a bilinear interpolation method, and the characteristics of the area are extracted by using the same network structure to obtain the characteristics of the fine-grained identification area for final class prediction.
Further, in step (4), the step of sorting the attention diagrams according to an importance sorting algorithm, and selecting the most discriminative region to participate in the category prediction specifically includes:
obtaining the attention diagram A after the step (2.1) by utilizing the trained network model, and cutting the attention diagram A from the original imagekAmplifying the indicated object area by using a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and judging the probability Q of the area belonging to the groudtruth class according to the characteristics1,Q2,Q3,...,QmSelecting the maximum value Q of the probability belonging to the group route classkA corresponding one of the regions, A corresponding to the regionkThe region is regarded as the anchor node with the highest importance degree, the coordinates of the geometric center of each region are calculated, all regions with the geometric centers being less than margin from the geometric center of the anchor node are selected, and the attention diagrams A corresponding to the regions are obtained correspondinglyk,Al,...,AtAveraging these attention maps yields AaverCutting out A from the original imageaverAnd (3) an indicated object region is the region with the most discrimination, the region is amplified by using a bilinear interpolation method, and the features of the region are extracted by using the same network structure to obtain the region features with the most discrimination for final class prediction.
Advantageous effects
1. The invention provides an attention map importance ordering algorithm, which can order each attention map according to importance, thereby positioning an area with the most discrimination in an original image according to the importance degree of the attention map, strengthening the learning of the area with the most discrimination, and simultaneously solving the problem that excessive unnecessary noise is introduced due to strong randomness of a data enhancement mode of random cutting;
2. according to the method, the first three convolutional layers are adopted for extraction when the original image features are extracted, the extracted middle layer features are higher in resolution compared with the high layer features and contain more position and detail information, and meanwhile, the problems of low semantic property and high noise of the low layer features are solved; then, the output result of the third convolution layer is operated by using the 3 × 3 convolution Conv, the global maximum pooling GAP and the global average pooling GMP, so that multi-scale information can be obtained, and a fine-grained identification task with slight difference only in a local area is facilitated;
3. the invention uses attention clipping and attention reduction operation, and applies the idea of reinforcement learning to drive the network to extract more distinctive features.
Drawings
FIG. 1 is a flow chart of a fine-grained identification method based on attention-driven graph sorting according to the present invention;
FIG. 2 is a general block diagram of a fine-grained identification method based on attention-graph sorting according to the present invention;
FIG. 3 is a schematic illustration of the weakly supervised attention learning process of FIG. 2;
FIG. 4 is a schematic diagram of the bilinear attention focusing process of FIG. 2;
FIG. 5 is a schematic diagram of the importance ranking algorithm of FIG. 2, with the intent to attempt.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In this embodiment, a flow of the fine-grained identification method based on attention-driven graph sorting is shown in fig. 1, and an overall frame diagram thereof is shown in fig. 2, and the method includes the following steps:
(1) acquiring the characteristics of an original image;
extracting features of images in the training set by using the first three convolutional layers of convolutional neural network inclusion v3, and then respectively processing the output result X3 of the third convolutional layer by using 3 × 3 convolutional Conv, global maximum pooling GAP and global average pooling GMP, as shown in fig. 3, processing the obtained three features:
Figure BDA0003292618490000071
Figure BDA0003292618490000072
cascading to obtain the cascaded characteristics
Figure BDA0003292618490000073
Then, carrying out Batch standardized Batch Normalization processing on the cascaded features to accelerate the training speed of the convolution network, and obtaining a feature map of the image through full-connection processing; and adjusting the obtained feature map to the same size by a bilinear interpolation method, thereby obtaining the features of the original image for final class prediction.
(2) Performing weak supervision attention learning;
(2.1) obtaining a characteristic diagram and an attention diagram:
extracting the characteristics of the images in the training set by using a convolutional neural network to obtain a characteristic diagram F, wherein F belongs to RH×W×NDistribution of parts of an object by means of an attention diagram A ∈ RH×W×MR represents dimension, H, W represents height and width, respectively, N represents number of characteristic diagram channels, M represents number of attention diagrams, and attention diagram a is obtained from F by the following formula:
Figure BDA0003292618490000074
wherein F represents a feature map, F (F) represents a convolution operation on the feature map, k represents a counter, k is equal to [1, M ]],AkThe kth attention map is shown.
(2.2) bilinear attention focusing:
after obtaining the Attention map a, features are extracted from these portions using Bilinear Attention focusing (BAP), as shown in fig. 4, which is a Bilinear Attention focusing process diagram, and a feature map F is generated by multiplying the feature map F by each Attention map by elements, as shown in the following formula:
Fk=Ak⊙F(k=1,2,...M)
wherein, Fk∈R1×NIndicating the kth feature diagram, which represents an element-by-element multiplication operation.
Further extracting distinctive local features through feature extraction operation to obtain the kth further extracted part feature fk∈R1×NThe following formula shows:
fk=g(Fk)
wherein f iskDenotes the feature of the part after the k-th further extraction, g (F)k) For the k characteristic diagram FkAnd carrying out feature extraction operation.
The characteristics of the whole object are determined by a part characteristic matrix P ∈ RM×NThe matrix is formed by the superposition of the features of the parts after further extraction, can be represented by a lower formula,
Figure BDA0003292618490000081
where M denotes the number of attention maps and N denotes the number of feature map channels.
(2.3) attention regularization:
for each fine-grained category, a k-th attention map A is expectedkRepresenting the kth identical part of the subject, the present invention proposes a loss of attention regularization to weakly supervise the attention learning process. Punishment of the difference of the part features belonging to the same object after further extraction, and the kth part feature f after further extractionkWill approach the k-th global feature center ck∈R1×NAnd the kth attention map AkWill be activated in parts of the same object, attention regularization lost
Figure BDA0003292618490000082
As shown in the following equation:
Figure BDA0003292618490000083
ckthe update formula is as follows:
ck←ck+β(fk-ck)
where M represents the number of attention maps, k represents a counter, k ∈ [1, M],fkRepresenting the feature of the part after the k-th further extraction, ckThe k-th global feature center is represented,
Figure BDA0003292618490000086
representing the feature f of the part after the k-th further extractionkThe difference from the kth global feature center is squared by a two-norm, where β represents ckThe update rate of.
And (2.4) carrying out attention diagram-oriented data expansion in the training process, wherein the data expansion comprises an enhancement diagram, attention clipping and attention reduction.
Wherein, the data expansion step of the enhancement graph is as follows:
when the size of the object is small, a large part of the image is background, in which case the random data enhancement is inefficient. With the attention map, the data can be more efficiently augmented. For each training image, randomly selecting an attention map to guide the reinforcement learning process, and normalizing the k-th attention map into an enhancement map
Figure BDA0003292618490000087
Expressed as:
Figure BDA0003292618490000084
wherein
Figure BDA0003292618490000085
R represents dimension, H, W represents height and width, respectively, AkDenotes the kth attention map, min (A)k) The pixel value, max (A), representing the pixel point with the smallest pixel value in the kth attention mapk) And representing the pixel value of the pixel point with the maximum pixel value in the kth attention map.
The data expansion steps for attention clipping are as follows:
with the enhancement map, the corresponding area in the original map is enlarged and extractedTaking more detailed local features, in particular, first will
Figure BDA0003292618490000091
Is larger than a clipping threshold value theta c set by people and belongs to [0, 1 ]]The clipping mask of the pixel point is set to be 1, the clipping masks of other pixel points are set to be 0, and the following formula is shown:
Figure BDA0003292618490000092
wherein (i, j) represents pixel points with coordinates of the horizontal axis and the vertical axis being i and j respectively, Ck(i, j) represents a cropping mask of pixel points (i, j) obtained from the kth enhancement map,
Figure BDA0003292618490000093
and (3) representing the value of the pixel point (i, j) in the kth enhanced graph.
Bounding box B determined from the kth enhancement mapkCan be covered with Ck(i, j) is a positive value, and B is enlarged from the original imagekThe enclosed area serves as enhanced input data as shown in fig. 3. Due to the fact that the proportion of the object part is increased, the object can be better seen, and the features with finer granularity are extracted.
The attention-deficit data expansion steps are as follows:
attention regularization loss supervision kth attention map AkRepresenting the kth site of the same subject, while different attention maps may focus on similar sites of the subject. Attention deficit has been proposed to encourage the attention map to represent multiple distinct regions of a subject. In particular, by mixing
Figure BDA0003292618490000094
Greater than an artificially set drop threshold thetad∈[0,1]The falling mask of the pixel point is set to be 0, and the falling masks of other pixel points are set to be 1, as shown in the following formula:
Figure BDA0003292618490000095
wherein Dk(i, j) represents a falling mask of the pixel point (i, j) obtained from the k-th enhancement map.
By Dk(i, j) masking the original image to remove the kth region will encourage the network to extract other distinctive parts since the kth region is removed from the image, which means that the object can be better seen, the robustness of classification and the accuracy of positioning will be improved.
(3) Positioning and refining, namely positioning a fine-grained identification area through a bounding box and extracting the characteristics of the area:
obtaining an attention diagram A after the step (2.1) by using the trained network model, and indicating the average A of M attention diagrams of the object positionaverCalculated from the following formula:
Figure BDA0003292618490000096
according to AaverCutting A from the original image according to the data expansion step of attention cutting in the step (2.4)averAnd (3) an indicated object area is the positioned fine-grained identification area, the area is amplified by using a bilinear interpolation method, and the characteristics of the area are extracted by using the same network structure to obtain the characteristics of the fine-grained identification area for final class prediction.
(4) The attention is sought to be ranked according to an importance ranking algorithm, and the most distinctive part is selected to participate in the category prediction, as shown in fig. 5, which is a schematic diagram of the attention ranking algorithm.
Obtaining an attention diagram A after the step (2) by using the trained network model, and obtaining the attention diagram A according to the AkCutting A from the original image according to the data expansion step of attention cutting in the step (2.4)kAmplifying the indicated object area by using a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and judging the probability Q of the area belonging to the groudtruth class according to the characteristics1,Q2,Q3,...,QmSelecting the maximum value Q of the probability belonging to the group route classkA corresponding one of the regions, A corresponding to the regionkThe region is considered to be the most important, and is regarded as an anchor node. Calculating the coordinates of the geometric center of each region, and selecting all regions with the geometric center less than margin from the geometric center of the anchor node, wherein the regions correspond to an attention diagram Ak,Al,...,AtAveraging these attention maps yields AaverCutting A from the original image according to the data expansion of the attention cutting in the step (2.4)averThe indicated object region is the region with the most discrimination, the region is amplified by a bilinear interpolation method, the characteristics of the region are extracted by the same network structure, and the region characteristics with the most discrimination are obtained for final category prediction;
(5) and (4) cascading (concat) the characteristics of the original image, the positioned fine-grained identification area and the selected area with the most discriminativity for final prediction.
The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the embodiments of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (9)

1. A fine-grained identification method based on attention map sorting is characterized by comprising the following steps:
(1) acquiring the characteristics of an original image;
(2) performing weak supervision attention learning;
(3) positioning and refining, namely positioning a fine-grained identification area through a bounding box and extracting the characteristics of the area;
(4) sorting the attention diagrams according to an importance sorting algorithm, and selecting the most discriminative region to participate in category prediction;
(5) and cascading the characteristics of the original image, the positioned fine-grained identification area and the selected most discriminative area for final prediction.
2. The fine-grained identification method according to claim 1, wherein in the step (1), the obtaining of the original image features specifically comprises:
extracting the characteristics of the images in the training set by using the first three convolutional layers of the convolutional neural network inclusion v3, and then respectively using 3 × 3 convolutional Conv, global maximum pooling GAP and global average pooling GMP to output results X of the third convolutional layer3And (3) processing, wherein the three processed characteristics are as follows:
Figure FDA0003292618480000011
cascading to obtain the cascaded characteristics
Figure FDA0003292618480000012
Then, carrying out Batch standardized Batch Normalization processing on the cascaded features to accelerate the training speed of the convolution network, and obtaining a feature map of the image through full-connection processing; and adjusting the obtained feature map to the same size by a bilinear interpolation method, thereby obtaining the features of the original image for final class prediction.
3. The fine-grained identification method according to claim 2, wherein in the step (2), the performing the weakly supervised attention learning comprises:
(2.1) acquiring a characteristic diagram and an attention diagram;
(2.2) bilinear attention focusing;
(2.3) attention regularization;
and (2.4) carrying out attention diagram-oriented data expansion in the training process, wherein the data expansion comprises an enhancement diagram, attention clipping and attention reduction.
4. The fine grain identification method according to claim 3, wherein in the step (2.1), the obtaining of the feature map and the attention map is specifically:
extracting the characteristics of the images in the training set by using a convolutional neural network to obtain a characteristic diagram F, wherein F belongs to RH×W×NR represents dimension, H, W represents height and width of characteristic diagram, N represents channel number of characteristic diagram, distribution of each part of object is in attention diagram A ∈ RH ×W×MM represents the number of attention maps, and the attention map a is obtained from F by the following formula:
Figure FDA0003292618480000021
wherein F represents a feature map, F (F) represents a convolution operation on the feature map, k represents a counter, k is equal to [1, M ]],AkThe kth attention map is shown.
5. The fine grain identification method according to claim 4, wherein in the step (2.2), the bilinear attention is specifically:
after the attention map a is obtained, features are extracted from these portions using bilinear attention-focused BAP, and the feature map F is multiplied by each attention map by element to generate a part feature map, as shown in the following formula:
Fk=Ak⊙F(k=1,2,...M)
wherein, Fk∈R1×NIndicates the kth part feature diagram, which indicates element-by-element multiplication operation;
further extracting distinctive local features through feature extraction operation to obtain the kth further extracted part feature fk∈R1×NThe following formula shows:
fk=g(Fk)
wherein f iskDenotes the feature of the part after the k-th further extraction, g (F)k) Feature diagram F of the k-th partkCarrying out feature extraction operation;
the characteristics of the whole object are determined by a part characteristic matrix F ∈ RM×NRepresenting the matrix from these further extracted partsThe feature is formed by superposition, a part feature matrix can be represented by a lower formula,
Figure FDA0003292618480000022
where M denotes the number of attention maps and N denotes the number of feature map channels.
6. A fine-grained identification method according to claim 5, wherein in step (2.3), the attention regularization is specifically:
for each fine-grained category, a k-th attention map A is expectedkThe k same part of the representative object punishs the difference of the part features which belong to the same part and are further extracted, and the k part feature f which is further extractedkWill approach the k-th global feature center ck∈R1×NAnd the kth attention map AkWill be activated in parts of the same object, attention regularization lost
Figure FDA0003292618480000023
As shown in the following equation:
Figure FDA0003292618480000031
ckthe update formula is as follows:
ck←ck+β(fk-ck)
where M represents the number of attention maps, k represents a counter, k ∈ [1, M],fkRepresenting the feature of the part after the k-th further extraction, ckThe k-th global feature center is represented,
Figure FDA0003292618480000032
representing the feature f of the part after the k-th further extractionkThe difference from the kth global feature center is squared by a two-norm, where β representsckThe update rate of.
7. The fine-grained recognition method according to claim 6, wherein in step (2.4), the attention-graph-oriented data expansion in the training process includes data expansion of an enhancement graph, attention clipping, and attention reduction, specifically:
the data expansion steps of the enhanced graph are as follows:
when the size of the object is small, a large part of the image is background, in which case the random data enhancement efficiency is low, and for each training image, an attention map is randomly selected to guide the reinforcement learning process and normalized to an enhancement map, which can be expressed by the following formula:
Figure FDA0003292618480000033
wherein,
Figure FDA0003292618480000034
an enhancement diagram showing the kth attention diagram,
Figure FDA0003292618480000035
r represents dimension, H, W represents height and width of the enhancement map of the attention map, respectively, AkDenotes the kth attention map, min (A)k) The pixel value, max (A), representing the pixel point with the smallest pixel value in the kth attention mapk) Representing the pixel value of the pixel point with the maximum pixel value in the kth attention diagram;
the attention clipping data expansion steps are as follows:
firstly, the first step is to
Figure FDA0003292618480000036
Is larger than a clipping threshold value theta c set by people and belongs to [0, 1 ]]The clipping mask of the pixel point is set to be 1, the clipping masks of other pixel points are set to be 0, and the following formula is shown:
Figure FDA0003292618480000037
wherein, (i, j) represents pixel points with coordinates of the horizontal axis and the vertical axis being i and j respectively, and Ck(i, j) represents a cropping mask of pixel points (i, j) obtained from the kth enhancement map,
Figure FDA0003292618480000038
representing the value of a pixel point (i, j) in the kth enhancement image;
bounding box B determined from the kth enhancement mapkCan be covered with Ck(i, j) is a positive value, and B is enlarged from the original imagekThe surrounded area is used as enhanced input data, and features with finer granularity are extracted; the attention-deficit data expansion step is as follows:
by mixing
Figure FDA0003292618480000041
Greater than an artificially set drop threshold thetad∈[0,1]The falling mask of the pixel point is set to be 0, and the falling masks of other pixel points are set to be 1, as shown in the following formula:
Figure FDA0003292618480000042
wherein Dk(i, j) represents a falling mask of the pixel point (i, j) obtained from the k-th enhancement map.
8. The fine grain identification method according to claim 7, wherein in the step (3), the positioning and refining, the fine grain identification area is positioned through the bounding box and the feature of the area is extracted, specifically:
obtaining the attention diagram A after the step (2.1) by using the trained network model, and indicating the average A of the M attention diagrams of the position of the objectaverCalculated from the following formula:
Figure FDA0003292618480000043
according to AaverCropping A from the original image according to the data expansion step of attention cropping described in step (2.4)averAnd (3) an indicated object area is the positioned fine-grained identification area, the area is amplified by using a bilinear interpolation method, and the characteristics of the area are extracted by using the same network structure to obtain the characteristics of the fine-grained identification area for final class prediction.
9. The fine grain identification method according to claim 8, wherein in the step (4), the attention maps are ranked according to an importance ranking algorithm, and the selection of the most discriminative region to participate in the category prediction specifically comprises:
obtaining the attention diagram A after the step (2.1) by utilizing the trained network model, and cutting the attention diagram A from the original imagekAmplifying the indicated object area by using a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and judging the probability Q of the area belonging to the groudtruth class according to the characteristics1,Q2,Q3,...,QmSelecting the maximum value Q of the probability belonging to the group route classkA corresponding one of the regions, A corresponding to the regionkThe region is regarded as the anchor node with the highest importance degree, the coordinates of the geometric center of each region are calculated, all regions with the geometric centers being less than margin from the geometric center of the anchor node are selected, and the attention diagrams A corresponding to the regions are obtained correspondinglyk,Al,...,AtAveraging these attention maps yields AaverCutting out A from the original imageaverAnd (3) an indicated object region is the region with the most discrimination, the region is amplified by using a bilinear interpolation method, and the features of the region are extracted by using the same network structure to obtain the region features with the most discrimination for final class prediction.
CN202111173394.7A 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering Active CN113936145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111173394.7A CN113936145B (en) 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111173394.7A CN113936145B (en) 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering

Publications (2)

Publication Number Publication Date
CN113936145A true CN113936145A (en) 2022-01-14
CN113936145B CN113936145B (en) 2024-06-11

Family

ID=79278297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111173394.7A Active CN113936145B (en) 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering

Country Status (1)

Country Link
CN (1) CN113936145B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160124A1 (en) * 2017-07-19 2020-05-21 Microsoft Technology Licensing, Llc Fine-grained image recognition
US20210012146A1 (en) * 2019-07-12 2021-01-14 Wuyi University Method and apparatus for multi-scale sar image recognition based on attention mechanism
CN112686242A (en) * 2020-12-29 2021-04-20 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN112699902A (en) * 2021-01-11 2021-04-23 福州大学 Fine-grained sensitive image detection method based on bilinear attention pooling mechanism
CN112949655A (en) * 2021-03-01 2021-06-11 南京航空航天大学 Fine-grained image recognition method combined with attention mixed cutting

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160124A1 (en) * 2017-07-19 2020-05-21 Microsoft Technology Licensing, Llc Fine-grained image recognition
US20210012146A1 (en) * 2019-07-12 2021-01-14 Wuyi University Method and apparatus for multi-scale sar image recognition based on attention mechanism
CN112686242A (en) * 2020-12-29 2021-04-20 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN112699902A (en) * 2021-01-11 2021-04-23 福州大学 Fine-grained sensitive image detection method based on bilinear attention pooling mechanism
CN112949655A (en) * 2021-03-01 2021-06-11 南京航空航天大学 Fine-grained image recognition method combined with attention mixed cutting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEI SUN等: "A Multi-Feature Learning Model with Enhanced Local Attention for Vehicle Re-Identification", 《TECH SCIENCE PRESS》, 24 August 2021 (2021-08-24) *

Also Published As

Publication number Publication date
CN113936145B (en) 2024-06-11

Similar Documents

Publication Publication Date Title
CN109977918B (en) Target detection positioning optimization method based on unsupervised domain adaptation
CN110321813B (en) Cross-domain pedestrian re-identification method based on pedestrian segmentation
CN107563372B (en) License plate positioning method based on deep learning SSD frame
CN110188807B (en) Tunnel pedestrian target detection method based on cascading super-resolution network and improved Faster R-CNN
Bart et al. Cross-generalization: Learning novel classes from a single example by feature replacement
CN111428807A (en) Image processing method and computer-readable storage medium
CN110738207A (en) character detection method for fusing character area edge information in character image
CN108304873A (en) Object detection method based on high-resolution optical satellite remote-sensing image and its system
CN111027493A (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN110909618B (en) Method and device for identifying identity of pet
Shabanzade et al. Combination of local descriptors and global features for leaf recognition
CN105809121A (en) Multi-characteristic synergic traffic sign detection and identification method
CN110222767B (en) Three-dimensional point cloud classification method based on nested neural network and grid map
CN105740892A (en) High-accuracy human body multi-position identification method based on convolutional neural network
CN107145889A (en) Target identification method based on double CNN networks with RoI ponds
CN110458864A (en) Based on the method for tracking target and target tracker for integrating semantic knowledge and example aspects
CN110543906A (en) Skin type automatic identification method based on data enhancement and Mask R-CNN model
CN101968852A (en) Entropy sequencing-based semi-supervision spectral clustering method for determining clustering number
CN109034024A (en) Logistics vehicles vehicle classification recognition methods based on image object detection
CN112784869A (en) Fine-grained image identification method based on attention perception and counterstudy
US20230154217A1 (en) Method for Recognizing Text, Apparatus and Terminal Device
CN112949655A (en) Fine-grained image recognition method combined with attention mixed cutting
CN115661072A (en) Disc rake surface defect detection method based on improved fast RCNN algorithm
CN112669274A (en) Multi-task detection method for pixel-level segmentation of surface abnormal region
CN112329771A (en) Building material sample identification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant