CN113936145B - Fine granularity identification method based on attention-seeking diagram ordering - Google Patents

Fine granularity identification method based on attention-seeking diagram ordering Download PDF

Info

Publication number
CN113936145B
CN113936145B CN202111173394.7A CN202111173394A CN113936145B CN 113936145 B CN113936145 B CN 113936145B CN 202111173394 A CN202111173394 A CN 202111173394A CN 113936145 B CN113936145 B CN 113936145B
Authority
CN
China
Prior art keywords
attention
map
area
fine
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111173394.7A
Other languages
Chinese (zh)
Other versions
CN113936145A (en
Inventor
张小瑞
王营营
孙伟
宋爱国
刘青山
张开华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202111173394.7A priority Critical patent/CN113936145B/en
Publication of CN113936145A publication Critical patent/CN113936145A/en
Application granted granted Critical
Publication of CN113936145B publication Critical patent/CN113936145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a fine granularity identification method based on attention-seeking ordering, belonging to the technical field of computer vision and pattern identification; firstly, acquiring original image features; processing the output result of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, and fully connecting the processing results to obtain multi-scale original image features; then, performing weak supervision attention learning; then, positioning and refining are carried out, and a fine-granularity identification area is positioned through a boundary box and the characteristics of the area are extracted; then, ordering attention attempts according to an importance ordering algorithm, selecting the most discriminative area through a boundary box and extracting the characteristics of the area; enhancing learning of the most discriminative region by noting the striving to importance ranking algorithm; and finally, cascading the features of the three layers of the original image, the positioned fine-granularity identification region and the selected most-discriminative region.

Description

Fine granularity identification method based on attention-seeking diagram ordering
Technical Field
The invention relates to a fine granularity identification method based on attention-seeking ordering, and belongs to the technical field of computer vision and pattern identification.
Background
Fine-grained image recognition is mainly to categorize subclasses of a certain class, for example, not only large classes such as flowers, birds, dogs, but also different subclasses of dogs, for example, subdivided classes such as halftime, samol, gold wool, etc. When classifying subdivision categories, the overall appearance is often similar, and the subdivision categories need to be classified by some local details, and the local details are different in positions in the image according to different postures of the current target, so that fine-granularity image recognition is more difficult compared with traditional image recognition.
The fine-grained image recognition task has been a challenge in the field of computer vision, mainly for the following reasons: (1) high intra-class differences. Objects belonging to the same category often exhibit significantly different poses; (2) inter-class differences. Objects belonging to different classes are very similar except for some minor differences, e.g. except for bird head color, bird beak shape; (3) training data is limited. Marking fine-grained categories typically requires a great deal of expertise and labeling time, and thus fine-grained identification datasets are typically small in size. For these reasons, it is difficult to obtain accurate classification results by means of only the existing coarse-grained Convolutional Neural Network (CNN).
In order to distinguish between different subclasses, for example, to distinguish between different types of birds, it is necessary to pick out a distinguishing local area, such as a local area of the head, mouth, foot, etc. of the bird, in addition to extracting features from the whole picture, and to assist the final class judgment with the features of the local area. As for background information, such as flowers and grasses, it is not important to judge the category because different birds will stay on the trees, grasslands, and information concerning trees and grasslands cannot play a critical role in the identification of birds. Therefore, the introduction of attention mechanisms in the field of image recognition is a very effective technique, and the deep learning model is focused on a certain distinguishing local area. Because the difference of the local areas with distinguishing fine granularity identification is relatively fine, the intermediate layer features are used for classification, compared with the high-layer features, the intermediate layer features have higher resolution and contain more position and detail information, and meanwhile, the characteristics of low semanteme and more noise of the low-layer features are overcome. Meanwhile, multi-scale information is obtained through convolution, global average pooling and global maximum pooling, so that fine-granularity recognition tasks with only fine differences of local areas are facilitated.
Marking fine-grained categories typically requires a domain expert to rely on expertise to spend a significant amount of marking time to complete, and thus fine-grained identification datasets are typically small. For this reason, data expansion is particularly necessary. The conventional data enhancement method generally cuts the picture randomly, so that the background area or the incomplete component area is easily cut, and the background area or the incomplete component area cut corresponds to noise, especially when the size of the object to be identified is small, more noise is introduced.
Disclosure of Invention
Aiming at the problems, the invention provides a fine granularity recognition method based on attention seeking to sort, by which features can be extracted from three layers of an original image, a positioned fine granularity recognition area and a selected most discriminative area, and category prediction is carried out, so that the accuracy of fine granularity recognition is improved.
The technical scheme of the invention is as follows:
In order to achieve the purpose of the invention, the invention provides a fine granularity identification method based on attention seeking to sort, which comprises the following steps:
(1) Acquiring original image features;
(2) Performing weak supervision attention learning;
(3) Positioning and refining, namely positioning a fine grain identification area through a boundary frame and extracting the characteristics of the area;
(4) Ordering attention force according to an importance ordering algorithm, and selecting the most discriminative area to participate in category prediction;
(5) And cascading the characteristics of the three layers of the original image, the positioned fine-granularity identification region and the selected most-discriminative region for final prediction.
Further, in step (1), the acquiring the original image feature specifically includes:
Extracting features of images in a training set by using the first three convolution layers of the convolutional neural network Inception v, then respectively processing an output result X 3 of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, and processing the three obtained features: cascading to obtain the characteristics/> Then, carrying out batch standardization Batch Normalization on the characteristics after cascading to accelerate the training speed of the convolution network, and obtaining a characteristic diagram of the image through full connection processing; and adjusting the obtained feature images to the same size by a bilinear interpolation method, so as to obtain the features of the original image, and using the features for final category prediction.
Further, in the step (2), the performing the weak supervision and attention mechanics exercise includes:
(2.1) obtaining a feature map and an attention map;
(2.2) bilinear concentration;
(2.3) attention regularization;
(2.4) data expansion for attention seeking direction during training, including data expansion for enhanced graphics, attention clipping, and attention decline.
Further, in the step (2.1), the obtaining a feature map and the attention seeking chart specifically includes:
Extracting features of images in a training set by using a convolutional neural network to obtain a feature map F, wherein F epsilon R H×W×N, R represents dimensions, H, W represents the height and width of the feature map respectively, N represents the channel number of the feature map, the distribution of each part of an object is represented by attention map A epsilon R H×W×M, M represents the number of attention maps A, and the attention map A is obtained by F through the following formula:
where F represents the feature map, F (F) represents convolving the feature map, k represents a counter, k ε [1, M ], and A k represents the k Zhang Zhuyi th force diagram.
Further, in the step (2.2), the bilinear attention is specifically directed to:
after obtaining attention patterns a, features are extracted from these parts using bilinear attention patterns BAP, and feature patterns F are multiplied by each attention pattern per element to generate a part feature pattern as shown in the following formula:
Fk=Ak⊙F(k=1,2,…M)
Wherein F k∈R1×N represents the kth part feature map, "-represents the element-wise multiplication;
The distinguishing local features are further extracted through the feature extraction operation, and the kth further extracted part feature f k∈R1×N is obtained, wherein the following formula is shown:
fk=g(Fk)
Wherein F k represents the kth further extracted part feature, g (F k) represents the feature extraction operation performed on the kth part feature map F k;
The integral characteristic of the object is represented by a part characteristic matrix P epsilon R M×N, which is formed by superposition of the further extracted part characteristics, the part characteristic matrix can be represented by the following formula,
Where M represents the number of attention attempts and N represents the number of feature map channels.
Further, in the step (2.3), the attention regularization is specifically:
For each fine-grained category, it is expected that the kth attention map A k represents the kth same location of the object, penalizing differences in the further extracted part features belonging to the same location, the kth further extracted part feature f k will be close to the kth global feature center c k∈R1×N, and the kth attention map A k will be activated in the part of the same object, attention regularization loss The following formula is shown:
The update formula for c k is as follows:
ck←ck+β(fk-ck)
Where M represents the number of attention attempts, k represents the counter, k e [1, M ], f k represents the kth further extracted part feature, c k represents the kth global feature center, Representing the difference between the kth further extracted part feature f k and the kth global feature center squared, β represents the update rate of c k.
Further, in step (2.4), the data expansion of the attention seeking direction in the training process includes data expansion of enhancement map, attention clipping and attention decline, specifically:
the data expansion steps of the enhancement map are as follows:
When the size of the object is small, a large portion of the image is background, in which case random data enhancement is inefficient, and for each training image, an attention map is randomly selected to guide the reinforcement learning process and normalized to an enhancement map, which can be expressed by the following formula:
Wherein, Enhanced graph representing kth attention graph,/>R represents dimensions, H, W represents the height and width of the enhancement map of the attention map, A k represents the k Zhang Zhuyi map, min (A k) represents the pixel value of the minimum pixel point of the pixel values in the k Zhang Zhuyi map, and max (A k) represents the pixel value of the maximum pixel point of the pixel values in the k Zhang Zhuyi map, respectively;
the data expansion steps of the attention clipping are as follows:
will first The clipping mask of the pixel point larger than the artificially set clipping threshold value thetac epsilon [0,1] is set to 1, and the clipping masks of other pixel points are set to 0, as shown in the following formula:
Wherein (i, j) represents a pixel point having coordinates of i and j on the horizontal axis and the vertical axis, respectively, C k (i, j) represents a clipping mask of the pixel point (i, j) obtained from the k Zhang Zengjiang diagram, A value representing pixel point (i, j) in fig. k Zhang Zengjiang;
The bounding box B k determined from fig. k Zhang Zengjiang may cover the region where C k (i, j) is positive, and the region surrounded by B k is enlarged from the original image as enhanced input data, extracting finer granularity features;
The data expansion steps of the attention reduction are as follows:
by combining The falling mask of the pixel point larger than the artificially set falling threshold value θ d e [0,1] is set to 0, and the falling mask of the other pixel points is set to 1, as shown in the following formula:
where D k (i, j) represents the fall mask of pixel point (i, j) obtained from the k Zhang Zengjiang plot.
Further, in the step (3), the positioning and refinement are performed, and the fine-grained identification area is positioned through the bounding box and the characteristics of the area are extracted, specifically:
Obtaining the attention map a after step (2.1) using the trained network model, the average a aver of the M attention maps indicating the object location being calculated by:
And (3) cutting out an object region indicated by A aver from the original image according to the data expansion step of attention cutting in the step (2.4) according to A aver, wherein the region is the positioned fine-grained identification region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the fine-grained identification region characteristics for final category prediction.
Further, in step (4), the attention is focused on ranking according to an importance ranking algorithm, and the selection of the most discriminative region to participate in the category prediction is specifically:
Obtaining the attention map A after a step (2.1) is carried out by utilizing a trained network model, cutting out an object region indicated by A k from an original image, amplifying the region by utilizing a bilinear interpolation method, utilizing the same network structure to extract the characteristics of the region, judging the probability Q 1,Q2,Q3,...,Qm of the region belonging to groundtruth types according to the characteristics, selecting a region corresponding to the maximum value Q k of the probability belonging to groundtruth types, regarding the region corresponding to the A k type as an anchor node, calculating the coordinate of the geometric center of each region, selecting all regions with the geometric center smaller than the margin from the geometric center of the anchor node, correspondingly obtaining the attention map A k,Al,...,At corresponding to the regions, averaging the attention maps to obtain A aver, cutting out an object region indicated by A aver from the original image, amplifying the region by utilizing the bilinear interpolation method, utilizing the same network structure to extract the characteristics of the region to obtain the most discriminative region characteristics for final category prediction.
Advantageous effects
1. The invention provides an attention force diagram importance sorting algorithm, which can sort each attention force diagram according to importance, so as to locate the most discriminative area in the original diagram according to the importance degree of the attention force diagram, strengthen the study of the most discriminative area, and solve the problem that too much unnecessary noise is introduced due to the strong randomness of the data enhancement mode of random clipping;
2. The invention adopts the first three convolution layers to extract when extracting the original image features, and compared with the features of the higher layer, the extracted features of the middle layer have higher resolution and contain more position and detail information, and meanwhile, the problems of low semanteme and more noise of the features of the lower layer are solved; then carrying out operation on the output result of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, so that multi-scale information can be obtained, and fine granularity recognition tasks with only fine differences in local areas are facilitated;
3. The invention uses the operations of attention clipping and attention dropping, applies the idea of reinforcement learning and drives the network to extract more distinguishing characteristics.
Drawings
FIG. 1 is a flow chart of a fine granularity identification method based on attention seeking to sort according to the present invention;
FIG. 2 is a general framework diagram of a fine granularity recognition method based on attention-seeking ranking according to the present invention;
FIG. 3 is a schematic diagram of the weakly supervised attention learning process of FIG. 2;
FIG. 4 is a schematic diagram of the bilinear concentration process of FIG. 2;
fig. 5 is a schematic diagram of the attention seeking importance ranking algorithm of fig. 2.
Detailed Description
The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
In this embodiment, a fine granularity identification method based on attention try to sort is shown in fig. 1, and a general frame diagram is shown in fig. 2, and includes the following steps:
(1) Acquiring original image features;
the features of the images in the training set are extracted by using the first three convolution layers of the convolutional neural network Inception v, then the output result X3 of the third convolution layer is respectively processed by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, as shown in FIG. 3, and the three features obtained by processing are obtained: cascading to obtain the characteristics/> Then, carrying out batch standardization Batch Normalization on the characteristics after cascading to accelerate the training speed of the convolution network, and obtaining a characteristic diagram of the image through full connection processing; and adjusting the obtained feature images to the same size by a bilinear interpolation method, so as to obtain the features of the original image, and using the features for final category prediction.
(2) Performing weak supervision attention learning;
(2.1) acquiring a feature map and an attention map:
The characteristic of the image in the training set is extracted by utilizing a convolutional neural network to obtain a characteristic diagram F, F epsilon R H×W×N, the distribution of each part of the object is represented by attention diagram A epsilon R H×W×M, R represents dimensions, H, W respectively represent height and width, N represents the number of characteristic diagram channels, M represents the number of attention diagrams, and the attention diagram A is obtained by F through the following formula:
where F represents the feature map, F (F) represents convolving the feature map, k represents a counter, k ε [1, M ], and A k represents the k Zhang Zhuyi th force diagram.
(2.2) Bilinear concentration:
After obtaining the attention map a, features are extracted from these parts using bilinear attention concentrations (Bilinear Attention Pooling, BAP), as shown in fig. 4, which is a bilinear attention concentration process diagram, the feature map F is multiplied by each attention map by element to generate a part feature map as shown in the following formula:
Fk=Ak⊙F(k=1,2,...M)
wherein F k∈R1×N denotes the k Zhang Tezheng chart, and wherein, as follows, the element-wise multiplication operation is indicated.
The distinguishing local features are further extracted through the feature extraction operation, and the kth further extracted part feature f k∈R1×N is obtained, wherein the following formula is shown:
fk=g(Fk)
Where F k denotes the kth further extracted part feature, g (F k) denotes the feature extraction operation performed on the k Zhang Tezheng drawing F k.
The overall characteristics of the object are represented by a part characteristic matrix P e R M×N, which is formed by superposition of these further extracted part characteristics, which can be represented by the following equation,
Where M represents the number of attention attempts and N represents the number of feature map channels.
(2.3) Attention regularization:
For each fine-grained class, it is expected that kth attention map A k represents the kth same location of the object, and the present invention proposes attention regularization loss to weakly supervise the attention learning process. Punishment of differences in further extracted part features belonging to the same object, the kth further extracted part feature f k will be close to the kth global feature center c k∈R1×N, and the kth attention map a k will be activated in the part of the same object, attention regularization loss The following formula is shown:
The update formula for c k is as follows:
ck←ck+β(fk-ck)
where M represents the number of attention attempts, k represents the counter, k ε [1, M ], f k represents the kth further extracted part feature, c k represents the kth global feature center, Representing the difference between the kth further extracted part feature f k and the kth global feature center squared, β represents the update rate of c k.
(2.4) Data expansion for attention seeking direction during training, including data expansion for enhanced graphics, attention clipping, and attention decline.
The data expansion steps of the enhancement graph are as follows:
When the size of the object is small, a large portion of the image is background, in which case random data enhancement is inefficient. With attention seeking, data may be more effectively expanded. For each training image, randomly selecting an attention map to guide the reinforcement learning process, and normalizing the kth Zhang Zhuyi map to an enhancement map Expressed as:
Wherein the method comprises the steps of R represents dimensions, H, W represents height and width, A k represents a k Zhang Zhuyi map, min (A k) represents a pixel value of a pixel point having a minimum pixel value in a k Zhang Zhuyi map, and max (A k) represents a pixel value of a pixel point having a maximum pixel value in a k Zhang Zhuyi map, respectively.
The data expansion steps of the attention clipping are as follows:
With the enhancement map, more detailed local features are extracted by enlarging the corresponding region of the enhancement map in the original map, specifically, first The clipping mask of the pixel point larger than the artificially set clipping threshold value thetac epsilon [0,1] is set to 1, and the clipping masks of other pixel points are set to 0, as shown in the following formula:
Where (i, j) represents a pixel point whose horizontal and vertical axes are i and j, respectively, C k (i, j) represents a clipping mask of the pixel point (i, j) obtained from the k Zhang Zengjiang diagram, The value of pixel (i, j) in figure k Zhang Zengjiang is shown.
The bounding box B k determined from fig. k Zhang Zengjiang may cover the area where C k (i, j) is positive, and the area enclosed by B k is enlarged from the original image as enhanced input data, as shown in fig. 3. As the proportion of the object parts increases, the object can be better seen, and features of finer granularity are extracted.
The data expansion steps for the attention loss are as follows:
Attention regularization loss monitors that kth attention map a k represents the kth location of the same object, while different attention maps may focus on similar locations of the object. To encourage attention patterns to represent multiple distinct locations of a subject, attention loss is proposed. Specifically, by combining The falling mask of the pixel point larger than the artificially set falling threshold value θ d e [0,1] is set to 0, and the falling mask of the other pixel points is set to 1, as shown in the following formula:
where D k (i, j) represents the fall mask of pixel point (i, j) obtained from the k Zhang Zengjiang plot.
Masking the original image by D k (i, j) removes the kth region of the region, which, due to the elimination of the kth region of the image, encourages the network to extract other distinct parts, which means that the object can also be seen better-the robustness of classification and accuracy of localization will be improved.
(3) Positioning and refining, namely positioning a fine grain identification area through a boundary box and extracting the characteristics of the area:
Obtaining an attention map a after step (2.1) using the trained network model, the average a aver of the M attention maps indicating the object location being calculated by:
And (3) cutting out an object region indicated by A aver from the original image according to the data expansion step of attention cutting in the step (2.4) according to A aver, wherein the region is a positioned fine-grained identification region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the fine-grained identification region characteristics for final category prediction.
(4) Attention is directed to ranking by an importance ranking algorithm, and the most distinctive part is selected to participate in category prediction, as shown in fig. 5, which is a schematic diagram of the importance ranking algorithm for attention.
Obtaining attention map A by using a trained network model after the step (2), cutting out an object region indicated by A k from an original image according to A k by using a data expansion step of attention cutting in the step (2.4), amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, judging the probability Q 1,Q2,Q3,...,Qm of the region belonging to groundtruth types according to the characteristics, selecting a region corresponding to the maximum value Q k of the probability belonging to groundtruth types, and taking the region corresponding to A k as an anchor node, wherein the importance degree of the region is highest. Calculating the coordinates of the geometric center of each region, selecting all regions whose geometric center is less than the margin from the geometric center of the anchor node, the corresponding attention attempts a k,Al,...,At for these regions, averaging these attention attempts to obtain a aver, expanding an object region indicated by A aver from an original image according to the data of attention clipping in the step (2.4), wherein the region is the most discriminative region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the most discriminative region characteristics for final category prediction;
(5) And cascading (concat) the characteristics of the three layers of the original image, the positioned fine-granularity identification region and the selected most discriminative region for final prediction.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present invention disclosed in the embodiments of the present invention should be covered by the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (7)

1. A fine granularity identification method based on attention seeking to sort, which is characterized by comprising the following steps:
(1) Acquiring original image features, wherein the acquiring of the original image features specifically comprises:
Extracting features of images in a training set by using the first three convolution layers of the convolutional neural network Inception v, then respectively processing an output result X 3 of the third convolution layer by using 3*3 convolution Conv, global maximum pooling GAP and global average pooling GMP, and processing the three obtained features: cascading to obtain the characteristics/> Then, carrying out batch standardization Batch Normalization on the characteristics after cascading to accelerate the training speed of the convolution network, and obtaining a characteristic diagram of the image through full connection processing; the obtained feature images are adjusted to the same size through a bilinear interpolation method, so that the features of the original images are obtained and used for final category prediction;
(2) Performing weak supervised attention learning, the performing weak supervised attention learning comprising:
(2.1) obtaining a feature map and an attention map;
(2.2) bilinear concentration;
(2.3) attention regularization;
(2.4) data expansion for attention seeking direction during training, including data expansion for enhanced graphics, attention clipping, and attention decline;
(3) Positioning and refining, namely positioning a fine grain identification area through a boundary frame and extracting the characteristics of the area;
(4) Ordering attention force according to an importance ordering algorithm, and selecting the most discriminative area to participate in category prediction;
(5) And cascading the characteristics of the three layers of the original image, the positioned fine-granularity identification region and the selected most-discriminative region for final prediction.
2. The fine-grained identification method according to claim 1, wherein in step (2.1), the acquiring feature map and attention map are specifically:
Extracting features of images in a training set by using a convolutional neural network to obtain a feature map F, wherein F epsilon R H×W×N, R represents dimensions, H, W represents the height and width of the feature map respectively, N represents the channel number of the feature map, the distribution of each part of an object is represented by attention map A epsilon R H ×W×M, M represents the number of attention maps A, and the attention map A is obtained by F through the following formula:
where F represents the feature map, F (F) represents convolving the feature map, k represents a counter, k ε [1, M ], and A k represents the k Zhang Zhuyi th force diagram.
3. The fine-grained identification method according to claim 2, characterized in that in step (2.2), the bilinear focus is specifically:
after obtaining attention patterns a, features are extracted from these parts using bilinear attention patterns BAP, and feature patterns F are multiplied by each attention pattern per element to generate a part feature pattern as shown in the following formula:
Fk=Ak⊙F(k=1,2,...M)
Wherein F k∈R1×N represents the kth part feature map, "-represents the element-wise multiplication;
The distinguishing local features are further extracted through the feature extraction operation, and the kth further extracted part feature f k∈R1×N is obtained, wherein the following formula is shown:
fk=g(Fk)
Wherein F k represents the kth further extracted part feature, g (F k) represents the feature extraction operation performed on the kth part feature map F k;
The integral characteristic of the object is represented by a part characteristic matrix P epsilon R M×N, which is formed by superposition of the further extracted part characteristics, the part characteristic matrix can be represented by the following formula,
Where M represents the number of attention attempts and N represents the number of feature map channels.
4. A fine-grained identification method according to claim 3, characterized in that in step (2.3), the attention regularization is in particular:
For each fine-grained category, it is expected that the kth attention map A k represents the kth same location of the object, penalizing differences in the further extracted part features belonging to the same location, the kth further extracted part feature f k will be close to the kth global feature center c k∈R1×N, and the kth attention map A k will be activated in the part of the same object, attention regularization loss The following formula is shown:
The update formula for c k is as follows:
ck←ck+β(fk-ck)
Where M represents the number of attention attempts, k represents the counter, k e [1, M ], f k represents the kth further extracted part feature, c k represents the kth global feature center, Representing the difference between the kth further extracted part feature f k and the kth global feature center squared, β represents the update rate of c k.
5. The fine-grained identification method according to claim 4, wherein in step (2.4), the data expansion of attention seeking to guide during the training process comprises data expansion of enhancement graphs, attention clipping and attention degradation, in particular:
the data expansion steps of the enhancement map are as follows:
When the size of the object is small, a large portion of the image is background, in which case random data enhancement is inefficient, and for each training image, an attention map is randomly selected to guide the reinforcement learning process and normalized to an enhancement map, which can be expressed by the following formula:
Wherein, Enhanced graph representing kth attention graph,/>R represents dimensions, H, W represents the height and width of the enhancement map of the attention map, A k represents the k Zhang Zhuyi map, min (A k) represents the pixel value of the minimum pixel point of the pixel values in the k Zhang Zhuyi map, and max (A k) represents the pixel value of the maximum pixel point of the pixel values in the k Zhang Zhuyi map, respectively;
the data expansion steps of the attention clipping are as follows:
will first The clipping mask of the pixel point larger than the artificially set clipping threshold value thetac epsilon [0,1] is set to 1, and the clipping masks of other pixel points are set to 0, as shown in the following formula:
Wherein (i, j) represents a pixel point having coordinates of i and j on the horizontal axis and the vertical axis, respectively, C k (i, j) represents a clipping mask of the pixel point (i, j) obtained from the k Zhang Zengjiang diagram, A value representing pixel point (i, j) in fig. k Zhang Zengjiang;
The bounding box B k determined from figure k Zhang Zengjiang may cover the area where C k (i, j) is positive, enlarge the area enclosed by B k from the original image as enhanced input data, extract finer granularity features;
The data expansion steps of the attention reduction are as follows:
by combining The falling mask of the pixel point larger than the artificially set falling threshold value θ d e [0,1] is set to 0, and the falling mask of the other pixel points is set to 1, as shown in the following formula:
where D k (i, j) represents the fall mask of pixel point (i, j) obtained from the k Zhang Zengjiang plot.
6. The fine-grained identification method according to claim 5, wherein in the step (3), the positioning and refinement locate the fine-grained identification region through a bounding box and extract features of the region, specifically:
Obtaining the attention map a after step (2.1) using the trained network model, the average a aver of the M attention maps indicating the object location being calculated by:
And (3) cutting out an object region indicated by A aver from the original image according to the data expansion step of attention cutting in the step (2.4) according to A aver, wherein the region is the positioned fine-grained identification region, amplifying the region by using a bilinear interpolation method, extracting the characteristics of the region by using the same network structure, and obtaining the fine-grained identification region characteristics for final category prediction.
7. The fine-grained identification method according to claim 6, wherein in the step (4), the attention is sought to be ranked according to an importance ranking algorithm, and the most discriminative area is selected to participate in the category prediction specifically comprises:
obtaining attention map A after a trained network model passes through the step (2.1), cutting out an object area indicated by Ak from an original image, amplifying the area by a bilinear interpolation method, extracting the characteristics of the area by using the same network structure, judging the probability Q 1,Q2,Q3,...,Qm of the area belonging to the groundtruth types according to the characteristics, selecting an area corresponding to the maximum value Q k of the groundtruth types probability, regarding the area as an anchor node, calculating the coordinate of the geometric center of each area, selecting all areas with the geometric center being smaller than the geometric center of the margin from the anchor node, correspondingly obtaining attention map A k,Al,...,At corresponding to the areas, obtaining A aver by averaging the attention maps, obtaining an object area indicated by A aver from the original image, amplifying the area by using the bilinear interpolation method, extracting the characteristics of the area by using the same network structure, and obtaining the most discriminative area characteristics for final class prediction.
CN202111173394.7A 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering Active CN113936145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111173394.7A CN113936145B (en) 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111173394.7A CN113936145B (en) 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering

Publications (2)

Publication Number Publication Date
CN113936145A CN113936145A (en) 2022-01-14
CN113936145B true CN113936145B (en) 2024-06-11

Family

ID=79278297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111173394.7A Active CN113936145B (en) 2021-10-08 2021-10-08 Fine granularity identification method based on attention-seeking diagram ordering

Country Status (1)

Country Link
CN (1) CN113936145B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686242A (en) * 2020-12-29 2021-04-20 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN112699902A (en) * 2021-01-11 2021-04-23 福州大学 Fine-grained sensitive image detection method based on bilinear attention pooling mechanism
CN112949655A (en) * 2021-03-01 2021-06-11 南京航空航天大学 Fine-grained image recognition method combined with attention mixed cutting

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284749A (en) * 2017-07-19 2019-01-29 微软技术许可有限责任公司 Refine image recognition
CN110647794B (en) * 2019-07-12 2023-01-03 五邑大学 Attention mechanism-based multi-scale SAR image recognition method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686242A (en) * 2020-12-29 2021-04-20 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN112699902A (en) * 2021-01-11 2021-04-23 福州大学 Fine-grained sensitive image detection method based on bilinear attention pooling mechanism
CN112949655A (en) * 2021-03-01 2021-06-11 南京航空航天大学 Fine-grained image recognition method combined with attention mixed cutting

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Multi-Feature Learning Model with Enhanced Local Attention for Vehicle Re-Identification;Wei Sun等;《Tech Science Press》;20210824;全文 *

Also Published As

Publication number Publication date
CN113936145A (en) 2022-01-14

Similar Documents

Publication Publication Date Title
CN103020971A (en) Method for automatically segmenting target objects from images
CN106446933B (en) Multi-target detection method based on contextual information
CN108304873A (en) Object detection method based on high-resolution optical satellite remote-sensing image and its system
CN109583425A (en) A kind of integrated recognition methods of the remote sensing images ship based on deep learning
CN103049763B (en) Context-constraint-based target identification method
CN109919177B (en) Feature selection method based on hierarchical deep network
CN102436637B (en) Method and system for automatically segmenting hairs in head images
CN108898145A (en) A kind of image well-marked target detection method of combination deep learning
CN106611420B (en) The SAR image segmentation method constrained based on deconvolution network and sketch map direction
CN107145889A (en) Target identification method based on double CNN networks with RoI ponds
CN111027493A (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN103473545B (en) A kind of text image method for measuring similarity based on multiple features
CN109325484A (en) Flowers image classification method based on background priori conspicuousness
Shahab et al. How salient is scene text?
CN110991274B (en) Pedestrian tumbling detection method based on Gaussian mixture model and neural network
CN107679509A (en) A kind of small ring algae recognition methods and device
CN110246141A (en) It is a kind of based on joint angle point pond vehicles in complex traffic scene under vehicle image partition method
CN106127735A (en) A kind of facilities vegetable edge clear class blade face scab dividing method and device
CN107256398A (en) The milk cow individual discrimination method of feature based fusion
CN112329771B (en) Deep learning-based building material sample identification method
CN109426793A (en) A kind of image behavior recognition methods, equipment and computer readable storage medium
CN106845456A (en) A kind of method of falling over of human body monitoring in video monitoring system
CN113484875A (en) Laser radar point cloud target hierarchical identification method based on mixed Gaussian ordering
CN116977960A (en) Rice seedling row detection method based on example segmentation
Hu et al. Computer vision based method for severity estimation of tea leaf blight in natural scene images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant