CN115984627A - Local diversity guided weak supervision fine-grained image classification method and system - Google Patents

Local diversity guided weak supervision fine-grained image classification method and system Download PDF

Info

Publication number
CN115984627A
CN115984627A CN202310077544.7A CN202310077544A CN115984627A CN 115984627 A CN115984627 A CN 115984627A CN 202310077544 A CN202310077544 A CN 202310077544A CN 115984627 A CN115984627 A CN 115984627A
Authority
CN
China
Prior art keywords
feature
guided
image classification
diversified
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310077544.7A
Other languages
Chinese (zh)
Inventor
刘光辉
占华
孟月波
段中兴
王博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Architecture and Technology
Original Assignee
Xian University of Architecture and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Architecture and Technology filed Critical Xian University of Architecture and Technology
Priority to CN202310077544.7A priority Critical patent/CN115984627A/en
Publication of CN115984627A publication Critical patent/CN115984627A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a local diversity-guided weak supervision fine-grained image classification method and system, which are used for constructing a local diversity-guided weak supervision fine-grained image classification network, wherein the classification network comprises a basic backbone network, a cross-Layer attention interaction module and a bilinear pooling module, which are connected behind Layer3 and Layer4 of the basic backbone network, and a random selection strategy is connected between the cross-Layer attention interaction module and the bilinear pooling module; training a local diversified guided weak supervision fine-grained image classification network to obtain a local diversified guided weak supervision fine-grained image classification network model; and sending the preprocessed training data set into a local diversified and guided weak supervision fine-grained image classification network model to obtain an image classification result. The invention solves the problem of inaccurate classification result caused by the problems that the distinguishable features in the existing fine-grained image classification task are too fine and difficult to capture, the local information is lack of effective utilization and the like.

Description

Local diversity guided weak supervision fine-grained image classification method and system
Technical Field
The invention belongs to the technical field of electronic information, and particularly belongs to a local diversified guided weak supervision fine-grained image classification method and system.
Background
Fine-grained image classification, also called subcategory image classification, aims to perform more detailed subcategory division on images (automobiles, dogs, flowers, birds and the like) belonging to the same basic category, has wide business requirements and application scenes in the industry and actual life, can be used for identification and research of animals and plants in the field environment, and provides an important technical basis for the field of biology; the method can be applied to visual tasks such as clothing detection and clothing identification; an automatic checkout service that may be deployed in a retail setting; the method can be used for quickly and accurately identifying the high-speed running vehicle. Compared with the common image classification task, the fine-grained image classification is more difficult due to the fine inter-class difference and the larger intra-class difference among the sub-classes.
In the aspect of an image recognition algorithm, the method can be divided into a strong supervision direction and a weak supervision direction, wherein the strong supervision direction is that besides a category label, a labeling frame, part labeling point information and the like are needed in a model, and the weak supervision direction is that only the category label is used for completing the training of the model. For fine-grained tasks, how to enable the neural network to locate distinguishable parts and learn distinguishable features is the key to solving the fine-grained problem, and a strong supervision method is often adopted to solve the problem in the early stage. The strong supervision mode excessively depends on additional manual annotation information such as annotated bounding boxes, part annotation information and the like, and the annotation information is expensive to obtain, so that the practicability of the strong supervision algorithm is limited. Due to the development of deep learning and the depth of related research, a weak supervision mode which only depends on class labels to finish classification can achieve good classification performance without additional manual labeling information. The weak supervision classification algorithm not only reduces extra manual labeling cost, but also better meets the requirement of practical application, and is the development trend of current research.
The attention mechanism is beneficial to improving the attention degree of the distinguishing parts, so that the accuracy of the classification task is improved. Although the attention mechanism can guide the model to focus on the distinguishing parts, the method usually focuses on only a few significant parts, does not comprehensively explore the potential distinguishing parts, and treats each feature in isolation, so that the classification result is inaccurate when the pictures of the same application scene are classified in a fine-grained manner.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a local diversified guided weak supervision fine-grained image classification method and system, and solves the problem of inaccurate classification result caused by the problems that distinguishable features in the existing fine-grained image classification task are too fine and difficult to capture, local information is lack of effective utilization and the like by constructing a local diversified guided weak supervision fine-grained classification network model.
In order to achieve the purpose, the invention provides the following technical scheme: a local diversity-guided weak supervision fine-grained image classification method specifically comprises the following steps:
s1, constructing a local diversity-guided weak supervision fine-grained image classification network, wherein the classification network comprises a basic main network ConvNeXt, a Layer3 connected to the basic main network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;
s2, training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;
and S3, sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.
Further, the basic backbone network ConvNeXt includes 4 layers, where Layer1 is used for preprocessing an input image; layer2-Layer4 are all multiple ConvNeXt BlockForming, outputting initial characteristic graphs D e R of different scales by Layer3 and Layer4 of the basic backbone network ConvNeX C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.
Further, a cross-layer attention interaction module CAIM is constructed, and the method specifically comprises the following steps:
1) For the initial characteristic diagrams D epsilon R of different scales obtained by the Layer3 and the Layer4 of the backbone network C×W×H Performing 3*3 convolution yields a multi-level attention map
Figure BDA0004066531890000031
2) Attention-seeking to multiple levels
Figure BDA0004066531890000032
Carrying out space interaction and channel interaction processing to obtain a characteristic diagram A for completing space interaction s12 、A s21 Feature graph A interacting with completion channel c12 、A c21
3) A that will accomplish spatial interaction, respectively s12 A interacting with channel c12 Combining to obtain diversified feature maps A 12 A to complete spatial interaction s21 A interacting with the channel c21 Combining to obtain diversified feature maps A 21
Further, the spatial interaction includes:
1) Attention diagram for multiple levels
Figure BDA0004066531890000033
Performing size conversion to obtain characteristic diagram
Figure BDA0004066531890000034
Figure BDA0004066531890000035
L 1 =W 1 ×H 1 ,L 2 =W 2 ×H 2
2) To feature map A' 1 And A' 2 Carrying out inner product operation to obtain a spatial similarity matrix W 1 Is a reaction of-W 1 Obtaining a space interaction characteristic graph M after normalization operation 1
Figure BDA0004066531890000036
3) Map M of spatial interaction features 1 Mapping to feature map A' 1 、A′ 2 Obtaining a characteristic diagram A' s12 And A' s21
A′ s12 =M 1 T ×A′ 1 +A′ T 2
A′ s21 =M 1 ×A ′T 2 +A 1
4) To feature map A' s12 And A' s21 Size conversion is carried out to obtain a characteristic diagram reflecting spatial dependence
Figure BDA0004066531890000038
And &>
Figure BDA0004066531890000039
Further, the channel interaction comprises:
1) Attention diagram A for multiple levels 2 Down-sampling and size conversion to obtain characteristic diagram
Figure BDA00040665318900000310
Attention diagram A for multiple levels 1 Size conversion to obtain feature map A' 1 A is prepared by ′T 1 And A' 2d As a graph opposite;
2) To A ′T 1 And A' 2d Channel similarity matrix W obtained by inner product operation 2 Through the channel similarity matrix W 2 Similarity evaluation of the image pairs was performed and-W was calculated 2 T Obtaining a channel interaction feature map M after normalization operation 2
M 2 =softmax(-W 2 T )∈[0,1] C′×C′ ,W 2 =A′ T 1 ×A′ 2d
3) Channel interaction feature map M 2 Mapping to feature map A' 1 、A′ 2d Obtaining a characteristic diagram A' c12 And A' c21
A′ c12 =M 2 ×A ′T 1 +A ′T 2d
Figure BDA0004066531890000044
4) To feature map A' c12 Size conversion to obtain A c12 For feature map A c21 ' conducting the above-mentioned process and size conversion to obtain A c21
Further, the method for constructing the random selection strategy RSS comprises the following specific steps:
1) Feature graph F epsilon R output to cross-layer attention interaction module CAIM along width dimension C′×W×H Slicing n equal parts to obtain n characteristic slices F (k) ∈R C′×(W/n)×H ,k∈[1,n];
2) Slice F for each feature (k) Randomly selecting the most significant inhibition branch, the feature enhancement branch or the no-operation branch to process data to obtain S (k)
Figure BDA0004066531890000041
Wherein, the probabilities of selecting the most significant inhibitory branch, the feature-enhanced branch or the no-operation branch are respectively on α, β, γ, wherein α + β + γ =1;
3) Cutting each processed characteristic into slices F (k) Splicing according to the width dimension which is cut to obtain a polymerization attention diagram A epsilon R C′×W×H
A=concat(S (k) )。
Further, the most significant inhibitory branches include:
1) Slicing the features F (k) Performing a channel averaging pooling operation to obtain F p(k) ∈R (W/n)×H
Figure BDA0004066531890000042
2) Slicing F according to characteristics (k) Generating an erasure mask P by setting a threshold rate delta for a pixel value of medium maximum intensity (k)drop On the contrary, the partial pixels larger than the threshold are set to 0, and on the contrary, the partial pixels smaller than the threshold are set to 1:
Figure BDA0004066531890000043
3) Will eliminate the mask P (k)drop Acting on characteristic slices F in a point-by-point manner (k) Obtaining an elimination characteristic diagram;
the feature enhancement branch comprises:
1) Slicing the feature F (k) Generating an enhancement mask P using a sigmoid activation function (k)important
P (k)important =sigmoid(F P(k) )∈[0,1] (W/n)×H
2) Will enhance the mask P (k)important Acting on slices F in a dot-by-dot manner (k) Obtaining an enhanced feature map;
the no-operation branch pair feature slice F (k) No processing is done.
Further, performing bilinear pooling module BP on the random selection strategy RSS and the output of the basic backbone network ConvNeXt, specifically comprising the following steps:
1) Acquiring initial characteristic diagram D of basic backbone network ConvNeXt output and aggregation attention diagram A e R of random selection strategy RSS output C′×H×W
2) Multiplying the aggregation attention diagram A and the initial characteristic diagram D in an element-by-element point multiplication mode to obtain a part characteristic diagram D k
D k =A k ⊙D(k=1,2,...,C′)
3) Mapping of partial feature maps D using global max-pooling GMP k Processing to obtain attention characteristics d of each specific part ∈R 1×C
d k =GMP(D k )
4) Attention will be paid to feature d k The characteristic matrix P of the stacked parts belongs to R M×C
P=(d 1 ,d 2 ,...,d M ) T
Further, layer3 and Layer4 of the basic backbone network ConvNeXt respectively obtain a part feature matrix P epsilon R M×C And performing dimension splicing on the two part feature matrixes to obtain a feature map y, mapping the feature map y into a one-dimensional feature vector, and realizing the final classification of the images by softmax logistic regression in a full-connected layer mode.
The invention also provides a local diversity-guided weak supervision fine-grained image classification system, which comprises the following steps:
the system comprises a classification network construction module, a classification network selection module and a classification network selection module, wherein the classification network construction module is used for constructing a local diversity-guided weak supervision fine-grained image classification network, the classification network comprises a basic main network ConvNeXt, a Layer3 connected with the basic main network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;
the classification network training module is used for training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;
and the image classification module is used for sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention provides a local diversity-guided weak supervision fine-grained image classification method, which comprises the steps of firstly, taking ConvNeXt as a backbone network to extract multi-level initial fine-grained characteristics to obtain a characteristic diagram; then, designing a cross-layer attention interaction module (CAIM), representing each part of the object by adopting different-level attention diagrams, establishing a multi-level attention diagram, promoting semantic expression of the attention diagram by adopting a channel interaction and space interaction mode, and sharing the mined information; then, in order to avoid feature assimilation, a Random Selection Strategy (RSS) is provided, attention is promoted to try to capture more local information with discriminability in a random selection mode, and the network is further promoted to obtain richer local features; and finally, fusing the attention diagram and the feature diagram by adopting a bilinear pooling module to construct a feature representation with strong capability to enhance the network fitting capability, and completing classification tasks through a full connection layer, thereby solving the problems that distinguishable features in a fine-grained image classification task are too fine and difficult to capture, local information is lack of effective utilization and the like, and improving the classification accuracy of images belonging to the same basic category.
Drawings
FIG. 1 is a diagram of a fine-grained image classification network architecture with local diversity guidance;
FIG. 2 is a block diagram of a cross-layer attention interaction module;
FIG. 3 is a diagram of a random selection strategy architecture;
fig. 4 visualizes the effect map.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
The weak supervision fine-grained image classification method guided by local diversity of the invention has the following processes:
1. downloading CUB-200-2011 (bird 1), NABirds (bird 2), ISIA Food-200 (Food) and image of the ancient tower building data set, screening the image of the data set by data, ensuring the integrity of the data image, and preprocessing the data set image. In the preprocessing stage, the data enhancement method is adopted to randomly perform operations such as cutting, rotating, scaling and the like on the sample image, expand the number of data set samples and enhance the robustness of the CNN model.
2. And constructing a local diversity guided weak supervision fine-grained image classification network. The method comprises the following specific steps:
the structure of the convolutional neural network based on local diversity guidance as shown in fig. 1 includes: basic backbone network (ConvNeXt), cross-layer Attention Interaction Module (CAIM), random Selection Strategy (RSS), bilinear Pooling Module (BP), and Cross-layer splicing classification operations. The design of the cross-Layer attention interaction module CAIM is to perform feature interaction modeling on the output of the Layer3 and the Layer4 of the main network, output a diversified feature map after the feature map passes through the CAIM, further output a part feature matrix by adopting a random selection strategy RSS for the output diversified feature map, and finally splice the two part feature matrices and input the spliced part feature matrices into a classifier to complete classification tasks.
First, an underlying backbone network (ConvNeXt) is built, which is mainly composed of 4 layers: the Layer1 has a simple structure and can be regarded as preprocessing of an input image; the layers 2-4 are all composed of multiple ConvNeXt blocks and have similar structures. The input image passes through different layers to obtain initial characteristic graphs of different scales, which are respectively recorded as D e R C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.
Secondly, for a fine-grained task, how to capture more local information with discriminability is the key for determining classification performance, in view of the above, the invention generates multi-level attention diagrams by convolution modes of the main networks Layer3 and Layer4 through 3*3
Figure BDA0004066531890000071
Two multi-level attention maps can express specific parts of a target object, and compared with other attention map models, the method can more easily locate most object parts, so that better local feature representation is obtained. Note that in the figure, the number of channels C' is set to 16 according to engineering experience.
Meanwhile, after the feature representation of the object component is obtained, if the feature representation is directly output, the capability of the model for performing clue comparison from different features is limited, so that different global features are not treated in an isolated manner.
In addition, in the cross-layer attention interaction module, if the cross-layer attention interaction module is not restricted, attention is easy to assimilate, namely, attention is easy to focus on the same salient region of the target object, and other secondary salient regions with the same judgment capability are ignored. Therefore, the invention proposes a random selection strategy, and designs three candidate branches, namely a most significant inhibition branch, a feature enhancement branch and a no-operation branch, wherein the most significant inhibition branch is used for inhibiting the most discriminative part, the feature enhancement branch is used for rewarding the most discriminative part, and the no-operation branch does not take any action as the name implies. The three candidate branches ensure the balance of punishment and reward by a random execution mode, simultaneously do not ignore the original characteristics, and urge attention to try to focus on more effective local information on the premise of having the most cost performance.
After the random selection strategy is completed, each channel in the attention map can represent different part features in the image, and if the channel can be effectively fused with the original feature map, a stronger feature representation can be constructed. In view of this, the present invention will utilize the bilinear pooling module BP to aggregate the attention map A and the initial feature map D of the random selection policy output.
In the classifier part, the feature map after fusion splicing is mapped into a one-dimensional feature vector, and the final classification of the image is realized by softmax logistic regression in a full-connection layer mode.
3. The basic backbone network is executed, and the specific steps comprise:
sending the images in the preprocessed training data set into ConvNeXt to generate initial feature graphs under different scales, and marking the initial feature graphs as D e R C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.
4. And constructing a cross-layer attention interaction module CAIM.
Fig. 2 is a structural diagram of a cross-layer attention interaction module according to the present invention, and as shown in fig. 2, the cross-layer attention interaction module is mainly divided into a channel interaction and a spatial interaction modeling.
In the space interaction part, firstly, the initial characteristic diagrams D epsilon R of different scales obtained by the trunk networks Layer3 and Layer4 C×W×H Multi-level attention map generated by 3*3 convolution
Figure BDA0004066531890000091
Information is reconstructed as an image pair, which is size-converted to->
Figure BDA0004066531890000092
L 1 =W 1 ×H 1 ,L 2 =W 2 ×H 2 ;/>
Then, image similarity evaluation is performed, specifically: through to A' 1 And A' 2 Inner product operation is carried out to obtain a spatial similarity matrix W 1 Elements of a matrix
Figure BDA0004066531890000093
Represents A' 1 Ith (i) th Pixel and A' 2 J th th The lower the similarity of two pixels, the greater the complementarity between the two, and therefore-W will be 1 After normalization operation, the method is used for expressing the interaction relation of the image pair to obtain a space interaction feature map M 1
Figure BDA00040665318900000910
Then, the space interaction feature map M 1 Mapping to A' 1 、A′ 2 To obtain A' s12 And A' s21
A′ s12 =M 1 T ×A′ 1 +A ′T 2
A′ s21 =M 1 ×A ′T 2 +A 1
Finally, to A' s12 And A' s21 Size conversion is carried out to obtain a characteristic diagram reflecting the spatial dependency relationship
Figure BDA0004066531890000094
And &>
Figure BDA0004066531890000095
Based on this, learning more discriminative features and outputting a feature map->
Figure BDA00040665318900000912
And &>
Figure BDA00040665318900000911
Figure BDA0004066531890000098
In the channel interaction part, each channel in the attention diagram can be regarded as a feature representation of a specific part, and if a correlation model is established among different channels, complementary information among the channels can be enhanced, and based on the following steps:
first, image pair A 'is constructed' 1 、A′ 2d Wherein the characteristic diagram
Figure BDA0004066531890000099
Is a multi-level attention diagram A 2 Down-sampling and size conversion; characteristic diagram A' 1 As above, a multi-level attention diagram A 1 Converting the size to obtain;
secondly, much like the above-described spatially interactive section, by matching feature map A ′T 1 And A' 2d Channel similarity matrix W obtained by inner product operation 2 Through the channel similarity matrix W 2 Performing similarity evaluation on the image pair, and comparing-W 2 T After normalization operation, the method is used for expressing the interaction relation of the image pair to obtain a channel interaction feature map M 2
M 2 =softmax(-W 2 T )∈[0,1] C′×C′ ,W 2 =A ′T 1 ×A′ 2d
Different from the space interaction part, the channel interaction part mainly mines complementary information along the channel dimension;
then, the channel interaction feature map M is processed 2 Mapping to feature map A' 1 、A′ 2d Obtaining a characteristic diagram A' c12 And A' c21
A′ c12 =M 2 ×A ′T 1 +A ′T 2d
Figure BDA0004066531890000106
Finally, the feature map is compared
Figure BDA0004066531890000101
Performs size conversion to obtain a feature map>
Figure BDA0004066531890000102
For a feature map>
Figure BDA0004066531890000103
It is necessary to convert the size of the product into L after adopting the process in one more step 1 X C', and then subjected to size conversion to obtain a characteristic map>
Figure BDA0004066531890000104
After the above operations are completed, the spatial interaction A is completed s12 A interacting with the channel c12 Combining to obtain diversified characteristic diagram A with richness 12 A to complete spatial interaction s21 A interacting with the channel c21 Combining to obtain diversified characteristic diagram A with richness 21
5. And (5) constructing a random selection strategy RSS. The method comprises the following specific steps:
FIG. 3 is a diagram of a randomly selected strategy structure according to the present invention, wherein the strategy input can be an arbitrary feature diagram F theoretically, and F in this section refers to the diversified feature diagram A outputted by the cross-layer attention interaction module 12 And A 21
First, the feature map F ∈ R is plotted along the width dimension C′×W×H Executing n equal slicing operations to obtain n characteristic slices F (k) ∈R C′×(W/n)×H ,k∈[1,n];
Then, each feature slice F (k) And randomly selecting candidate branches to complete corresponding operations, wherein n is determined by engineering experience, and the setting of the invention is 7.
1) For the most significant suppression of branching, slice F for features (k) Performing a channel averaging pooling operation to obtain F p(k) ∈R (W /n)×H
Figure BDA0004066531890000105
Characteristic section F (k) The value range of each pixel point is the same as that of the input feature map, and the key feature expression obtained by the classification model is represented. Since the randomly selected global diversified classification network proposed by the present invention is trained for classification tasks, feature slice F (k) The spatial distribution of the most discriminative part can be approximately reflected, and the higher the value of the element is, the stronger the discrimination is. That is, for the classification task, feature slice F (k) The intensity of each pixel in the array represents its ability to discriminate. To eliminate the most discriminating part, slice F is sliced according to the features () Generating a cancellation mask P by setting a threshold rate delta for a pixel value of medium maximum intensity () On the contrary, the partial pixels larger than the threshold are set to 0, and on the contrary, the partial pixels smaller than the threshold are set to 1:
Figure BDA0004066531890000111
finally, the mask P will be eliminated () Acting on slice F in a dot-product manner () And obtaining an elimination characteristic diagram.
2) For feature enhancement branching, feature slice F () Generating an enhancement mask P using a sigmoid activation function ()nt
P () =sigmoid( P(k) )∈[0,1] (/)×H
Figure BDA0004066531890000112
3) For no operation branch, then slice the feature F () No treatment is done.
The probabilities of the above three candidate branches are represented by α, β, γ, respectively, where α + β + γ =1. Each characteristic section F () One of the branches is randomly selected to obtain a selection characteristic S ()
Figure BDA0004066531890000113
Finally, each processed part is selected with a characteristic S () Splicing (concat) by the width dimension of the cut to obtain the polymerization attention
Graph A ∈ R c×W×H
A=concat( () )。
6. The bilinear pooling module BP specifically comprises the following steps:
after the random selection strategy is completed, the polymerization attention diagram A belongs to R C×H×W Each channel in the image can represent different part features in the image, and if the image can be effectively fused with the original feature map D, a stronger feature representation can be constructed. In view of this, the present invention uses the bilinear pooling module BP to aggregate the attention map a and the initial feature map D, and the specific process is as follows:
first, the aggregate attention map A is multiplied by the initial feature map D element by element to obtain a part feature map D k
Figure BDA0004066531890000114
Then, the partial feature map D is mapped using global maximum pooling GMP k Is processedObtaining attention characteristics d of each specific part k ∈R 1×C
d k =GMP(D k )
Finally, attention is paid to the feature d k The characteristic matrix P of the stacked parts belongs to R M×C
P=(d 1 ,d 2 ,...,d M ) T
7, splicing the features, comprising the following specific steps:
layer3 and Layer4 of the basic backbone network ConvNeXt are subjected to CAIM and RSS to obtain two different part feature matrixes, dimension splicing is carried out on the two part feature matrixes to obtain a feature graph y, and the feature graph y is input into a classifier for classification;
in the classifier part, the feature graph y after fusion splicing is mapped into a one-dimensional feature vector, and the final classification of the image is realized by softmax logistic regression in a full-connected layer mode.
8. And (3) loss calculation, which specifically comprises the following steps:
sending the preprocessed training data set into a local diversified and guided weak supervision fine-grained image classification network to obtain a prediction classification result, calculating a loss value of the classification result through Euclidean distance by using a loss function, and training the local diversified and guided weak supervision fine-grained image classification network by adopting an Adam optimization algorithm to obtain a final local diversified and guided weak supervision fine-grained image classification network model.
9. Visual analysis, which comprises the following steps:
in order to further verify the effectiveness of the algorithm of the invention, visual experimental analysis is carried out. Fig. 4 is a visual effect diagram of the method of the present invention on the CUB-200-2011 dataset and the self-built pyramid dataset, and compared with the original baseline model, the method of the present invention can focus on the most significant region and also focus on other overlooked key points, which is benefited by the combined action of the cross-layer attention interaction module CAIM and the random selection strategy RSS, so as to force the network to comprehensively mine a plurality of different identifiable characteristics.
10. Comparison method verification
The invention adopts the excellent feature extraction backbone network, constructs the strong initial feature representation, and effectively improves the algorithm precision; establishing a feature interaction method, strengthening the multi-level attention diagram association and enhancing the mining capability of semantic complementary information; a random selection strategy is provided, three operations of most obvious inhibition, feature enhancement and no operation are designed, attention is forced to learn more comprehensive local information through a random execution mode, and more local feature attention of the network is improved; the invention carries out sufficient experiments under a plurality of data sets, and proves the effectiveness of the method.
In the invention, a comparison experiment is carried out under a CUB-200-2011 data set, and the resolution and the backbone network of an image are important factors influencing the classification accuracy of a fine-grained algorithm, so the size of an input image of a comparison method and the adopted backbone network are listed in Table 1. As can be seen from the experimental results in Table 1, the method of the present invention achieves excellent performance with 92.3% accuracy on the data set, and the accuracy is much higher than that of Part-based R-CNNs, poseNorm, KERL, mask-CNN and BCNN. FBSD and API-Net lead other classification algorithms based on CNN architecture with 89.8% and 90.0% performance, but both fall behind the method of the present invention. The method based on the Transformer architecture, such as RAMS-Trans, TPSKG, AFTrans, FFVT and TransFG, has excellent effect, and the method of the invention takes ConvNeXt as a backbone network, combines a cross-layer attention interaction module and a random selection strategy, and obtains more excellent performance.
Table 1: comparison of accuracy of different algorithms on CUB-200-2011 (bird 1) data set
Figure BDA0004066531890000131
Figure BDA0004066531890000141
NABirds contains 48562 images of birds, with 555 categories, each containing around 100 images, and is a much larger fine-grained dataset than CUB-200-2011, and is also more challenging for fine-grained tasks. As can be seen from Table 2, the method of the invention achieves a performance of 91.5% accuracy on the data set, which is higher than PC-CNN, maxEnt, cross-X, HGNet and GaRD; compared with the excellent TPSKG and TransFG, the TPSKG and the TransFG are higher by 1.4 percent and 0.7 percent; it can be seen that the method of the present invention still has good performance on larger data sets.
Table 2: comparison of accuracy of different algorithms on NABirds (birds 2) datasets
Figure BDA0004066531890000142
The ISIA Food-200 contains 200 Food categories, about 20 million Food images and 319 ingredients, which are more diverse and voluminous than the two data sets described above, and therefore, only a few methods have been attempted to perform experiments in this data set. As can be seen from Table 3, the IG-CMAN and TPSKG have 67.5% and 69.5% accuracy in the data set, respectively. The method realizes the diversity of multi-scale local features, achieves 72.8 percent in the data set, and is respectively 5.3 percent higher and 3.3 percent higher than IG-CMAN and TPSKG.
Table 3: comparison of accuracy of different algorithms on the ISIA Food-200 dataset
Figure BDA0004066531890000143
Figure BDA0004066531890000151
For the pyramid dataset, the present invention gathered 13815 pyramid images, comprising 229 categories, of which 6915 images were used for training and 6900 for validation. On the data set, experiments are firstly carried out on a plurality of backbone networks commonly used by fine-grained tasks, and as can be seen from table 4, the accuracy rates of Resnet50, resnet101, densenet161 and ViT on the ancient data set respectively reach 92.3%, 93.4%, 93.6% and 93.8%. According to the three data set experiments, the invention selects several excellent algorithms to perform comparative experiments on the data set, specifically including FBSD, PMG, FFVT and TransFG, and the results are shown in Table 4. The method of the invention achieves 95.2% accuracy on the self-built ancient pagoda data set, and is superior to other algorithms.
Table 4: comparison of accuracy of different algorithms on a pyramid data set
Figure BDA0004066531890000152
The invention also provides a local diversity-guided weak supervision fine-grained image classification system, which comprises the following steps:
the system comprises a classification network construction module, a classification network selection module and a classification network selection module, wherein the classification network construction module is used for constructing a local diversity-guided weak supervision fine-grained image classification network, the classification network comprises a basic main network ConvNeXt, a Layer3 connected with the basic main network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;
the classification network training module is used for training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;
and the image classification module is used for sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.

Claims (10)

1. A local diversity-guided weak supervision fine-grained image classification method is characterized by comprising the following specific steps:
s1, constructing a local diversity-guided weak supervision fine-grained image classification network, wherein the classification network comprises a basic backbone network ConvNeXt, a Layer3 connected to the basic backbone network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;
s2, training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;
and S3, sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.
2. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, wherein the basic backbone network ConvNeXt comprises 4 layers, and Layer1 is used for preprocessing the input images; layer2-Layer4 are all composed of multiple ConvNeXt blocks, and Layer3 and Layer4 of the ConvNeX basic backbone network output initial characteristic diagrams D e R of different scales C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.
3. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, characterized in that a cross-layer attention interaction module CAIM is constructed by the specific steps of:
1) For the initial characteristic diagrams D epsilon R of different scales obtained by the Layer3 and the Layer4 of the backbone network C×W×H Performing 3*3 convolution yields a multi-level attention map
Figure FDA0004066531880000011
2) Attention diagram for multiple levels
Figure FDA0004066531880000012
Carrying out space interaction and channel interaction processing to obtain a characteristic diagram A for completing space interaction s12 、A s21 Feature graph A interacting with completion channel c12 、A c21
3) A that will accomplish spatial interaction, respectively s12 A interacting with channel c12 Combining to obtain a diversified characteristic diagram A 12 Will complete the spatial interactionA of (A) s21 A interacting with the channel c21 Combining to obtain diversified feature maps A 21
4. The method of claim 3, wherein the spatial interaction comprises:
1) Attention diagram for multiple levels
Figure FDA0004066531880000021
Performs size conversion to obtain a characteristic map>
Figure FDA0004066531880000022
Figure FDA0004066531880000023
L 1 =W 1 ×H 1 ,L 2 =W 2 ×H 2
2) To feature map A' 1 And A' 2 Carrying out inner product operation to obtain a spatial similarity matrix W 1 A is prepared from 1 Obtaining a space interaction characteristic graph M after normalization operation 1
Figure FDA0004066531880000024
W 1 =A′ 1 ×A′ 2
3) Map M of spatial interaction features 1 Mapping to feature map A' 1 、A′ 2 Obtaining a characteristic diagram A' s12 And A' s21
A′ s12 =M 1 T ×A′ 1 +A′ T 2
A′ s21 =M 1 ×A′ T 2 +A 1
4) To feature map A' s12 And A' s21 Size conversion is carried out to obtain a characteristic diagram reflecting spatial dependence
Figure FDA0004066531880000025
And &>
Figure FDA0004066531880000026
5. The method of claim 3, wherein the channel interaction comprises:
1) Attention diagram for multiple levels A 2 Down-sampling and size-converting to obtain characteristic diagram
Figure FDA0004066531880000027
Attention diagram A for multiple levels 1 Size conversion to obtain feature map A' 1 Prepared from A' T 1 And' 2d As a graph opposite;
2) To A' T 1 And A' 2d Channel similarity matrix W obtained by inner product operation 2 Through the channel similarity matrix W 2 Similarity evaluation of the image pairs is performed, and-W is 2 T Obtaining a channel interaction feature map M after normalization operation 2
M 2 =softmax(-W 2 T )∈[0,1] C′×C′ ,W 2 =A′ T 1 ×A′ 2d
3) Channel interaction feature map M 2 Mapping to feature map A' 1 、A′ 2d Obtaining a characteristic diagram A' c12 And A' c21
A′ c12 =M 2 ×A′ T 1 +A′ T 2d
Figure FDA0004066531880000028
4) To feature map A' c12 Size conversion to obtain A c12 For feature map A c21 ' carrying out upward miningUsed and size-converted to obtain A c21
6. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, wherein the step of randomly selecting the construction of the strategy RSS comprises the following steps:
1) Characteristic diagram F E R output by cross-layer attention interaction module CAIM along width dimension C′×W×H Slicing n equal parts to obtain n characteristic slices F (k) ∈R c′×(W/n)×H ,k∈[1,n];
2) Slice F for each feature (k) Randomly selecting the most significant suppression branch, the feature enhancement branch or the no-operation branch to perform data processing to obtain a selection feature S (k)
Figure FDA0004066531880000031
Wherein, the probabilities of selecting the most significant inhibitory branch, the feature-enhanced branch or the no-operation branch are respectively on α, β, γ, wherein α + β + γ =1;
3) Each processed selection feature S (k) Splicing according to the width dimension which is cut to obtain a polymerization attention diagram A epsilon R C′×W×H
A=concat(S (k) )。
7. The method of claim 6, wherein the most significant inhibitory branch comprises:
1) Slicing the feature F (k) Performing a channel averaging pooling operation to obtain F p(k) ∈R (W/n)×H
Figure FDA0004066531880000032
F (k) ∈R C′×(W/n)×H
2) Slicing according to characteristics F (k) Middle maximumIntensity pixel value setting threshold rate delta generation cancellation mask P (k)drop On the contrary, the partial pixels larger than the threshold are set to 0, and on the contrary, the partial pixels smaller than the threshold are set to 1:
Figure FDA0004066531880000033
3) Will eliminate the mask P (k)drop Acting on characteristic slices F in a point-by-point manner (k) Obtaining an elimination characteristic diagram;
the feature enhancement branch comprises:
1) Slicing the feature F (k) Generating an enhancement mask P using a sigmoid activation function (k)important
P (k)important =sigmoid(F P(k) )∈[0,1] (W/n)×H
2) Will enhance the mask P (k)important Acting on slice F in a dot-product manner (k) Obtaining an enhanced feature map;
the no-operation branch pair feature slice F (k) No processing is done.
8. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, wherein the output of the random selection strategy RSS and the basic backbone network ConvNeXt is processed by a bilinear pooling module BP, and the specific steps include:
1) Acquiring initial characteristic diagram D of basic backbone network ConvNeXt output and aggregation attention diagram A e R of random selection strategy RSS output C′×H×W
2) Multiplying the aggregation attention diagram A and the initial feature diagram D in an element-by-element point multiplication mode to obtain a part feature diagram D k
D k =A k ⊙D(k=1,2,...,C′)
3) Mapping of partial feature maps D using global max-pooling GMP k Processing to obtain attention characteristics d of each specific part k ∈R 1×C
d k =GMP(D k )
4) Attention will be paid to feature d k Stacking position feature matrix P ∈ R M×C
P=(d 1 ,d 2 ,...,d M ) T
9. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 8, wherein Layer3 and Layer4 of the basic backbone network ConvNeXt respectively obtain a part feature matrix Pepsilon R M×C And performing dimension splicing on the two part feature matrixes to obtain a feature map y, mapping the feature map y into a one-dimensional feature vector, and realizing the final classification of the images by softmax logistic regression in a full-connected layer mode.
10. A locally diversified guided weakly supervised fine grained image classification system, comprising:
the system comprises a classification network construction module, a classification network selection module and a classification network selection module, wherein the classification network construction module is used for constructing a local diversity-guided weak supervision fine-grained image classification network, the classification network comprises a basic main network ConvNeXt, a Layer3 connected with the basic main network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;
the classification network training module is used for training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;
and the image classification module is used for sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.
CN202310077544.7A 2023-01-29 2023-01-29 Local diversity guided weak supervision fine-grained image classification method and system Pending CN115984627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310077544.7A CN115984627A (en) 2023-01-29 2023-01-29 Local diversity guided weak supervision fine-grained image classification method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310077544.7A CN115984627A (en) 2023-01-29 2023-01-29 Local diversity guided weak supervision fine-grained image classification method and system

Publications (1)

Publication Number Publication Date
CN115984627A true CN115984627A (en) 2023-04-18

Family

ID=85963113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310077544.7A Pending CN115984627A (en) 2023-01-29 2023-01-29 Local diversity guided weak supervision fine-grained image classification method and system

Country Status (1)

Country Link
CN (1) CN115984627A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117853875A (en) * 2024-03-04 2024-04-09 华东交通大学 Fine-granularity image recognition method and system
CN117911798A (en) * 2024-03-19 2024-04-19 青岛奥克生物开发有限公司 Stem cell quality classification method and system based on image enhancement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117853875A (en) * 2024-03-04 2024-04-09 华东交通大学 Fine-granularity image recognition method and system
CN117853875B (en) * 2024-03-04 2024-05-14 华东交通大学 Fine-granularity image recognition method and system
CN117911798A (en) * 2024-03-19 2024-04-19 青岛奥克生物开发有限公司 Stem cell quality classification method and system based on image enhancement
CN117911798B (en) * 2024-03-19 2024-05-28 青岛奥克生物开发有限公司 Stem cell quality classification method and system based on image enhancement

Similar Documents

Publication Publication Date Title
CN115984627A (en) Local diversity guided weak supervision fine-grained image classification method and system
US8503792B2 (en) Patch description and modeling for image subscene recognition
CN110414554A (en) One kind being based on the improved Stacking integrated study fish identification method of multi-model
CN106023065A (en) Tensor hyperspectral image spectrum-space dimensionality reduction method based on deep convolutional neural network
CN104346620A (en) Inputted image pixel classification method and device, and image processing system
Badawi et al. A hybrid memetic algorithm (genetic algorithm and great deluge local search) with back-propagation classifier for fish recognition
Lodhi et al. Multipath-DenseNet: A Supervised ensemble architecture of densely connected convolutional networks
Marburg et al. Deep learning for benthic fauna identification
CN110674685B (en) Human body analysis segmentation model and method based on edge information enhancement
CN111126401B (en) License plate character recognition method based on context information
CN113034506B (en) Remote sensing image semantic segmentation method and device, computer equipment and storage medium
Nguyen et al. Satellite image classification using convolutional learning
Zhang et al. MATNet: A combining multi-attention and transformer network for hyperspectral image classification
CN112733912B (en) Fine granularity image recognition method based on multi-granularity countering loss
CN111062438B (en) Image propagation weak supervision fine granularity image classification algorithm based on correlation learning
CN116563680B (en) Remote sensing image feature fusion method based on Gaussian mixture model and electronic equipment
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
CN115641473A (en) Remote sensing image classification method based on CNN-self-attention mechanism hybrid architecture
Hassam et al. A single stream modified mobilenet v2 and whale controlled entropy based optimization framework for citrus fruit diseases recognition
CN113496221B (en) Point supervision remote sensing image semantic segmentation method and system based on depth bilateral filtering
CN112149526A (en) Lane line detection method and system based on long-distance information fusion
CN117173702A (en) Multi-view multi-mark learning method based on depth feature map fusion
Luan et al. Sunflower seed sorting based on convolutional neural network
Nie et al. Learning enhanced features and inferring twice for fine-grained image classification
Tian et al. Recognition of geological legends on a geological profile via an improved deep learning method with augmented data using transfer learning strategies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination