CN115984627A

CN115984627A - Local diversity guided weak supervision fine-grained image classification method and system

Info

Publication number: CN115984627A
Application number: CN202310077544.7A
Authority: CN
Inventors: 刘光辉; 占华; 孟月波; 段中兴; 王博
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-04-18

Abstract

The invention provides a local diversity-guided weak supervision fine-grained image classification method and system, which are used for constructing a local diversity-guided weak supervision fine-grained image classification network, wherein the classification network comprises a basic backbone network, a cross-Layer attention interaction module and a bilinear pooling module, which are connected behind Layer3 and Layer4 of the basic backbone network, and a random selection strategy is connected between the cross-Layer attention interaction module and the bilinear pooling module; training a local diversified guided weak supervision fine-grained image classification network to obtain a local diversified guided weak supervision fine-grained image classification network model; and sending the preprocessed training data set into a local diversified and guided weak supervision fine-grained image classification network model to obtain an image classification result. The invention solves the problem of inaccurate classification result caused by the problems that the distinguishable features in the existing fine-grained image classification task are too fine and difficult to capture, the local information is lack of effective utilization and the like.

Description

Local diversity guided weak supervision fine-grained image classification method and system

Technical Field

The invention belongs to the technical field of electronic information, and particularly belongs to a local diversified guided weak supervision fine-grained image classification method and system.

Background

Fine-grained image classification, also called subcategory image classification, aims to perform more detailed subcategory division on images (automobiles, dogs, flowers, birds and the like) belonging to the same basic category, has wide business requirements and application scenes in the industry and actual life, can be used for identification and research of animals and plants in the field environment, and provides an important technical basis for the field of biology; the method can be applied to visual tasks such as clothing detection and clothing identification; an automatic checkout service that may be deployed in a retail setting; the method can be used for quickly and accurately identifying the high-speed running vehicle. Compared with the common image classification task, the fine-grained image classification is more difficult due to the fine inter-class difference and the larger intra-class difference among the sub-classes.

In the aspect of an image recognition algorithm, the method can be divided into a strong supervision direction and a weak supervision direction, wherein the strong supervision direction is that besides a category label, a labeling frame, part labeling point information and the like are needed in a model, and the weak supervision direction is that only the category label is used for completing the training of the model. For fine-grained tasks, how to enable the neural network to locate distinguishable parts and learn distinguishable features is the key to solving the fine-grained problem, and a strong supervision method is often adopted to solve the problem in the early stage. The strong supervision mode excessively depends on additional manual annotation information such as annotated bounding boxes, part annotation information and the like, and the annotation information is expensive to obtain, so that the practicability of the strong supervision algorithm is limited. Due to the development of deep learning and the depth of related research, a weak supervision mode which only depends on class labels to finish classification can achieve good classification performance without additional manual labeling information. The weak supervision classification algorithm not only reduces extra manual labeling cost, but also better meets the requirement of practical application, and is the development trend of current research.

The attention mechanism is beneficial to improving the attention degree of the distinguishing parts, so that the accuracy of the classification task is improved. Although the attention mechanism can guide the model to focus on the distinguishing parts, the method usually focuses on only a few significant parts, does not comprehensively explore the potential distinguishing parts, and treats each feature in isolation, so that the classification result is inaccurate when the pictures of the same application scene are classified in a fine-grained manner.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a local diversified guided weak supervision fine-grained image classification method and system, and solves the problem of inaccurate classification result caused by the problems that distinguishable features in the existing fine-grained image classification task are too fine and difficult to capture, local information is lack of effective utilization and the like by constructing a local diversified guided weak supervision fine-grained classification network model.

In order to achieve the purpose, the invention provides the following technical scheme: a local diversity-guided weak supervision fine-grained image classification method specifically comprises the following steps:

s1, constructing a local diversity-guided weak supervision fine-grained image classification network, wherein the classification network comprises a basic main network ConvNeXt, a Layer3 connected to the basic main network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;

s2, training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;

and S3, sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.

Further, the basic backbone network ConvNeXt includes 4 layers, where Layer1 is used for preprocessing an input image; layer2-Layer4 are all multiple ConvNeXt BlockForming, outputting initial characteristic graphs D e R of different scales by Layer3 and Layer4 of the basic backbone network ConvNeX ^C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.

Further, a cross-layer attention interaction module CAIM is constructed, and the method specifically comprises the following steps:

1) For the initial characteristic diagrams D epsilon R of different scales obtained by the Layer3 and the Layer4 of the backbone network ^C×W×H Performing 3*3 convolution yields a multi-level attention map

2) Attention-seeking to multiple levels

Carrying out space interaction and channel interaction processing to obtain a characteristic diagram A for completing space interaction _s12 、A _s21 Feature graph A interacting with completion channel _c12 、A _c21 ；

3) A that will accomplish spatial interaction, respectively _s12 A interacting with channel _c12 Combining to obtain diversified feature maps A ₁₂ A to complete spatial interaction _s21 A interacting with the channel _c21 Combining to obtain diversified feature maps A ₂₁ 。

Further, the spatial interaction includes:

1) Attention diagram for multiple levels

Performing size conversion to obtain characteristic diagram

L ₁ ＝W ₁ ×H ₁ ，L ₂ ＝W ₂ ×H ₂ ；

2) To feature map A' ₁ And A' ₂ Carrying out inner product operation to obtain a spatial similarity matrix W ₁ Is a reaction of-W ₁ Obtaining a space interaction characteristic graph M after normalization operation ₁ ：

3) Map M of spatial interaction features ₁ Mapping to feature map A' ₁ 、A′ ₂ Obtaining a characteristic diagram A' _s12 And A' _s21 ：

A′ _s12 ＝M ₁ ^T ×A′ ₁ +A′ ^T ₂

A′ _s21 ＝M ₁ ×A ^′T ₂ +A ₁

4) To feature map A' _s12 And A' _s21 Size conversion is carried out to obtain a characteristic diagram reflecting spatial dependence

And &>

Further, the channel interaction comprises:

1) Attention diagram A for multiple levels ₂ Down-sampling and size conversion to obtain characteristic diagram

Attention diagram A for multiple levels ₁ Size conversion to obtain feature map A' ₁ A is prepared by ^′T ₁ And A' _2d As a graph opposite;

2) To A ^′T ₁ And A' _2d Channel similarity matrix W obtained by inner product operation ₂ Through the channel similarity matrix W ₂ Similarity evaluation of the image pairs was performed and-W was calculated ₂ ^T Obtaining a channel interaction feature map M after normalization operation ₂ ：

M ₂ ＝softmax(-W ₂ ^T )∈[0,1] ^C′×C′ ,W ₂ ＝A′ ^T ₁ ×A′ _2d

3) Channel interaction feature map M ₂ Mapping to feature map A' ₁ 、A′ _2d Obtaining a characteristic diagram A' _c12 And A' _c21 ：

A′ _c12 ＝M ₂ ×A ^′T ₁ +A ^′T _2d

4) To feature map A' _c12 Size conversion to obtain A _c12 For feature map A _c21 ' conducting the above-mentioned process and size conversion to obtain A _c21 。

Further, the method for constructing the random selection strategy RSS comprises the following specific steps:

1) Feature graph F epsilon R output to cross-layer attention interaction module CAIM along width dimension ^C′×W×H Slicing n equal parts to obtain n characteristic slices F _(k) ∈R ^{C′×(W/n)×H} ,k∈[1,n]；

2) Slice F for each feature _(k) Randomly selecting the most significant inhibition branch, the feature enhancement branch or the no-operation branch to process data to obtain S _(k) ：

Wherein, the probabilities of selecting the most significant inhibitory branch, the feature-enhanced branch or the no-operation branch are respectively on α, β, γ, wherein α + β + γ =1;

3) Cutting each processed characteristic into slices F _(k) Splicing according to the width dimension which is cut to obtain a polymerization attention diagram A epsilon R ^C′×W×H ：

A＝concat(S _(k) )。

Further, the most significant inhibitory branches include:

1) Slicing the features F _(k) Performing a channel averaging pooling operation to obtain F _p(k) ∈R ^(W/n)×H ：

2) Slicing F according to characteristics _(k) Generating an erasure mask P by setting a threshold rate delta for a pixel value of medium maximum intensity _(k)drop On the contrary, the partial pixels larger than the threshold are set to 0, and on the contrary, the partial pixels smaller than the threshold are set to 1:

3) Will eliminate the mask P _(k)drop Acting on characteristic slices F in a point-by-point manner _(k) Obtaining an elimination characteristic diagram;

the feature enhancement branch comprises:

1) Slicing the feature F _(k) Generating an enhancement mask P using a sigmoid activation function _(k)important ：

P _(k)important ＝sigmoid(F _P(k) )∈[0,1] ^(W/n)×H

2) Will enhance the mask P _(k)important Acting on slices F in a dot-by-dot manner _(k) Obtaining an enhanced feature map;

the no-operation branch pair feature slice F _(k) No processing is done.

Further, performing bilinear pooling module BP on the random selection strategy RSS and the output of the basic backbone network ConvNeXt, specifically comprising the following steps:

1) Acquiring initial characteristic diagram D of basic backbone network ConvNeXt output and aggregation attention diagram A e R of random selection strategy RSS output ^C′×H×W ；

2) Multiplying the aggregation attention diagram A and the initial characteristic diagram D in an element-by-element point multiplication mode to obtain a part characteristic diagram D _k ；

D _k ＝A _k ⊙D(k＝1，2，...，C′)

3) Mapping of partial feature maps D using global max-pooling GMP _k Processing to obtain attention characteristics d of each specific part _⊙ ∈R ^1×C ：

d _k ＝GMP(D _k )

4) Attention will be paid to feature d _k The characteristic matrix P of the stacked parts belongs to R ^M×C ：

P＝(d ₁ ,d ₂ ,...,d _M ) ^T 。

Further, layer3 and Layer4 of the basic backbone network ConvNeXt respectively obtain a part feature matrix P epsilon R ^M×C And performing dimension splicing on the two part feature matrixes to obtain a feature map y, mapping the feature map y into a one-dimensional feature vector, and realizing the final classification of the images by softmax logistic regression in a full-connected layer mode.

The invention also provides a local diversity-guided weak supervision fine-grained image classification system, which comprises the following steps:

the system comprises a classification network construction module, a classification network selection module and a classification network selection module, wherein the classification network construction module is used for constructing a local diversity-guided weak supervision fine-grained image classification network, the classification network comprises a basic main network ConvNeXt, a Layer3 connected with the basic main network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;

the classification network training module is used for training a local diversified and guided weak supervision fine-grained image classification network to obtain a local diversified and guided weak supervision fine-grained image classification network model;

and the image classification module is used for sending the preprocessed training data set into a local diversified guided weak supervision fine-grained image classification network model to obtain an image classification result.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention provides a local diversity-guided weak supervision fine-grained image classification method, which comprises the steps of firstly, taking ConvNeXt as a backbone network to extract multi-level initial fine-grained characteristics to obtain a characteristic diagram; then, designing a cross-layer attention interaction module (CAIM), representing each part of the object by adopting different-level attention diagrams, establishing a multi-level attention diagram, promoting semantic expression of the attention diagram by adopting a channel interaction and space interaction mode, and sharing the mined information; then, in order to avoid feature assimilation, a Random Selection Strategy (RSS) is provided, attention is promoted to try to capture more local information with discriminability in a random selection mode, and the network is further promoted to obtain richer local features; and finally, fusing the attention diagram and the feature diagram by adopting a bilinear pooling module to construct a feature representation with strong capability to enhance the network fitting capability, and completing classification tasks through a full connection layer, thereby solving the problems that distinguishable features in a fine-grained image classification task are too fine and difficult to capture, local information is lack of effective utilization and the like, and improving the classification accuracy of images belonging to the same basic category.

Drawings

FIG. 1 is a diagram of a fine-grained image classification network architecture with local diversity guidance;

FIG. 2 is a block diagram of a cross-layer attention interaction module;

FIG. 3 is a diagram of a random selection strategy architecture;

fig. 4 visualizes the effect map.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

The weak supervision fine-grained image classification method guided by local diversity of the invention has the following processes:

1. downloading CUB-200-2011 (bird 1), NABirds (bird 2), ISIA Food-200 (Food) and image of the ancient tower building data set, screening the image of the data set by data, ensuring the integrity of the data image, and preprocessing the data set image. In the preprocessing stage, the data enhancement method is adopted to randomly perform operations such as cutting, rotating, scaling and the like on the sample image, expand the number of data set samples and enhance the robustness of the CNN model.

2. And constructing a local diversity guided weak supervision fine-grained image classification network. The method comprises the following specific steps:

the structure of the convolutional neural network based on local diversity guidance as shown in fig. 1 includes: basic backbone network (ConvNeXt), cross-layer Attention Interaction Module (CAIM), random Selection Strategy (RSS), bilinear Pooling Module (BP), and Cross-layer splicing classification operations. The design of the cross-Layer attention interaction module CAIM is to perform feature interaction modeling on the output of the Layer3 and the Layer4 of the main network, output a diversified feature map after the feature map passes through the CAIM, further output a part feature matrix by adopting a random selection strategy RSS for the output diversified feature map, and finally splice the two part feature matrices and input the spliced part feature matrices into a classifier to complete classification tasks.

First, an underlying backbone network (ConvNeXt) is built, which is mainly composed of 4 layers: the Layer1 has a simple structure and can be regarded as preprocessing of an input image; the layers 2-4 are all composed of multiple ConvNeXt blocks and have similar structures. The input image passes through different layers to obtain initial characteristic graphs of different scales, which are respectively recorded as D e R ^C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.

Secondly, for a fine-grained task, how to capture more local information with discriminability is the key for determining classification performance, in view of the above, the invention generates multi-level attention diagrams by convolution modes of the main networks Layer3 and Layer4 through 3*3

Two multi-level attention maps can express specific parts of a target object, and compared with other attention map models, the method can more easily locate most object parts, so that better local feature representation is obtained. Note that in the figure, the number of channels C' is set to 16 according to engineering experience.

Meanwhile, after the feature representation of the object component is obtained, if the feature representation is directly output, the capability of the model for performing clue comparison from different features is limited, so that different global features are not treated in an isolated manner.

In addition, in the cross-layer attention interaction module, if the cross-layer attention interaction module is not restricted, attention is easy to assimilate, namely, attention is easy to focus on the same salient region of the target object, and other secondary salient regions with the same judgment capability are ignored. Therefore, the invention proposes a random selection strategy, and designs three candidate branches, namely a most significant inhibition branch, a feature enhancement branch and a no-operation branch, wherein the most significant inhibition branch is used for inhibiting the most discriminative part, the feature enhancement branch is used for rewarding the most discriminative part, and the no-operation branch does not take any action as the name implies. The three candidate branches ensure the balance of punishment and reward by a random execution mode, simultaneously do not ignore the original characteristics, and urge attention to try to focus on more effective local information on the premise of having the most cost performance.

After the random selection strategy is completed, each channel in the attention map can represent different part features in the image, and if the channel can be effectively fused with the original feature map, a stronger feature representation can be constructed. In view of this, the present invention will utilize the bilinear pooling module BP to aggregate the attention map A and the initial feature map D of the random selection policy output.

In the classifier part, the feature map after fusion splicing is mapped into a one-dimensional feature vector, and the final classification of the image is realized by softmax logistic regression in a full-connection layer mode.

3. The basic backbone network is executed, and the specific steps comprise:

sending the images in the preprocessed training data set into ConvNeXt to generate initial feature graphs under different scales, and marking the initial feature graphs as D e R ^C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.

4. And constructing a cross-layer attention interaction module CAIM.

Fig. 2 is a structural diagram of a cross-layer attention interaction module according to the present invention, and as shown in fig. 2, the cross-layer attention interaction module is mainly divided into a channel interaction and a spatial interaction modeling.

In the space interaction part, firstly, the initial characteristic diagrams D epsilon R of different scales obtained by the trunk networks Layer3 and Layer4 ^C×W×H Multi-level attention map generated by 3*3 convolution

Information is reconstructed as an image pair, which is size-converted to->

L ₁ ＝W ₁ ×H ₁ ，L ₂ ＝W ₂ ×H ₂ ；/>

Then, image similarity evaluation is performed, specifically: through to A' ₁ And A' ₂ Inner product operation is carried out to obtain a spatial similarity matrix W ₁ Elements of a matrix

Represents A' ₁ Ith (i) ^th Pixel and A' ₂ J th ^th The lower the similarity of two pixels, the greater the complementarity between the two, and therefore-W will be ₁ After normalization operation, the method is used for expressing the interaction relation of the image pair to obtain a space interaction feature map M ₁ ：

Then, the space interaction feature map M ₁ Mapping to A' ₁ 、A′ ₂ To obtain A' _s12 And A' _s21 ：

A′ _s12 ＝M ₁ ^T ×A′ ₁ +A ^′T ₂

A′ _s21 ＝M ₁ ×A ^′T ₂ +A ₁

Finally, to A' _s12 And A' _s21 Size conversion is carried out to obtain a characteristic diagram reflecting the spatial dependency relationship

And &>

Based on this, learning more discriminative features and outputting a feature map->

And &>

In the channel interaction part, each channel in the attention diagram can be regarded as a feature representation of a specific part, and if a correlation model is established among different channels, complementary information among the channels can be enhanced, and based on the following steps:

first, image pair A 'is constructed' ₁ 、A′ _2d Wherein the characteristic diagram

Is a multi-level attention diagram A ₂ Down-sampling and size conversion; characteristic diagram A' ₁ As above, a multi-level attention diagram A ₁ Converting the size to obtain;

secondly, much like the above-described spatially interactive section, by matching feature map A ^′T ₁ And A' _2d Channel similarity matrix W obtained by inner product operation ₂ Through the channel similarity matrix W ₂ Performing similarity evaluation on the image pair, and comparing-W ₂ ^T After normalization operation, the method is used for expressing the interaction relation of the image pair to obtain a channel interaction feature map M ₂ ：

M ₂ ＝softmax(-W ₂ ^T )∈[0,1] ^C′×C′ ,W ₂ ＝A ^′T ₁ ×A′ _2d

Different from the space interaction part, the channel interaction part mainly mines complementary information along the channel dimension;

then, the channel interaction feature map M is processed ₂ Mapping to feature map A' ₁ 、A′ _2d Obtaining a characteristic diagram A' _c12 And A' _c21 ：

A′ _c12 ＝M ₂ ×A ^′T ₁ +A ^′T _2d

Finally, the feature map is compared

Performs size conversion to obtain a feature map>

For a feature map>

It is necessary to convert the size of the product into L after adopting the process in one more step ₁ X C', and then subjected to size conversion to obtain a characteristic map>

After the above operations are completed, the spatial interaction A is completed _s12 A interacting with the channel _c12 Combining to obtain diversified characteristic diagram A with richness ₁₂ A to complete spatial interaction _s21 A interacting with the channel _c21 Combining to obtain diversified characteristic diagram A with richness ₂₁ 。

5. And (5) constructing a random selection strategy RSS. The method comprises the following specific steps:

FIG. 3 is a diagram of a randomly selected strategy structure according to the present invention, wherein the strategy input can be an arbitrary feature diagram F theoretically, and F in this section refers to the diversified feature diagram A outputted by the cross-layer attention interaction module ₁₂ And A ₂₁ 。

First, the feature map F ∈ R is plotted along the width dimension ^C′×W×H Executing n equal slicing operations to obtain n characteristic slices F _(k) ∈R ^{C′×(W/n)×H} ,k∈[1,n]；

Then, each feature slice F _(k) And randomly selecting candidate branches to complete corresponding operations, wherein n is determined by engineering experience, and the setting of the invention is 7.

1) For the most significant suppression of branching, slice F for features _(k) Performing a channel averaging pooling operation to obtain F _p(k) ∈R ^(W ^/n)×H 。

Characteristic section F _(k) The value range of each pixel point is the same as that of the input feature map, and the key feature expression obtained by the classification model is represented. Since the randomly selected global diversified classification network proposed by the present invention is trained for classification tasks, feature slice F _(k) The spatial distribution of the most discriminative part can be approximately reflected, and the higher the value of the element is, the stronger the discrimination is. That is, for the classification task, feature slice F _(k) The intensity of each pixel in the array represents its ability to discriminate. To eliminate the most discriminating part, slice F is sliced according to the features ₍₎ Generating a cancellation mask P by setting a threshold rate delta for a pixel value of medium maximum intensity ₍₎ On the contrary, the partial pixels larger than the threshold are set to 0, and on the contrary, the partial pixels smaller than the threshold are set to 1:

finally, the mask P will be eliminated ₍₎ Acting on slice F in a dot-product manner ₍₎ And obtaining an elimination characteristic diagram.

2) For feature enhancement branching, feature slice F ₍₎ Generating an enhancement mask P using a sigmoid activation function _()nt ：

P ₍₎ ＝sigmoid( _P(k) )∈[0,1] ^(/)×H

3) For no operation branch, then slice the feature F ₍₎ No treatment is done.

The probabilities of the above three candidate branches are represented by α, β, γ, respectively, where α + β + γ =1. Each characteristic section F ₍₎ One of the branches is randomly selected to obtain a selection characteristic S ₍₎ 。

Finally, each processed part is selected with a characteristic S ₍₎ Splicing (concat) by the width dimension of the cut to obtain the polymerization attention

′

Graph A ∈ R ^c×W×H ：

A＝concat( ₍₎ )。

6. The bilinear pooling module BP specifically comprises the following steps:

′

after the random selection strategy is completed, the polymerization attention diagram A belongs to R ^C×H×W Each channel in the image can represent different part features in the image, and if the image can be effectively fused with the original feature map D, a stronger feature representation can be constructed. In view of this, the present invention uses the bilinear pooling module BP to aggregate the attention map a and the initial feature map D, and the specific process is as follows:

first, the aggregate attention map A is multiplied by the initial feature map D element by element to obtain a part feature map D _k 。

Then, the partial feature map D is mapped using global maximum pooling GMP _k Is processedObtaining attention characteristics d of each specific part _k ∈R ^1×C 。

d _k ＝GMP(D _k )

Finally, attention is paid to the feature d _k The characteristic matrix P of the stacked parts belongs to R ^M×C 。

P＝(d ₁ ,d ₂ ,...,d _M ) ^T

7, splicing the features, comprising the following specific steps:

layer3 and Layer4 of the basic backbone network ConvNeXt are subjected to CAIM and RSS to obtain two different part feature matrixes, dimension splicing is carried out on the two part feature matrixes to obtain a feature graph y, and the feature graph y is input into a classifier for classification;

in the classifier part, the feature graph y after fusion splicing is mapped into a one-dimensional feature vector, and the final classification of the image is realized by softmax logistic regression in a full-connected layer mode.

8. And (3) loss calculation, which specifically comprises the following steps:

sending the preprocessed training data set into a local diversified and guided weak supervision fine-grained image classification network to obtain a prediction classification result, calculating a loss value of the classification result through Euclidean distance by using a loss function, and training the local diversified and guided weak supervision fine-grained image classification network by adopting an Adam optimization algorithm to obtain a final local diversified and guided weak supervision fine-grained image classification network model.

9. Visual analysis, which comprises the following steps:

in order to further verify the effectiveness of the algorithm of the invention, visual experimental analysis is carried out. Fig. 4 is a visual effect diagram of the method of the present invention on the CUB-200-2011 dataset and the self-built pyramid dataset, and compared with the original baseline model, the method of the present invention can focus on the most significant region and also focus on other overlooked key points, which is benefited by the combined action of the cross-layer attention interaction module CAIM and the random selection strategy RSS, so as to force the network to comprehensively mine a plurality of different identifiable characteristics.

10. Comparison method verification

The invention adopts the excellent feature extraction backbone network, constructs the strong initial feature representation, and effectively improves the algorithm precision; establishing a feature interaction method, strengthening the multi-level attention diagram association and enhancing the mining capability of semantic complementary information; a random selection strategy is provided, three operations of most obvious inhibition, feature enhancement and no operation are designed, attention is forced to learn more comprehensive local information through a random execution mode, and more local feature attention of the network is improved; the invention carries out sufficient experiments under a plurality of data sets, and proves the effectiveness of the method.

In the invention, a comparison experiment is carried out under a CUB-200-2011 data set, and the resolution and the backbone network of an image are important factors influencing the classification accuracy of a fine-grained algorithm, so the size of an input image of a comparison method and the adopted backbone network are listed in Table 1. As can be seen from the experimental results in Table 1, the method of the present invention achieves excellent performance with 92.3% accuracy on the data set, and the accuracy is much higher than that of Part-based R-CNNs, poseNorm, KERL, mask-CNN and BCNN. FBSD and API-Net lead other classification algorithms based on CNN architecture with 89.8% and 90.0% performance, but both fall behind the method of the present invention. The method based on the Transformer architecture, such as RAMS-Trans, TPSKG, AFTrans, FFVT and TransFG, has excellent effect, and the method of the invention takes ConvNeXt as a backbone network, combines a cross-layer attention interaction module and a random selection strategy, and obtains more excellent performance.

Table 1: comparison of accuracy of different algorithms on CUB-200-2011 (bird 1) data set

NABirds contains 48562 images of birds, with 555 categories, each containing around 100 images, and is a much larger fine-grained dataset than CUB-200-2011, and is also more challenging for fine-grained tasks. As can be seen from Table 2, the method of the invention achieves a performance of 91.5% accuracy on the data set, which is higher than PC-CNN, maxEnt, cross-X, HGNet and GaRD; compared with the excellent TPSKG and TransFG, the TPSKG and the TransFG are higher by 1.4 percent and 0.7 percent; it can be seen that the method of the present invention still has good performance on larger data sets.

Table 2: comparison of accuracy of different algorithms on NABirds (birds 2) datasets

The ISIA Food-200 contains 200 Food categories, about 20 million Food images and 319 ingredients, which are more diverse and voluminous than the two data sets described above, and therefore, only a few methods have been attempted to perform experiments in this data set. As can be seen from Table 3, the IG-CMAN and TPSKG have 67.5% and 69.5% accuracy in the data set, respectively. The method realizes the diversity of multi-scale local features, achieves 72.8 percent in the data set, and is respectively 5.3 percent higher and 3.3 percent higher than IG-CMAN and TPSKG.

Table 3: comparison of accuracy of different algorithms on the ISIA Food-200 dataset

For the pyramid dataset, the present invention gathered 13815 pyramid images, comprising 229 categories, of which 6915 images were used for training and 6900 for validation. On the data set, experiments are firstly carried out on a plurality of backbone networks commonly used by fine-grained tasks, and as can be seen from table 4, the accuracy rates of Resnet50, resnet101, densenet161 and ViT on the ancient data set respectively reach 92.3%, 93.4%, 93.6% and 93.8%. According to the three data set experiments, the invention selects several excellent algorithms to perform comparative experiments on the data set, specifically including FBSD, PMG, FFVT and TransFG, and the results are shown in Table 4. The method of the invention achieves 95.2% accuracy on the self-built ancient pagoda data set, and is superior to other algorithms.

Table 4: comparison of accuracy of different algorithms on a pyramid data set

Claims

1. A local diversity-guided weak supervision fine-grained image classification method is characterized by comprising the following specific steps:

s1, constructing a local diversity-guided weak supervision fine-grained image classification network, wherein the classification network comprises a basic backbone network ConvNeXt, a Layer3 connected to the basic backbone network ConvNeXt, a cross-Layer attention interaction module CAIM and a bilinear pooling module BP behind the Layer4, and a random selection strategy RSS is connected between the cross-Layer attention interaction module CAIM and the bilinear pooling module BP;

2. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, wherein the basic backbone network ConvNeXt comprises 4 layers, and Layer1 is used for preprocessing the input images; layer2-Layer4 are all composed of multiple ConvNeXt blocks, and Layer3 and Layer4 of the ConvNeX basic backbone network output initial characteristic diagrams D e R of different scales ^C×W×H Wherein C, W and H represent the channel number, width and height of the characteristic diagram.

3. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, characterized in that a cross-layer attention interaction module CAIM is constructed by the specific steps of:

2) Attention diagram for multiple levels

3) A that will accomplish spatial interaction, respectively _s12 A interacting with channel _c12 Combining to obtain a diversified characteristic diagram A ₁₂ Will complete the spatial interactionA of (A) _s21 A interacting with the channel _c21 Combining to obtain diversified feature maps A ₂₁ 。

4. The method of claim 3, wherein the spatial interaction comprises:

1) Attention diagram for multiple levels

Performs size conversion to obtain a characteristic map>

L ₁ ＝W ₁ ×H ₁ ，L ₂ ＝W ₂ ×H ₂ ；

2) To feature map A' ₁ And A' ₂ Carrying out inner product operation to obtain a spatial similarity matrix W ₁ A is prepared from ₁ Obtaining a space interaction characteristic graph M after normalization operation ₁ ：

W ₁ ＝A′ ₁ ×A′ ₂

A′ _s12 ＝M ₁ ^T ×A′ ₁ +A′ ^T ₂

A′ _s21 ＝M ₁ ×A′ ^T ₂ +A ₁

And &>

5. The method of claim 3, wherein the channel interaction comprises:

1) Attention diagram for multiple levels A ₂ Down-sampling and size-converting to obtain characteristic diagram

Attention diagram A for multiple levels ₁ Size conversion to obtain feature map A' ₁ Prepared from A' ^T ₁ And' _2d As a graph opposite;

2) To A' ^T ₁ And A' _2d Channel similarity matrix W obtained by inner product operation ₂ Through the channel similarity matrix W ₂ Similarity evaluation of the image pairs is performed, and-W is ₂ ^T Obtaining a channel interaction feature map M after normalization operation ₂ ：

A′ _c12 ＝M ₂ ×A′ ^T ₁ +A′ ^T _2d

4) To feature map A' _c12 Size conversion to obtain A _c12 For feature map A _c21 ' carrying out upward miningUsed and size-converted to obtain A _c21 。

6. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, wherein the step of randomly selecting the construction of the strategy RSS comprises the following steps:

1) Characteristic diagram F E R output by cross-layer attention interaction module CAIM along width dimension ^C′×W×H Slicing n equal parts to obtain n characteristic slices F _(k) ∈R ^{c′×(W/n)×H} ,k∈[1,n]；

2) Slice F for each feature _(k) Randomly selecting the most significant suppression branch, the feature enhancement branch or the no-operation branch to perform data processing to obtain a selection feature S _(k) ：

3) Each processed selection feature S _(k) Splicing according to the width dimension which is cut to obtain a polymerization attention diagram A epsilon R ^C′×W×H ：

A＝concat(S _(k) )。

7. The method of claim 6, wherein the most significant inhibitory branch comprises:

1) Slicing the feature F _(k) Performing a channel averaging pooling operation to obtain F _p(k) ∈R ^(W/n)×H ：

F _(k) ∈R ^{C′×(W/n)×H}

2) Slicing according to characteristics F _(k) Middle maximumIntensity pixel value setting threshold rate delta generation cancellation mask P _(k)drop On the contrary, the partial pixels larger than the threshold are set to 0, and on the contrary, the partial pixels smaller than the threshold are set to 1:

the feature enhancement branch comprises:

P _(k)important ＝sigmoid(F _P(k) )∈[0,1] ^(W/n)×H

2) Will enhance the mask P _(k)important Acting on slice F in a dot-product manner _(k) Obtaining an enhanced feature map;

the no-operation branch pair feature slice F _(k) No processing is done.

8. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 1, wherein the output of the random selection strategy RSS and the basic backbone network ConvNeXt is processed by a bilinear pooling module BP, and the specific steps include:

2) Multiplying the aggregation attention diagram A and the initial feature diagram D in an element-by-element point multiplication mode to obtain a part feature diagram D _k ；

D _k ＝A _k ⊙D(k＝1,2,...,C′)

3) Mapping of partial feature maps D using global max-pooling GMP _k Processing to obtain attention characteristics d of each specific part _k ∈R ^1×C ：

d _k ＝GMP(D _k )

4) Attention will be paid to feature d _k Stacking position feature matrix P ∈ R ^M×C ：

P＝(d ₁ ,d ₂ ,...,d _M ) ^T 。

9. The method for classifying the locally diversified and guided weakly supervised fine grained images according to claim 8, wherein Layer3 and Layer4 of the basic backbone network ConvNeXt respectively obtain a part feature matrix Pepsilon R ^M×C And performing dimension splicing on the two part feature matrixes to obtain a feature map y, mapping the feature map y into a one-dimensional feature vector, and realizing the final classification of the images by softmax logistic regression in a full-connected layer mode.

10. A locally diversified guided weakly supervised fine grained image classification system, comprising: