CN110738247B

CN110738247B - Fine-grained image classification method based on selective sparse sampling

Info

Publication number: CN110738247B
Application number: CN201910942790.8A
Authority: CN
Inventors: 焦建彬; 丁瑶; 叶齐祥; 韩振军; 万方
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-08-25
Anticipated expiration: 2039-09-30
Also published as: CN110738247A

Abstract

The invention provides a fine-grained image classification method based on selective sparse sampling, which comprises the following steps: positioning important components by utilizing a classification network in a mode of extracting a class response graph from an image, and positioning key components which are effective for classification on a target as comprehensively as possible; then, locally amplifying the learned key component groups in a sparse resampling mode; and extracting features of the image after the local amplification, and determining the image category through a classifier by combining the features of the original image. The method realizes the quick positioning of the key part by utilizing the characteristic that the similar peak value response corresponds to the visual cue, and is quicker and more effective than the method of positioning the part by utilizing the detection frame; according to the invention, the key components are locally amplified in a sparse resampling mode, so that the image details are enhanced while the background information is retained, and the information loss is avoided. Therefore, the method has good practicability and expansibility and has important significance for the fine-grained image classification task.

Description

Fine-grained image classification method based on selective sparse sampling

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a fine-grained image classification method based on selective sparse sampling.

Background

The task of classifying fine-grained images is one of important problems in the field of vision, and has important application value in the fields of animal and plant protection, medical image analysis and the like. Conventional fine-grained image classification models often require that the position of each object, and even each part on the object, be accurately labeled in the image. Although such methods can rely on a large amount of labeled information to learn the information of object identification, they place very high demands on the collection and production of data sets. This process of accurately labeling each target in the image dataset is time consuming and labor intensive, especially when the dataset size becomes large, which greatly limits the application of the algorithm to large-scale fine-grained image datasets.

In order to reduce manual labeling and supervision during modeling, a fine-grained image classification framework based only on image class labels has been proposed. The fine-grained image classification framework based on the image category label only requires that the target in the image is labeled, and other forms of labeling information such as a bounding box are not required. The labeling mode greatly reduces the workload of labeling, and simultaneously, massive internet image resources can be directly utilized to collect large-scale data sets. However, in the current fine-grained image classification algorithm training process based on image labels, because precise part position information is lacked, greater part positioning randomness is generated, so that the stability and the precision of the algorithm are influenced, and higher requirements are provided for the fine feature learning capability of the fine-grained image classification algorithm.

The existing fine-grained image classification method mainly comprises three types: 1. the characteristic learning-based method is typically represented by a bilinear model based on a classification network. 2. Based on a fine feature learning model for positioning the discriminant part, the method mostly uses a weak supervision target detection method for realizing the positioning of the discriminant part, then cuts the parts from an original image according to a positioning result, extracts features, and completes feature learning by combining the features of the original image; 3. based on the attention mechanism method, the attention mechanism method is introduced, firstly, the part with the most discriminating force is positioned in an iterative learning mode, and secondly, the intermediate output result of the iterative process, namely the features of the part under different scales, is fused. These methods are increasingly optimized and achieve state-of-the-art performance.

However, these methods have disadvantages, such as: the first method is more general, but is not optimized according to the characteristic of slight difference among various categories in a fine-grained classification task; in the second method, the part positioning process based on image label discrimination is complex and time-consuming, secondly, the method needs the number of artificial designated parts and does not have self-adaptability to image content, in addition, the method extracts the parts by using a cutting mode, and a large amount of useful information can be lost when the part positioning is inaccurate; the third method adopts an iterative learning mode to easily cause error accumulation. These deficiencies limit the robustness and generalization of learning to models.

Disclosure of Invention

In order to overcome the problems, the inventor of the invention carries out intensive research and provides a fine-grained image classification method based on selective sparse sampling, the rich semantic information of a classification network response graph (similar activation graph) is utilized to realize the positioning of a part with discrimination so as to improve the efficiency and the flexibility of a model, and then, the part with discrimination is learned on a larger scale in a local amplification mode so as to avoid the loss of information. Experiments show that the method improves the positioning speed and the positioning precision of the fine component, and exceeds the best method (such as NTS-CNN) in performance, thereby completing the invention.

The invention aims to provide the following technical scheme:

the invention aims to provide a fine-grained image classification method based on selective sparse sampling, which comprises a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:

step 1: key component positioning: inputting the image into a classification network, outputting a class response graph corresponding to the image, and extracting a class peak value response on the class response graph;

step 2: class peak response grouping: grouping the class peak responses obtained in the step 1 according to response strength, wherein the class peak responses are respectively a discriminant attention group and a complementary attention group, each class peak response generates an attention diagram, and the two corresponding groups of class peak responses generate two groups of attention diagrams;

and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;

and 4, step 4: completing the construction of a feature fusion and classification model: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.

The fine-grained image classification method based on selective sparse sampling provided by the invention has the following beneficial effects:

(1) according to the method, learning is carried out based on the image category label, strong labeling data (target enclosing frame or part labeling information) in a relevant scene is not needed, the rich semantics of class peak value response on a class response graph are utilized, the key part is quickly positioned, and the feasibility and the practicability are remarkably improved;

(2) according to the method, the class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved;

(3) according to the method, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided;

(4) the method of the invention integrates the characteristics of the resampled image and the characteristics of the original image to realize the fusion of the local characteristics and the global characteristics of the image. In addition, the resampled image and the original image share a feature extraction network, so that data enhancement in a special form is realized, and the generalization of a model is favorably improved;

(5) according to the method, the class response graph can be updated along with the training of the network and the learning of the image characteristics, and the update of the class response graph can guide the generation of a new resampling image. Therefore, the positioning of the discriminant part and the feature learning can be mutually enhanced, and the method belongs to a special closed-loop iterative learning mode.

Drawings

FIG. 1 is a schematic diagram of a model structure of a selective sparse sampling-based fine-grained image classification method;

FIG. 2 shows an example and distribution of the number of class peak responses to which the model locates;

FIG. 3 shows a schematic of selective sparse sampling during model training;

FIG. 4 is a diagram illustrating the result of target classification in the proposed method;

fig. 5 shows a diagram of the target location result of the method proposed by the present invention.

Detailed Description

The invention is explained in further detail below with reference to the drawing. The features and advantages of the present invention will become more apparent from the description.

As shown in fig. 1, the present invention provides a fine-grained image classification method based on selective sparse sampling, which includes a process of training a classification model for object classification, wherein the training process of the classification model includes the following steps:

wherein the class peak response corresponds to a key component in the image, the key component being a distinctive region for classification;

the class peak response is preferably a local maximum on the class response map;

and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.

Step 1 of the invention, key components are positioned. The key component positioning algorithm of the method is based on semantic information rich in the similar response graph, integrates the characteristics of components corresponding to the similar peak response points on the similar response graph, and aims to find out the components and the positions thereof which are important for classification and judgment. Compared with the method for positioning the components of the weak supervision target detection framework, the method disclosed by the invention omits the searching and screening processes of important components, so that the classification can be more efficient.

In a preferred embodiment of the invention, in step 1, the image is only given with image labels, the whole target and the positions of all parts are not needed, and the key parts are quickly positioned by utilizing the rich semantics of the class peak response on the class response image.

In a preferred embodiment of the invention, step 1 comprises the following sub-steps:

step 1.1: inputting the image into a classification network, and calculating a class response graph;

the classification network is preferably a convolutional neural network, and may be selected from any one of AlexNet, ResNet, VGGNet, google lenet, and the like.

Defining a fine-grained image classification dataset: c represents the number of classes, N represents the number of samples, wherein the training set comprises N samples_trainThe test set contains N samples_testGiving an image I in a training set, inputting the image I into a classification network, and extracting a feature atlas S ∈ R output by the deepest convolutional layer^D×H×WS is sent to a full connection layer FC after passing through a global average pooling layer to obtain each class prediction score S ∈ R of the network to the image^C. Denote the weight of the full connection layer FC as W^fc∈R^D×CEach class c and each feature map S_dCorresponds to W^fcIs a numerical value of

Then the calculation of the class response graph corresponding to the class c in the class response graph M is shown in equation (1):

the formula defines the relation between the characteristics learned by the network and the image categories, and is beneficial to intuitively understand which areas are helpful for category judgment.

Step 1.2: calculating a prediction probability set P of each category of the images by the network by using the classification result s obtained in the step 1.1, sequencing the prediction probabilities in a descending mode, and selecting the prediction probabilities of the top five

Calculating entropy as shown in formula (2):

this entropy measures the confidence of the current classification network prediction. As can be seen from the classification results of a plurality of classification networks on a plurality of data sets such as CUB _200_2011, Stanford Cars, FGVC-Aircraft and the like, the classification performance of the first five classification networks outputting prediction probabilities can reach 99.9%, namely, one of the first five classifications of network prediction is correct, so that the prediction probabilities of the first five classification networks are necessary and sufficient to select.

Step 1.3: calculating a class response map for extracting the class peak response according to equation (3) in consideration of the accuracy and recall of the class peak response locating unit:

wherein the content of the first and second substances,

is composed of

The corresponding class response plot, which is the threshold, was selected to be 0.2 based on the control experiment.

The rules defined in equations (2) and (3) are: when the top-1 probability value predicted by the classification network is high, namely the prediction is more credible, only selecting a class response graph corresponding to the top-1 class, so that the extracted class peak value response does not contain noise; and when the probability value of top-1 predicted by the classification network is low, namely the sample is difficult to predict and cannot be credible, selecting a class response graph corresponding to the top-5 class, and ensuring the recall of class peak value response. In order to ensure the consistency of the training and testing process, a mode of predicting scores according to classification is selected instead of class labels.

To avoid the problem that the magnitude of the variable causes numerical computation complexity, the response graph R to the class_oA normalization of the maximum-minimum pattern is performed, as in formula (4):

wherein R is a normalized class response graph, R_oClass response plot, min (R) obtained for formula (3)_o) Is the minimum value of the class response map, max (R)_o) Is the maximum value of the class response graph.

Step 1.4: extracting local maximum values from the class response graph R in a window with a set size to obtain a class peak response position set T { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n) And n is the number of peak-like responses.

In a preferred embodiment, the pane size is 3 x 3, 5 x 5, or 7 x 7, preferably the pane size is 3 x 3.

The number and position of the peak-like responses extracted in the process are self-adaptive to the image content and are not fixed, and the peak-like responses are distributed on a plurality of fine-grained image classification data sets as shown in FIG. 2. Thus, the proposed framework is more flexible and can be applied to different domains, such as birds, airplanes and cars, without the need to adjust the hyper-parameters for each specific task.

The invention comprises the following steps: a peak-like response packet. The class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved. In an embodiment, step 2 comprises the following sub-steps:

step 2.1: will be described in detailClass-peak response in 1 divided into two sets T_dAnd T_cThe method is divided into the following formulas (5) and (6):

T_d＝{(x,y)|(x,y)∈T if R_x,ynot less than ζ } formula (5)

T_c＝{(x,y)|(x,y)∈T if R_x,y<ζ formula (6)

Wherein R is_x,yThe response value is the peak-like response (x, y), zeta is the division number, zeta is chosen to be a random number that can be (0,1) evenly distributed, or the median of all peak-like responses, etc., T_dFor the discriminative class peak response set, i.e. the component decisive for the class decision, T_cIs a complementary peak-like response set, i.e. a component that plays a complementary role for class determination.

Step 2.2: and (3) calculating a corresponding attention diagram for each class peak response by using a Gaussian kernel function in a way of formula (7), wherein the corresponding two groups of class peak responses generate two groups of attention diagrams:

wherein the content of the first and second substances,

is a peak-like response (x)_i,y_i) β, β₁And β₂Are learnable parameters. The meaning of the formula is: the stronger the response value, the more the region is amplified.

Step 3 of the invention: and (6) resampling. The local amplification of key parts is realized by an image resampling mode, so that the background information is kept while important details in the image are enhanced, and the loss of information is avoided.

In a preferred embodiment of the present invention, step 3 comprises the following substeps:

step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resampling^dAnd Q^cThe calculation method is shown in formulas (8) and (9):

Q^d＝∑A_i,if(x_i,y_i)∈T_dformula (8)

Q^c＝∑A_i,if(x_i,y_i)∈T_cFormula (9)

Wherein Q is^dRepresenting a discriminative branch significance map, Q^cRepresenting a complementary branch significance map.

Step 3.2: the two sets of saliency maps calculated from step 3.1 then guide the resampling of the original image.

The image I is regarded as a grid consisting of a point set V and an edge set E (a set of connecting lines between two adjacent points in the point set V), where V ═ x [ (×)₀,y₀)，(x₁,y₁),…,(x_end,y_end)]，(x_i,y_i) And the coordinates are corresponding to the image pixel points i. The points and edges form criss-cross grid lines. The goal of image resampling is to find a new set of coordinate points, V '═ x'₀,y′₀)，(x′₁,y′₁),…,(x′_end,y′_end)]So that in the new coordinate system, important regions in the original image are uniformly sampled, while unimportant regions allow a certain degree of compression. This problem can be translated into finding a mapping between the original image and the resampled image, the mapping comprising two mapping functions f (x, y) and g (x, y), the resampled image then being I_new(x,y)＝I(f(x,y),g(x,y))。

f (x, y) and g (x, y) uniformly distribute the saliency map calculated from the original image into the resampled image. The solution to this problem satisfies the following condition:

the estimates of the solutions of this equation are shown in equations (10) and (11):

wherein, k ((x ', y'), (x, y)) is a gaussian kernel function, which is used as a regularization term to avoid extreme situations, such as all pixel points converging to the same value. Significance map Q calculated from equations (8) and (9)^dAnd Q^cThe two resampled images are obtained by substituting the two resampled images into equations (10) and (11). And Q^dThe corresponding image is named as a discriminant branch resample image, which highlights the regions that are decisive for classification; and Q^dThe corresponding image, named a complementary branched resampled image, enlarges the area with supplementary instructions for classification, and can stimulate the model to learn more supportive evidence.

As shown in fig. 3, the selective sparse sampling provided in the method of the present invention can prevent the strong feature from dominating the gradient learning process, thereby promoting the more comprehensive feature expression of the learning of the network.

The whole resampling process is realized through convolution operation, can be embedded into any neural network and realizes end-to-end learning and training, so that the classification loss obtained by calculating the resampled image can optimize the parameter β₁And β₂。

In the step 3, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided.

Step 4 of the invention: and completing the feature fusion and the classification model construction. And (3) integrating the characteristics of the resampled image and the characteristics of the original image (inputting the resampled image into the step 1 classification network to extract the characteristics, and cascading the characteristics with the characteristics of the original image extracted in the step 1 to generate new characteristic description of the image), thereby realizing the fusion of the global characteristics and the local characteristics of the image.

Through steps 1, 2, and 3, two resampled images are derived from one input image. The two resampled images are input into the classification network used in step 1 to extract features. To aggregate global and local features, define an image feature as F_J＝{F_o,F_D,F_c},F_o,F_D,F_cRespectively corresponding to the original image features, the discriminative branch image features and the complementary branch image features. And (5) the features are ranked and fed into a full connection layer with softmax, and an image classification result is obtained. The resampled image and the original image share a feature extraction network, so that data enhancement in a special form is realized, and the generalization of the model is favorably improved.

In a preferred embodiment, the fine-grained image classification method based on selective sparse sampling further includes a model optimization process, including the following steps:

step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;

the definition of the model classification cross entropy loss function is shown as equation (12):

wherein L is_clsDenotes cross entropy loss, YⁱFor the corresponding prediction vectors, Y, of the original image and the resampled image^jFor prediction vectors corresponding to joint features, Y^*Is an image label.

Step 4.2: and (3) judging whether the network is converged (namely the error is not reduced) or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration number is reached, stopping network training if the network is converged or the maximum iteration number is reached, and otherwise, skipping to the step 1.

The unknown images in the test set are input into the trained model to obtain a target classification result, as shown in fig. 4. The positioning result of the method of the present invention is shown in fig. 4, and it can be seen that the method of the present invention improves the classification performance by activating more regions compared to the general classification network.

The method and the device can improve the accuracy of fine-grained image classification and improve the target positioning capability. The invention is used for positioning the target and comprises the following steps:

step 1:class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branch_O,M_D,M_C；

Step 2: class response map M corresponding to discriminant branch_DClass response graph M corresponding to complementary branches_CMapping to an original class response map M by a corresponding inverse transformation_OSpace, then M_O,M_D,M_CThe three are added to generate a final class response graph M_{f inal}；

The inverse transformation is a transformation for restoring the locally enlarged image to the original image.

And step 3: the final class response graph M_{f inal}And upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.

The positioning result of the method of the invention is shown in fig. 5, which shows that the method of the invention has more accurate and comprehensive positioning compared with the reference method, and obviously improves the problem of information loss caused by overfitting in the reference model.

Examples

Example 1

1. Database and sample classification

The method is adopted to classify the fine-grained images, and for the accuracy and comparability of the experiment, the open data CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets widely used in the field of fine-grained image classification are used. The CUB-200 + 2011 dataset is an avian dataset, with 11788 images in total, of 200 species, which divides the entire image set into two parts: training and testing, the number of images per part being 5994 and 5794 respectively. The StanfordCars dataset is an automobile dataset with 16185 images, 196 cars, and 8144 and 8044 images for training and testing, respectively. The FGVC-Aircraft dataset is an airplane dataset with 10000 images, 6667 images for training and verification, and 3333 images for testing. The method only uses the image category label, and does not use any other additional strong labeling information, such as the target surrounding frame, the part point labeling information and the hierarchical semantic association of the category label.

Constructing a model: with the method of the invention, a 50-layer residual convolutional neural network (ResNet-50) is used as a feature extractor. The 60 periods were trained using a stochastic gradient descent with momentum with a batch size of 16. Setting the weight attenuation to 1e^-4The momentum is set to 0.9. For parameters initialized from the pre-trained model on ImageNet, an initial learning rate of 0.001 was used; for other parameters, we use an initial learning rate of 0.01. The input images are each resized to 448 x 448 pixels.

2. Performance evaluation criteria

2.1 Classification Performance evaluation criterion

For evaluation of algorithm performance and comparison with other methods, we chose evaluation methods widely used in image classification: and (3) top-1 classification accuracy, for a single image, taking the class corresponding to the maximum value in the predicted probability vector as a prediction result, if the prediction class is consistent with the class information labeled by the image, the prediction is correct, and for the whole data set, the proportion of the number of the correctly predicted images in the whole data set is the top-1 classification accuracy of the data set.

In addition, in order to evaluate the positioning performance of the algorithm frame, a calibration frame of a data set is used in the evaluation process.

Evaluating the positioning performance of the frame: the method of the invention obtains the prediction frame of the target, if the frame and the mark frame IOU of the target in the original image are more than 0.5, the frame is considered to be positioned correctly, otherwise, the positioning is wrong. And calculating the percentage of correct picture positioning and all pictures for each category respectively as a performance evaluation result of frame positioning.

3. Results and analysis of the experiments

And evaluating the quality of the located class peak value response in the model, and respectively verifying the effectiveness of the discriminant branch, the complementary branch and the sparse attention module.

3.1) quality of class Peak response

When the peak-like response points fall in the labeling frame of the target, the response points are considered to be accurately positioned, and the positioning accuracy of the single image is the proportion of the number of the peak-like response points falling in the labeling frame to the total number of the detected peak-like response points. The accuracy of the data set as a whole is the average of the accuracies of all images.

Table 1 verification of accuracy of peak-like response in locating a part on a data set (%)

Data set	CUB	Aircraft	Cars
				Positioning accuracy	94.63	97.22	98.76

As can be seen from table 1, the peak-like response is used to locate the component on the target very well.

3.2) quality of class Peak response

The effect of each branch including the original branch (O-branch), the discriminative branch (D-branch) and the complementary branch (C-branch) was verified, and the experiment was performed on CUB-200-2011. The effect on the algorithm classification and frame positioning performance after removal of the different branches was determined and the results are shown in table 2.

Table 2 verifies the results of classification and box positioning (%), of the various branches, on CUB-200-2011

Is provided with	Positioning	By O-branching	D-branch	C-branch	Total of
						S3N O	57.7	86.0	-	-	86.0
S3N O+D	59.2	87.0	86.5	-	87.6
						S3N O+C	56.6	86.8	-	85.3	87.3
S3N D+C	62.6	-	87.1	85.6	87.5
						S3N O+D+C	65.2	87.9	86.7	85.1	88.5

The second column in table 2 is the box positioning performance, evaluation follows the box positioning evaluation criteria. The third to sixth columns in Table 2 are top-1 classification performance, and the evaluation follows the evaluation criteria.

As can be seen from Table 2, both discriminative (D-branch) and complementary (C-branch) branches can be used for classification performance of the model, confirming that both can facilitate the learning of fine features by the model. Secondly, the discriminant branch (D-branch) is better than the complementary branch (C-branch) in improving the classification performance, which proves that the discriminant branch focuses on the components with the decisive influence on the classification, and the complementary branch focuses on the components with the weak support of the classification. The classification performance of the model is optimal when the original image branch (O-branch), the discriminant branch (D-branch) and the complementary branch (C-branch) exist simultaneously, and the characteristics learned among the original image branch, the discriminant branch and the complementary branch are proved to have complementarity. The existence of the discriminant branch (D-branch) and the complementary branch (C-branch) can improve the classification performance of the original image branch (O-branch), and proves that the weight sharing of the backbone network realizes the data enhancement in a special form and improves the generalization of the network.

3.3) Effect of sparse attention

The effectiveness of the sparse attention module provided by the invention on the selective sampling problem is verified, and the influence of several different attention mechanisms on the classification performance is respectively measured on the CUB-200-2011 data set, and the result is shown in Table 3.

TABLE 3 Classification accuracy (%) -for several different attention mechanisms

Attention mechanism mode	Top-1 accuracy	Notes
			Significance-based attention	85.9	Class independent
Attention based on class response graph	87.8	Class correlation
			Sparse attention	88.5	Component correlation

Among these, attention Based on significance is set forth in the literature "Recasens A, Kellnhofer P, Stent S, et.

Attention based on class response maps is presented in the literature "Zhou B, Khosla a, laperiza a, et al.

As can be seen from table 3, as the relevance between attention and categories is reduced, the classification and localization performance is significantly reduced, which indicates that the salient features based on the network bottom layer cannot guide feature learning from the perspective of high-level semantics. Second, compared to class response graph-based attention mechanisms, sparse attention mechanisms can capture finer visual cues while discarding regions that are irrelevant to the classification decision or even harmful. Compared with the prior art, the algorithm provided by the invention can well position the fine components which are useful for classifying and judging fine-grained images, thereby improving the classification performance.

3.4) comparison of Fine-grained image classification learning method

The existing fine-grained image classification learning method is used, and the test is carried out based on a characteristic learning method (B-CNN, Lowrank B-CNN, boost CNN, HIHCA and DFL-CNN), an attention mechanism method (RA-CNN, MA-CNN and DT-RAM) and NTS (weak supervision target detection framework-based method). The evaluation criteria of the image classification performance are the same as the embodiment by adopting the CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets.

B-CNN is described in the literature "Tsungyu Lin, Areni Roychowdhury, and Subhransu Maji. Biliner CNN models for fine-grained visual recognition. international conference on computer vision, pages 1449-;

low rank B-CNN is proposed in the literature "Shu Kong and Charless C Fowles. Low-rank multilineage point for fine-grained classification. computer vision and pattern recognition, patterns 7025-;

HIHCA is set forth in the literature "Sijia Cai, Wangmeng Zuo, and Lei Zhang. high-order integration of high-order capacitive activities for fine-ordered visual interaction. in IEEE International Conference on Computer Vision, 2017.";

Boosted-CNN is set forth in the documents "Mohammad Moghimi, Serge J. Belongie, Mohammad J. Saberian, Jian Yang, Nuno Vasconce cells, and Li-Jia Li. Boosted associative neural networks in Proceedings of the British Machine Vision Conserence 2016.";

DFL-CNN is proposed in the literature "Yaming Wang, Vlad I Moraruu, and Larry S Davis. learning acquired imaging filter bank with a cNn for fine-grained registration. computing and pattern registration, pages 4148 and 4157, 2018";

RA-CNN is proposed in the literature "Jianlong Fu, Heliang Zheng, and Tao Mei. hook closer to seebecter: Current authentication consistent neural network for fine-grain image Recognition. computer Vision and Pattern Recognition, 2017.";

MA-CNN is set forth in the literature "Heliang Zheng, Jianlong Fu, Mei Tao, and Jiebo Luo. Learning Multi-attribute connected neural network for fine-grained image recognition. in IEEE International Conference on Computer Vision, 2017.";

DT-RAM is described in the literature "Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Weixu. dynamic computing time for visual authentication. international conference computer vision, pages 1199. 1209, 2017.";

NTS is proposed in the literature "Ze Yang, Tiange Luo, Dong Wang, Zhijiang Hu, Jun Gao, and LiweiWang.Learing to personal for fine-grained classification. in ECCV2018pages 438-454".

Table 4 comparison results (%) of the fine-grained image classification method based on selective sparse sampling (S3N) and other methods on three data sets

As can be seen from Table 4, the accuracy of the method "S3N" provided by the invention in the test is higher than that of the existing fine-grained image classification methods (B-CNN, RA-CNN, MA-CNN, DPL-CNN, DFL-CNN, NTS-CNN) only using image class labels. It can be seen that after selective sparse sampling is used, the method provided by the invention can be used for mining more visual clues, and meanwhile, feature learning is carried out on discriminative parts on a larger scale, so that a model can learn more fine features.

3.5) influence of the selection of the threshold on the classification

And (3) comparing results (%) on CUB-200-2011 by using the selective sparse sampling-based fine-grained image classification method (S3N).

TABLE 5 Effect of different thresholds on classification

δ	0	0.05	0.1	0.15	0.2	0.25	0.3
								top-1(％)	88.14	88.23	88.40	88.50	88.52	88.47	88.43

As can be seen from table 5, when the class response map for extracting the class peak response is calculated, the value of the threshold has a certain influence on the classification result, and when the value is 0.2, the classification performance is better.

The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

1. A fine-grained image classification method based on selective sparse sampling comprises a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:

step 2 comprises the following substeps:

step 2.1: dividing the peak-like response in the step 1 into two sets T_dAnd T_cThe division is as follows:

T_d＝{(x，y)|(x，y)∈T if R_x，y≥ζ}，

T_c＝{(x，y)|(x，y)∈T if R_x，y＜ζ}，

wherein R is_x，yIs the response value of (x, y), ζ is the division number, T_dFor the discriminative class peak response set, T_cIs a complementary peak-like response set, T is a peak-like response position set, and (x, y) is a peak-like response position;

step 2.2: and calculating a corresponding attention diagram for each class peak response by using a Gaussian kernel function, wherein the calculation mode is as follows:

wherein the content of the first and second substances,

is (x)_i，y_i) β, β₁And β₂Is a learnable parameter for controlling the degree of local amplification;

2. The method of claim 1 wherein, in step 1, the class-like peak response is a local maximum on the class response map.

3. The method according to claim 1, wherein in step 1, the picture is given only image labels, and the whole target and the positions of all parts are not marked.

4. The method according to claim 1, characterized in that step 1 comprises the following sub-steps:

step 1.1: inputting the image into a classification network, and calculating a class response graph, wherein the class response graph is defined as follows:

wherein, W^fcIs the weight of the fully connected layer, i.e. the classifier, each class c and each feature map S_dCorresponds to W^fcIs a numerical value of

D is the number of characteristic channels, and D is one channel label of the characteristic diagram S;

step 1.2: obtaining classification results by using the classification network in the step 1.1, calculating a prediction probability set P of the current image in each class, and selecting the prediction probabilities of the top five

Calculating the entropy:

step 1.3: defining a class response graph for extracting class peak responses:

wherein p is_iIs the probability of predicting the input image as the ith class,

is composed of

The corresponding class response map, being the threshold,

the class response graph corresponding to the class with the highest prediction probability;

step 1.4: extracting local maximum values from the class response map in a window with a set size to obtain a class peak response position set T { (x)₁，y₁)，(x₂，y₂)，…，(x_n，y_n) And n is the number of peak-like responses.

5. Method according to claim 4, characterized in that in step 1.3, the response graph R to classes_oNormalization in a maximum-minimum mannerAnd obtaining a class peak value response position set T based on the normalized class response diagram:

wherein R is a normalized class response graph, R_oIs the class response plot, min (R) obtained in step 1.3_o) Is the minimum value of the class response map, max (R)_o) Is the maximum value of the class response graph.

6. The method according to claim 1, characterized in that step 3 comprises the following sub-steps:

step 3.1: summing the attention diagrams of each group to obtain a significance map Q for guiding resampling by using the attention diagrams obtained in the step 2^dAnd Q^c，

Q^d＝∑A_i，if(x_i，y_i)∈T_d

Q^c＝∑A_i，if(x_i，y_i)∈T_c

Wherein Q is^dRepresenting a discriminative branch significance map, Q^cRepresents the significance map of the complementary branches, A_iAn attention map, T, representing the ith class peak response_dRepresenting a set of discriminative class peak responses, T_cA set representing complementary class peak responses;

step 3.2: and (3) guiding the original image to be resampled by the two groups of significance maps obtained by calculation in the step (3.1) to obtain a resampled image, wherein the calculation formula is as follows:

wherein, (x ', y') is the coordinates of pixel points in the original image, and f (x, y) is the corresponding coordinates in the sampled imageG (x, y) is its corresponding ordinate in the sampled image, Q is Q^dOr Q^cK ((x ', y'), (x, y)) is a gaussian kernel function.

7. The method of claim 6, wherein the resampling process is performed by a convolution operation.

8. The method according to claim 1, wherein the selective sparse sampling based fine-grained image classification method further comprises a model optimization process, comprising the following steps:

the definition of the model classification cross entropy loss function is shown as follows:

wherein L is_clsDenotes cross entropy loss, YⁱFor the corresponding prediction vectors, Y, of the original image and the resampled image^jFor prediction vectors corresponding to joint features, Y^*The image label is an image label, O represents an original image, D represents a discriminant resample image, and C represents a complementary resample image;

step 4.2: and (3) judging whether the network is converged or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration times are reached or not, stopping network training if the network is converged or the maximum iteration times are reached, and otherwise, skipping to the step 1.

9. The method of claim 1, wherein the selective sparse sampling based fine-grained image classification method is further applied to target localization, comprising the steps of:

step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branch_O，M_D，M_C；

Step 2: class response map M corresponding to discriminant branch_DClass response graph M corresponding to complementary branches_CMapping to an original class response map M by a corresponding inverse transformation_OSpace, then M_O，M_D，M_CThe three are added to generate a final class response graph M_final；

And step 3: the final class response graph M_finalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.