CN110738247A

CN110738247A - fine-grained image classification method based on selective sparse sampling

Info

Publication number: CN110738247A
Application number: CN201910942790.8A
Authority: CN
Inventors: 焦建彬; 丁瑶; 叶齐祥; 韩振军; 万方
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-31
Anticipated expiration: 2039-09-30
Also published as: CN110738247B

Abstract

The invention provides fine-grained image classification methods based on selective sparse sampling, which are realized by the steps of positioning important parts by a classification network in a mode of extracting class response graphs from images, positioning the key parts which are effective for classification on targets as comprehensively as possible, locally amplifying learned key part groups in a sparse resampling mode, extracting characteristics of the locally amplified images, and determining image categories by a classifier in combination with original image characteristics.

Description

fine-grained image classification method based on selective sparse sampling

Technical Field

The invention relates to the field of computer vision and image processing, in particular to fine-grained image classification methods based on selective sparse sampling, which can be applied to aspects such as cultural relic protection and medical images in .

Background

The task of classifying fine-grained images is which is an important problem in the field of vision and has important application value in the fields of animal and plant protection, medical image analysis and the like, the traditional fine-grained image classification model usually needs to accurately label the positions of targets or even parts on the targets in the images, although the method can learn the information of target identification by depending on a large amount of labeled information, the method puts a very high requirement on the collection and production of a data set, the process of accurately labeling targets in the image data set consumes time and labor, and particularly under the condition that the data set is large in scale, the application of the algorithm to a large-scale fine-grained image data set is limited to a great extent.

In order to reduce manual labeling and supervision during modeling, a fine-grained image classification framework based only on image class labels has been proposed. The fine-grained image classification framework based on the image category label only requires that the target in the image is labeled, and other forms of labeling information such as a bounding box are not required. The labeling mode greatly reduces the workload of labeling, and simultaneously, massive internet image resources can be directly utilized to collect large-scale data sets. However, in the current fine-grained image classification algorithm training process based on image labels, because precise part position information is lacked, greater part positioning randomness is generated, so that the stability and the precision of the algorithm are influenced, and higher requirements are provided for the fine feature learning capability of the fine-grained image classification algorithm.

The existing fine-grained image classification method mainly comprises three types: 1. the characteristic learning-based method is typically represented by a bilinear model based on a classification network. 2. Based on a fine feature learning model for positioning the discriminant part, the method mostly uses a weak supervision target detection method for realizing the positioning of the discriminant part, then cuts the parts from an original image according to a positioning result, extracts features, and completes feature learning by combining the features of the original image; 3. based on the attention mechanism method, the attention mechanism method is introduced, firstly, the part with the most discriminating force is positioned in an iterative learning mode, and secondly, the intermediate output result of the iterative process, namely the features of the part under different scales, is fused. These methods are increasingly optimized and achieve state-of-the-art performance.

However, the methods have the disadvantages that the method is more general and is not optimized for the characteristic of slight difference among various categories in a fine-grained classification task, the part positioning process in the discrimination based on image labels in the second method is complex and time-consuming, the method needs the number of artificial designated parts and does not have the adaptivity to image content, in addition, the method extracts the parts by using a cutting mode, a large amount of useful information is lost when the part positioning is inaccurate, and the third method adopts an iterative learning mode to easily cause error accumulation.

Disclosure of Invention

In order to overcome the problems, the inventor of the invention carries out intensive research, fine-grained image classification methods based on selective sparse sampling are provided, the positioning of a part with discrimination is realized by utilizing semantic information rich in a classification network type response diagram (like activation diagram), further the efficiency and the flexibility of a model are improved, then the part with discrimination is learned on a larger scale in a local amplification mode, and the loss of information is avoided.

The invention aims to provide the following technical scheme:

the invention aims to provide fine-grained image classification methods based on selective sparse sampling, which comprise a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:

step 1, key component positioning, namely inputting the images into classification networks, outputting corresponding class response graphs, and extracting class peak responses on the class response graphs;

grouping class peak responses, namely grouping the class peak responses obtained in the step 1 according to response strength, wherein each class peak response is respectively a discriminant attention group and a complementary attention group, each class peak response generates attention diagrams, and two groups of corresponding class peak responses generate two groups of attention diagrams;

and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;

and 4, step 4: completing the construction of a feature fusion and classification model: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.

According to the fine-grained image classification method based on selective sparse sampling provided by the invention, the following beneficial effects are achieved:

(1) according to the method, learning is carried out based on the image category label, strong labeling data (target enclosing frame or part labeling information) in a relevant scene is not needed, the rich semantics of class peak value response on a class response graph are utilized, the key part is quickly positioned, and the feasibility and the practicability are remarkably improved;

(2) according to the method, the class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved;

(3) according to the method, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided;

(4) in addition, the resampled image and the original image share a feature extraction network, so that types of special data enhancement are realized, and the generalization of the model is favorably improved;

(5) therefore, the positioning of the discriminant part and the feature learning can be mutually enhanced, and the method belongs to special closed-loop iterative learning modes.

Drawings

FIG. 1 is a schematic diagram of a model structure of a selective sparse sampling-based fine-grained image classification method;

FIG. 2 shows an example and distribution of the number of class peak responses to which the model locates;

FIG. 3 shows a schematic of selective sparse sampling during model training;

FIG. 4 is a diagram illustrating the result of target classification in the proposed method;

fig. 5 shows a diagram of the target location result of the method proposed by the present invention.

Detailed Description

The present invention is further illustrated in the following detailed description and in the following figures, the features and advantages of the present invention will become more apparent from the description.

As shown in fig. 1, the present invention provides fine-grained image classification methods based on selective sparse sampling, the method includes a process of training a classification model for object classification, the training process of the classification model includes the following steps:

wherein the class peak response corresponds to a key component in the image, the key component being a distinctive region for classification;

the class peak response is preferably a local maximum on the class response map;

and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.

Step 1 of the invention, key components are positioned. The key component positioning algorithm of the method is based on semantic information rich in the similar response graph, integrates the characteristics of components corresponding to the similar peak response points on the similar response graph, and aims to find out the components and the positions thereof which are important for classification and judgment. Compared with the method for positioning the components of the weak supervision target detection framework, the method disclosed by the invention omits the searching and screening processes of important components, so that the classification can be more efficient.

In preferred embodiments of the present invention, in step 1, the picture is given only image labels without the need of the whole target and the positions of the components, and the rich semantics of the class peak response on the class response graph is used to realize the fast positioning of the key components.

In a preferred embodiment of the invention, step 1 comprises the following substeps:

step 1.1: inputting the image into a classification network, and calculating a class response graph;

the classification network is preferably a convolutional neural network, and may be selected from any of AlexNet, ResNet, VGGNet, google lenet, and the like.

Defining fine-grained image classification data sets, wherein C represents the number of classes and N represents the number of samples, and the training set comprises N samples_trainMeasure and measureThe test set contains N samples_testGiving images I in the training set, inputting the images I into a classification network, and extracting a feature atlas S e R output by the deepest convolutional layer^D×H×WWherein D is the number of the characteristic channels, and H and W are the height and width of the characteristic diagram respectively. S is sent to a full connection layer FC after passing through a global average pooling layer to obtain each class prediction score S of the network to the image, belonging to R^C. Denote the weight of the full connection layer FC as W^fc∈R^D×CEach categories c and each feature maps S_dCorresponds to W^fc numerical values in

Then the calculation of the class response graph corresponding to the class c in the class response graph M is shown in equation (1):

the formula defines the relation between the characteristics learned by the network and the image categories, and is beneficial to intuitively understand which areas are helpful for category judgment.

Step 1.2, calculating a prediction probability set P of each type of images by the network according to the classification result s obtained in the step 1.1, sequencing the prediction probability sets in a descending mode, and selecting the prediction probabilities of the top five

Calculating entropy as shown in formula (2):

as can be seen from the classification results of a plurality of classification networks on a plurality of data sets such as CUB _200_2011, Stanford Cars, FGVC-Aircraft and the like, the classification performance of the top five of the network output prediction probabilities can reach 99.9 percent, namely in the top five categories of the network prediction are correct, so that the prediction probabilities of the top five are necessary and sufficient to select.

Step 1.3: calculating a class response map for extracting the class peak response according to equation (3) in consideration of the accuracy and recall of the class peak response locating unit:

wherein the content of the first and second substances,

is composed of

The corresponding class response plot, δ is the threshold, selected 0.2 based on the control experiment.

The rules defined in the formulas (2) and (3) are that when the top-1 probability value predicted by the classification network is high, namely prediction is more credible, only the class response graph corresponding to the top-1 class is selected, so that the extracted class peak value response does not contain noise, when the top-1 probability value predicted by the classification network is low, namely a sample is more difficult to predict and unreliable, the class response graph corresponding to the top-5 class is selected, and recall of the class peak value response is ensured.

To avoid the problem that the magnitude of the variable causes numerical computation complexity, the response graph R to the class_oA maximum-minimum mode regression was performed, as in formula (4):

wherein R is a class response diagram after being classified into _oClass response plot, min (R) obtained for formula (3)_o) Is the minimum value of the class response map, max (R)_o) Is the maximum value of the class response graph.

Step 1.4: extracting local maximum values from the class response graph R in a window with a set size to obtain a class peak response position set T { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n) And n is the number of peak-like responses.

In preferred embodiments, the pane size is 3 x 3, 5 x 5, or 7 x 7, preferably the pane size is 3 x 3.

The number and position of the peak-like responses extracted in the process are self-adaptive to the image content and are not fixed, and the peak-like responses are distributed on a plurality of fine-grained image classification data sets as shown in FIG. 2. Thus, the proposed framework is more flexible and can be applied to different domains, such as birds, airplanes and cars, without the need to adjust the hyper-parameters for each specific task.

The invention comprises the following steps: a peak-like response packet. The class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved. In an embodiment, step 2 comprises the following sub-steps:

step 2.1: dividing the peak-like response in the step 1 into two sets T_dAnd T_cThe method is divided into the following formulas (5) and (6):

T_d＝{(x,y)|(x,y)∈T if R_x,ynot less than ζ } formula (5)

T_c＝{(x,y)|(x,y)∈T if R_x,y<ζ formula (6)

Wherein R is_x,yThe response value is the peak-like response (x, y), zeta is the division number, zeta is chosen to be a random number that can be (0,1) evenly distributed, or the median of all peak-like responses, etc., T_dFor the discriminative class peak response set, i.e. the component decisive for the class decision, T_cIs a complementary peak-like response set, i.e. a component that plays a complementary role for class determination.

And 2.2, calculating corresponding attention diagrams for each class peak responses by using a Gaussian kernel function in a calculation mode shown as a formula (7), wherein the corresponding two groups of class peak responses generate two groups of attention diagrams:

wherein the content of the first and second substances,

is a peak-like response (x)_i,y_i) β, β₁And β₂Are learnable parameters. The meaning of the formula is: the stronger the response value, the more the region is amplified.

Step 3 of the invention: and (6) resampling. The local amplification of key parts is realized by an image resampling mode, so that the background information is kept while important details in the image are enhanced, and the loss of information is avoided.

In preferred embodiments of the present invention, step 3 comprises the following substeps:

step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resampling^dAnd Q^cThe calculation method is shown in formulas (8) and (9):

Q^d＝∑A_i,if(x_i,y_i)∈T_dformula (8)

Q^c＝∑A_i,if(x_i,y_i)∈T_cFormula (9)

Wherein Q is^dRepresenting a discriminative branch significance map, Q^cRepresenting a complementary branch significance map.

Step 3.2: the two sets of saliency maps calculated from step 3.1 then guide the resampling of the original image.

The image I is regarded as a grid consisting of a point set V and an edge set E (a set of connecting lines between two adjacent points in the point set V), where V ═ x [ (×)₀,y₀)，(x₁,y₁),…,(x_end,y_end)]，(x_i,y_i) The aim of image resampling is to find new coordinate point sets V ═ x'₀,y′₀)，(x′₁,y′₁),…,(x′_end,y′_end)]So that in the new coordinate system, important regions in the original image are uniformly sampled, while unimportant regions allow some degree of compressionTo find the mapping between the original image and the resampled image, which contains two mapping functions f (x, y) and g (x, y), the resampled image is I_new(x,y)＝I(f(x,y),g(x,y))。

f (x, y) and g (x, y) uniformly distribute the saliency map calculated from the original image into the resampled image. The solution to this problem satisfies the following condition:

the estimates of the solutions of this equation are shown in equations (10) and (11):

wherein k ((x ', y'), (x, y)) is a Gaussian kernel function and is used as a regularization term to avoid extreme conditions, such as all pixel points converge to the same values, and a significance map Q obtained by calculating the formula (8) and the formula (9) is used^dAnd Q^cThe two resampled images are obtained by substituting the two resampled images into equations (10) and (11). And Q^dThe corresponding image is named as a discriminant branch resample image, which highlights the regions that are decisive for classification; and Q^dThe corresponding image, named a complementary branched resampled image, enlarges the area with supplementary instructions for classification, and can stimulate the model to learn more supportive evidence.

As shown in fig. 3, the selective sparse sampling provided in the method of the present invention can prevent the strong feature from dominating the gradient learning process, thereby promoting the more comprehensive feature expression of the learning of the network.

The whole resampling process is realized through convolution operation, can be embedded into any neural network and realizes end-to-end learning and training, so that the classification loss obtained by calculating the resampled image can optimize the parameter β₁And β₂。

In the step 3, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided.

Step 4 of the invention: and completing the feature fusion and the classification model construction. And (3) integrating the characteristics of the resampled image and the characteristics of the original image (inputting the resampled image into the step 1 classification network to extract the characteristics, and cascading the characteristics with the characteristics of the original image extracted in the step 1 to generate new characteristic description of the image), thereby realizing the fusion of the global characteristics and the local characteristics of the image.

Obtaining two resample images through steps 1, 2 and 3, deriving from input images, inputting the two resample images into the classification network used in step 1 to extract features, and defining the image features as F in order to aggregate global features and local features_J＝{F_o,F_D,F_c},F_o,F_D,F_cThe characteristics are cascaded and sent to a full connection layer with softmax to obtain an image classification result, and the resampled image and the original image share a characteristic extraction network, so that types of special data enhancement are realized, and the generalization of the model is favorably improved.

In preferred embodiments, the method for classifying fine-grained images based on selective sparse sampling further includes a model optimization process, including the following steps:

step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;

the definition of the model classification cross entropy loss function is shown as equation (12):

wherein L is_clsDenotes cross entropy loss, YⁱFor the corresponding prediction vectors, Y, of the original image and the resampled image^jFor prediction vectors corresponding to joint features, Y^*For marking imagesAnd (6) a label.

Step 4.2: and (3) judging whether the network is converged (namely the error is not reduced) or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration number is reached, stopping network training if the network is converged or the maximum iteration number is reached, and otherwise, skipping to the step 1.

The unknown images in the test set are input into the trained model to obtain a target classification result, as shown in fig. 4. The positioning result of the method of the present invention is shown in fig. 4, and it can be seen that the method of the present invention improves the classification performance by activating more regions compared to the general classification network.

The method and the device can improve the accuracy of fine-grained image classification and improve the target positioning capability. The invention is used for positioning the target and comprises the following steps:

step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branch_O,M_D,M_C；

Step 2: class response map M corresponding to discriminant branch_DClass response graph M corresponding to complementary branches_CMapping to an original class response map M by a corresponding inverse transformation_OSpace, then M_O,M_D,M_CThe three are added to generate a final class response graph M_{f inal}；

The inverse transformation is a transformation for restoring the locally enlarged image to the original image.

And step 3: the final class response graph M_{f inal}And upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.

The positioning result of the method of the invention is shown in fig. 5, which shows that the method of the invention has more accurate and comprehensive positioning compared with the reference method, and obviously improves the problem of information loss caused by overfitting in the reference model.

Examples

Example 1

1. Database and sample classification

The method is adopted for carrying out fine-grained image classification, and for the accuracy and comparability of experiments, widely used public data in the field of fine-grained image classification are used, namely a CUB-200-plus 2011, a Stanford Cars and an FGVC-Aircraft data set, the CUB-200-plus 2011 data set is a bird data set, 11788 images are shared, and 200 species are used, the data set divides the whole image set into two parts, namely training and testing, wherein the number of the images of each part is 5994 and 5794, the Stanford Cars data set is an automobile data set, 16185 images are shared, 196 automobiles are used, the images used for training and testing are respectively 8144 and 8044, the FGVC-Aircraft data set is an airplane data set, 10000 images are shared, 6667 images are used for training and verification, and 3333 images are used for testing.

Constructing a model: with the method of the invention, a 50-layer residual convolutional neural network (ResNet-50) is used as a feature extractor. The 60 periods were trained using a stochastic gradient descent with momentum with a batch size of 16. Setting the weight attenuation to 1e^-4The momentum is set to 0.9. For parameters initialized from the pre-trained model on ImageNet, an initial learning rate of 0.001 was used; for other parameters, we use an initial learning rate of 0.01. The input images are each resized to 448 x 448 pixels.

2. Performance evaluation criteria

2.1 Classification Performance evaluation criterion

In order to evaluate the performance of the algorithm and compare with other methods, an evaluation method widely used by in image classification is selected, namely top-1 classification accuracy, for a single image, a class corresponding to the maximum value in a predicted probability vector is used as a prediction result, if the prediction class and the class information of image annotation are consistent, the prediction is correct, and for the whole data set, the proportion of the number of correctly predicted images in the whole data set is the top-1 classification accuracy of the data set.

In addition, in order to evaluate the positioning performance of the algorithm frame, a calibration frame of a data set is used in the evaluation process.

Evaluating the positioning performance of the frame: the method of the invention obtains the prediction frame of the target, if the frame and the mark frame IOU of the target in the original image are more than 0.5, the frame is considered to be positioned correctly, otherwise, the positioning is wrong. And calculating the percentage of correct picture positioning and all pictures for each category respectively as a performance evaluation result of frame positioning.

3. Results and analysis of the experiments

And evaluating the quality of the located class peak value response in the model, and respectively verifying the effectiveness of the discriminant branch, the complementary branch and the sparse attention module.

3.1) quality of class Peak response

When the peak-like response points fall in the labeling frame of the target, the response points are considered to be accurately positioned, and the positioning accuracy of the single image is the proportion of the number of the peak-like response points falling in the labeling frame to the total number of the detected peak-like response points. The accuracy of the data set as a whole is the average of the accuracies of all images.

Table 1 verification of accuracy of peak-like response in locating a part on a data set (%)

Data set	CUB	Aircraft	Cars
				Positioning accuracy	94.63	97.22	98.76

As can be seen from table 1, the peak-like response is used to locate the component on the target very well.

3.2) quality of class Peak response

The effect of each branch including the original branch (O-branch), the discriminative branch (D-branch) and the complementary branch (C-branch) was verified, and the experiment was performed on CUB-200-2011. The effect on the algorithm classification and frame positioning performance after removal of the different branches was determined and the results are shown in table 2.

Table 2 verifies the results of classification and box positioning (%), of the various branches, on CUB-200-2011

Is provided with	Positioning	By O-branching	D-branch	C-branch	Total of
						S3N O	57.7	86.0	-	-	86.0
S3N O+D	59.2	87.0	86.5	-	87.6
						S3N O+C	56.6	86.8	-	85.3	87.3
S3N D+C	62.6	-	87.1	85.6	87.5
						S3N O+D+C	65.2	87.9	86.7	85.1	88.5

The second column in table 2 is the box positioning performance, evaluation follows the box positioning evaluation criteria. The third to sixth columns in Table 2 are top-1 classification performance, and the evaluation follows the evaluation criteria.

As can be seen from Table 2, both discriminative (D-branch) and complementary (C-branch) branches can be used for classification performance of the model, confirming that both can facilitate the learning of fine features by the model. Secondly, the discriminant branch (D-branch) is better than the complementary branch (C-branch) in improving the classification performance, which proves that the discriminant branch focuses on the components with the decisive influence on the classification, and the complementary branch focuses on the components with the weak support of the classification. The classification performance of the model is optimal when the original image branch (O-branch), the discriminant branch (D-branch) and the complementary branch (C-branch) exist simultaneously, and the characteristics learned among the original image branch, the discriminant branch and the complementary branch are proved to have complementarity. The existence of the discriminant branch (D-branch) and the complementary branch (C-branch) can improve the classification performance of the original image branch (O-branch), and proves that the weight sharing of the backbone network realizes the data enhancement in a special form and improves the generalization of the network.

3.3) Effect of sparse attention

The effectiveness of the sparse attention module provided by the invention on the selective sampling problem is verified, and the influence of several different attention mechanisms on the classification performance is respectively measured on the CUB-200-2011 data set, and the result is shown in Table 3.

TABLE 3 Classification accuracy (%) -for several different attention mechanisms

Attention mechanism mode	Top-1 accuracy	Notes
			Significance-based attention	85.9	Class independent
Attention based on class response graph	87.8	Class correlation
			Sparse attention	88.5	Component correlation

Among these, attention Based on significance is set forth in the literature "Recasens A, Kellnhofer P, Stent S, et.

Attention based on class response maps is presented in the literature "Zhou B, Khosla a, laperiza a, et al.

As can be seen from table 3, as the relevance between attention and categories is reduced, the classification and localization performance is significantly reduced, which indicates that the salient features based on the network bottom layer cannot guide feature learning from the perspective of high-level semantics. Second, compared to class response graph-based attention mechanisms, sparse attention mechanisms can capture finer visual cues while discarding regions that are irrelevant to the classification decision or even harmful. Compared with the prior art, the algorithm provided by the invention can well position the fine components which are useful for classifying and judging fine-grained images, thereby improving the classification performance.

3.4) comparison of Fine-grained image classification learning method

The existing fine-grained image classification learning method is used, and the test is carried out based on a characteristic learning method (B-CNN, Lowrank B-CNN, boost CNN, HIHCA and DFL-CNN), an attention mechanism method (RA-CNN, MA-CNN and DT-RAM) and NTS (weak supervision target detection framework-based method). The evaluation criteria of the image classification performance are the same as the embodiment by adopting the CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets.

B-CNN is described in the literature "Tsungyu Lin, Areni Roychowdhury, and Subhransu Maji. Biliner CNN models for fine-grained visual recognition. international conference on computer vision, pages 1449-;

low rank B-CNN is proposed in the literature "Shu Kong and Charless C Fowles. Low-rank multilineage point for fine-grained classification. computer vision and pattern recognition, patterns 7025-;

HIHCA is set forth in the literature "Sijia Cai, Wangmeng Zuo, and Lei Zhang. high-order integration of high-order capacitive activities for fine-ordered visual interaction. in IEEE International Conference on Computer Vision, 2017.";

Boosted-CNN is set forth in the documents "Mohammad Moghimi, Serge J. Belongie, Mohammad J. Saberian, Jian Yang, Nuno Vasconce cells, and Li-Jia Li. Boosted associative neural networks in Proceedings of the British Machine Vision Conserence 2016.";

DFL-CNN is proposed in the literature "Yaming Wang, Vlad I Moraruu, and Larry S Davis. learning acquired imaging filter bank with a cNn for fine-grained registration. computing and pattern registration, pages 4148 and 4157, 2018";

RA-CNN is proposed in the literature "Jianlong Fu, Heliang Zheng, and Tao Mei. hook closer to seebecter: Current authentication consistent neural network for fine-grain image Recognition. computer Vision and Pattern Recognition, 2017.";

MA-CNN is set forth in the literature "Heliang Zheng, Jianlong Fu, Mei Tao, and Jiebo Luo. Learning Multi-attribute connected neural network for fine-grained image recognition. in IEEE International Conference on Computer Vision, 2017.";

DT-RAM is described in the literature "Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Weixu. dynamic computing time for visual authentication. international conference computer vision, pages 1199. 1209, 2017.";

NTS is proposed in the literature "Ze Yang, Tiange Luo, Dong Wang, Zhijiang Hu, Jun Gao, and LiweiWang.Learing to personal for fine-grained classification. in ECCV2018pages 438-454".

Table 4 comparison results (%) of the fine-grained image classification method based on selective sparse sampling (S3N) and other methods on three data sets

As can be seen from Table 4, the accuracy of the method "S3N" provided by the invention in the test is higher than that of the existing fine-grained image classification methods (B-CNN, RA-CNN, MA-CNN, DPL-CNN, DFL-CNN, NTS-CNN) only using image class labels. It can be seen that after selective sparse sampling is used, the method provided by the invention can be used for mining more visual clues, and meanwhile, feature learning is carried out on discriminative parts on a larger scale, so that a model can learn more fine features.

3.5) influence of the selection of the threshold δ on the classification

And (3) comparing results (%) on CUB-200-2011 by using the selective sparse sampling-based fine-grained image classification method (S3N).

TABLE 5 Effect of different thresholds on classification

δ	0	0.05	0.1	0.15	0.2	0.25	0.3
								top-1(％)	88.14	88.23	88.40	88.50	88.52	88.47	88.43

As can be seen from table 5, when the class response map for extracting the class peak response is calculated, the value of the threshold δ has definite influence on the classification result, and when the value of δ is 0.2, the classification performance is better.

The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

fine-grained image classification method based on selective sparse sampling, the method includes a process of training a classification model for target classification, the training process of the classification model includes the following steps:

step 1, key component positioning, namely inputting the images into classification networks, outputting corresponding class response graphs, and extracting class peak responses on the class response graphs;

grouping class peak responses, namely grouping the class peak responses obtained in the step 1 according to response strength, wherein each class peak response is respectively a discriminant attention group and a complementary attention group, each class peak response generates attention diagrams, and two groups of corresponding class peak responses generate two groups of attention diagrams;

and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;

and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
2. The method of claim 1 wherein, in step 1, the class-like peak response is a local maximum on the class response map.
3. The method according to claim 1, wherein in step 1, the picture is given only image labels, and the whole target and the positions of all parts are not marked.
4. The method according to claim 1, characterized in that step 1 comprises the following sub-steps:

step 1.1: inputting the image into a classification network, and calculating a class response graph, wherein the class response graph is defined as follows:

wherein, W^fcIs the weight of the fully connected layer, i.e. classifier, for each classes c and for each feature maps S_dCorresponds to W^fc numerical values in

Step 1.2, obtaining a classification result by utilizing the classification network in the step 1.1, calculating a prediction probability set P of the current image in each category, and selecting the prediction probabilities of the top five
Calculating the entropy:

step 1.3: defining a class response graph for extracting class peak responses:

wherein the content of the first and second substances,
is composed of
Corresponding class response graph, delta is threshold;

step 1.4: extracting local maximum values from the class response map in a window with a set size to obtain a class peak response position set T { (x)₁,y₁),(x₂,y₂),…,(x_n,y_n) And n is the number of peak-like responses.
5. Method according to claim 4, characterized in that in step 1.3, the response graph R to classes_oCarrying out classification of a maximum value-minimum value mode, and obtaining a class peak value response position set T based on a class response diagram after classification :

wherein R is a class response diagram after being classified into _oIs the class response plot, min (R) obtained in step 1.3_o) Is the minimum value of the class response map, max (R)_o) Is the maximum value of the class response graph.
6. The method according to claim 1, characterized in that step 2 comprises the following sub-steps:

step 2.1: dividing the peak-like response in the step 1 into two sets T_dAnd T_cThe division is as follows:

T_d＝{(x,y)|(x,y)∈T if R_x,y≥ζ}，

T_c＝{(x,y)|(x,y)∈T if R_x,y<ζ}，

wherein R is_x,yIs the response value of (x, y), ζ is the division number, T_dFor the discriminative class peak response set, T_cA complementary peak-like response set;

step 2.2, calculating corresponding attention diagrams for each class peak responses by using a Gaussian kernel function, wherein the calculation mode is as follows:

wherein the content of the first and second substances,
is (x)_i,y_i) β, β₁And β₂Is a learnable parameter for controlling the degree of local amplification.
7. The method according to claim 1, characterized in that step 3 comprises the following sub-steps:

step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resampling^dAnd Q^c，

Q^d＝∑A_i,if(x_i,y_i)∈T_d

Q^c＝∑A_i,if(x_i,y_i)∈T_c

Wherein Q is^dRepresenting a discriminative branch significance map, Q^cRepresenting a complementary branch significance map;

step 3.2: and (3) guiding the original image to be resampled by the two groups of significance maps obtained by calculation in the step (3.1) to obtain a resampled image, wherein the calculation formula is as follows:

wherein, (x ', y') is the coordinates of pixel points in the original image, f (x, y) is the corresponding abscissa in the sampled image, g (x, y) is the corresponding ordinate in the sampled image, and Q is Q^dOr Q^c。
8. The method of claim 7, wherein the resampling process is performed by a convolution operation.
9. The method according to claim 1, wherein the selective sparse sampling based fine-grained image classification method further comprises a model optimization process, comprising the following steps:

step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;

the definition of the model classification cross entropy loss function is shown as follows:

wherein L is_clsDenotes cross entropy loss, YⁱFor the corresponding prediction vectors, Y, of the original image and the resampled image^jFor prediction vectors corresponding to joint features, Y^*Is an image label;

step 4.2: and (3) judging whether the network is converged or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration times are reached or not, stopping network training if the network is converged or the maximum iteration times are reached, and otherwise, skipping to the step 1.
10. The method of claim 1, wherein the selective sparse sampling based fine-grained image classification method is further applied to target localization, comprising the steps of:

step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branch_O,M_D,M_C；

Step 2: class response map M corresponding to discriminant branch_DClass response graph M corresponding to complementary branches_CMapping to an original class response map M by a corresponding inverse transformation_OSpace, then M_O,M_D,M_CThe three are added to generate a final class response graph M_final；

And step 3: the final class response graph M_finalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.