CN110738247A - fine-grained image classification method based on selective sparse sampling - Google Patents

fine-grained image classification method based on selective sparse sampling Download PDF

Info

Publication number
CN110738247A
CN110738247A CN201910942790.8A CN201910942790A CN110738247A CN 110738247 A CN110738247 A CN 110738247A CN 201910942790 A CN201910942790 A CN 201910942790A CN 110738247 A CN110738247 A CN 110738247A
Authority
CN
China
Prior art keywords
class
classification
response
image
peak
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910942790.8A
Other languages
Chinese (zh)
Other versions
CN110738247B (en
Inventor
焦建彬
丁瑶
叶齐祥
韩振军
万方
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN201910942790.8A priority Critical patent/CN110738247B/en
Publication of CN110738247A publication Critical patent/CN110738247A/en
Application granted granted Critical
Publication of CN110738247B publication Critical patent/CN110738247B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides fine-grained image classification methods based on selective sparse sampling, which are realized by the steps of positioning important parts by a classification network in a mode of extracting class response graphs from images, positioning the key parts which are effective for classification on targets as comprehensively as possible, locally amplifying learned key part groups in a sparse resampling mode, extracting characteristics of the locally amplified images, and determining image categories by a classifier in combination with original image characteristics.

Description

fine-grained image classification method based on selective sparse sampling
Technical Field
The invention relates to the field of computer vision and image processing, in particular to fine-grained image classification methods based on selective sparse sampling, which can be applied to aspects such as cultural relic protection and medical images in .
Background
The task of classifying fine-grained images is which is an important problem in the field of vision and has important application value in the fields of animal and plant protection, medical image analysis and the like, the traditional fine-grained image classification model usually needs to accurately label the positions of targets or even parts on the targets in the images, although the method can learn the information of target identification by depending on a large amount of labeled information, the method puts a very high requirement on the collection and production of a data set, the process of accurately labeling targets in the image data set consumes time and labor, and particularly under the condition that the data set is large in scale, the application of the algorithm to a large-scale fine-grained image data set is limited to a great extent.
In order to reduce manual labeling and supervision during modeling, a fine-grained image classification framework based only on image class labels has been proposed. The fine-grained image classification framework based on the image category label only requires that the target in the image is labeled, and other forms of labeling information such as a bounding box are not required. The labeling mode greatly reduces the workload of labeling, and simultaneously, massive internet image resources can be directly utilized to collect large-scale data sets. However, in the current fine-grained image classification algorithm training process based on image labels, because precise part position information is lacked, greater part positioning randomness is generated, so that the stability and the precision of the algorithm are influenced, and higher requirements are provided for the fine feature learning capability of the fine-grained image classification algorithm.
The existing fine-grained image classification method mainly comprises three types: 1. the characteristic learning-based method is typically represented by a bilinear model based on a classification network. 2. Based on a fine feature learning model for positioning the discriminant part, the method mostly uses a weak supervision target detection method for realizing the positioning of the discriminant part, then cuts the parts from an original image according to a positioning result, extracts features, and completes feature learning by combining the features of the original image; 3. based on the attention mechanism method, the attention mechanism method is introduced, firstly, the part with the most discriminating force is positioned in an iterative learning mode, and secondly, the intermediate output result of the iterative process, namely the features of the part under different scales, is fused. These methods are increasingly optimized and achieve state-of-the-art performance.
However, the methods have the disadvantages that the method is more general and is not optimized for the characteristic of slight difference among various categories in a fine-grained classification task, the part positioning process in the discrimination based on image labels in the second method is complex and time-consuming, the method needs the number of artificial designated parts and does not have the adaptivity to image content, in addition, the method extracts the parts by using a cutting mode, a large amount of useful information is lost when the part positioning is inaccurate, and the third method adopts an iterative learning mode to easily cause error accumulation.
Disclosure of Invention
In order to overcome the problems, the inventor of the invention carries out intensive research, fine-grained image classification methods based on selective sparse sampling are provided, the positioning of a part with discrimination is realized by utilizing semantic information rich in a classification network type response diagram (like activation diagram), further the efficiency and the flexibility of a model are improved, then the part with discrimination is learned on a larger scale in a local amplification mode, and the loss of information is avoided.
The invention aims to provide the following technical scheme:
the invention aims to provide fine-grained image classification methods based on selective sparse sampling, which comprise a process of training a classification model for target classification, wherein the training process of the classification model comprises the following steps:
step 1, key component positioning, namely inputting the images into classification networks, outputting corresponding class response graphs, and extracting class peak responses on the class response graphs;
grouping class peak responses, namely grouping the class peak responses obtained in the step 1 according to response strength, wherein each class peak response is respectively a discriminant attention group and a complementary attention group, each class peak response generates attention diagrams, and two groups of corresponding class peak responses generate two groups of attention diagrams;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: completing the construction of a feature fusion and classification model: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
According to the fine-grained image classification method based on selective sparse sampling provided by the invention, the following beneficial effects are achieved:
(1) according to the method, learning is carried out based on the image category label, strong labeling data (target enclosing frame or part labeling information) in a relevant scene is not needed, the rich semantics of class peak value response on a class response graph are utilized, the key part is quickly positioned, and the feasibility and the practicability are remarkably improved;
(2) according to the method, the class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved;
(3) according to the method, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided;
(4) in addition, the resampled image and the original image share a feature extraction network, so that types of special data enhancement are realized, and the generalization of the model is favorably improved;
(5) therefore, the positioning of the discriminant part and the feature learning can be mutually enhanced, and the method belongs to special closed-loop iterative learning modes.
Drawings
FIG. 1 is a schematic diagram of a model structure of a selective sparse sampling-based fine-grained image classification method;
FIG. 2 shows an example and distribution of the number of class peak responses to which the model locates;
FIG. 3 shows a schematic of selective sparse sampling during model training;
FIG. 4 is a diagram illustrating the result of target classification in the proposed method;
fig. 5 shows a diagram of the target location result of the method proposed by the present invention.
Detailed Description
The present invention is further illustrated in the following detailed description and in the following figures, the features and advantages of the present invention will become more apparent from the description.
As shown in fig. 1, the present invention provides fine-grained image classification methods based on selective sparse sampling, the method includes a process of training a classification model for object classification, the training process of the classification model includes the following steps:
step 1, key component positioning, namely inputting the images into classification networks, outputting corresponding class response graphs, and extracting class peak responses on the class response graphs;
wherein the class peak response corresponds to a key component in the image, the key component being a distinctive region for classification;
the class peak response is preferably a local maximum on the class response map;
grouping class peak responses, namely grouping the class peak responses obtained in the step 1 according to response strength, wherein each class peak response is respectively a discriminant attention group and a complementary attention group, each class peak response generates attention diagrams, and two groups of corresponding class peak responses generate two groups of attention diagrams;
and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
Step 1 of the invention, key components are positioned. The key component positioning algorithm of the method is based on semantic information rich in the similar response graph, integrates the characteristics of components corresponding to the similar peak response points on the similar response graph, and aims to find out the components and the positions thereof which are important for classification and judgment. Compared with the method for positioning the components of the weak supervision target detection framework, the method disclosed by the invention omits the searching and screening processes of important components, so that the classification can be more efficient.
In preferred embodiments of the present invention, in step 1, the picture is given only image labels without the need of the whole target and the positions of the components, and the rich semantics of the class peak response on the class response graph is used to realize the fast positioning of the key components.
In a preferred embodiment of the invention, step 1 comprises the following substeps:
step 1.1: inputting the image into a classification network, and calculating a class response graph;
the classification network is preferably a convolutional neural network, and may be selected from any of AlexNet, ResNet, VGGNet, google lenet, and the like.
Defining fine-grained image classification data sets, wherein C represents the number of classes and N represents the number of samples, and the training set comprises N samplestrainMeasure and measureThe test set contains N samplestestGiving images I in the training set, inputting the images I into a classification network, and extracting a feature atlas S e R output by the deepest convolutional layerD×H×WWherein D is the number of the characteristic channels, and H and W are the height and width of the characteristic diagram respectively. S is sent to a full connection layer FC after passing through a global average pooling layer to obtain each class prediction score S of the network to the image, belonging to RC. Denote the weight of the full connection layer FC as Wfc∈RD×CEach categories c and each feature maps SdCorresponds to Wfc numerical values in
Figure BDA0002223379210000061
Then the calculation of the class response graph corresponding to the class c in the class response graph M is shown in equation (1):
Figure BDA0002223379210000062
the formula defines the relation between the characteristics learned by the network and the image categories, and is beneficial to intuitively understand which areas are helpful for category judgment.
Step 1.2, calculating a prediction probability set P of each type of images by the network according to the classification result s obtained in the step 1.1, sequencing the prediction probability sets in a descending mode, and selecting the prediction probabilities of the top five
Figure BDA0002223379210000063
Calculating entropy as shown in formula (2):
Figure BDA0002223379210000064
as can be seen from the classification results of a plurality of classification networks on a plurality of data sets such as CUB _200_2011, Stanford Cars, FGVC-Aircraft and the like, the classification performance of the top five of the network output prediction probabilities can reach 99.9 percent, namely in the top five categories of the network prediction are correct, so that the prediction probabilities of the top five are necessary and sufficient to select.
Step 1.3: calculating a class response map for extracting the class peak response according to equation (3) in consideration of the accuracy and recall of the class peak response locating unit:
Figure BDA0002223379210000071
wherein the content of the first and second substances,
Figure BDA0002223379210000072
is composed of
Figure BDA0002223379210000073
The corresponding class response plot, δ is the threshold, selected 0.2 based on the control experiment.
The rules defined in the formulas (2) and (3) are that when the top-1 probability value predicted by the classification network is high, namely prediction is more credible, only the class response graph corresponding to the top-1 class is selected, so that the extracted class peak value response does not contain noise, when the top-1 probability value predicted by the classification network is low, namely a sample is more difficult to predict and unreliable, the class response graph corresponding to the top-5 class is selected, and recall of the class peak value response is ensured.
To avoid the problem that the magnitude of the variable causes numerical computation complexity, the response graph R to the classoA maximum-minimum mode regression was performed, as in formula (4):
Figure BDA0002223379210000074
wherein R is a class response diagram after being classified into oClass response plot, min (R) obtained for formula (3)o) Is the minimum value of the class response map, max (R)o) Is the maximum value of the class response graph.
Step 1.4: extracting local maximum values from the class response graph R in a window with a set size to obtain a class peak response position set T { (x)1,y1),(x2,y2),…,(xn,yn) And n is the number of peak-like responses.
In preferred embodiments, the pane size is 3 x 3, 5 x 5, or 7 x 7, preferably the pane size is 3 x 3.
The number and position of the peak-like responses extracted in the process are self-adaptive to the image content and are not fixed, and the peak-like responses are distributed on a plurality of fine-grained image classification data sets as shown in FIG. 2. Thus, the proposed framework is more flexible and can be applied to different domains, such as birds, airplanes and cars, without the need to adjust the hyper-parameters for each specific task.
The invention comprises the following steps: a peak-like response packet. The class peak responses are grouped according to the response values, so that a strong response leading learning process is avoided, parts corresponding to weak responses can be learned, and the robustness of the features is improved. In an embodiment, step 2 comprises the following sub-steps:
step 2.1: dividing the peak-like response in the step 1 into two sets TdAnd TcThe method is divided into the following formulas (5) and (6):
Td={(x,y)|(x,y)∈T if Rx,ynot less than ζ } formula (5)
Tc={(x,y)|(x,y)∈T if Rx,y<ζ formula (6)
Wherein R isx,yThe response value is the peak-like response (x, y), zeta is the division number, zeta is chosen to be a random number that can be (0,1) evenly distributed, or the median of all peak-like responses, etc., TdFor the discriminative class peak response set, i.e. the component decisive for the class decision, TcIs a complementary peak-like response set, i.e. a component that plays a complementary role for class determination.
And 2.2, calculating corresponding attention diagrams for each class peak responses by using a Gaussian kernel function in a calculation mode shown as a formula (7), wherein the corresponding two groups of class peak responses generate two groups of attention diagrams:
Figure BDA0002223379210000081
wherein the content of the first and second substances,
Figure BDA0002223379210000082
is a peak-like response (x)i,yi) β, β1And β2Are learnable parameters. The meaning of the formula is: the stronger the response value, the more the region is amplified.
Step 3 of the invention: and (6) resampling. The local amplification of key parts is realized by an image resampling mode, so that the background information is kept while important details in the image are enhanced, and the loss of information is avoided.
In preferred embodiments of the present invention, step 3 comprises the following substeps:
step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resamplingdAnd QcThe calculation method is shown in formulas (8) and (9):
Qd=∑Ai,if(xi,yi)∈Tdformula (8)
Qc=∑Ai,if(xi,yi)∈TcFormula (9)
Wherein Q isdRepresenting a discriminative branch significance map, QcRepresenting a complementary branch significance map.
Step 3.2: the two sets of saliency maps calculated from step 3.1 then guide the resampling of the original image.
The image I is regarded as a grid consisting of a point set V and an edge set E (a set of connecting lines between two adjacent points in the point set V), where V ═ x [ (×)0,y0),(x1,y1),…,(xend,yend)],(xi,yi) The aim of image resampling is to find new coordinate point sets V ═ x'0,y′0),(x′1,y′1),…,(x′end,y′end)]So that in the new coordinate system, important regions in the original image are uniformly sampled, while unimportant regions allow some degree of compressionTo find the mapping between the original image and the resampled image, which contains two mapping functions f (x, y) and g (x, y), the resampled image is Inew(x,y)=I(f(x,y),g(x,y))。
f (x, y) and g (x, y) uniformly distribute the saliency map calculated from the original image into the resampled image. The solution to this problem satisfies the following condition:
Figure BDA0002223379210000091
the estimates of the solutions of this equation are shown in equations (10) and (11):
Figure BDA0002223379210000101
Figure BDA0002223379210000102
wherein k ((x ', y'), (x, y)) is a Gaussian kernel function and is used as a regularization term to avoid extreme conditions, such as all pixel points converge to the same values, and a significance map Q obtained by calculating the formula (8) and the formula (9) is useddAnd QcThe two resampled images are obtained by substituting the two resampled images into equations (10) and (11). And QdThe corresponding image is named as a discriminant branch resample image, which highlights the regions that are decisive for classification; and QdThe corresponding image, named a complementary branched resampled image, enlarges the area with supplementary instructions for classification, and can stimulate the model to learn more supportive evidence.
As shown in fig. 3, the selective sparse sampling provided in the method of the present invention can prevent the strong feature from dominating the gradient learning process, thereby promoting the more comprehensive feature expression of the learning of the network.
The whole resampling process is realized through convolution operation, can be embedded into any neural network and realizes end-to-end learning and training, so that the classification loss obtained by calculating the resampled image can optimize the parameter β1And β2
In the step 3, the local amplification of the key part is realized in an image resampling mode, so that the important details in the image are enhanced while the background information is kept, and the information loss is avoided.
Step 4 of the invention: and completing the feature fusion and the classification model construction. And (3) integrating the characteristics of the resampled image and the characteristics of the original image (inputting the resampled image into the step 1 classification network to extract the characteristics, and cascading the characteristics with the characteristics of the original image extracted in the step 1 to generate new characteristic description of the image), thereby realizing the fusion of the global characteristics and the local characteristics of the image.
Obtaining two resample images through steps 1, 2 and 3, deriving from input images, inputting the two resample images into the classification network used in step 1 to extract features, and defining the image features as F in order to aggregate global features and local featuresJ={Fo,FD,Fc},Fo,FD,FcThe characteristics are cascaded and sent to a full connection layer with softmax to obtain an image classification result, and the resampled image and the original image share a characteristic extraction network, so that types of special data enhancement are realized, and the generalization of the model is favorably improved.
In preferred embodiments, the method for classifying fine-grained images based on selective sparse sampling further includes a model optimization process, including the following steps:
step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;
the definition of the model classification cross entropy loss function is shown as equation (12):
Figure BDA0002223379210000111
wherein L isclsDenotes cross entropy loss, YiFor the corresponding prediction vectors, Y, of the original image and the resampled imagejFor prediction vectors corresponding to joint features, Y*For marking imagesAnd (6) a label.
Step 4.2: and (3) judging whether the network is converged (namely the error is not reduced) or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration number is reached, stopping network training if the network is converged or the maximum iteration number is reached, and otherwise, skipping to the step 1.
The unknown images in the test set are input into the trained model to obtain a target classification result, as shown in fig. 4. The positioning result of the method of the present invention is shown in fig. 4, and it can be seen that the method of the present invention improves the classification performance by activating more regions compared to the general classification network.
The method and the device can improve the accuracy of fine-grained image classification and improve the target positioning capability. The invention is used for positioning the target and comprises the following steps:
step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branchO,MD,MC
Step 2: class response map M corresponding to discriminant branchDClass response graph M corresponding to complementary branchesCMapping to an original class response map M by a corresponding inverse transformationOSpace, then MO,MD,MCThe three are added to generate a final class response graph Mf inal
The inverse transformation is a transformation for restoring the locally enlarged image to the original image.
And step 3: the final class response graph Mf inalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.
The positioning result of the method of the invention is shown in fig. 5, which shows that the method of the invention has more accurate and comprehensive positioning compared with the reference method, and obviously improves the problem of information loss caused by overfitting in the reference model.
Examples
Example 1
1. Database and sample classification
The method is adopted for carrying out fine-grained image classification, and for the accuracy and comparability of experiments, widely used public data in the field of fine-grained image classification are used, namely a CUB-200-plus 2011, a Stanford Cars and an FGVC-Aircraft data set, the CUB-200-plus 2011 data set is a bird data set, 11788 images are shared, and 200 species are used, the data set divides the whole image set into two parts, namely training and testing, wherein the number of the images of each part is 5994 and 5794, the Stanford Cars data set is an automobile data set, 16185 images are shared, 196 automobiles are used, the images used for training and testing are respectively 8144 and 8044, the FGVC-Aircraft data set is an airplane data set, 10000 images are shared, 6667 images are used for training and verification, and 3333 images are used for testing.
Constructing a model: with the method of the invention, a 50-layer residual convolutional neural network (ResNet-50) is used as a feature extractor. The 60 periods were trained using a stochastic gradient descent with momentum with a batch size of 16. Setting the weight attenuation to 1e-4The momentum is set to 0.9. For parameters initialized from the pre-trained model on ImageNet, an initial learning rate of 0.001 was used; for other parameters, we use an initial learning rate of 0.01. The input images are each resized to 448 x 448 pixels.
2. Performance evaluation criteria
2.1 Classification Performance evaluation criterion
In order to evaluate the performance of the algorithm and compare with other methods, an evaluation method widely used by in image classification is selected, namely top-1 classification accuracy, for a single image, a class corresponding to the maximum value in a predicted probability vector is used as a prediction result, if the prediction class and the class information of image annotation are consistent, the prediction is correct, and for the whole data set, the proportion of the number of correctly predicted images in the whole data set is the top-1 classification accuracy of the data set.
In addition, in order to evaluate the positioning performance of the algorithm frame, a calibration frame of a data set is used in the evaluation process.
Evaluating the positioning performance of the frame: the method of the invention obtains the prediction frame of the target, if the frame and the mark frame IOU of the target in the original image are more than 0.5, the frame is considered to be positioned correctly, otherwise, the positioning is wrong. And calculating the percentage of correct picture positioning and all pictures for each category respectively as a performance evaluation result of frame positioning.
Figure BDA0002223379210000141
3. Results and analysis of the experiments
And evaluating the quality of the located class peak value response in the model, and respectively verifying the effectiveness of the discriminant branch, the complementary branch and the sparse attention module.
3.1) quality of class Peak response
When the peak-like response points fall in the labeling frame of the target, the response points are considered to be accurately positioned, and the positioning accuracy of the single image is the proportion of the number of the peak-like response points falling in the labeling frame to the total number of the detected peak-like response points. The accuracy of the data set as a whole is the average of the accuracies of all images.
Table 1 verification of accuracy of peak-like response in locating a part on a data set (%)
Data set CUB Aircraft Cars
Positioning accuracy 94.63 97.22 98.76
As can be seen from table 1, the peak-like response is used to locate the component on the target very well.
3.2) quality of class Peak response
The effect of each branch including the original branch (O-branch), the discriminative branch (D-branch) and the complementary branch (C-branch) was verified, and the experiment was performed on CUB-200-2011. The effect on the algorithm classification and frame positioning performance after removal of the different branches was determined and the results are shown in table 2.
Table 2 verifies the results of classification and box positioning (%), of the various branches, on CUB-200-2011
Is provided with Positioning By O-branching D-branch C-branch Total of
S3N O 57.7 86.0 - - 86.0
S3N O+D 59.2 87.0 86.5 - 87.6
S3N O+C 56.6 86.8 - 85.3 87.3
S3N D+C 62.6 - 87.1 85.6 87.5
S3N O+D+C 65.2 87.9 86.7 85.1 88.5
The second column in table 2 is the box positioning performance, evaluation follows the box positioning evaluation criteria. The third to sixth columns in Table 2 are top-1 classification performance, and the evaluation follows the evaluation criteria.
As can be seen from Table 2, both discriminative (D-branch) and complementary (C-branch) branches can be used for classification performance of the model, confirming that both can facilitate the learning of fine features by the model. Secondly, the discriminant branch (D-branch) is better than the complementary branch (C-branch) in improving the classification performance, which proves that the discriminant branch focuses on the components with the decisive influence on the classification, and the complementary branch focuses on the components with the weak support of the classification. The classification performance of the model is optimal when the original image branch (O-branch), the discriminant branch (D-branch) and the complementary branch (C-branch) exist simultaneously, and the characteristics learned among the original image branch, the discriminant branch and the complementary branch are proved to have complementarity. The existence of the discriminant branch (D-branch) and the complementary branch (C-branch) can improve the classification performance of the original image branch (O-branch), and proves that the weight sharing of the backbone network realizes the data enhancement in a special form and improves the generalization of the network.
3.3) Effect of sparse attention
The effectiveness of the sparse attention module provided by the invention on the selective sampling problem is verified, and the influence of several different attention mechanisms on the classification performance is respectively measured on the CUB-200-2011 data set, and the result is shown in Table 3.
TABLE 3 Classification accuracy (%) -for several different attention mechanisms
Attention mechanism mode Top-1 accuracy Notes
Significance-based attention 85.9 Class independent
Attention based on class response graph 87.8 Class correlation
Sparse attention 88.5 Component correlation
Among these, attention Based on significance is set forth in the literature "Recasens A, Kellnhofer P, Stent S, et.
Attention based on class response maps is presented in the literature "Zhou B, Khosla a, laperiza a, et al.
As can be seen from table 3, as the relevance between attention and categories is reduced, the classification and localization performance is significantly reduced, which indicates that the salient features based on the network bottom layer cannot guide feature learning from the perspective of high-level semantics. Second, compared to class response graph-based attention mechanisms, sparse attention mechanisms can capture finer visual cues while discarding regions that are irrelevant to the classification decision or even harmful. Compared with the prior art, the algorithm provided by the invention can well position the fine components which are useful for classifying and judging fine-grained images, thereby improving the classification performance.
3.4) comparison of Fine-grained image classification learning method
The existing fine-grained image classification learning method is used, and the test is carried out based on a characteristic learning method (B-CNN, Lowrank B-CNN, boost CNN, HIHCA and DFL-CNN), an attention mechanism method (RA-CNN, MA-CNN and DT-RAM) and NTS (weak supervision target detection framework-based method). The evaluation criteria of the image classification performance are the same as the embodiment by adopting the CUB-200 plus 2011, Stanford Cars and FGVC-Aircraft data sets.
B-CNN is described in the literature "Tsungyu Lin, Areni Roychowdhury, and Subhransu Maji. Biliner CNN models for fine-grained visual recognition. international conference on computer vision, pages 1449-;
low rank B-CNN is proposed in the literature "Shu Kong and Charless C Fowles. Low-rank multilineage point for fine-grained classification. computer vision and pattern recognition, patterns 7025-;
HIHCA is set forth in the literature "Sijia Cai, Wangmeng Zuo, and Lei Zhang. high-order integration of high-order capacitive activities for fine-ordered visual interaction. in IEEE International Conference on Computer Vision, 2017.";
Boosted-CNN is set forth in the documents "Mohammad Moghimi, Serge J. Belongie, Mohammad J. Saberian, Jian Yang, Nuno Vasconce cells, and Li-Jia Li. Boosted associative neural networks in Proceedings of the British Machine Vision Conserence 2016.";
DFL-CNN is proposed in the literature "Yaming Wang, Vlad I Moraruu, and Larry S Davis. learning acquired imaging filter bank with a cNn for fine-grained registration. computing and pattern registration, pages 4148 and 4157, 2018";
RA-CNN is proposed in the literature "Jianlong Fu, Heliang Zheng, and Tao Mei. hook closer to seebecter: Current authentication consistent neural network for fine-grain image Recognition. computer Vision and Pattern Recognition, 2017.";
MA-CNN is set forth in the literature "Heliang Zheng, Jianlong Fu, Mei Tao, and Jiebo Luo. Learning Multi-attribute connected neural network for fine-grained image recognition. in IEEE International Conference on Computer Vision, 2017.";
DT-RAM is described in the literature "Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, and Weixu. dynamic computing time for visual authentication. international conference computer vision, pages 1199. 1209, 2017.";
NTS is proposed in the literature "Ze Yang, Tiange Luo, Dong Wang, Zhijiang Hu, Jun Gao, and LiweiWang.Learing to personal for fine-grained classification. in ECCV2018pages 438-454".
Table 4 comparison results (%) of the fine-grained image classification method based on selective sparse sampling (S3N) and other methods on three data sets
Figure BDA0002223379210000171
Figure BDA0002223379210000181
As can be seen from Table 4, the accuracy of the method "S3N" provided by the invention in the test is higher than that of the existing fine-grained image classification methods (B-CNN, RA-CNN, MA-CNN, DPL-CNN, DFL-CNN, NTS-CNN) only using image class labels. It can be seen that after selective sparse sampling is used, the method provided by the invention can be used for mining more visual clues, and meanwhile, feature learning is carried out on discriminative parts on a larger scale, so that a model can learn more fine features.
3.5) influence of the selection of the threshold δ on the classification
And (3) comparing results (%) on CUB-200-2011 by using the selective sparse sampling-based fine-grained image classification method (S3N).
TABLE 5 Effect of different thresholds on classification
δ 0 0.05 0.1 0.15 0.2 0.25 0.3
top-1(%) 88.14 88.23 88.40 88.50 88.52 88.47 88.43
As can be seen from table 5, when the class response map for extracting the class peak response is calculated, the value of the threshold δ has definite influence on the classification result, and when the value of δ is 0.2, the classification performance is better.
The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims (10)

  1. fine-grained image classification method based on selective sparse sampling, the method includes a process of training a classification model for target classification, the training process of the classification model includes the following steps:
    step 1, key component positioning, namely inputting the images into classification networks, outputting corresponding class response graphs, and extracting class peak responses on the class response graphs;
    grouping class peak responses, namely grouping the class peak responses obtained in the step 1 according to response strength, wherein each class peak response is respectively a discriminant attention group and a complementary attention group, each class peak response generates attention diagrams, and two groups of corresponding class peak responses generate two groups of attention diagrams;
    and step 3: resampling: respectively aggregating the attention diagrams in the two groups to generate two saliency diagrams, resampling the images under the guidance of the saliency diagrams, realizing local amplification of corresponding key components, and obtaining two resampled images;
    and 4, step 4: inputting the resampled image obtained in the step 3 into the classification network in the step 1 to extract features, combining the features of the original image and the resampled image, and classifying by using a classifier to obtain a classification model.
  2. 2. The method of claim 1 wherein, in step 1, the class-like peak response is a local maximum on the class response map.
  3. 3. The method according to claim 1, wherein in step 1, the picture is given only image labels, and the whole target and the positions of all parts are not marked.
  4. 4. The method according to claim 1, characterized in that step 1 comprises the following sub-steps:
    step 1.1: inputting the image into a classification network, and calculating a class response graph, wherein the class response graph is defined as follows:
    Figure FDA0002223379200000011
    wherein, WfcIs the weight of the fully connected layer, i.e. classifier, for each classes c and for each feature maps SdCorresponds to Wfc numerical values in
    Figure FDA0002223379200000012
    Step 1.2, obtaining a classification result by utilizing the classification network in the step 1.1, calculating a prediction probability set P of the current image in each category, and selecting the prediction probabilities of the top five
    Figure FDA0002223379200000026
    Calculating the entropy:
    Figure FDA0002223379200000021
    step 1.3: defining a class response graph for extracting class peak responses:
    Figure FDA0002223379200000022
    wherein the content of the first and second substances,
    Figure FDA0002223379200000023
    is composed of
    Figure FDA0002223379200000024
    Corresponding class response graph, delta is threshold;
    step 1.4: extracting local maximum values from the class response map in a window with a set size to obtain a class peak response position set T { (x)1,y1),(x2,y2),…,(xn,yn) And n is the number of peak-like responses.
  5. 5. Method according to claim 4, characterized in that in step 1.3, the response graph R to classesoCarrying out classification of a maximum value-minimum value mode, and obtaining a class peak value response position set T based on a class response diagram after classification :
    Figure FDA0002223379200000025
    wherein R is a class response diagram after being classified into oIs the class response plot, min (R) obtained in step 1.3o) Is the minimum value of the class response map, max (R)o) Is the maximum value of the class response graph.
  6. 6. The method according to claim 1, characterized in that step 2 comprises the following sub-steps:
    step 2.1: dividing the peak-like response in the step 1 into two sets TdAnd TcThe division is as follows:
    Td={(x,y)|(x,y)∈T if Rx,y≥ζ},
    Tc={(x,y)|(x,y)∈T if Rx,y<ζ},
    wherein R isx,yIs the response value of (x, y), ζ is the division number, TdFor the discriminative class peak response set, TcA complementary peak-like response set;
    step 2.2, calculating corresponding attention diagrams for each class peak responses by using a Gaussian kernel function, wherein the calculation mode is as follows:
    Figure FDA0002223379200000031
    wherein the content of the first and second substances,
    Figure FDA0002223379200000032
    is (x)i,yi) β, β1And β2Is a learnable parameter for controlling the degree of local amplification.
  7. 7. The method according to claim 1, characterized in that step 3 comprises the following sub-steps:
    step 3.1: using the attention maps obtained in step 2, summing the groups of attention maps to obtain a saliency map, Q, for guiding resamplingdAnd Qc
    Qd=∑Ai,if(xi,yi)∈Td
    Qc=∑Ai,if(xi,yi)∈Tc
    Wherein Q isdRepresenting a discriminative branch significance map, QcRepresenting a complementary branch significance map;
    step 3.2: and (3) guiding the original image to be resampled by the two groups of significance maps obtained by calculation in the step (3.1) to obtain a resampled image, wherein the calculation formula is as follows:
    Figure FDA0002223379200000033
    Figure FDA0002223379200000034
    wherein, (x ', y') is the coordinates of pixel points in the original image, f (x, y) is the corresponding abscissa in the sampled image, g (x, y) is the corresponding ordinate in the sampled image, and Q is QdOr Qc
  8. 8. The method of claim 7, wherein the resampling process is performed by a convolution operation.
  9. 9. The method according to claim 1, wherein the selective sparse sampling based fine-grained image classification method further comprises a model optimization process, comprising the following steps:
    step 4.1: designing a cross entropy loss function, calculating the gradient of the classification network according to the loss function, carrying out gradient back transmission on the whole classification network, and updating network parameters;
    the definition of the model classification cross entropy loss function is shown as follows:
    Figure FDA0002223379200000041
    wherein L isclsDenotes cross entropy loss, YiFor the corresponding prediction vectors, Y, of the original image and the resampled imagejFor prediction vectors corresponding to joint features, Y*Is an image label;
    step 4.2: and (3) judging whether the network is converged or not according to the classification error obtained by calculating the cross entropy loss function, or judging whether the maximum iteration times are reached or not, stopping network training if the network is converged or the maximum iteration times are reached, and otherwise, skipping to the step 1.
  10. 10. The method of claim 1, wherein the selective sparse sampling based fine-grained image classification method is further applied to target localization, comprising the steps of:
    step 1: class response map M for computing top-1 corresponding to original image, discriminant branch and complementary branchO,MD,MC
    Step 2: class response map M corresponding to discriminant branchDClass response graph M corresponding to complementary branchesCMapping to an original class response map M by a corresponding inverse transformationOSpace, then MO,MD,MCThe three are added to generate a final class response graph Mfinal
    And step 3: the final class response graph MfinalAnd upsampling to the size of the original image, dividing the upsampled image by using the average value, and selecting the minimum bounding box of the maximum connected domain as a positioning result of the target.
CN201910942790.8A 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling Active CN110738247B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910942790.8A CN110738247B (en) 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910942790.8A CN110738247B (en) 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling

Publications (2)

Publication Number Publication Date
CN110738247A true CN110738247A (en) 2020-01-31
CN110738247B CN110738247B (en) 2020-08-25

Family

ID=69269842

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910942790.8A Active CN110738247B (en) 2019-09-30 2019-09-30 Fine-grained image classification method based on selective sparse sampling

Country Status (1)

Country Link
CN (1) CN110738247B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368942A (en) * 2020-05-27 2020-07-03 深圳创新奇智科技有限公司 Commodity classification identification method and device, electronic equipment and storage medium
CN111507403A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Image classification method and device, computer equipment and storage medium
CN111784673A (en) * 2020-06-30 2020-10-16 创新奇智(上海)科技有限公司 Defect detection model training and defect detection method, device and storage medium
CN111816308A (en) * 2020-07-13 2020-10-23 中国医学科学院阜外医院 System for predicting coronary heart disease onset risk through facial picture analysis
CN111915618A (en) * 2020-06-02 2020-11-10 华南理工大学 Example segmentation algorithm and computing device based on peak response enhancement
CN111967527A (en) * 2020-08-21 2020-11-20 菏泽学院 Peony variety identification method and system based on artificial intelligence
CN112906810A (en) * 2021-03-08 2021-06-04 共达地创新技术(深圳)有限公司 Object detection method, electronic device, and storage medium
CN113177546A (en) * 2021-04-30 2021-07-27 中国科学技术大学 Target detection method based on sparse attention module
CN113470029A (en) * 2021-09-03 2021-10-01 北京字节跳动网络技术有限公司 Training method and device, image processing method, electronic device and storage medium
WO2022119155A1 (en) * 2020-12-02 2022-06-09 재단법인 아산사회복지재단 Apparatus and method for diagnosing explainable multiple electrocardiogram arrhythmias

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729911A (en) * 2009-12-23 2010-06-09 宁波大学 Multi-view image color correction method based on visual perception
CN108334901A (en) * 2018-01-30 2018-07-27 福州大学 A kind of flowers image classification method of the convolutional neural networks of combination salient region
CN109284749A (en) * 2017-07-19 2019-01-29 微软技术许可有限责任公司 Refine image recognition
CN109583305A (en) * 2018-10-30 2019-04-05 南昌大学 A kind of advanced method that the vehicle based on critical component identification and fine grit classification identifies again
CN110168573A (en) * 2016-11-18 2019-08-23 易享信息技术有限公司 Spatial attention model for image labeling
CN110197202A (en) * 2019-04-30 2019-09-03 杰创智能科技股份有限公司 A kind of local feature fine granularity algorithm of target detection
KR20190109194A (en) * 2018-03-16 2019-09-25 주식회사 에이아이트릭스 Apparatus and method for learning neural network capable of modeling uncerrainty

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101729911A (en) * 2009-12-23 2010-06-09 宁波大学 Multi-view image color correction method based on visual perception
CN110168573A (en) * 2016-11-18 2019-08-23 易享信息技术有限公司 Spatial attention model for image labeling
CN109284749A (en) * 2017-07-19 2019-01-29 微软技术许可有限责任公司 Refine image recognition
CN108334901A (en) * 2018-01-30 2018-07-27 福州大学 A kind of flowers image classification method of the convolutional neural networks of combination salient region
KR20190109194A (en) * 2018-03-16 2019-09-25 주식회사 에이아이트릭스 Apparatus and method for learning neural network capable of modeling uncerrainty
CN109583305A (en) * 2018-10-30 2019-04-05 南昌大学 A kind of advanced method that the vehicle based on critical component identification and fine grit classification identifies again
CN110197202A (en) * 2019-04-30 2019-09-03 杰创智能科技股份有限公司 A kind of local feature fine granularity algorithm of target detection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ADRI`A RECASENS ET AL.: "Learning to Zoom: a Saliency-Based Sampling layer for Neural Networks", 《ECCV 2018》 *
BOLEI ZHOU ET AL.: "Learning Deep Features for Discriminative Localization", 《2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
HELIANG ZHENG ET AL.: "Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION》 *
YANZHAO ZHOU: "Weakly Supervised Instance Segmentation using Class Peak Response", 《ARXIV》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507403A (en) * 2020-04-17 2020-08-07 腾讯科技(深圳)有限公司 Image classification method and device, computer equipment and storage medium
CN111368942A (en) * 2020-05-27 2020-07-03 深圳创新奇智科技有限公司 Commodity classification identification method and device, electronic equipment and storage medium
CN111915618A (en) * 2020-06-02 2020-11-10 华南理工大学 Example segmentation algorithm and computing device based on peak response enhancement
CN111915618B (en) * 2020-06-02 2024-05-14 华南理工大学 Peak response enhancement-based instance segmentation algorithm and computing device
CN111784673B (en) * 2020-06-30 2023-04-18 创新奇智(上海)科技有限公司 Defect detection model training and defect detection method, device and storage medium
CN111784673A (en) * 2020-06-30 2020-10-16 创新奇智(上海)科技有限公司 Defect detection model training and defect detection method, device and storage medium
CN111816308A (en) * 2020-07-13 2020-10-23 中国医学科学院阜外医院 System for predicting coronary heart disease onset risk through facial picture analysis
CN111816308B (en) * 2020-07-13 2023-09-29 中国医学科学院阜外医院 System for predicting coronary heart disease onset risk through facial image analysis
CN111967527A (en) * 2020-08-21 2020-11-20 菏泽学院 Peony variety identification method and system based on artificial intelligence
CN111967527B (en) * 2020-08-21 2022-09-06 菏泽学院 Peony variety identification method and system based on artificial intelligence
WO2022119155A1 (en) * 2020-12-02 2022-06-09 재단법인 아산사회복지재단 Apparatus and method for diagnosing explainable multiple electrocardiogram arrhythmias
CN112906810A (en) * 2021-03-08 2021-06-04 共达地创新技术(深圳)有限公司 Object detection method, electronic device, and storage medium
CN112906810B (en) * 2021-03-08 2024-04-16 共达地创新技术(深圳)有限公司 Target detection method, electronic device, and storage medium
CN113177546A (en) * 2021-04-30 2021-07-27 中国科学技术大学 Target detection method based on sparse attention module
CN113470029A (en) * 2021-09-03 2021-10-01 北京字节跳动网络技术有限公司 Training method and device, image processing method, electronic device and storage medium
CN113470029B (en) * 2021-09-03 2021-12-03 北京字节跳动网络技术有限公司 Training method and device, image processing method, electronic device and storage medium

Also Published As

Publication number Publication date
CN110738247B (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN110738247B (en) Fine-grained image classification method based on selective sparse sampling
US11960568B2 (en) Model and method for multi-source domain adaptation by aligning partial features
CN107657279B (en) Remote sensing target detection method based on small amount of samples
CN110443818B (en) Graffiti-based weak supervision semantic segmentation method and system
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN107103754B (en) Road traffic condition prediction method and system
CN107633226B (en) Human body motion tracking feature processing method
CN109978893A (en) Training method, device, equipment and the storage medium of image, semantic segmentation network
CN109871875B (en) Building change detection method based on deep learning
CN114092832B (en) High-resolution remote sensing image classification method based on parallel hybrid convolutional network
CN111027481B (en) Behavior analysis method and device based on human body key point detection
CN107256017B (en) Route planning method and system
CN106408030A (en) SAR image classification method based on middle lamella semantic attribute and convolution neural network
CN111274926B (en) Image data screening method, device, computer equipment and storage medium
CN110659601B (en) Depth full convolution network remote sensing image dense vehicle detection method based on central point
WO2022218396A1 (en) Image processing method and apparatus, and computer readable storage medium
CN110889421A (en) Target detection method and device
CN112365497A (en) High-speed target detection method and system based on Trident Net and Cascade-RCNN structures
CN112132014A (en) Target re-identification method and system based on non-supervised pyramid similarity learning
CN110728694A (en) Long-term visual target tracking method based on continuous learning
CN112528022A (en) Method for extracting characteristic words corresponding to theme categories and identifying text theme categories
CN115311449A (en) Weak supervision image target positioning analysis system based on class reactivation mapping chart
CN108805181B (en) Image classification device and method based on multi-classification model
Jiang et al. Dynamic proposal sampling for weakly supervised object detection
CN113283467A (en) Weak supervision picture classification method based on average loss and category-by-category selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant