CN112613574B

CN112613574B - Training method of image classification model, image classification method and device

Info

Publication number: CN112613574B
Application number: CN202011609917.3A
Authority: CN
Inventors: 黄高; 王朝飞; 宋士吉
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-19
Anticipated expiration: 2040-12-30
Also published as: CN112613574A

Abstract

The embodiment of the application discloses a training method of an image classification model, an image classification method and a device, wherein the image classification model is realized based on a convolutional neural network, and the training method comprises the following steps: aiming at each picture input into the image classification model in the training process of the image classification model, the following operations are respectively executed: acquiring a class activation map CAM corresponding to a preset class of the picture and a thermodynamic map CAAM irrelevant to the preset class; calculating a loss term between the CAM graph and the CAAM graph by adopting a preset algorithm; determining a final loss function using the loss term; adjusting parameters of the image classification model using the determined final loss function. According to the scheme, in the image classification model training process, the CAAM image is restrained by the CAM image, and the parameters of the training image classification model are adjusted.

Description

Training method of image classification model, image classification method and device

Technical Field

The embodiment of the application relates to but is not limited to the field of image classification, and particularly relates to a training method of an image classification model, and an image classification method and device.

Background

Image classification techniques are currently the most popular area of research in the fields of machine learning, pattern recognition and computer vision. With the development of deep learning technology, processing the image classification problem by using a deep learning model has become the mainstream. In recent years, the structure of the deep neural network becomes larger and deeper, and the accuracy of the deep neural network on the visual classification task is continuously improved. However, the deep network is easy to face the key problem of overfitting while obtaining strong learning ability. Many researchers have proposed effective regularization solutions such as Dropout, Weight decay, Stochastic depth, Mixup, and others.

In a deep learning framework, a loss function is an essential link for judging the difference between a prediction result and a real result, so that a network is guided to adjust parameters in a direction of making more accurate prediction. An effective solution to the problem of overfitting is to design different loss functions to obtain a more categorically differentiated power distribution, i.e. to increase intra-class compactness and inter-class separability. Based on the inspiration, researchers put forward Center-loss and triple-loss, and introduce additional constraint on the basis of the traditional Softmax-loss, wherein the Center-loss requires that the distance sum of squares of the distances between the characteristics of samples in each batch and the class Center is as small as possible, and the triple-loss requires that the distance between similar samples is as small as possible and the distance between dissimilar samples is as large as possible. However, both of these loss functions are computationally intensive, which limits their application to large-scale datasets, such as ImageNet. L-Softmax and SM-Softmax are provided by researchers, and original Softmax functions are mathematically corrected, so that image feature representations have larger angle separability. However, such a loss function is not visually intuitive. In addition, when the above-described loss function is used, an image is always represented by a one-dimensional feature vector and does not contain any visual spatial information. Therefore, the method realizes the loss function containing visual space information without introducing a large amount of calculation, and has strong practical significance for solving the overfitting problem of the deep learning network and improving the overall performance (including classification capability, positioning capability, interpretable capability and the like) of the deep learning network.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the disclosure discloses a training method and a training device for an image classification model and an image classification method, which can realize that a CAAM (computer aided amplitude modulation) graph is constrained by a CAM (computer aided manufacturing) graph and can adjust parameters of the image classification model.

The present disclosure provides a training method of an image classification model, where the image classification model is implemented based on a convolutional neural network, the training method including:

aiming at each picture input into the image classification model in the training process of the image classification model, the following operations are respectively executed:

acquiring a class activation map CAM corresponding to a preset class of the picture and a thermodynamic map CAAM irrelevant to the preset class;

calculating a loss term between the CAM image and the CAAM image by adopting a preset algorithm;

determining a final loss function using the loss term;

adjusting parameters of the image classification model using the determined final loss function.

In an exemplary embodiment, the obtaining the CAM and CAAM of the preset category corresponding to the picture includes:

obtaining a plurality of characteristic graphs output by the picture on the last layer of convolution layer of the image classification model;

for each pixel position point, respectively carrying out weighted summation on weighting coefficients corresponding to the preset categories in all the characteristic graphs to obtain a CAM graph corresponding to the preset categories;

and summing values in all the feature maps to obtain a CAAM map for each pixel position point.

In an exemplary embodiment, the calculating the loss term between the CAM map and the CAAM map by using a predetermined algorithm includes:

respectively normalizing the CAM graph and the CAAM graph;

and calculating loss terms between the CAM graph and the CAAM graph by adopting a distance measurement method for the normalized CAM graph and the normalized CAAM graph.

In an exemplary embodiment, the distance metric includes: distance measurement method of arbitrary pixel space.

In an exemplary embodiment, the determining the final loss function using the loss term includes:

and carrying out weighted summation on the determined loss terms and preset classification cross entropy loss terms to obtain a final loss function.

the final loss function includes: CAM-Loss ═ α Loss_cam+Loss_cross；

Wherein, alpha represents the preset regulation loss term ratioHeavy over-parameter, Loss_camRepresenting a Loss term, Loss, between the CAM and CAAM maps_crossRepresenting the categorical cross entropy loss term.

In an exemplary embodiment, the convolutional neural network employs a linear classifier or a cosine classifier.

The present disclosure also provides an image classification method, including:

training the image classification model according to the training method of the image classification model in any embodiment;

and classifying the input pictures by adopting a trained image classification model.

The present disclosure also provides an apparatus comprising a memory and a processor; the memory is configured to store a training or image classification program for an image classification model, and the processor is configured to read and execute the training or image classification program for an image classification model, and execute the training method for an image classification model or the image classification method according to any one of the above embodiments.

The present disclosure also provides a storage medium having stored therein a training or image classification program for an image classification model, the program being arranged to perform, when running, the training method of the image classification model according to any one of the above embodiments or the image classification method according to the above embodiments.

The embodiment of the disclosure discloses a training method of an image classification model, an image classification method and a device, wherein the image classification model is realized based on a convolutional neural network, and the training method comprises the following steps: aiming at each picture input into the image classification model in the training process of the image classification model, the following operations are respectively executed: acquiring a class activation map CAM of a preset class corresponding to the picture and a thermodynamic map CAAM irrelevant to the preset class; calculating a loss term between the CAM graph and the CAAM graph by adopting a preset algorithm; determining a final loss function using the loss term; adjusting parameters of the image classification model using the determined final loss function. According to the scheme disclosed by the invention, in the training process, the CAAM image is constrained by the CAM image, and the parameters of the image classification model are adjusted.

Other aspects will be apparent upon reading and understanding the attached figures and detailed description.

Drawings

FIG. 1 is a flowchart of a training method of an image classification model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the calculation of CAM, CAAM, CAM-loss in some exemplary embodiments;

FIG. 3 is a schematic view of an apparatus according to an embodiment of the present application;

FIG. 4 is a graphical depiction of a visual comparison of the CAM-loss method to the Softmax-loss baseline method in some exemplary embodiments.

Detailed Description

Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be noted that the features of the embodiments and examples of the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

The embodiment of the present disclosure provides a training method for an image classification model, where the image classification model is implemented based on a convolutional neural network, and as shown in fig. 1, the method includes:

s100, acquiring a class activation map CAM corresponding to a preset class of the picture and a thermodynamic map CAAM irrelevant to the preset class;

s200, calculating a loss item between the CAM image and the CAAM image by adopting a preset algorithm;

s300, determining a final loss function by using the loss item;

s400, adjusting parameters of the image classification model by using the determined final loss function.

In this embodiment, the Class Activation Map CAM is a thermodynamic Map (Class Activation Map, also called Class Activation Map, abbreviated as CAM) that includes rich spatial information. The CAM may exhibit the most discriminating core regions for identifying a particular class. And directly calculating the total sum of the output feature maps of the last convolutional layer to obtain a Class-independent thermodynamic Map (CAAM), which shows the spatial distribution of all the characteristics of the pictures input into the image classification model.

In this embodiment, the obtaining the CAM and the CAAM of the preset category corresponding to the picture includes: obtaining a plurality of characteristic graphs output by the picture on the last layer of convolution layer of the image classification model; for each pixel position point, respectively carrying out weighted summation on weighting coefficients corresponding to the preset categories in all the characteristic graphs to obtain a CAM graph corresponding to the preset categories; and summing the values in all the feature maps to obtain a CAAM map for each pixel position point.

In this embodiment, the calculating the loss term between the CAM map and the CAAM map by using a preset algorithm includes: respectively normalizing the CAM graph and the CAAM graph; and calculating loss terms between the CAM graph and the CAAM graph by adopting a distance measurement method for the normalized CAM graph and the normalized CAAM graph.

In an exemplary embodiment, the distance metric may include, but is not limited to: distance measurement method of arbitrary pixel space.

In this embodiment, the determining a final loss function by using the loss term includes:

and carrying out weighted summation on the determined loss item and a preset classified cross entropy loss item to obtain a final loss function.

In an exemplary embodiment, the final loss function determined by using the loss term may adopt a weighted fusion strategy, and the loss term is not limited to the cross-entropy loss term and is not limited to the fusion strategy adopting the weighted sum.

In this embodiment, the determining a final loss function by using the loss term includes: the final loss function includes: CAM-Loss ═ α Loss_cam+Loss_cross(ii) a Wherein alpha represents a preset hyper-parameter for adjusting the proportion of the Loss term, Loss_camRepresenting a Loss term, Loss, between the CAM and CAAM maps_crossRepresenting the categorical cross entropy loss term.

In this embodiment, the convolutional neural network employs a linear classifier or a cosine classifier; the neural network structure may employ a deep neural network structure such as ResNet, DenseNet, ResNext, or the like.

In the embodiment, in the image classification model training process, the CAAM is constrained by the CAM, so that the network highlights the feature expression of the preset class, namely the target class, and inhibits the feature expression of the non-target class, thereby enhancing the intra-class compactness and the inter-class separability.

The above embodiment is described below by way of an example.

This example presents a method of training an image classification model. Aiming at each iteration process of image classification model training, the following operations are respectively executed for each picture input into the image classification model:

step 1, inputting each picture of the image classification model to perform classification inference operation;

in this step, the last convolutional layer of the deep neural network is set to output n feature maps (feature maps), f_k(x, y) represents a value corresponding to coordinates (x, y) in the kth feature map, the size of each feature map is fixed to be H and W, and after global average pooling (global average pooling), the corresponding value of the feature map is

After training and learning, aiming at the target class c, the weight corresponding to the value is

The weight corresponding to the target category is obtained through deep learning network training.

Step 2, calculating a CAM graph and a CAAM graph under the category c;

in this step, the CAM map of the input picture under the category c may be represented as CAM_cWherein, the value of each pixel coordinate point can be expressed as:

the CAAM diagram of the input picture, wherein the value of each pixel coordinate point may be expressed as: CAAM (x, y) ═ Σ_kf_k(x, y); a schematic diagram of the CAM map and the CAAM map calculated under category c is shown in fig. 2.

Step 3, calculating a loss item between the CAM graph and the CAAM graph by adopting a preset algorithm;

in this step, in order to utilize CAM_cTo constrain CAAM, define Loss_camTo measure CAM_cThe gap from CAAM. First, CAM_cAnd CAAM to CAM'_cAnd CAAM'; second, calculating Loss by using distance measurement method of arbitrary pixel space_camSuch as manhattan distance (L1):

the euclidean distance (L2) equidistance measurement method may be used, but is not particularly limited thereto.

And 4, calculating a loss function.

In this step, it is determined that the final Loss function CAM-Loss is defined by Loss_camAnd cross entropy Loss term Loss_crossCombined, the final loss function includes:

CAM-loss＝αLoss_cam+Loss_cross；

wherein alpha represents a preset hyper-parameter for adjusting the proportion of the Loss term, Loss_camRepresenting a Loss term, Loss, between the CAM and CAAM maps_crossRepresenting a classification cross entropy loss term; a schematic of this calculation is shown in fig. 2.

And 5, adjusting parameters of the image classification model by using the determined final loss function and adopting a parameter updating method such as a gradient descent method.

And (5) continuously repeating the process of the iteration step 1-5, and training the image classification model to obtain the final image classification model.

In the example, in the image classification model training process based on the convolutional neural network, each training cycle can obtain a thermodynamic diagram CAM corresponding to a specific category and a thermodynamic diagram CAAM irrelevant to the category of each picture; utilizing CAM to restrict CAAM so as to generate a new loss item, wherein the loss item can guide the network to more prominently express the characteristics of the target class and inhibit the characteristics of the non-target class; weighting and fusing the new loss item and the classified cross entropy loss item to obtain a new loss function CAM-loss; the loss function is used for training, and a conventional convolutional neural network model is combined to train to obtain a high-efficiency deep learning image classification model. By the method, the visual space information is fused into the loss function, the accuracy of the image classification task is effectively improved, and the computing resources can be saved.

The disclosed embodiment also provides an apparatus, as shown in fig. 3, including a memory and a processor; the memory is configured to store a training or image classification program for an image classification model, and the processor is configured to read and execute the training or image classification program for an image classification model, and execute the training method for an image classification model or the image classification method according to any one of the above embodiments.

Embodiments of the present disclosure also provide a storage medium, where a training or image classification program for an image classification model is stored, and the program is configured to execute, when running, the training method for the image classification model according to any one of the foregoing embodiments or the image classification method according to the foregoing embodiments.

The embodiment of the present disclosure further provides an image classification method, including:

training the image classification model according to the training method of the image classification model in the embodiment;

The effect of the above-described image classification is explained below with an example.

The present example uses an actual image classification model training process as an example to illustrate the effect of the image classification model training method.

Taking the image classification reference data sets CIFAR100 and ImageNet as examples, Softmax-loss is taken as a comparison reference, and CAM-loss is taken as an experimental verification method instead of Softmax-loss. The hyper-parameter α is fixedly set to 1, and a linear classifier is employed.

(1) On ImageNet, run 120epoch, with a batch size of 1024, an initial learning rate lr of 0.4 (for ResNext-101, the batch size is 512, the initial lr is 0.2 due to computational limitations), and take a cosine learning rate decay strategy.

(2) On CIFAR-100, 300 epochs are run, the batch size is 128, the initial learning rate lr is 0.4, and cosine learning rate attenuation is adopted.

The comparative experiment results of the CAM-loss method under different network architectures are shown in Table 1:

in Table 1, the Top 1 error rate is used and the results obtained by the CAM-loss method are shown in bold. It can be seen that: the CAM-loss can be widely applied to various network structures, and the performance of the CIFAR-100 and ImageNet baseline methods is obviously improved. Specifically, the improvement is 0.69-1.46% on CIFAR-100 and 0.51-0.70% on ImageNet, which is obvious under the standard of deep network architecture.

FIG. 4 is a graph showing the results of the visualization of the CAM-loss method and the Softmax-loss method, and it can be seen that:

(1) as can be seen from comparison between the column (d) and the column (b), the CAAM graph obtained by the CAM-loss is more accurate than the CAAM graph obtained by the Softmax-loss, the characteristic representation of the target class is more obvious, and the CAAM graph has a certain inhibiting effect on the characteristic representation of the non-target class. For example, the correct category for the third graph is the pan, and the CAAM graph (b) by Softmax-loss focuses more on the food in the pan, while the CAAM graph (e) by CAM-loss focuses more on the surrounding pan and at the same time suppresses the expression of the food area (position shown as black box).

(2) As can be seen from comparison between the column (e) and the column (c), the CAM map obtained by the CAM-loss is more accurate than the CAM map obtained by the Softmax-loss, the characteristic representation of the target class is more obvious, and the CAM has a certain inhibiting effect on the characteristic representation of the non-target class. For example, the correct category for the second image is bookshelf, and the CAM map from Softmax-loss (c) focuses more on the desk, while the CAM map from CAM-loss (e) gives more attention to the bookshelf area in the upper right corner and at the same time suppresses the expression of the desk area (positions shown as white boxes).

In the example, CAAM is restricted by CAM, so that the network highlights feature expression of the target category, and inhibits feature expression of the non-target category, which is beneficial to enhancing compactness and separability between categories simultaneously, and realizes fusion of visual space information into a loss function, thereby effectively improving accuracy of image classification task.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, or suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A training method of an image classification model, wherein the image classification model is realized based on a convolutional neural network, and the training method comprises the following steps:

the method for acquiring the class activation map CAM corresponding to the preset class of the picture and the thermodynamic map CAAM irrelevant to the preset class comprises the following steps:

summing values in all the feature maps to obtain a CAAM map for each pixel position point;

calculating a loss term between the CAM graph and the CAAM graph by adopting a distance measurement method;

determining a final loss function using the loss term;

2. The method for training the image classification model according to claim 1, wherein the calculating the loss term between the CAM map and the CAAM map by using a predetermined algorithm comprises:

respectively normalizing the CAM graph and the CAAM graph;

3. The method of claim 2, wherein the distance metric comprises: distance measurement method of arbitrary pixel space.

4. The method for training an image classification model according to claim 3, wherein the determining a final loss function by using the loss term comprises:

5. The method for training an image classification model according to claim 4, wherein the determining a final loss function by using the loss term comprises:

the final loss function includes: CAM-Loss ═ α Loss_cam+Loss_cross；

Wherein alpha represents a preset hyper-parameter for adjusting the proportion of the Loss term, Loss_camRepresenting a Loss term, Loss, between the CAM and CAAM maps_crossRepresenting the categorical cross entropy loss term.

6. The method for training the image classification model according to claim 1, wherein the convolutional neural network adopts a linear classifier or a cosine classifier.

7. An image classification method, comprising:

training the image classification model according to the training method of the image classification model as claimed in claims 1-6;

8. An apparatus comprising a memory and a processor; wherein the memory is configured to store a training or image classification program for an image classification model, and the processor is configured to read out a training or image classification program for an image classification model, a training method for an image classification model according to any one of claims 1 to 6, or an image classification method according to claim 7.

9. A storage medium in which a training or image classification program for an image classification model is stored, the program being arranged to perform, when running, the method of training an image classification model according to any one of claims 1 to 6 or the method of image classification according to claim 7.