CN114972791A

CN114972791A - Image classification model training method, image classification method and related device

Info

Publication number: CN114972791A
Application number: CN202210626206.XA
Authority: CN
Inventors: 邓佳丽; 丁子霖; 刘明; 龚海刚; 王晓敏; 刘明辉; 程旋; 解天舒
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-08-30

Abstract

The embodiment of the invention provides an image classification model training method, an image classification method and a related device, and relates to the field of image processing. Firstly, acquiring an original image and a type label of the original image; inputting an original image into a pre-constructed image classification model, wherein the image classification model comprises a feature extraction network and a type prediction network, and the feature extraction network comprises N convolutional layers which are sequentially connected in series; then, feature extraction is carried out on the original image by utilizing a feature extraction network to obtain a first feature map output by the Nth convolutional layer and a second feature map output by the (N-1) th convolutional layer; generating a first training set based on the first feature map, the type label and a plurality of preset binary masks, and generating a second training set based on the second feature map, the type label and the plurality of binary masks; and finally, training the type prediction network by utilizing the first training set and the second training set to obtain a trained image classification model, thereby reducing the additional overhead generated in the model training process.

Description

Image classification model training method, image classification method and related device

Technical Field

The invention relates to the field of image processing, in particular to an image classification model training method, an image classification method and a related device.

Background

With the development of artificial intelligence technology, the application scenarios of calling the image classification model to predict the type of the image are more and more, and before calling the image classification model to predict the type of the image, the image classification model needs to be trained.

In order to improve the accuracy of the model, in the prior art, the original image data is expanded by several times and then input into the convolutional neural network for training, but in the convolutional neural network, the time consumption of the convolutional operation is long, so that the additional overhead of the model training process is several times of the original training overhead.

Disclosure of Invention

In order to overcome the defects of the prior art, embodiments of the present invention provide an image classification model training method, an image classification method, and a related apparatus, which can reduce the overhead generated in the model training process.

Embodiments of the invention may be implemented as follows:

in a first aspect, the present invention provides a method for training an image classification model, the method comprising:

acquiring an original image and a type label of the original image;

inputting the original image into a pre-constructed image classification model, wherein the image classification model comprises a feature extraction network and a type prediction network, and the feature extraction network comprises N sequentially connected convolution layers in series;

performing feature extraction on the original image by using the feature extraction network to obtain a first feature map output by the Nth convolutional layer and a second feature map output by the (N-1) th convolutional layer;

generating a first training set based on the first feature map, the type labels and a plurality of preset binary masks;

generating a second training set based on the second feature map, the type label, and the plurality of binary masks;

and training the type prediction network by utilizing the first training set and the second training set to obtain a trained image classification model.

In an optional embodiment, the step of generating a first training set based on the first feature map, the type label, and a plurality of preset binary masks includes:

processing the first feature map by using the multiple binary masks to obtain multiple first training samples, wherein the multiple first training samples correspond to the multiple binary masks in a one-to-one mode;

and generating a label of each first training sample according to the plurality of binary masks and the type label to obtain a first training set, wherein the first training set comprises the plurality of first training samples and the label of each first training sample.

In an alternative embodiment, the side length of each binary mask is the same as that of the first feature map, and each binary mask includes all-1 regions and/or all-0 regions;

the step of processing the first feature map with the plurality of binary masks to obtain a plurality of first training samples includes:

aiming at any one target binary mask in the multiple binary masks, reserving a region, which is overlapped with all 1 regions of the target binary mask, in the first feature map, and erasing a region, which is overlapped with all 0 regions of the target binary mask, in the first feature map to obtain a first training sample corresponding to the target binary mask;

and traversing each binary mask to obtain a first training sample corresponding to each binary mask.

In an alternative embodiment, the step of generating a label for each of the first training samples according to the plurality of binary masks and the type label includes:

and associating the identifier of the binary mask corresponding to each first training sample with the type label to obtain the label of each first training sample.

In an alternative embodiment, the step of generating a second training set based on the second feature map, the type label, and the plurality of binary masks includes:

processing the second feature map by using the plurality of binary masks to obtain a plurality of second training samples, wherein the plurality of second training samples are in one-to-one correspondence with the plurality of binary masks;

and generating a label of each second training sample according to the plurality of binary masks and the type label to obtain a second training set, wherein the second training set comprises the plurality of second training samples and the label of each second training sample.

In an optional embodiment, the step of training the type prediction network by using the first training set and the second training set to obtain a trained image classification model includes:

training the type prediction network by using the first training set to obtain a first type prediction network;

training the type prediction network by using the second training set to obtain a second type prediction network;

distilling the parameters of the first type of prediction network and the parameters of the second type of prediction network to obtain the parameters of the type of prediction network so as to finish the training of the image classification model.

In a second aspect, the present invention provides a method of image classification, the method comprising:

acquiring an image to be classified;

inputting the image to be classified into an image classification model obtained by the image classification model training method of any one of the preceding embodiments, and predicting the type of the image to be classified.

In a third aspect, the present invention provides an image classification model training apparatus, including:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring an original image and a type label of the original image;

a processing unit to:

and the training unit is used for training the type prediction network by utilizing the first training set and the second training set to obtain a trained image classification model.

In a fourth aspect, the present invention provides an image classification apparatus, comprising:

the second acquisition unit is used for acquiring the image to be classified;

and the prediction unit is used for inputting the image to be classified into the image classification model obtained by utilizing the image classification model training method in any one of the preceding embodiments and predicting the type of the image to be classified.

In a fifth aspect, the present invention provides a computer device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the image classification model training method according to any one of the foregoing embodiments when executing the computer program, and/or the image classification method according to the foregoing embodiments.

Compared with the prior art, the image classification model training method, the image classification method and the related device provided by the embodiment of the invention have the advantages that firstly, the original image and the type label of the original image are obtained; then, inputting an original image into a pre-constructed image classification model, wherein the image classification model comprises a feature extraction network and a type prediction network, and the feature extraction network comprises N sequentially connected convolution layers in series; then, feature extraction is carried out on the original image by utilizing a feature extraction network to obtain a first feature map output by the Nth convolutional layer and a second feature map output by the (N-1) th convolutional layer; then, generating a first training set based on the first feature map, the type label and a plurality of preset binary masks, and generating a second training set based on the second feature map, the type label and the plurality of binary masks; and finally, training the type prediction network by utilizing the first training set and the second training set to obtain a trained image classification model. According to the embodiment of the invention, a first training set is generated based on a first feature map output by the Nth convolutional layer, a type label and a plurality of preset binary masks, a second training set is generated based on a second feature map output by the N-1 th convolutional layer, the type label and the binary masks, and then the first training set and the second training set are used for training the type prediction network to obtain the trained image classification model, so that the convolution operation of original image data expanded by multiple times is avoided, and the extra overhead generated in the model training process is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of an image classification model training method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an implementation manner of step S104 according to an embodiment of the present invention;

fig. 3 is an exemplary diagram of a binary mask according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an exemplary acquisition process of a first training sample according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an example of a process for generating labels for a first training sample according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating an implementation manner of step S105 according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of an image classification model training method according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating an example of an image classification model training process according to an embodiment of the present invention;

FIG. 9 is a comparison graph of network test results provided by embodiments of the present invention;

FIG. 10 is a graph of results of an ablation learning experiment for a hyperparameter β provided by an embodiment of the present invention;

FIG. 11 is a block diagram illustrating an exemplary configuration of a computing device according to an embodiment of the present invention;

FIG. 12 is a block diagram of functional units of an apparatus for training an image classification model according to an embodiment of the present invention;

fig. 13 is a functional unit block diagram of an image classification apparatus according to an embodiment of the present invention.

Icon: 100-a computer device; 110-a memory; 120-a processor; 200-image classification model training device; 201-a first acquisition unit; 202-a processing unit; 203-a training unit; 300-image classification means; 301-a second acquisition unit; 302-prediction unit.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

In order to avoid performing convolution operation on original image data expanded by multiple times and reduce the additional overhead generated in the model training process, the embodiment of the invention provides an image classification model training method, which is described in detail below.

Referring to fig. 1, fig. 1 shows a flow of an image classification model training method according to an embodiment of the present invention, where the image classification model training method includes steps S101 to S106.

S101, acquiring an original image and a type label of the original image.

The original image is an image required by training an image classification model, and a corresponding type label can be generally allocated to the original image according to the type of the original image, wherein the type label can be a digital label, a text label, a symbol label, and other labels defined by a user according to actual requirements.

S102, inputting the original image into a pre-constructed image classification model.

The image classification model comprises a feature extraction network and a type prediction network. The feature extraction network comprises N convolutional layers which are sequentially connected in series. The type prediction network comprises a full connectivity layer and a softmax classifier.

S103, feature extraction is carried out on the original image by using a feature extraction network, and a first feature map output by the Nth convolutional layer and a second feature map output by the (N-1) th convolutional layer are obtained.

The convolution operation is carried out on the original image by utilizing N convolution layers which are sequentially connected in series in the feature extraction network so as to carry out feature extraction. Since the features extracted by the shallow convolutional layer are often rough and are not suitable for training the type prediction network, the embodiment of the present invention performs training of the type prediction network based on the feature maps output by the two reciprocal convolutional layers in the feature extraction network, that is, the first feature map output by the nth convolutional layer and the second feature map output by the N-1 th convolutional layer.

And S104, generating a first training set based on the first feature map, the type labels and a plurality of preset binary masks.

The first characteristic diagram from the Nth convolution layer is erased by using a plurality of different binary masks to obtain a plurality of new first characteristic diagrams. And expanding the type labels of the original image by using the plurality of binary masks to obtain a plurality of new labels. And matching the new first feature graph corresponding to the same binary mask with the new label to obtain a first training set.

And S105, generating a second training set based on the second feature map, the type label and the plurality of binary masks.

Wherein, for the second characteristic diagram from the (N-1) th convolution layer, the different regions on the second characteristic diagram are erased by using a plurality of different binary masks, respectively, so as to obtain a plurality of new second characteristic diagrams. And expanding the type labels of the original image by using the plurality of binary masks to obtain a plurality of new labels. And matching the new second feature map corresponding to the same binary mask with the new label to obtain a second training set.

And S106, training the type prediction network by using the first training set and the second training set to obtain a trained image classification model.

And training the first training set input type prediction network to obtain a first softmax classifier. And inputting the second training set into the type prediction network for training to obtain a second softmax classifier. And distilling information contained in the joint reasoning result output by the first softmax classifier and the second softmax classifier to a new softmax classifier, and obtaining a type prediction network in the trained image classification model by the full-link layer and the new softmax classifier.

The method provided by the embodiment of the invention has the advantages that a first training set is generated based on the first characteristic diagram and the type label output by the Nth convolutional layer and a plurality of preset binary masks, a second training set is generated based on the second characteristic diagram and the type label output by the (N-1) th convolutional layer and a plurality of binary masks, and then the first training set and the second training set are used for training the type prediction network to obtain the trained image classification model, so that the convolution operation on original image data expanded by multiple times is avoided, and the additional overhead generated in the model training process is reduced.

Step S104 will be described in detail below.

Referring to fig. 2, fig. 2 shows a flow of an implementation manner of step S104 according to an embodiment of the present invention, where step S104 includes sub-steps S104-1 to S104-2.

S104-1, processing the first feature map by using the plurality of binary masks to obtain a plurality of first training samples.

The plurality of first training samples correspond to the plurality of binary masks one to one. In an embodiment of the invention, the binary mask comprises M ₀ 、M ₁ 、M ₂ 、M ₃ And M ₄ (see fig. 3), each binary mask includes a full 1 region: (I.e., white area) and/or all-0 area (i.e., black area), and the boundary coordinates of all-0 area may be expressed as

As shown in FIG. 3, M ₀ Boundary coordinates B of all-0 region ₀ ＝(0,0,0,0)，M ₁ Boundary coordinates of all-0 region

M ₂ Boundary coordinates of all-0 region

M ₃ Boundary coordinates of all-0 region

M ₄ Boundary coordinates of all-0 region

In the above expression c is the side length of the first feature map f,

and

respectively, the coordinates of the four vertices on the boundary of the all 0 region.

Using binary masks M respectively ₀ 、M ₁ 、M ₂ 、M ₃ And M ₄ Processing the first characteristic diagram f to obtain a first training sample

And

can indicate a process as [ ], i.e.

Specifically, the implementation process of the sub-step S104-1 may be as follows:

firstly, aiming at any one target binary mask in a plurality of binary masks, reserving an area which is overlapped with all 1 areas of the target binary mask in a first characteristic diagram, and erasing the area which is overlapped with all 0 areas of the target binary mask in the first characteristic diagram to obtain a first training sample corresponding to the target binary mask;

then, traversing each binary mask to obtain a first training sample corresponding to each binary mask.

Wherein, the first feature map f and the binary mask M ₀ 、M ₁ 、M ₂ 、M ₃ And M ₄ All have the same side length. As shown in FIG. 4, M ₀ Not including all 0 regions, M ₀ Corresponding first training sample

All the regions in the first feature map f, that is, the first training sample, are retained

This is the first profile f. And M ₁ 、M ₂ 、M ₃ And M ₄ All 0 regions are respectively at the upper left, upper right, lower left and lower right, so as to understand that the upper left, upper right, lower left and lower right regions in the first feature map f are respectively erased to obtain the first training sample

And

in the embodiment of the method, as the characteristics of the partial region in the first characteristic diagram f are erased, the subsequent type prediction network can deepen the learning of the partial characteristics in the process of learning the semantic information of different regions of the characteristic diagram, and has more definite cognition on the target region, and meanwhile, the training difficulty of the type prediction network is improved to a certain extent, so that the over-fitting phenomenon can be effectively inhibited, in addition, the characteristics of the target region and the characteristics of the background region are separated to a certain extent, the possibility of forming the shortcut learning by the network is reduced, and the coupling effect of the target region and the background region caused by the shortcut learning is effectively relieved.

S104-2, generating a label of each first training sample according to the plurality of binary masks and the type labels to obtain a first training set.

Wherein the first training set comprises a plurality of first training samples and a label for each first training sample. In order to reduce the manual labeling cost and enable the obtained first training sample to be directly applied to network training of a supervised learning scene, the embodiment of the invention provides a binary mask M ₀ 、M ₁ 、M ₂ 、M ₃ And M ₄ Expanding the type label y of the original image to respectively obtain a first training sample

And

is marked with a label

And

specifically, the sub-step S104-2 is implemented as follows:

Wherein the first training sample

Is marked with a label

Can be expressed as

As shown in FIG. 5, the type labels of the original image are "dog", "cat", "fish" and "bird", and a binary mask M is used for any one type label, for example, "dog ₀ 、M ₁ 、M ₂ 、M ₃ And M ₄ Expanding the training data to obtain a first training sample

And

are respectively (dog, M) ₀ ) (dog, M) ₁ ) (dog, M) ₂ ) (dog, M) ₃ ) And (dog, M) ₄ )。

Step S105 will be described in detail below.

Referring to fig. 6, fig. 6 shows a flow of an implementation manner of step S105 according to an embodiment of the present invention, where step S105 includes sub-steps S105-1 to S105-2.

And S105-1, processing the second feature map by using the plurality of binary masks to obtain a plurality of second training samples.

The plurality of second training samples correspond to the plurality of binary masks one to one, and it can be understood that the implementation process of the substep S105-1 is the same as that of the substep S104-1, and thus is not described again;

and S105-2, generating a label of each second training sample according to the plurality of binary masks and the type labels to obtain a second training set.

The second training set includes a plurality of second training samples and a label of each second training sample, and the implementation process of the substep S105-2 is the same as that of the substep S104-2, and therefore, the description is not repeated.

Step S106 will be described in detail below.

Referring to fig. 7, fig. 7 shows another flow of the image classification model training method according to the embodiment of the present invention, and step S106 includes sub-steps S106-1 to S106-3.

S106-1, training the type prediction network by using the first training set to obtain a first type prediction network.

The first type prediction network is obtained by inputting first training set data into the type prediction network for training, and the first type prediction network comprises a first softmax classifier.

And S106-2, training the type prediction network by using the second training set to obtain a second type prediction network.

The second type prediction network is obtained by inputting second training set data into the type prediction network for training, and the second type prediction network comprises a second softmax classifier.

And S106-3, distilling the parameters of the first type prediction network and the parameters of the second type prediction network to obtain the parameters of the type prediction network so as to finish the training of the image classification model.

The parameters of the first type of prediction network refer to information contained in the joint reasoning result output by the first softmax classifier, and the parameters of the second type of prediction network refer to information contained in the joint reasoning result output by the second softmax classifier. And distilling information contained in a joint reasoning result output by the first softmax classifier and the second softmax classifier to a brand-new softmax classifier in a knowledge migration mode by adopting a Self-Distillation learning method (Self-Distillation), and obtaining a type prediction network in the trained image classification model by using the full connection layer and the softmax classifier.

To explain step S106 more visually, please refer to fig. 8, fig. 8 is an exemplary diagram of an image classification model training process according to an embodiment of the present invention, in which feature map processing modules, full connection layers, and softmax classifiers are respectively disposed after two reciprocal convolution layers of a convolutional neural network.

Order to

A softmax classifier attached after the ith convolution layer, wherein ω is a weight parameter in the classifier, f _i Is the characteristic diagram of the ith convolution layer output.

Original probability of output of existing softmax classifier Fc (f; omega) for N-class classification tasks

Where y represents the original kind of input image x.

In the embodiment of the invention, the combined softmax classifier is used for replacing the original softmax classifier to carry out feature map processing by using binary mask

And (6) performing prediction. At this time, the joint softmax classifier outputs a joint probability

Where j denotes a label generated from the binary mask and the type label of the input image x, and is referred to herein as a "pseudo label".

Training loss function derived from a combined softmax classifier connected after the last convolutional layer

Cross entropy loss term therein

Can be expressed as:

in this case, the single inference result is the result output by the combined softmax classifier connected after the last convolutional layer

Will be used to represent the ability of the network to classify the original data set.

In bookIn the embodiment of the invention, two combined softmax classifiers Fc are introduced _m (. theta.) with Fc _m-1 Theta). Network training total loss function at this time

Will combine the respective penalty functions from the two joint classifiers

And

can be expressed as

Where β is a network training hyper-parameter that needs to be set in advance to limit the combined classifier Fc _m-1 The influence of [ theta ] on the net prediction results, so that beta will be in [0,1]]Taking values. As can be seen from the ablation learning study, the value of β usually takes 1.

For the mini-batch training strategy, the optimization goal of the network is to make the network training process of each batch have a good balance

The average value of (c) is minimal.

The optimization process can be expressed as

Wherein, M is a preset mini-batch size, i.e. the number of images input to the network each time.

In the embodiment of the invention, each feature map after binary mask processing is endowed with a label formed by combining the same original label and an independent self-supervision 'pseudo label', so that a possibility guarantee is provided for a network to integrate all feature maps in a testing stage to predict the original label.

Thus, embodiments of the present invention provide a "single inference" result P _SI On the basis of (y | x), a more comprehensive 'aggregation reasoning' result is provided as a final prediction result of the network. And integrating the prediction results of all the joint classifiers output on the characteristic maps to predict the original class y of the input image x. The method can effectively improve the robustness and the generalization of the network, and simultaneously improves the fault tolerance of network prediction to a certain extent.

The process of sourcing the "aggregation inference" results is given here:

the "aggregate reasoning" result under a single joint softmax classifier can be expressed as:

where m indicates that the joint softmax classifier is preceded by the last convolutional layer group.

Because the embodiment of the invention uses two joint softmax classifiers to optimize the joint loss, the network can integrate respective 'aggregation inference' results of the two classifiers to obtain a more comprehensive 'aggregation inference' result. So far, the result P of the "aggregation inference" output by the network of the embodiment of the present invention _AG The way (y | x) is calculated can be expressed as:

wherein, P _m,AG (y | x) and P _m-1,AG (y | x) represents the independent "aggregate inference" results given by the joint softmax classifier connected to the last and penultimate convolutional layers, respectively.

In order to restore the original structure of the network, the embodiment of the invention utilizes the existing Self-Distillation learning method (Self-Distillation) to restore the added feature map processing module and softmax classifier in the network.

The network distills knowledge learned by self-training, so that a classification promoting effect similar to integrated learning is achieved, and an additional teacher network is not required to be introduced in the whole training process. In addition, in the method provided by the embodiment of the invention, the self-distillation learning method can be introduced to restore the original structure of the network and compress the network parameters in the testing stage, and meanwhile, the convolution mathematics in the original deep convolutional neural network achieves effective supervised learning characterization content.

After the self-distillation learning method is introduced into the method of the embodiment of the invention, the network outputs the aggregation inference result P through a knowledge migration mode _AG The information contained in (. |) is distilled into a completely new softmax classifier Fc (.;. mu.). At this time, it should be noted that the weight parameter μ in the new classifier Fc (-) μ should be the same size as the weight parameter ω in the classifier Fc (-) ω originally used for the N classification task, thereby achieving the goal of restoring the network classifier.

After the network training is completed by introducing self-distillation learning, the new classifier Fc (·;. mu.) should provide better test classification performance than the reference network without the network using binary mask-based feature maps and labels.

At this time, the optimization goal of the network in the training process is

Change to

Can be expressed as:

wherein the content of the first and second substances,

representing the Kullback-Leibler divergence, with which the classifier Fc (f; μ) can be approximated to the "aggregate inference" result P of the network output _AG (. x) toThe purpose of self-distillation learning is achieved.

Kullback-Leibler divergence

Can be expressed as:

probabilistic results output by additional classifiers introduced by self-distillation learning, referred to herein as "self-distillation" results P _SD (y | x), which can be expressed as:

FIG. 8 illustrates the implementation of the method of an embodiment of the present invention with three different output results ("Single inference" result P) _SI (y | x), "aggregated reasoning" result P _AG (y | x) and "self-distillation" result P _SD (y | x)).

The embodiment of the invention also provides a series of experiments to further embody the effectiveness of the image classification model training method.

First, the software and hardware configuration of the experiment and the training strategy of the network are described. In the embodiment of the invention, in order to ensure the objectivity and effectiveness of the experiment, the relevant experiment is completed by using uniform software and hardware configuration, and specific software and hardware configuration information is shown in table 1.

TABLE 1

The experiment uses a deep convolutional neural network to complete the image classification task, and the related network structures comprise ResNet, Wide ResNet and pyramidNet; the image datasets involved include the CIFAR-10/100 dataset, the Tiny-ImageNet dataset; fine-grained image dataset: stanford Dog/Car dataset, Caltech-UCSD libraries-200 and 2011(CUB200-2011) dataset.

All experiments are repeated for four times, then the average value and variance of each index are given, and the highest classification accuracy is represented by bold characters. Wherein the initial learning rate of the network is 0.1, the initial momentum is 0.9, and the initial weight decay is 10-4. For a typical image classification task, the network would be trained for 300 cycles with the batch size of the CIFAR-10/100 dataset set to 128 and the batch size of the Tiny-ImageNet dataset set to 256. For the fine-grained image classification task, the network would also be trained for 300 cycles, with the batch size of the Stanford Dog/Car dataset and the CUB200-2011 dataset set to 16. In addition, a random gradient descent strategy is uniformly adopted to optimize the network weight in the network training process. The learning rate of the network will be multiplied by a decay factor having a value of 0.1 when the network is trained to the 150 th and 225 th cycles.

Table 2 and table 3 show the classification performance of the method provided in the embodiment of the present invention on different data sets and different network structures, respectively, where the classification performance includes a reference network classification result and a prediction result of a network output after training by the method of the embodiment of the present invention, and the highest classification accuracy is represented by a bolded word. It should be noted that the data set generalization experiment uniformly uses the ResNet-110 network; the network structure generalization experiment uniformly uses a CIFAR-100 data set. In addition, the experiment incorporates the validation set data in the dataset into the test set due to the smaller number of test set data in the Tiny-ImageNet.

TABLE 2

TABLE 3

The experimental result shows that the method provided by the embodiment of the invention can obviously improve the supervised learning characterization capability of different networks on different data sets. Meanwhile, the self-distillation result can obtain higher classification accuracy than the aggregation inference result due to the gain effect of the self-distillation learning on the network.

Table 4 shows the results of the overlay experiment, including the classification accuracy of the reference network, and the classification accuracy of the network output after being trained by the method (Ours in the table), the existing regularization method, and the existing self-supervised learning method according to the embodiments of the present invention. The experiment selects the aggregation inference result output by the network to display.

TABLE 4

From the experimental results, when the method of the embodiment of the invention and the existing method are used in a superposition manner, the reference network obtains higher classification accuracy than the method of the embodiment of the invention used alone or the existing method used alone. Especially, when the self-supervision learning method SLA and the method of the embodiment of the invention are superposed on the ResNet-110, the 'aggregation inference' result output by the network can reach the highest classification accuracy 80.61% in the superposition experiment.

Experiments respectively apply the method of the embodiment of the invention and the SLA (self-supervision learning) method in the prior art to the ResNet-110 network, and perform an image classification task under a CIFAR-100 data set, so as to compare the classification performance with the calculation cost. Table 5 shows the comparison results of the GPU memory consumption of the image processor, the time consumed by the network training for each batch, and the classification accuracy. It should be noted that the classification accuracy of the SLA method and the method of the embodiment of the present invention in the table are both given by a "single inference" result.

TABLE 5

As can be seen from table 5, compared with 252.1% of extra GPU memory consumption and 258.1% of extra training time overhead caused by the SLA method, the method of the embodiment of the present invention only causes 10.1% of extra GPU memory consumption and 11.7% of extra training time overhead, and the accuracy gains caused by the two methods to the network are almost the same. Therefore, the method provided by the embodiment of the invention can effectively improve the network classification performance and simultaneously can not bring a large amount of additional training overhead to the network.

FIG. 9 illustrates regions of interest for target objects on different test samples from a reference network and a network trained by a method of an embodiment of the invention. It should be noted that the test samples were randomly selected from the Stanford Dog dataset, and the effect of the method according to the embodiment of the present invention can be objectively demonstrated.

Compared with a reference network, the network trained by the method provided by the embodiment of the invention has the advantages that the target object area is more accurately judged, and the classification performance of the network is improved. In addition, the visualization result shows that the method of the embodiment of the invention effectively relieves the coupling effect of the target area and the background area caused by shortcut learning, the characteristics of the target object are more clearly known in the network training process, and the network does not continuously pay attention to the background object area which appears in large amount in the data set. Therefore, the network reduces the dependence of the network on 'shortcut learning', and further realizes the decoupling of the target object region and the background redundant region.

In the embodiment of the invention, the ablation learning is mainly used for explaining the effectiveness of the hyper-parameter beta, and the optimal value range of the hyper-parameter beta is given through an ablation experiment. The hyper-parameter introduced by the method of the embodiment of the invention is used for limiting the output result of the joint classifier connected with the penultimate convolution layer, and the hyper-parameter beta is subjected to value selection in [0,1 ]. Therefore, the method provided by the embodiment of the invention is adopted to train the ResNet-110 network to classify the CIFAR-100 data set, the hyperparameter beta is sequentially valued in [0,1], and the influence brought by the hyperparameter beta is observed according to the variation trend of the network classification performance under different values.

Fig. 10 shows the results of the ablation learning experiment with respect to the hyper-parameter β. It can be seen that the method of the present example exhibits a very high tolerance to the hyper-parameter β, except for the case where β is 0.1. When β is 0.1, the joint classifier connected with the penultimate convolution layer is difficult to train. With the increase of the hyper-parameter beta, the classification performance of the network can be effectively improved. Therefore, to ensure the versatility of the method of the embodiment of the present invention, the value of the hyper-parameter β is 1 by default.

Through the experimental results, the beneficial effects of the image classification model training method provided by the embodiment of the invention compared with the prior art can be obviously seen.

The embodiment of the invention also provides an image classification method, which comprises the steps S201 to S202.

S201, acquiring an image to be classified.

S202, inputting the image to be classified into an image classification model, and predicting the type of the image to be classified.

The image classification model is obtained according to the image classification model training method in the embodiment of the method.

Further, an embodiment of the present invention also provides a schematic block diagram of a structure of the computer device 100, and referring to fig. 11, the computer device 100 may include a memory 110 and a processor 120.

The processor 120 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application-Specific Integrated Circuit (ASIC), or one or more ics for controlling the image classification model training method provided by the above embodiments of the method and/or the program execution of the image classification method.

The MEMory 110 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an Electrically Erasable programmable Read-Only MEMory (EEPROM), a compact disc Read-Only MEMory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 110 may be self-contained and coupled to the processor 120 via a communication bus. The memory 110 may also be integrated with the processor 120. Memory 110 is used to store, among other things, machine-executable instructions for performing aspects of the present application. Processor 120 is operative to execute machine executable instructions stored in memory 110 to implement the method embodiments described above.

Embodiments of the present invention also provide a computer-readable storage medium containing a computer program, which when executed, can be used to perform the image classification model training method provided in the above-mentioned method embodiments, and/or the related operations in the image classification method.

Referring to fig. 12, fig. 12 is a functional unit block diagram of an image classification model training apparatus 200 according to an embodiment of the present invention. The image classification model training apparatus 200 is applied to the computer device 100, and may include a first obtaining unit 201, a processing unit 202, and a training unit 203, where the first obtaining unit 201, the processing unit 202, and the training unit 203 may all be stored in a memory or a computer-readable storage medium in the form of software. It should be noted that the basic principle and the resulting technical effect of the image classification model training device 200 provided by the embodiment of the present invention are the same as those of the above-mentioned embodiments, and for the sake of brief description, no reference is made to the embodiment of the present invention.

A first obtaining unit 201, configured to obtain an original image and a type tag of the original image.

The processing unit 202 is used for inputting an original image into a pre-constructed image classification model, wherein the image classification model comprises a feature extraction network and a type prediction network, and the feature extraction network comprises N sequentially-connected convolution layers in series; extracting the features of the original image by using a feature extraction network to obtain a first feature map output by the Nth convolutional layer and a second feature map output by the (N-1) th convolutional layer; generating a first training set based on the first feature map, the type labels and a plurality of preset binary masks; generating a second training set based on the second feature map, the type label and the plurality of binary masks;

and the training unit 203 is configured to train the type prediction network by using the first training set and the second training set to obtain a trained image classification model.

In an implementation manner, the processing unit 202 is further specifically configured to process the first feature map by using a plurality of binary masks to obtain a plurality of first training samples, where the plurality of first training samples correspond to the plurality of binary masks one to one; and generating a label of each first training sample according to the plurality of binary masks and the type labels to obtain a first training set, wherein the first training set comprises the plurality of first training samples and the label of each first training sample.

In an implementation manner, the side length of each binary mask is the same as the side length of the first feature map, each binary mask includes a full 1 region and/or a full 0 region, and the processing unit 202 is further specifically configured to, for any one target binary mask in the plurality of binary masks, reserve a region in the first feature map that coincides with the full 1 region of the target binary mask, and erase the region in the first feature map that coincides with the full 0 region of the target binary mask, to obtain a first training sample corresponding to the target binary mask, when the processing unit is configured to process the first feature map using the plurality of binary masks, to obtain a plurality of first training samples; and traversing each binary mask to obtain a first training sample corresponding to each binary mask.

In an implementation manner, when the processing unit 202 is configured to generate a label of each first training sample according to a plurality of binary masks and type labels, the processing unit is further specifically configured to associate an identifier of a binary mask corresponding to each first training sample with a type label, so as to obtain a label of each first training sample.

In an implementation manner, the processing unit 202 is further specifically configured to process the second feature map by using a plurality of binary masks to obtain a plurality of second training samples, where the plurality of second training samples correspond to the plurality of binary masks one to one; and generating a label of each second training sample according to the plurality of binary masks and the type labels to obtain a second training set, wherein the second training set comprises the plurality of second training samples and the label of each second training sample.

In one implementation, the training unit 203 is further specifically configured to train the type prediction network with a first training set to obtain a first type prediction network; training the type prediction network by using a second training set to obtain a second type prediction network; and distilling the parameters of the first type of prediction network and the parameters of the second type of prediction network to obtain the parameters of the type of prediction network so as to finish the training of the image classification model.

Referring to fig. 13, fig. 13 is a functional unit block diagram of an image classification apparatus 300 according to an embodiment of the present invention. The image classification apparatus 300 applied to the computer device 100 may include a second obtaining unit 301 and a predicting unit 302, wherein the second obtaining unit 301 and the predicting unit 302 may be stored in a memory or a computer readable storage medium in a software form. It should be noted that the image classification apparatus 300 according to the embodiment of the present invention has the same basic principle and technical effect as those of the above-mentioned embodiments, and for the sake of brief description, no reference is made to the embodiment of the present invention.

A second obtaining unit 301, configured to obtain an image to be classified;

the prediction unit 302 is configured to input the image to be classified into the image classification model obtained by using the aforementioned image classification model training method, and predict the type of the image to be classified.

To sum up, the image classification model training method, the image classification method and the related device provided by the embodiment of the invention firstly obtain an original image and a type label of the original image; then, inputting an original image into a pre-constructed image classification model, wherein the image classification model comprises a feature extraction network and a type prediction network, and the feature extraction network comprises N sequentially connected convolution layers in series; then, feature extraction is carried out on the original image by utilizing a feature extraction network to obtain a first feature map output by the Nth convolutional layer and a second feature map output by the (N-1) th convolutional layer; then, generating a first training set based on the first feature map, the type label and a plurality of preset binary masks, and generating a second training set based on the second feature map, the type label and the plurality of binary masks; and finally, training the type prediction network by utilizing the first training set and the second training set to obtain a trained image classification model. According to the embodiment of the invention, a first training set is generated based on a first characteristic diagram output by the Nth convolutional layer, a type label and a plurality of preset binary masks, a second training set is generated based on a second characteristic diagram output by the Nth-1 th convolutional layer, the type label and the plurality of binary masks, and then the type prediction network is trained by utilizing the first training set and the second training set to obtain a trained image classification model, so that the convolution operation on original image data expanded by multiple times is avoided, and the extra overhead generated in the model training process is reduced.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image classification model training method, characterized in that the method comprises:

acquiring an original image and a type label of the original image;

2. The method of claim 1, wherein the step of generating a first training set based on the first feature map, the type label, and a preset plurality of binary masks comprises:

3. The method of claim 2, wherein a side length of each of the binary masks is the same as a side length of the first feature map, each of the binary masks including all-1 regions and/or all-0 regions;

4. The method of claim 2, wherein the step of generating a label for each of the first training samples based on the plurality of binary masks and the type label comprises:

5. The method of claim 1, wherein the step of generating a second training set based on the second feature map, the type labels, and the plurality of binary masks comprises:

6. The method of claim 1, wherein the step of training the type prediction network using the first training set and the second training set to obtain a trained image classification model comprises:

7. A method of image classification, the method comprising:

acquiring an image to be classified;

inputting the image to be classified into an image classification model obtained by the image classification model training method according to any one of claims 1 to 6, and predicting the type of the image to be classified.

8. An apparatus for training an image classification model, the apparatus comprising:

a processing unit to:

9. An image classification apparatus, characterized in that the apparatus comprises:

the second acquisition unit is used for acquiring the image to be classified;

a prediction unit, configured to input the image to be classified into an image classification model obtained by using the image classification model training method according to any one of claims 1 to 6, and predict a type of the image to be classified.

10. A computer device comprising a memory storing a computer program and a processor implementing the image classification model training method according to any one of claims 1 to 6 and/or the image classification method according to claim 7 when the computer program is executed.