CN111950643A

CN111950643A - Model training method, image classification method and corresponding device

Info

Publication number: CN111950643A
Application number: CN202010834837.1A
Authority: CN
Inventors: 秦永强; 李素莹; 宋亮; 高达辉
Original assignee: Innovation Wisdom Shanghai Technology Co ltd
Current assignee: Innovation Wisdom Shanghai Technology Co ltd
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2020-11-17
Anticipated expiration: 2040-08-18
Also published as: CN111950643B

Abstract

The application relates to the technical field of artificial intelligence, and provides a model training method, an image classification method and a corresponding device. The model training method comprises the following steps: inputting the training image into a first neural network for processing to obtain a first characteristic diagram; obtaining a first attention diagram based on the first feature diagram; non-uniformly sampling the training image according to the information of all channels and the information of a single channel in the first attention diagram respectively to obtain a first sampling image and a second sampling image; inputting the first sampling image into a second neural network for processing to obtain a first classification probability, and inputting the second sampling image into a third neural network for processing to obtain a second classification probability; and calculating the classification prediction loss according to the first classification probability and the second classification probability, and updating the parameters of each new neural network according to the classification prediction loss. The first attention in the method tries to automatically locate the key details required by classification through learning without depending on labeling information, and training cost is saved.

Description

Model training method, image classification method and corresponding device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a model training method, an image classification method and a corresponding device.

Background

The fine image classification means that a coarse image classification is finely sub-classified, and because the differences among the sub-classes are finer, different classes can be distinguished only by means of small local differences.

Currently, the vast majority of image sub-classification methods follow such a flow framework: firstly, finding a foreground object and local areas thereof, then respectively extracting the characteristics of the local areas, and finally finishing the training and prediction of a classifier based on the extracted characteristics. In the method, when the model is trained, in addition to the class label of the image, extra manual labeling information such as the position of a local area and the like is often used, and the extra labeling information is high in acquisition cost, time-consuming and labor-consuming.

Disclosure of Invention

An object of the embodiments of the present application is to provide a model training method, an image classification method and a corresponding apparatus, so as to improve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides a model training method, including: inputting a training image into a first neural network for processing to obtain a first feature map output by the first neural network; obtaining a first attention diagram based on the first feature map, wherein the value of a pixel in the first attention diagram is positively correlated with the probability that a corresponding pixel in the training image is sampled; non-uniformly sampling the training image according to the information of all channels in the first attention diagram to obtain a first sampling image, and non-uniformly sampling the training image according to the information of a single channel in the first attention diagram to obtain a second sampling image; inputting the first sampling image into a second neural network for processing to obtain a first classification probability output by the second neural network, and inputting the second sampling image into a third neural network for processing to obtain a second classification probability output by the third neural network; and calculating a classification prediction loss according to the first classification probability and the second classification probability, and updating parameters of the first neural network, the second neural network and the third neural network by using a back propagation algorithm according to the classification prediction loss.

In the method, since the value of the pixel in the first attention map is positively correlated with the probability that the corresponding pixel in the training image is sampled, so that the training image is non-uniformly sampled according to the first attention map, a region with a larger value of the pixel in the first attention map (i.e., a region with concentrated attention distribution) is assigned with more sampling points, and the influence on the classification prediction result is more significant.

Further, the first attention map is not preset, but is obtained based on calculation of the first neural network, and the first neural network continuously adjusts parameters according to a classification prediction result in a training process, so that a region with a large pixel value in the first attention map gradually falls into a key region which is beneficial to classification of the region in the training image. That is to say, with the deepening of the training process, the first attention endeavor can gradually locate the details which play a key role in correctly classifying the training images, and the detail locating capability is automatically generated through learning without depending on additional marking information, so that the training cost can be saved, the training efficiency can be improved, and the practicability of the method can be improved.

In addition, the image classification network used in the above method can be regarded as including two branch networks: the global branch network generates a first classification probability based on the prediction of a first sampling image, and the first sampling image is obtained by non-uniformly sampling the training image according to the information of all channels in the first attention diagram, so that the global contour information of the training image is reserved in the first sampling image; the local branch network generates a second classification probability based on the prediction of the second sampling image, and the second sampling image is obtained by non-uniformly sampling the training image according to the information of a single channel in the first attention diagram, so that the local detail information of the training image is reserved in the second sampling image. However, when the prediction loss is finally calculated, the method considers the first classification probability and the second classification probability simultaneously, which is equivalent to blending the local detail information extracted by the local branch network and contributing to image classification into the global branch network through knowledge distillation, that is, the information in the image is fully and comprehensively utilized for classification, so that the trained image classification network has better performance.

It should be noted that the trained image classification network can be used for performing both the image sub-classification task and the general image classification task.

In an implementation manner of the first aspect, the non-uniformly sampling the training image according to information of all channels in the first attention map to obtain a first sampled image includes: performing average pooling on all channels in the first attention map to obtain an average attention map; sampling the training image with a first non-uniform sampling function according to the average attention map to obtain the first sampled image.

In the above implementation, the average pooling is to average pixel values of each channel in the first attention map at the same position, and the average pooled average attention map fuses information of each channel in the first attention map, so that the attention distribution in the training image can be reflected as a whole, and thus the training image is sampled according to the average attention map, and the obtained first sampled image retains global contour information of the training image.

In an implementation manner of the first aspect, the sampling the training image with a first non-uniform sampling function according to the average attention map to obtain the first sampled image includes: the first sampling image is obtained by calculation according to the following formula:

wherein Is represents the first sampled image, and S represents the first inhomogeneityA uniform sampling function, M representing the first attention map, A (M) representing the average attention map, I representing the training image, w representing the width of the training image, h representing the height of the training image, I representing a pixel index in the w direction, j representing a pixel index in the h direction,

and

which are respectively the inverse functions of the following two functions:

wherein, F_wDenotes the integral of A (M) in the w direction, n is 1. ltoreq. h, F_hRepresents the integral of A (M) in the h direction, and M is more than or equal to 1 and less than or equal to w.

In one implementation manner of the first aspect, the non-uniformly sampling the training image according to information of a single channel in the first attention map to obtain a second sampled image includes: randomly selecting one channel from all channels of the first attention diagram; sampling the training image by using a second non-uniform sampling function according to the selected channel to obtain a second sampling image; and performing channel random selection once again at intervals of a preset training period.

In one implementation manner of the first aspect, the non-uniformly sampling the training image according to information of a single channel in the first attention map to obtain a second sampled image includes: selecting one channel from all channels of the first attention diagram according to a preset sequence; sampling the training image by using a second non-uniform sampling function according to the selected channel to obtain a second sampling image; and performing channel selection again according to the preset sequence every preset training period, wherein the preset sequence is an arrangement sequence of all channels in the first attention diagram.

In both implementations, a single channel in the first attention map is selected to sample the training image, and since each channel represents a visual mode, the resulting second sampled image retains local detail information of the training image for that attention channel.

Two ways of selecting channels from the first attention map are provided above, one is random selection and one is selection according to a predetermined sequence, but whichever way, when the training time is long enough, all channels in the first attention map are traversed, i.e. different levels of local detail information are finally extracted and used for training the image classification network. In addition, only one channel in the first attention diagram is selected for sampling during each training, so that the calculation amount is reduced, and the training efficiency is improved.

In one implementation manner of the first aspect, the obtaining a first attention map based on the first feature map includes: and calculating the first attention diagram according to the relation among the channels in the first feature diagram.

In the implementation manner, since the information of each channel in the first feature map is fused when the first attention map is calculated, the calculated first attention map can more effectively reflect the attention distribution in the training image.

In an implementation manner of the first aspect, the calculating a classification prediction loss according to the first classification probability and the second classification probability includes: calculating a first loss according to the first classification probability and a label of the training image, and calculating a second loss according to the first classification probability and the second classification probability; wherein the first loss characterizes a difference between the predicted classification result of the second neural network and a true classification result, and the second loss characterizes a difference between the predicted classification result of the second neural network and a predicted classification result of the third neural network; and carrying out weighted summation on the first loss and the second loss to obtain the classified prediction loss.

In the above implementation, the total classification prediction loss is obtained by weighted summation of a first loss and a second loss, the first loss is a traditional classification prediction loss, and training based on the first loss can make the classification result predicted by the second neural network close to the real classification result; the second loss is a loss newly proposed by the application and is used for fusing the local detail information extracted by the local branch network into the global branch network, and training based on the second loss can enable the classification result predicted by the third neural network to be close to the classification result predicted by the second neural network.

In a second aspect, an embodiment of the present application provides an image classification method, including: inputting an image to be classified into a first neural network for processing to obtain a second feature map output by the first neural network; obtaining a second attention diagram based on the second feature map, wherein the value of a pixel in the second attention diagram is positively correlated with the probability that the corresponding pixel in the image to be classified is sampled; the image to be classified is subjected to non-uniform sampling according to the information of all channels in the second attention diagram to obtain a third sampling image, and the image to be classified is subjected to non-uniform sampling according to the information of a single channel in the second attention diagram to obtain a fourth sampling image; inputting the third sampling image into a second neural network for processing to obtain a third classification probability output by the second neural network, and inputting the fourth sampling image into a third neural network for processing to obtain a fourth classification probability output by the third neural network; and determining the final classification result of the image to be classified according to the third classification probability and the fourth classification probability.

In the method, the image classification network provided by the first aspect is used for classifying the image to be classified, so that the second attention map acquired based on the trained first neural network can automatically locate the key details related to classification in the image to be classified, and the non-uniform sampling of the image to be classified based on the second attention map can strengthen the key details in the obtained sampled image.

The third sampling image is obtained by non-uniformly sampling the image to be classified according to the information of all channels in the second attention diagram, so that the global contour information of the image to be classified is reserved in the third sampling image; since the fourth sampling image is obtained by non-uniformly sampling the image to be classified according to the information of the single channel in the second attention map, the local detail information of the image to be classified is reserved in the fourth sampling image.

When the final classification result of the image to be classified is determined, the third classification probability generated based on the third sampling image prediction and the fourth classification probability generated based on the fourth sampling image prediction are considered at the same time, namely, the key global contour information and the local detail information are considered, so that the classification precision is high, and the method is very suitable for executing a fine image classification task.

In a third aspect, an embodiment of the present application provides a model training apparatus, including: the first characteristic diagram acquisition module is used for inputting a training image into a first neural network for processing to obtain a first characteristic diagram output by the first neural network; a first attention map obtaining module, configured to obtain a first attention map based on the first feature map, where a value of a pixel in the first attention map is positively correlated with a probability that a corresponding pixel in the training image is sampled; the first sampling module is used for carrying out non-uniform sampling on the training image according to the information of all channels in the first attention diagram to obtain a first sampling image, and carrying out non-uniform sampling on the training image according to the information of a single channel in the first attention diagram to obtain a second sampling image; the first classification prediction module is used for inputting the first sampling image into a second neural network for processing to obtain a first classification probability output by the second neural network, and inputting the second sampling image into a third neural network for processing to obtain a second classification probability output by the third neural network; and the parameter updating module is used for calculating the classification prediction loss according to the first classification probability and the second classification probability and updating the parameters of the first neural network, the second neural network and the third neural network by utilizing a back propagation algorithm according to the classification prediction loss.

In a fourth aspect, an embodiment of the present application provides an image classification apparatus, including: the second characteristic diagram acquisition module is used for inputting the image to be classified into the first neural network for processing to obtain a second characteristic diagram output by the first neural network; a second attention map obtaining module, configured to obtain a second attention map based on the second feature map, where a value of a pixel in the second attention map is positively correlated with a probability that a corresponding pixel in the image to be classified is sampled; the second sampling module is used for carrying out non-uniform sampling on the image to be classified according to the information of all channels in the second attention diagram to obtain a third sampling image, and carrying out non-uniform sampling on the image to be classified according to the information of a single channel in the second attention diagram to obtain a fourth sampling image; the second classification prediction module is used for inputting the third sampling image into a second neural network for processing to obtain a third classification probability output by the second neural network, and inputting the fourth sampling image into the third neural network for processing to obtain a fourth classification probability output by the third neural network; and the classification result acquisition module is used for determining the final classification result of the image to be classified according to the third classification probability and the fourth classification probability.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores computer program instructions, and when the computer program instructions are read and executed by a processor, the computer program instructions perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

In a sixth aspect, an embodiment of the present application provides an electronic device, including: a memory in which computer program instructions are stored, and a processor, where the computer program instructions are read and executed by the processor to perform the method provided by the first aspect or any one of the possible implementation manners of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart illustrating a model training method provided by an embodiment of the present application;

FIG. 2 is a block diagram illustrating an image classification network used in a model training method according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of an image classification method provided by an embodiment of the present application;

FIG. 4 is a block diagram of a model training apparatus provided in an embodiment of the present application;

FIG. 5 is a block diagram of an image classification apparatus according to an embodiment of the present application;

fig. 6 shows a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

Fig. 1 is a flowchart illustrating a model training method provided in an embodiment of the present application, and fig. 2 is a block diagram illustrating an image classification network that can be used in the model training method of fig. 1, and will be described with reference to fig. 2 when describing steps of the method of fig. 1. The model training method in fig. 1 may be, but is not limited to, performed by an electronic device, and fig. 6 shows a possible structure of the electronic device, which is described in detail with reference to fig. 6 later. Referring to fig. 1, the method includes:

step S110: and inputting the training image into the first neural network for processing to obtain a first characteristic diagram output by the first neural network.

In the solution proposed in the present application, the type of the Neural Network (including the first Neural Network, the second Neural Network, and the third Neural Network) is not limited, and may be, for example, a Convolutional Neural Network (CNN), a cyclic Neural Network (RNN), a Deep Neural Network (DNN), or the like. Convolutional neural networks are more commonly used in the field of image processing, and the convolutional neural networks at least include a convolutional layer, and may further include a pooling layer, a full-link layer, and the like.

The training images may refer to images in a training set, and when training the image classification network, a batch (batch) training method may be adopted, that is, a batch of images in the training set are input into the image classification network for training each time, but for simplicity, the case where each batch includes only one training image is taken as an example when the scheme of the present application is introduced, and the case where each batch includes multiple training images is similar.

After the training image is input to the first neural network, feature extraction is carried out through the first neural network to obtain a first feature map. For example, let the training image be I, the first feature map be X, then the dimension of I may be h × w (h is the height of I, and w is the width of I), and the dimension of X may be c × h × w (c is the number of channels of X).

Step S120: a first attention map is obtained based on the first feature map.

The first attention map reflects the attention distribution in the training image, and the more concentrated the attention distribution is, the more likely the region contains the key information (e.g., small details for distinguishing two similar commodities) required for classifying the image, and the more the region should be focused by the image classification model. In the first attention map, the degree of concentration of such an attention distribution is characterized by the size of the pixel value (larger pixel values indicate more concentrated attention distribution, whereas less concentrated distribution).

On the other hand, the value of the pixel in the first attention map is positively correlated with the probability that the corresponding pixel in the training image is sampled, so that the training image is non-uniformly sampled according to the first attention map (the specific sampling process is shown in step S130), and areas with larger pixel values (i.e. areas with concentrated attention distribution) in the first attention map are allocated with more sampling points, and the influence of the areas on the classification prediction result is more significant. Such a sampling method is suitable for image classification, because it has been mentioned above that the regions with concentrated attention distribution are more likely to contain the key information required for classifying images, and it occupies a larger proportion in the sampling result to be beneficial for enhancing the key information, thereby improving the image classification result, while the regions with concentrated attention distribution are less helpful for image classification, and should occupy a smaller proportion in the sampling result, or even not need to sample it.

Of course, the first attempt is not inherently capable of locating the critical information required for classification, which is obtained through the constant training of the first neural network. The first neural network continuously adjusts parameters according to the classification prediction result in the training process (see step S150), so that the region with larger pixel value in the first attention map gradually falls into a key region which is beneficial to classifying the region in the training image. That is, as the training process progresses, the first attention endeavor is increasingly able to locate details in the training image that are critical to its correct classification. It should be noted that the detail positioning capability of the first attention map is automatically generated through learning without depending on additional labeled information, thereby being beneficial to saving training cost, improving training efficiency and improving practicability of the training method.

The first attention map and the first feature map have the same dimension, the first attention map is not labeled as M, the dimension of M is also c × h × w, and the dimension of the training image is h × w, so that the pixels in M can establish a corresponding relationship with the pixels at the same position in the training image. In some implementations, the first feature map may be directly taken as the first attention map; in other implementations, the first attention map may also be generated by a first profile operation: for example, the first attention map may be calculated according to a relationship between channels in the first feature map, and this implementation is advantageous in that information of the channels in the first feature map is fused when the first attention map is calculated, so that the calculated first attention map more effectively reflects the attention distribution in the training image. A specific example is given below:

M＝(XX^T)X

where T represents the transpose of the matrix, XX^TThe relation among the channels in the first characteristic diagram is embodied, and only X self is used in the calculation formula, so that the matrix operation is based on a self-care mechanism.

Step S130: non-uniformly sampling the training image according to the information of all channels in the first attention diagram to obtain a first sampling image; and non-uniformly sampling the training image according to the information of the single channel in the first attention diagram to obtain a second sampling image.

Step S130 is divided into two sub-steps, the first sub-step is to sample and obtain a first sampled image, and the second sub-step is to sample and obtain a second sampled image. The two sub-steps may be executed in parallel, without limitation to the order of execution, and the first sub-step of step S130 is described below:

in some implementations, all channels in the first attention map are averaged and pooled, where the averaging pooling means that pixel values of each channel in the first attention map at the same position are averaged, and after the averaging pooling, a plurality of channels in the first attention map are merged into one channel, which is called an average attention map, and a pixel value in the average attention map is also positively correlated to a probability that a corresponding pixel in the training image is sampled.

Then, the training image is sampled by a first non-uniform sampling function according to the average attention map, and a first sampling image is obtained. By non-uniform sampling, it is meant that more sampling points are assigned to regions with larger pixels in the average attention map (i.e., regions with concentrated attention distribution in the training image), and less or even no sampling points are assigned to regions with smaller pixels in the average attention map (i.e., regions with concentrated attention distribution in the training image).

The first non-uniform sampling function Is a function that can implement the above-mentioned non-uniform sampling function, and if the first non-uniform sampling function Is denoted as S and the first sampling image Is denoted as Is, the following functions are provided:

Is＝S(I,A(M))

where I represents the training image, M represents the first attention map, a (·) represents the average pooling (a (M) represents the average attention map). The present application does not limit the specific form of the first non-uniform sampling function, and a specific example is given below.

In one implementation, the first sampled image may be obtained by calculation using the following formula:

where w denotes the width of the training image, h denotes the height of the training image, i denotes the pixel index in the w direction, j denotes the pixel index in the h direction,

and

which are respectively the inverse functions of the following two functions:

In the above implementation, S can be regarded as a mapping from I to Is, and it should be noted that although the dimension of Is the same as I, and Is h × w, pixels at certain positions in I are repeatedly sampled multiple times, where the positions are the positions where the pixel values in a (M) are larger, or the positions where attention Is focused.

Since the average attention map fuses information of each channel in the first attention map, attention distribution in the training image can be reflected on the whole, so that the training image is sampled according to the average attention map, the obtained first sampling image effectively retains global contour information of the training image, and the global contour information is key information for image classification according to the analysis of the first attention map.

It is to be understood that the way of fusing the multi-channel information in the first attention map is not limited to average pooling, but may be, for example, directly summing the channels, maximal pooling, etc.

The second substep of step S130 is described below:

firstly, one channel is selected from a plurality of channels of the first attention map, and the value of a pixel in the channel is also positively correlated with the probability that the corresponding pixel in the training image is sampled. And then, sampling the training image by using a second non-uniform sampling function according to the selected channel to obtain a second sampling image. Since each channel represents a visual mode, the resulting second sample image retains the local detail information of the training image for that attention channel, which is key information for image classification according to the foregoing analysis of the first attention map.

The second non-uniform sampling function may be the same as or different from the first non-uniform sampling function, and will not be described in detail. The following illustrates how to select a channel for sampling from the plurality of channels in the first attention map, including at least the following two approaches:

(1) and randomly selecting one channel from all channels of the first attention diagram for obtaining a second sampling image, and performing channel random selection again at preset training intervals.

If the second non-uniform sampling function is also denoted as S and the second sampled image is denoted as Id, then:

Id＝S(I,R(M))

where I represents a training image, M represents a first attention map, and R (·) represents a random selection of one channel from among a plurality of channels of the image (R (M) represents a random selection of one channel from among a plurality of channels of the first attention map).

The preset training period may be a preset duration, a preset number of steps, or a preset number of rounds in the training process, for example, in one implementation, each round of training (all images in the training set participate in one round of training) re-randomly selects a channel for sampling in the first attention map once again. The reason for re-randomly choosing the channel is that: the local detail information of different levels in the training image can be extracted by sampling based on different channels, and when the training time is long enough, all the channels in the first attention diagram can be traversed, that is, the local detail information corresponding to all the channels can be extracted and used for training the image classification network.

(2) One channel is selected from all the channels of the first attention map in a predetermined order for obtaining the second sample image, and the channel selection is performed again in the predetermined order every preset training period.

Wherein the predetermined order is an order of all channels in the first attention map, such that all channels in the first attention map are traversed when the training time is long enough. For example, in one implementation, the channels used for sampling in the first attention diagram are re-selected once per training round, and the selection is performed sequentially, i.e., the first channel is selected for the first training round, the second channel is selected for the second training round, and so on.

In the method (1) or the method (2), only one channel in the first attention diagram is selected for sampling during each training (which may refer to training with a batch of training images, or one step in the training process), and the channel is selected again at every preset training period, which is beneficial to reducing the amount of computation required for extracting local detail information and improving the training efficiency.

Referring to fig. 2, the image classification network may be divided into two branch networks: a global branch network and a local branch network. The global branch network carries out non-uniform sampling on the training image by using a first sampling function according to the information of all channels in the first attention diagram to obtain a first sampling image containing global contour information, and then carries out classification prediction according to the first sampling image; and the local branch network carries out non-uniform sampling on the training image according to the information of a single channel in the first attention diagram by using a second sampling function to obtain a second sampling image containing local detail information, and then carries out classification prediction according to the second sampling image.

Step S140: inputting the first sampling image into a second neural network for processing to obtain a first classification probability output by the second neural network; and inputting the second sampling image into a third neural network for processing to obtain a second classification probability output by the third neural network.

The application does not limit what architecture is adopted by the second neural network, for example, the architecture may be VGG, ResNet, GoogleNet, or the like, and the end of the second neural network may include a full connection layer and a softmax classifier, so that after the first sampled image is processed by the second neural network, a first classification probability may be output, where the first classification probability is a vector, each element constituting the vector corresponds to a probability value of a class, and the first classification probability also represents a classification result predicted by the second neural network (a class corresponding to an element with the largest median in the vector may be taken as a class predicted by the second neural network). The third neural network is similar to the second neural network and will not be described again.

Step S150: and calculating the classification prediction loss according to the first classification probability and the second classification probability, and updating parameters of the first neural network, the second neural network and the third neural network by using a back propagation algorithm according to the classification prediction loss.

The present application does not specifically limit how the classification prediction loss is calculated based on the first classification probability and the second classification probability, but limits that two factors, i.e., the first classification probability and the second classification probability, must be considered at the same time when the classification prediction loss is calculated. As mentioned above, the global contour information of the training image is retained in the first sampling image, so the first classification probability predicted by the second neural network according to the first sampling image also represents the global information, and the local detail information of the training image is retained in the second sampling image, so the second classification probability predicted by the third neural network according to the second sampling image also represents the local information, so that two factors, namely the first classification probability and the second classification probability, are considered when calculating the classification prediction loss, which is equivalent to fusing the local detail information extracted by the local branch network with the global contour information extracted by the global branch network through knowledge distillation, that is, the key information in the image which is helpful for classification is fully and comprehensively utilized, and the image classification network obtained through training has better performance.

In some implementations, the total classification prediction loss can be obtained by a weighted sum of the first loss and the second loss (shown in fig. 2 with an arrow pointing to the classification prediction loss). Wherein the first loss is calculated according to the first classification probability and the label of the training image (shown by an arrow pointing to the first loss in fig. 2), the loss is the traditional classification prediction loss, and the difference between the classification result of the second neural network prediction and the real classification result is represented, and the training based on the first loss can enable the classification result of the second neural network prediction to be close to the real classification result. For example, if the first classification probability is represented as qs and the label of the training image is represented as y (one-hot coding can be adopted), the first loss can be represented as L0(qs, y), and as for the expression of L0, reference may be made to the prior art, and no description is given.

The second predicted loss is calculated according to the first classification probability and the second classification probability (shown by an arrow pointing to the second loss in fig. 2), the loss is newly proposed in the present application, and the loss represents the difference between the classification result of the second neural network prediction and the classification result of the third neural network prediction, and the training based on the second loss can enable the classification result of the third neural network prediction to be close to the classification result of the second neural network prediction. For example, if the second classification probability is denoted as qd, the second loss may be denoted as L1(qs, qd), and if the cross entropy loss is adopted in L1, the specific calculation formula of the second loss is as follows:

wherein N represents the total number of categories, and qd and qs are both N-dimensional vectors, it is understood that L1 may also employ other loss functions, and it is not necessary to employ cross entropy loss. According to the above notation, the total classification prediction loss can be calculated as:

L＝L0(qs,y)+αL1(qs,qd)

where α is a weighting coefficient, and if α ═ 1 indicates that L0 and L1 are directly summed, this may be regarded as a specific example of weighted summation. By weighted summation, the local detail information extracted by the local branch network is fused to the global branch network, and the process can also be regarded as a knowledge distillation process. It will be appreciated that in other implementations, the classification prediction loss may include other loss terms besides the first loss and the second loss, such as a third loss calculated based on the second classification probability and the label of the training image.

As for the updating of the parameters of the neural network by using the back propagation algorithm in step S150, reference is made to the prior art, which is not described herein. It should be noted that the image classification network includes not only the first neural network, the second neural network, and the third neural network, but also the sampling structure is a part of the network (the blocks shown by the solid lines in fig. 2 can be regarded as components of the network), but since the sampling structure does not have parameters that need to be updated, only the parameter updating of the first neural network, the second neural network, and the third neural network is mentioned here. It is understood that if the image classification network further includes other parts requiring parameter updating, the parameter updating may be performed when step S150 is performed.

Steps S110 to S150 are repeatedly performed in the training process until the training end condition is satisfied. The training end condition may be one or more of convergence of the image classification network, training for a preset time, training for a preset turn, and the like. The trained image classification network may be used to perform both a fine image classification task and a general image classification task, which is not limited in this application.

Fig. 3 is a flowchart illustrating an image classification method provided in an embodiment of the present application, where an image classification network adopted in the method is obtained by training with a model training method provided in an embodiment of the present application. The image classification training method in fig. 3 may be, but is not limited to be, performed by an electronic device, and fig. 6 shows a possible structure of the electronic device, which can be referred to in detail as set forth in fig. 6 below. Referring to fig. 3, the method includes:

step S210: and inputting the image to be classified into the first neural network for processing to obtain a second characteristic diagram output by the first neural network.

Step S220: obtaining a second attention diagram based on the second feature diagram; and the value of the pixel in the second attention map is positively correlated with the probability of sampling the corresponding pixel in the image to be classified.

The above two steps are similar to steps S110 and S120, and reference may be made to the corresponding contents, which are not repeated.

Step S230: and carrying out non-uniform sampling on the image to be classified according to the information of the single channel in the second attention map to obtain a fourth sampling image.

Step S230 is similar to step S130, and reference may be made to the corresponding contents in the foregoing, but it should be noted that, in the trained image classification network, no matter which channel in the second attention diagram is selected for sampling to obtain the fourth sampled image, the influence on the final classification result is not great. Therefore, in step S230, when selecting the channel for sampling in the second attention map, the channel may be selected randomly or one channel (e.g., the first channel) may be selected fixedly.

Step S240: and inputting the fourth sampling image into the third neural network for processing to obtain a fourth classification probability output by the third neural network.

Step S240 is similar to step S140, and reference may be made to the corresponding contents above.

Step S250: and determining the final classification result of the image to be classified according to the third classification probability and the fourth classification probability.

In some implementation manners, a mean value of the third classification probability and the fourth classification probability (a mean value of the elements at the corresponding positions of the two vectors) may be calculated, and a category corresponding to an element with the largest value in the obtained mean value vectors is used as a final classification result of the image to be classified. Of course, in other implementation manners, the third classification probability and the fourth classification probability may also be summed or weighted summed (summing or weighted summing of elements at positions corresponding to two vectors), and the category corresponding to the element with the largest value in the obtained sum vector is used as the final classification result of the image to be classified, and so on.

According to the image classification method, the image classification network trained by the model training method is adopted for classifying the images to be classified, so that the second attention map acquired based on the trained first neural network can automatically locate the key details related to classification in the images to be classified, and the non-uniform sampling of the images to be classified based on the second attention map can strengthen the key details in the obtained sampling images.

The third sampling image is obtained by non-uniformly sampling the image to be classified according to the information of all channels in the second attention diagram, so that the global contour information of the image to be classified is reserved in the third sampling image; and the fourth sampling image is obtained by non-uniformly sampling the image to be classified according to the information of the single channel in the second attention map, so that the local detail information of the image to be classified is reserved in the fourth sampling image. When the final classification result of the image to be classified is determined, the image classification method considers the third classification probability generated based on the third sampling image prediction and the fourth classification probability generated based on the fourth sampling image prediction at the same time, namely, the key global contour information and the local detail information in the image to be classified are considered, so that the classification precision is high, and the image classification method is very suitable for executing a fine image classification task.

Fig. 4 shows a functional block diagram of a model training apparatus 300 according to an embodiment of the present application. Referring to fig. 4, the model training apparatus 300 includes:

a first feature map obtaining module 310, configured to input a training image to a first neural network for processing, and obtain a first feature map output by the first neural network;

a first attention map obtaining module 320, configured to obtain a first attention map based on the first feature map, where a value of a pixel in the first attention map is positively correlated with a probability that a corresponding pixel in the training image is sampled;

a first sampling module 330, configured to perform non-uniform sampling on the training image according to information of all channels in the first attention map to obtain a first sampling image, and perform non-uniform sampling on the training image according to information of a single channel in the first attention map to obtain a second sampling image;

a first classification prediction module 340, configured to input the first sampled image to a second neural network for processing, so as to obtain a first classification probability output by the second neural network, and input the second sampled image to a third neural network for processing, so as to obtain a second classification probability output by the third neural network;

a parameter updating module 350, configured to calculate a classification prediction loss according to the first classification probability and the second classification probability, and update parameters of the first neural network, the second neural network, and the third neural network by using a back propagation algorithm according to the classification prediction loss.

In one implementation of the model training apparatus 300, the non-uniform sampling of the training image by the first sampling module 330 according to the information of all channels in the first attention map to obtain a first sampled image includes: performing average pooling on all channels in the first attention map to obtain an average attention map; sampling the training image with a first non-uniform sampling function according to the average attention map to obtain the first sampled image.

In one implementation of the model training apparatus 300, the first sampling module 330 samples the training image according to the average attention map by using a first non-uniform sampling function to obtain the first sampled image, including: the first sampling image is obtained by calculation according to the following formula:

wherein Is denotes the first sampled image, S denotes the first non-uniform sampling function, M denotes the first attention map, A (M) denotes the average attention map, I denotes the training image, w denotes the width of the training image, h denotes the height of the training image, I denotes a pixel index in the w-direction, j denotes a pixel index in the h-direction,

and

which are respectively the inverse functions of the following two functions:

In one implementation of the model training apparatus 300, the non-uniform sampling of the training image by the first sampling module 330 according to the information of the single channel in the first attention map to obtain the second sampling image includes: randomly selecting one channel from all channels of the first attention diagram; sampling the training image by using a second non-uniform sampling function according to the selected channel to obtain a second sampling image; and performing channel random selection once again at intervals of a preset training period.

In one implementation of the model training apparatus 300, the non-uniform sampling of the training image by the first sampling module 330 according to the information of the single channel in the first attention map to obtain the second sampling image includes: selecting one channel from all channels of the first attention diagram according to a preset sequence; sampling the training image by using a second non-uniform sampling function according to the selected channel to obtain a second sampling image; and performing channel selection again according to the preset sequence every preset training period, wherein the preset sequence is an arrangement sequence of all channels in the first attention diagram.

In one implementation of the model training apparatus 300, the first attention map obtaining module 320 obtains a first attention map based on the first feature map, including: and calculating the first attention diagram according to the relation among the channels in the first feature diagram.

In one implementation of the model training apparatus 300, the parameter updating module 350 calculates the classification prediction loss according to the first classification probability and the second classification probability, including: calculating a first loss according to the first classification probability and a label of the training image, and calculating a second loss according to the first classification probability and the second classification probability; wherein the first loss characterizes a difference between the predicted classification result of the second neural network and a true classification result, and the second loss characterizes a difference between the predicted classification result of the second neural network and a predicted classification result of the third neural network; and carrying out weighted summation on the first loss and the second loss to obtain the classified prediction loss.

The model training apparatus 300 according to the embodiment of the present application, which has been described in the foregoing method embodiments, can be referred to the corresponding contents in the method embodiments for the sake of brief description, and the portions of the apparatus embodiments that are not mentioned in the foregoing description.

Fig. 5 is a functional block diagram of an image classification apparatus 400 according to an embodiment of the present application. Referring to fig. 5, the image classification apparatus 400 includes:

a second feature map obtaining module 410, configured to input an image to be classified into a first neural network for processing, and obtain a second feature map output by the first neural network;

a second attention map obtaining module 420, configured to obtain a second attention map based on the second feature map, where a value of a pixel in the second attention map is positively correlated with a probability that a corresponding pixel in the image to be classified is sampled;

the second sampling module 430 is configured to perform non-uniform sampling on the image to be classified according to information of all channels in the second attention map to obtain a third sampled image, and perform non-uniform sampling on the image to be classified according to information of a single channel in the second attention map to obtain a fourth sampled image;

the second classification prediction module 440 is configured to input the third sampled image to a second neural network for processing, so as to obtain a third classification probability output by the second neural network, and input the fourth sampled image to a third neural network for processing, so as to obtain a fourth classification probability output by the third neural network;

a classification result obtaining module 450, configured to determine a final classification result of the image to be classified according to the third classification probability and the fourth classification probability.

The image classification apparatus 400 according to the embodiment of the present application, which has been described in the foregoing method embodiments, can be referred to the corresponding contents in the method embodiments for brevity.

Fig. 6 shows a possible structure of an electronic device 500 provided in an embodiment of the present application. Referring to fig. 6, the electronic device 500 includes: a processor 510, a memory 520, and a communication interface 530, which are interconnected and in communication with each other via a communication bus 540 and/or other form of connection mechanism (not shown).

The Memory 520 includes one or more (Only one is shown in the figure), which may be, but not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an electrically Erasable Programmable Read-Only Memory (EEPROM), and the like. The processor 510, as well as possibly other components, may access, read, and/or write data to the memory 520.

The processor 510 includes one or more (only one shown) which may be an integrated circuit chip having signal processing capabilities. The Processor 510 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Network Processor (NP), or other conventional processors; the Application Specific Processor may also be a special purpose Processor, including a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, and discrete hardware components. Also, when there are multiple processors 510, some of them may be general-purpose processors and others may be special-purpose processors.

Communication interface 530 includes one or more devices (only one of which is shown) that can be used to communicate directly or indirectly with other devices for data interaction. Communication interface 530 may include an interface to communicate wired and/or wireless.

One or more computer program instructions may be stored in memory 520 and read and executed by processor 510 to implement the model training method and/or the image classification method provided by the embodiments of the present application.

It will be appreciated that the configuration shown in FIG. 6 is merely illustrative and that electronic device 500 may include more or fewer components than shown in FIG. 6 or have a different configuration than shown in FIG. 6. The components shown in fig. 6 may be implemented in hardware, software, or a combination thereof. The electronic device 500 may be a physical device, such as a PC, a laptop, a tablet, a cell phone, a server, an embedded device, etc., or may be a virtual device, such as a virtual machine, a virtualized container, etc. The electronic device 500 is not limited to a single device, and may be a combination of a plurality of devices or a cluster including a large number of devices.

The embodiment of the present application further provides a computer-readable storage medium, where computer program instructions are stored on the computer-readable storage medium, and when the computer program instructions are read and executed by a processor of a computer, the computer program instructions execute the model training method and/or the image classification method provided in the embodiment of the present application. For example, the computer-readable storage medium may be embodied as the memory 520 in the electronic device 500 of FIG. 6.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of model training, comprising:

inputting a training image into a first neural network for processing to obtain a first feature map output by the first neural network;

obtaining a first attention diagram based on the first feature map, wherein the value of a pixel in the first attention diagram is positively correlated with the probability that a corresponding pixel in the training image is sampled;

non-uniformly sampling the training image according to the information of all channels in the first attention diagram to obtain a first sampling image, and non-uniformly sampling the training image according to the information of a single channel in the first attention diagram to obtain a second sampling image;

inputting the first sampling image into a second neural network for processing to obtain a first classification probability output by the second neural network, and inputting the second sampling image into a third neural network for processing to obtain a second classification probability output by the third neural network;

and calculating a classification prediction loss according to the first classification probability and the second classification probability, and updating parameters of the first neural network, the second neural network and the third neural network by using a back propagation algorithm according to the classification prediction loss.

2. The model training method of claim 1, wherein the non-uniformly sampling the training image according to the information of all channels in the first attention map to obtain a first sampled image comprises:

performing average pooling on all channels in the first attention map to obtain an average attention map;

sampling the training image with a first non-uniform sampling function according to the average attention map to obtain the first sampled image.

3. The model training method of claim 2, wherein the sampling the training image with a first non-uniform sampling function according to the average attention map to obtain the first sampled image comprises:

the first sampling image is obtained by calculation according to the following formula:

and

which are respectively the inverse functions of the following two functions:

4. The model training method of claim 1, wherein the non-uniformly sampling the training image according to the information of the single channel in the first attention map to obtain a second sampled image comprises:

randomly selecting one channel from all channels of the first attention diagram;

sampling the training image by using a second non-uniform sampling function according to the selected channel to obtain a second sampling image;

and performing channel random selection once again at intervals of a preset training period.

5. The model training method of claim 1, wherein the non-uniformly sampling the training image according to the information of the single channel in the first attention map to obtain a second sampled image comprises:

selecting one channel from all channels of the first attention diagram according to a preset sequence;

and performing channel selection again according to the preset sequence every preset training period, wherein the preset sequence is an arrangement sequence of all channels in the first attention diagram.

6. The model training method of claim 1, wherein the obtaining a first attention map based on the first feature map comprises:

and calculating the first attention diagram according to the relation among the channels in the first feature diagram.

7. The model training method of claim 1, wherein said calculating a classification prediction loss based on said first classification probability and said second classification probability comprises:

calculating a first loss according to the first classification probability and a label of the training image, and calculating a second loss according to the first classification probability and the second classification probability; wherein the first loss characterizes a difference between the predicted classification result of the second neural network and a true classification result, and the second loss characterizes a difference between the predicted classification result of the second neural network and a predicted classification result of the third neural network;

and carrying out weighted summation on the first loss and the second loss to obtain the classified prediction loss.

8. An image classification method, comprising:

inputting an image to be classified into a first neural network for processing to obtain a second feature map output by the first neural network;

obtaining a second attention diagram based on the second feature map, wherein the value of a pixel in the second attention diagram is positively correlated with the probability that the corresponding pixel in the image to be classified is sampled;

the image to be classified is subjected to non-uniform sampling according to the information of all channels in the second attention diagram to obtain a third sampling image, and the image to be classified is subjected to non-uniform sampling according to the information of a single channel in the second attention diagram to obtain a fourth sampling image;

inputting the third sampling image into a second neural network for processing to obtain a third classification probability output by the second neural network, and inputting the fourth sampling image into a third neural network for processing to obtain a fourth classification probability output by the third neural network;

and determining the final classification result of the image to be classified according to the third classification probability and the fourth classification probability.

9. A model training apparatus, comprising:

the first characteristic diagram acquisition module is used for inputting a training image into a first neural network for processing to obtain a first characteristic diagram output by the first neural network;

a first attention map obtaining module, configured to obtain a first attention map based on the first feature map, where a value of a pixel in the first attention map is positively correlated with a probability that a corresponding pixel in the training image is sampled;

the first sampling module is used for carrying out non-uniform sampling on the training image according to the information of all channels in the first attention diagram to obtain a first sampling image, and carrying out non-uniform sampling on the training image according to the information of a single channel in the first attention diagram to obtain a second sampling image;

the first classification prediction module is used for inputting the first sampling image into a second neural network for processing to obtain a first classification probability output by the second neural network, and inputting the second sampling image into a third neural network for processing to obtain a second classification probability output by the third neural network;

and the parameter updating module is used for calculating the classification prediction loss according to the first classification probability and the second classification probability and updating the parameters of the first neural network, the second neural network and the third neural network by utilizing a back propagation algorithm according to the classification prediction loss.

10. An image classification apparatus, comprising:

the second characteristic diagram acquisition module is used for inputting the image to be classified into the first neural network for processing to obtain a second characteristic diagram output by the first neural network;

a second attention map obtaining module, configured to obtain a second attention map based on the second feature map, where a value of a pixel in the second attention map is positively correlated with a probability that a corresponding pixel in the image to be classified is sampled;

the second sampling module is used for carrying out non-uniform sampling on the image to be classified according to the information of all channels in the second attention diagram to obtain a third sampling image, and carrying out non-uniform sampling on the image to be classified according to the information of a single channel in the second attention diagram to obtain a fourth sampling image;

the second classification prediction module is used for inputting the third sampling image into a second neural network for processing to obtain a third classification probability output by the second neural network, and inputting the fourth sampling image into the third neural network for processing to obtain a fourth classification probability output by the third neural network;

and the classification result acquisition module is used for determining the final classification result of the image to be classified according to the third classification probability and the fourth classification probability.