CN115457329B

CN115457329B - Training method of image classification model, image classification method and device

Info

Publication number: CN115457329B
Application number: CN202211165540.6A
Authority: CN
Inventors: 王兆玮; 杨叶辉; 武秉泓; 王晓荣; 黄海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2023-11-10
Anticipated expiration: 2042-09-23
Also published as: CN115457329A

Abstract

The disclosure provides a training method of an image classification model, an image classification method and an image classification device, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of image processing, deep learning and the like. The specific implementation scheme is as follows: determining a first feature image of the sample image by using a first feature extraction network in the first image classification model, inputting a mask image of the sample image to a second feature extraction network in the second image classification model to obtain a second feature image with the same size as the first feature image, and training the second feature extraction network of the second image classification model according to the first feature image and the second feature image. Therefore, by means of the self-supervision training of the second image classification model by means of the first image classification model, the training cost of the second image classification model is reduced, and meanwhile, the trained second image classification model can achieve good classification effect.

Description

Training method of image classification model, image classification method and device

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence, in particular to the technical fields of image processing, deep learning and the like, and especially relates to a training method of an image classification model, an image classification method and an image classification device.

Background

In the related art, it is often necessary to classify images in some scenes, and how to obtain an image classification model is important for image classification.

Disclosure of Invention

The disclosure provides a training method of an image classification model, an image classification method and an image classification device.

According to an aspect of the present disclosure, there is provided a training method of an image classification model, the method including: determining a mask image of the sample image; determining a first feature image of the sample image using a first feature extraction network in a first image classification model; inputting the mask image into a second feature extraction network in a second image classification model to obtain a second feature image, wherein the first feature image and the second feature image are the same in size; training the second feature extraction network according to the first feature image and the second feature image.

According to another aspect of the present disclosure, there is provided an image classification method, the method including: acquiring an image to be processed; inputting the image to be processed into a second feature extraction network in a second image classification model to obtain a third feature image of the image to be processed, wherein the second feature extraction network is obtained by training based on a first feature image and a second feature image, the first feature image is obtained by carrying out feature extraction on a sample image by using the first feature extraction network in the first image classification model, and the second feature image is obtained by carrying out feature extraction on a mask image of the sample image by using the second feature extraction network; and classifying the third characteristic image by using a classification network in the second image classification model to obtain type label information of the image to be processed.

According to another aspect of the present disclosure, there is provided a training apparatus of an image classification model, including: a first determining module for determining a mask image of the sample image; a second determining module for determining a first feature image of the sample image using a first feature extraction network in a first image classification model; the feature extraction module is used for inputting the mask image into a second feature extraction network in a second image classification model to obtain a second feature image, wherein the sizes of the first feature image and the second feature image are the same; and the first training module is used for training the second feature extraction network according to the first feature image and the second feature image.

According to another aspect of the present disclosure, there is provided an image classification module including: the acquisition module is used for acquiring the image to be processed; the feature extraction module is used for inputting the image to be processed into a second feature extraction network in a second image classification model to obtain a third feature image of the image to be processed, wherein the second feature extraction network is obtained by training based on a first feature image and a second feature image, the first feature image is obtained by carrying out feature extraction on a sample image by utilizing the first feature extraction network in the first image classification model, and the second feature image is obtained by carrying out feature extraction on a mask image of the sample image by utilizing the second feature extraction network; and the classification module is used for classifying the third characteristic image by utilizing a classification network in the second image classification model so as to obtain the type label information of the image to be processed.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a training method or an image classification method of the image classification model of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a training method or an image classification method of an image classification model disclosed by an embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a training method or an image classification method of an image classification model of the present disclosure.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is an exemplary diagram of a process of knowledge distillation training;

FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 is an exemplary diagram of self-supervised training of a self-attention based conversion transducer network model in conjunction with a trained CNN model;

FIG. 8 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 10 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 11 is a schematic diagram according to a ninth embodiment of the present disclosure;

fig. 12 is a block diagram of an electronic device used to implement a training method or image classification method of an image classification model in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In medical scenes, how to train to obtain an image classification model is important for quickly classifying medical images. In the related art, in order to enable an image classification model to accurately classify a medical image, an initial image classification model (e.g., a transformation former network) is generally trained based on a large number of sample image data with type labels to obtain an image classification model. However, a large amount of labeled sample image data is difficult to acquire, and thus, the training cost of training the image classification model is high.

To this end, the present disclosure utilizes a first feature extraction network in a first image classification model, determines a first feature image of a sample image, and inputs a mask image of the sample image to a second feature extraction network in a second image classification model to obtain a second feature image of the same size as the first feature image, and trains the second feature extraction network of the second image classification model according to the first feature image and the second feature image. Therefore, by means of the self-supervision training of the second image classification model by means of the first image classification model, the training cost of the second image classification model is reduced, and meanwhile, the trained second image classification model can achieve good classification effect.

The training method, the image classification method and the device of the image classification model according to the embodiment of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a first embodiment according to the present disclosure, which provides a training method of an image classification model.

As shown in fig. 1, the training method of the image classification model may include:

step 101, a mask image of the sample image is determined.

The execution subject of the training method of the image classification model of the present embodiment is a training device of the image classification model, and the training device of the image classification model may be implemented by software and/or hardware, and the training device of the image classification model may be an electronic device, or may be configured in the electronic device.

The electronic device may include, but is not limited to, a terminal device, a server, etc., and the embodiment is not particularly limited to the electronic device.

In some exemplary embodiments, in a medical scene, the sample image may be a medical sample image of a specified body part.

In some exemplary embodiments, the designated body part may be one of the body parts corresponding to a human or animal. For example, the specified body part may be an eye of a human body, and specifically, the sample image in this example may be a fundus sample image.

The mask image is an image obtained by masking the sample image.

In some exemplary embodiments, the implementation of determining the mask image of the sample image is different in different application scenarios, the exemplary embodiments are as follows:

as one example, a sample image is partitioned to obtain a plurality of image blocks of the sample image; masking a portion of the plurality of image blocks to obtain a masked image of the sample image. Thus, by masking a part of the image blocks of the sample image, a mask image of the sample image can be accurately obtained.

Specifically, a partial image block may be randomly selected from a plurality of image blocks, and masking processing may be performed on the randomly selected partial image block to obtain a masked image of the sample image.

As another example, a mask image corresponding to a sample image may be acquired from a correspondence between the sample image and the mask image stored in advance.

As another example, the sample image may be directly randomly masked to obtain a masked image of the sample image.

Step 102, determining a first feature image of the sample image by using a first feature extraction network in the first image classification model.

Wherein the size of the first feature image is smaller than the size of the sample image in this example, for example, the first feature image may be reduced by a factor of 32 with respect to the sample image, i.e. the size of the sample image is 32 times the size of the first feature image.

In one embodiment of the present disclosure, in order that local feature information of the sample image may be accurately extracted, the first feature extraction network in the present example may be a convolutional neural network, for example, the first feature extraction network in the first image classification model may be a residual network Resnet-50. Wherein, the Resnet-50 can perform a plurality of downsampling processes on the sample image through a plurality of feature extraction layers to obtain a first feature image of the sample image.

Step 103, inputting the mask image into a second feature extraction network in the second image classification model to obtain a second feature image, wherein the sizes of the first feature image and the second feature image are the same.

In some examples, the second feature extraction network may be based on a self-attentive transformation former network to semantic feature extract the mask image through the self-attentive transformation former network.

Step 104, training the second feature extraction network according to the first feature image and the second feature image.

In some exemplary embodiments, to accurately train the second feature extraction network in the second image classification model, one possible implementation of training the second feature extraction network from the first feature image and the second feature image is: determining a mean square error loss between the first feature image and the second feature image; and training the second characteristic extraction network according to the mean square error loss.

Specifically, the network parameters of the second feature extraction network may be adjusted according to the mean square error loss, so as to obtain an adjusted second image classification model, and training is continued on the adjusted second image classification model until the mean square error loss meets a preset condition.

The preset condition is a condition for ending model training. The preset conditions can be configured correspondingly according to actual requirements. For example, the mean square error loss satisfies the preset condition may be that the mean square error loss is smaller than a preset value, or that the change of the mean square error loss is smooth, that is, the difference of the mean square error losses corresponding to two or more adjacent training is smaller than a set value, that is, the mean square error loss is not changed basically.

Based on the above description, it can be seen that, in this example, the first image classification model performs self-supervision training on the second feature extraction network in the second image classification model, and since a large number of sample images with labels do not need to be collected during the training process, the training cost of the model can be reduced. In addition, in the process of training the second feature extraction network in the second image classification model, the second feature extraction network can learn the feature extraction capability of the first feature extraction network in the first image classification model, so that the feature extraction accuracy in the second feature extraction network in the second image classification model can be improved, and the classification accuracy of the trained second image classification model can be improved.

According to the training method of the image classification model, a first feature extraction network in a first image classification model is utilized to determine a first feature image of a sample image, a mask image of the sample image is input to a second feature extraction network in a second image classification model to obtain a second feature image with the same size as the first feature image, and training is performed on the second feature extraction network of the second image classification model according to the first feature image and the second feature image. Therefore, by means of the self-supervision training of the second image classification model by means of the first image classification model, the training cost of the second image classification model is reduced, and meanwhile, the trained second image classification model can achieve good classification effect.

It will be appreciated that in some embodiments, in order to further improve the accuracy of classification of the second image classification model, knowledge distillation training may be performed on the second image classification model in conjunction with the first image classification model, and in order to clearly understand the process of knowledge distillation training on the second image classification model in conjunction with the first image classification model, an exemplary description of the training method of the image classification model of this embodiment is described below in conjunction with fig. 2.

Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure.

As shown in fig. 2, the training method of the image classification model may include: :

in step 201, a mask image of the sample image is determined.

Step 202, determining a first feature image of the sample image using a first feature extraction network in a first image classification model.

And step 203, classifying the first characteristic image by using a classification network in the first image classification model to obtain first type tag information of the sample image.

That is, in the present example, the sample image may be input into the first image classification model to obtain the first type tag information of the sample image through the first image classification model. The processing of the first image classification model to process the sample image to obtain the first type tag information of the sample image is shown in step 202 and step 203.

Step 204, inputting the sample image into a second image classification model to obtain second type label information of the sample image.

In some exemplary embodiments, after the sample image is input into the second image classification model, the specific exemplary process of the second image classification model obtaining the second type of tag information of the sample image is: and the second feature extraction network in the second image classification model performs feature extraction on the sample image, the extracted feature image is input into the classification network in the second image classification model, and correspondingly, the classification network in the second image classification model classifies the extracted feature image to obtain second type label information of the sample image.

Step 205, training the second image classification model according to the first type label information and the second type label information.

That is, knowledge distillation training may also be performed on the second image classification model based on the first image classification model prior to self-supervising training of the second image classification model.

Among them, based on the content of the description in the present example, it can be seen that in the course of performing knowledge distillation training, the teacher model in the present example is a first image classification model, and the student model is a second image classification model.

In some exemplary embodiments, in order for the second image classification model to accurately inherit the performance of the first image classification model, one possible implementation of training the second image classification model according to the first type of tag information and the second type of tag information is: determining a distillation loss value according to the first type tag information and the second type tag information; and adjusting model parameters of the second image classification model according to the distillation loss value to realize training.

In some exemplary embodiments, the first type tag information and the second type tag information in this example may be distribution probability information of the sample image on preset various types of tags. That is, in some examples, the first type of tag information may be first type of tag distribution probability information and the second type of tag information may be second type of tag distribution probability information.

For example, assuming that the sample image is a fundus sample image, the first image classification model is an image classification model based on a convolutional neural network CNN, the second image classification model is an image classification model based on a self-attention-based conversion transducer network, and the first type tag information and the second type tag information are both type tag distribution probabilities. Since the image classification model based on the convolutional neural network CNN can accurately extract local feature information of a sample image, in order to enable the image classification model based on the convolutional neural network CNN to follow the capability of extracting local features of the sample image from the image classification model based on the convolutional neural network CNN, the image classification model based on the convolutional neural network CNN can be subjected to knowledge distillation training based on the image classification model based on the convolutional neural network CNN, wherein an example diagram of a process of knowledge distillation training is shown in fig. 3, specifically, a fundus sample image can be input into the image classification model based on the convolutional neural network CNN to obtain first type tag distribution probability information output by the image classification model based on the convolutional neural network CNN, a fundus sample image is input into the image classification model based on the convolutional neural network to obtain second type tag distribution probability information of the fundus sample image, a loss value is determined according to the first type tag distribution probability information and the second type tag distribution probability, and the fundus sample image classification model based on the self-attention conversion transformer network is trained based on the root loss value.

Step 206, inputting the mask image into a second feature extraction network in the second image classification model to obtain a second feature image, wherein the first feature image and the second feature image have the same size.

Step 207 trains the second feature extraction network from the first feature image and the second feature image.

It should be noted that, regarding the specific implementation manner of step 206 and step 207, reference may be made to the related description in the embodiments of the present disclosure, which is not repeated here.

In the example, the first type tag information of the sample image is output through the first image classification model, the second type tag information of the sample image is output through the second image classification model, and knowledge distillation training is performed on the second image classification model based on the first type tag information and the second type tag information, so that the second image classification model can inherit the capability of the first image classification, and the classification accuracy of the second image classification model is improved.

In one embodiment of the present disclosure, in order to enable the first image classification model to accurately determine type tag information of the sample image, knowledge distillation training of the second image classification model based on the first image classification model is facilitated, and in some exemplary embodiments, the first image classification model may be trained in combination with the sample image and corresponding type annotation data. In order that the process may be clearly understood, an exemplary description of the training method of the implemented image classification model is described below in connection with fig. 4.

Fig. 4 is a schematic diagram according to a third embodiment of the present disclosure.

As shown in fig. 4, the training method of the image classification model may include:

in step 401, type labeling data of the sample image is obtained, wherein the type labeling data includes third type label information.

It will be appreciated that the third type of tag information in this example is type-tagging information obtained by type-tagging a sample image. In some examples, the sample image may be manually type-tagged and third type tag information tagged for the sample image may be obtained.

Step 402, a mask image of a sample image is determined.

It should be noted that, regarding the specific implementation manner of step 402, reference may be made to the related description of the embodiments of the present disclosure, which is not repeated here.

Step 403, determining a first feature image of the sample image using the first feature extraction network in the first image classification model.

Step 404, classifying the first feature image by using a classification network in the first image classification model to obtain first type tag information of the sample image.

In this example, the sample image may be input into the first image classification model to obtain first type tag information of the sample image through the first image classification model. The first image classification model processes the sample image to obtain first type tag information of the sample image, as shown in step 403 and step 404. Correspondingly, a first feature extraction network and a classification network in the first image classification model sequentially process the sample image to obtain first type tag information of the sample image.

Step 405, training the first image classification model according to the third type of label information and the first type of label information.

That is, in this example, the first image classification model may be trained based on the sample image and corresponding type tag data prior to knowledge distillation training of the second image classification model by the first image classification model. Therefore, the first image classification model can accurately determine the type label information of the input image.

Step 406, inputting the sample image into a second image classification model to obtain second type label information of the sample image.

Step 407, training the second image classification model according to the first type label information and the second type label information.

In this example, the second image classification model may be knowledge distilled based on the trained first image classification model.

Step 408, inputting the mask image to a second feature extraction network in the second image classification model to obtain a second feature image, wherein the first feature image and the second feature image are the same in size.

Step 409, training the second feature extraction network based on the first feature image and the second feature image.

It should be noted that, regarding the specific implementation manner of step 408 and step 409, reference may be made to the related description of the embodiments of the present disclosure, which is not repeated herein.

In some examples, the first image classification model in this example may be a convolutional neural network CNN-based image classification model, since the convolutional neural network model performs better on a small sample collection, and the number of sample images with type labeling data required to train one convolutional neural network CNN-based image classification model is typically long. In order to make the image semantics comprehensible, the second image classification model in this example is an image classification model of a self-attention-based transformation transformer network. The convolutional neural network in the image classification model based on the convolutional neural network CNN can perform local feature extraction on the sample image, and correspondingly, the self-attention conversion transformer network in the image classification model based on the self-attention conversion transformer network can perform global semantic feature extraction on the sample image.

In the example, before the knowledge distillation training is performed on the second image classification model through the first image classification model, the first image classification model is subjected to the sample image and corresponding type labeling data, so that the trained first image classification model can accurately determine the type label information of the sample image, and the knowledge distillation training can be conveniently and accurately realized on the second image classification model based on the first image classification model.

Based on the embodiments shown in fig. 2 or fig. 4, in order to further improve the accuracy of the classification of the second image classification model, the second image classification model may also be trained again in combination with the sample image and the corresponding type tag data. In order to clearly understand how the second image classification model is trained in conjunction with the sample image and the corresponding type tag data, an exemplary description of the training method of the image classification model of this embodiment is provided below in conjunction with fig. 5.

Fig. 5 is a schematic diagram according to a fourth embodiment of the present disclosure.

As shown in fig. 5, the training method of the image classification model may include:

in step 501, a mask image of the sample image is determined.

It should be noted that, regarding the specific implementation manner of step 501, reference may be made to the related description of the embodiments of the present disclosure, which is not repeated here.

Step 502, determining a first feature image of the sample image using a first feature extraction network in the first image classification model.

In some example embodiments, the first image classification model may also be trained based on the sample image and corresponding type annotation data.

Step 503, classifying the first feature image by using a classification network in the first image classification model to obtain first type tag information of the sample image.

Step 504, the sample image is input to a second image classification model to obtain a second type of label information for the sample image.

Step 505, training the second image classification model according to the first type label information and the second type label information.

Step 506, inputting the mask image to a second feature extraction network in the second image classification model to obtain a second feature image, wherein the first feature image and the second feature image are the same in size.

Step 507, training the second feature extraction network according to the first feature image and the second feature image.

And step 508, obtaining type labeling data of the sample image, wherein the type labeling data comprises third type label information.

Step 509, training the second image classification model according to the third type of label information and the second type of label information.

In some exemplary embodiments, the corresponding cross entropy loss value may be determined based on the third type of tag information and the second type of tag information, and model parameters of the second image classification model may be adjusted based on the cross entropy loss value to obtain an adjusted second image classification model, and training of the adjusted second image classification model may be continued until the corresponding cross entropy loss value meets a preset condition.

The preset condition is a condition for ending model training. The preset conditions can be configured correspondingly according to actual requirements. For example, the cross entropy loss value satisfies the preset condition may be that the cross entropy loss value is smaller than the preset value, or that the change of the cross entropy loss value is smooth, that is, the difference value of the cross entropy loss values corresponding to two or more adjacent training is smaller than the set value, that is, the cross entropy loss value is not changed basically.

In this example, after the second image classification model is subjected to knowledge distillation training and self-supervision training through the first image classification model, the second image classification model is retrained based on the sample image and the corresponding label labeling data, so that fine adjustment of model parameters of the second image classification model is realized, global feature information in the sample image can be learned on the basis of the capability of the first image classification model, and the trained second image classification model can have classification capability exceeding that of the first image classification model, so that classification accuracy of the second image classification model is further improved.

In order that the present disclosure may be clearly understood, the training method of this embodiment is further described below in connection with fig. 6. In this exemplary embodiment, a convolutional neural network (Convolutional Neural Networks, CNN) model is taken as a first image classification model, and a self-attention-based conversion transformer network model is taken as a second image classification model for exemplary description.

Fig. 6 is a schematic diagram according to a fifth embodiment of the present disclosure.

As shown in fig. 6, the method may include:

and step 601, training the CNN model based on the sample image and the corresponding type labeling data to obtain a trained CNN model.

In some exemplary embodiments, the sample image may be input into the CNN model to obtain a prediction type label of the sample image, and the CNN model is trained for multiple rounds according to the type labeling data and the prediction type label to obtain a trained CNN model.

Step 602, performing knowledge distillation training on a self-attention-based conversion transducer network model based on a trained CNN model.

In some exemplary embodiments, the sample image may be input to a trained CNN model to obtain first type tag distribution probability information of the sample image, the sample image may be input to a self-attention-based conversion transducer network model to obtain second type tag distribution probability information of the sample image, and the self-attention-based conversion transducer network model may be subjected to knowledge distillation training according to the first type tag distribution probability information and the second type tag distribution probability information.

Based on the above description, it can be seen that in this example, knowledge distillation training is performed using a trained CNN model as a teacher model and a self-attention based transformation former network model as a student model.

Step 603, obtaining a first feature image obtained by feature extraction of the sample image by the feature extraction network in the CNN model, and obtaining a second feature image obtained by feature extraction of the mask image item of the sample image by the feature extraction network in the self-attention-based conversion transformer network model.

Step 604 trains a self-attention based conversion transducer network model based on the first feature image and the second feature image.

In order to further improve the learning ability of the self-attention-based conversion transformation network model to the image semantics, self-supervision task cooperative training is added in the process of learning the CNN model by the self-attention-based conversion transformation network model: the sample image is conveyed to a CNN model, and a first deep feature image after 32 times of downsampling of the CNN model is extracted; and carrying out certain mask shielding on the sample image to obtain a mask image of the sample image, and sending the mask image into the self-attentive conversion converter network model again to enable the self-attentive conversion converter network model to recover a deep layer characteristic diagram of the CNN model so as to obtain a second deep layer characteristic image, and training the self-attentive conversion converter network model based on the first deep layer characteristic image and the second deep layer characteristic image, so that the self-attentive conversion converter network model can learn the capability of capturing image characteristics of the CNN model.

For example, the sample image is a fundus sample image, an example of self-supervised training of the self-attentive transformation former network model in combination with the trained CNN model is shown in fig. 7, specifically, the fundus sample image may be input into a convolutional neural network in the trained CNN model to obtain a first deep feature image, the fundus sample image is correspondingly masked to obtain a mask image, then the mask image is input into the self-attentive transformation former network model, and correspondingly, a feature extraction network in the self-attentive transformation former network model performs feature extraction based on the mask image to obtain a second depth feature image, a mean square error loss of the first depth feature image and the second depth feature image is determined, and a feature extraction network in the self-attentive transformation former network model is trained based on the mean square error loss.

Wherein the self-attention converting transducer network model in the present example is based on a Swin-transducer architecture, and 2 layers of convolution downsampling are added to the 8 times downsampling feature map of the network output, so as to align with the feature map of the convolution neural network in the CNN model.

In order to achieve the above embodiments, the embodiments of the present disclosure further provide a training device for an image classification model.

Fig. 8 is a schematic diagram of a sixth embodiment according to the present disclosure, which provides an image classification method.

As shown in fig. 8, the image classification method may include:

step 801, an image to be processed is acquired.

Step 802, inputting the image to be processed into a second feature extraction network in a second image classification model to obtain a third feature image of the image to be processed.

The second feature extraction network is trained based on a first feature image and a second feature image, wherein the first feature image is obtained by feature extraction of a sample image by using the first feature extraction network in the first image classification model, and the second feature image is obtained by feature extraction of a mask image of the sample image by using the second feature extraction network.

It should be noted that, for the related description of training the second image classification model, reference may be made to the related description of the embodiments of the present disclosure, which is not repeated herein.

And 803, classifying the third characteristic image by using a classification network in the second image classification model to obtain type label information of the image to be processed.

According to the image classification method provided by the embodiment, self-supervision training is performed on the second image classification model based on the first image classification model, the image to be processed is input into the trained second image classification model, feature extraction is performed on the image to be processed through a feature extraction network in the second image classification model, so that a corresponding feature image is obtained, and the corresponding feature image is classified through a classification network in the second image classification model, so that type label information of the image to be processed is obtained. Therefore, the classification of the image to be processed can be accurately realized through the trained second image classification model, and the classification accuracy of the second image classification model is improved.

Fig. 9 is a schematic diagram of a seventh embodiment of the present disclosure, which provides a training apparatus of an image classification model.

As shown in fig. 9, the training apparatus 90 of the image classification model may include a first determining module 901, a second determining module 902, a feature extracting module 903, and a first training module 904, wherein:

a first determining module 901 is configured to determine a mask image of the sample image.

A second determining module 902 is configured to determine a first feature image of the sample image using the first feature extraction network in the first image classification model.

The feature extraction module 903 is configured to input the mask image to a second feature extraction network in the second image classification model to obtain a second feature image, where the first feature image and the second feature image have the same size.

A first training module 904 is configured to train the second feature extraction network according to the first feature image and the second feature image.

The training device for the image classification model of the embodiment of the disclosure determines a first feature image of a sample image by using a first feature extraction network in a first image classification model, inputs a mask image of the sample image to a second feature extraction network in a second image classification model to obtain a second feature image with the same size as the first feature image, and trains the second feature extraction network of the second image classification model according to the first feature image and the second feature image. Therefore, by means of the self-supervision training of the second image classification model by means of the first image classification model, the training cost of the second image classification model is reduced, and meanwhile, the trained second image classification model can achieve good classification effect.

In one embodiment of the present disclosure, as shown in fig. 10, the training apparatus 100 of the image classification model may include: a first determination module 1001, a second determination module 1002, a feature extraction module 1003, a first training module 1004, a first classification module 1005, a second classification module 1006, a second training module 1007, a first acquisition module 1008, a third training module 1009, a second acquisition module 1010, and a fourth training module 1011.

It should be noted that, for the detailed description of the second determining module 1002 and the feature extracting module 1003, reference may be made to the description of the second determining module 902 and the feature extracting module 903 in fig. 9, which are not described here.

In one embodiment of the present disclosure, the first classification module 1005 is configured to classify the first feature image using a classification network in a first image classification model to obtain first type tag information of the sample image.

A second classification module 1006 is configured to input the sample image into a second image classification model to obtain second type tag information of the sample image.

A second training module 1007 is configured to train the second image classification model according to the first type of tag information and the second type of tag information.

In one embodiment of the present disclosure, the apparatus further comprises:

a first obtaining module 1008, configured to obtain type labeling data of the sample image, where the type labeling data includes third type tag information;

the third training module 1009 is configured to train the first image classification model according to the third type of tag information and the first type of tag information.

In one embodiment of the present disclosure, the second training module 1006 is specifically configured to: determining a distillation loss value according to the first type tag information and the second type tag information; and adjusting model parameters of the second image classification model according to the distillation loss value to realize training.

In one embodiment of the present disclosure, the first training module 1004 is specifically configured to: determining a mean square error loss between the first feature image and the second feature image; and training the second characteristic extraction network according to the mean square error loss.

In one embodiment of the present disclosure, the apparatus further comprises:

a second obtaining module 1010, configured to obtain type labeling data of the sample image, where the type labeling data includes third type tag information;

and a fourth training module 1011 for training the second image classification model according to the third type tag information and the second type tag information.

In one embodiment of the present disclosure, the first feature extraction network is a convolutional neural network and the second feature extraction network is a self-attention mechanism based conversion transformer network.

In one embodiment of the present disclosure, the first determining module 1001 is specifically configured to: dividing the sample image to obtain a plurality of image blocks of the sample image; masking a portion of the plurality of image blocks to obtain a masked image of the sample image.

It should be noted that the training method and the explanation of the image classification method described above are also applicable to the training device of the image classification model in this embodiment, and this embodiment will not be described in detail.

Fig. 11 is a schematic view of a ninth embodiment according to the present disclosure, which provides an image classification apparatus.

As shown in fig. 11, the image classification apparatus 110 may include an acquisition module 1101, a feature extraction module 1102, and a classification module 1103, wherein:

the acquisition module 1101 is configured to acquire an image to be processed.

The feature extraction module 1102 is configured to input an image to be processed into a second feature extraction network in a second image classification model to obtain a third feature image of the image to be processed, where the second feature extraction network is obtained by training based on a first feature image and a second feature image, the first feature image is obtained by feature extraction of a sample image by using the first feature extraction network in the first image classification model, and the second feature image is obtained by feature extraction of a mask image of the sample image by using the second feature extraction network.

The classification module 1103 is configured to classify the third feature image by using the classification network in the second image classification model, so as to obtain type tag information of the image to be processed.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium and a computer program product.

Fig. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the electronic device 1200 may include a computing unit 1201 that may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.

Various components in device 1200 are connected to I/O interface 1205, including: an input unit 1206 such as a keyboard, mouse, etc.; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208 such as a magnetic disk, an optical disk, or the like; and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the various methods and processes described above, such as the training method of the image classification model. For example, in some embodiments, the method of training the image classification model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the above-described training method of the image classification model may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the training method of the image classification model in any other suitable way (e.g., by means of firmware).

In some examples, the computing unit 1201 performs the image classification method described above. For example, in some examples, the image classification method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When a computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the image classification method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the image classification method by any other suitable means (e.g. by means of firmware).

Various implementations of the apparatus and techniques described here above may be implemented in digital electronic circuit devices, integrated circuit devices, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), on-chip device devices (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on programmable devices including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage device, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or apparatus, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the apparatus and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The apparatus and techniques described here may be implemented in a computing device that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the apparatus and techniques described here), or any combination of such background, middleware, or front-end components. The components of the apparatus may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer device may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may be a cloud server, a server of a distributed device, or a server incorporating a blockchain.

It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training an image classification model, comprising:

determining a mask image of the sample image;

determining a first feature image of the sample image by using a first feature extraction network in a first image classification model, wherein the first image classification model is obtained by training based on the sample image and corresponding type tag data;

inputting the mask image into a second feature extraction network in a second image classification model to obtain a second feature image, wherein the first feature image and the second feature image are the same in size;

Training the second feature extraction network according to the first feature image and the second feature image to realize self-supervision training of a second image classification model;

the training of the second feature extraction network from the first feature image and the second feature image includes:

determining a mean square error loss between the first feature image and the second feature image;

and training the second feature extraction network according to the mean square error loss.

2. The method of claim 1, wherein the method further comprises:

classifying the first feature image by using a classification network in the first image classification model to obtain first type tag information of the sample image;

inputting the sample image into the second image classification model to obtain second type tag information of the sample image;

and training the second image classification model according to the first type label information and the second type label information.

3. The method of claim 2, wherein the method further comprises:

acquiring type labeling data of the sample image, wherein the type labeling data comprises third type label information;

And training the first image classification model according to the third type label information and the first type label information.

4. The method of claim 2, wherein the training the second image classification model according to the first type of tag information and the second type of tag information comprises:

determining a distillation loss value according to the first type tag information and the second type tag information;

and adjusting model parameters of the second image classification model according to the distillation loss value to realize training.

5. The method of claim 2, wherein the method further comprises:

and training the second image classification model according to the third type label information and the second type label information.

6. The method of claim 1, wherein the first feature extraction network is a convolutional neural network and the second feature extraction network is a self-attention mechanism based conversion transformer network.

7. The method of any of claims 1-6, wherein the determining a mask image of the sample image comprises:

Dividing the sample image to obtain a plurality of image blocks of the sample image;

masking a portion of the plurality of image blocks to obtain a masked image of the sample image.

8. An image classification method, comprising:

acquiring an image to be processed;

inputting the image to be processed into a second feature extraction network in a second image classification model to obtain a third feature image of the image to be processed, wherein the second feature extraction network is obtained by training based on a first feature image and a second feature image, the first feature image is obtained by carrying out feature extraction on a sample image by using the first feature extraction network in the first image classification model, the second feature image is obtained by carrying out feature extraction on a mask image of the sample image by using the second feature extraction network, and the first image classification model is obtained by training based on the sample image and corresponding type tag data;

classifying the third characteristic image by using a classification network in the second image classification model to obtain type label information of the image to be processed;

The second feature extraction network is obtained based on training of the first feature image and the second feature image, and specifically comprises the following steps:

9. The method of claim 8, wherein the first feature extraction network is a convolutional neural network and the second feature extraction network is a self-attention mechanism based conversion transformer network.

10. A training apparatus for an image classification model, comprising:

a first determining module for determining a mask image of the sample image;

the second determining module is used for determining a first characteristic image of the sample image by utilizing a first characteristic extraction network in a first image classification model, and the first image classification model is obtained by training based on the sample image and corresponding type label data;

the feature extraction module is used for inputting the mask image into a second feature extraction network in a second image classification model to obtain a second feature image, wherein the sizes of the first feature image and the second feature image are the same;

The first training module is used for training the second feature extraction network according to the first feature image and the second feature image so as to realize self-supervision training of a second image classification model;

the first training module is specifically configured to:

11. The apparatus of claim 10, wherein the apparatus further comprises:

the first classification module is used for classifying the first characteristic images by utilizing a classification network in the first image classification model so as to obtain first type tag information of the sample images;

the second classification module is used for inputting the sample image into the second image classification model so as to obtain second type tag information of the sample image;

and the second training module is used for training the second image classification model according to the first type label information and the second type label information.

12. The apparatus of claim 11, wherein the apparatus further comprises:

The first acquisition module is used for acquiring type labeling data of the sample image, wherein the type labeling data comprises third type label information;

and the third training module is used for training the first image classification model according to the third type label information and the first type label information.

13. The apparatus of claim 11, wherein the second training module is specifically configured to:

14. The apparatus of claim 11, wherein the apparatus further comprises:

the second acquisition module is used for acquiring type labeling data of the sample image, wherein the type labeling data comprises third type label information;

and the fourth training module is used for training the second image classification model according to the third type label information and the second type label information.

15. The apparatus of claim 10, wherein the first feature extraction network is a convolutional neural network and the second feature extraction network is a self-attention mechanism based conversion transformer network.

16. The apparatus according to any one of claims 10-15, wherein the first determining module is specifically configured to:

17. An image classification apparatus comprising:

the acquisition module is used for acquiring the image to be processed;

the feature extraction module is used for inputting the image to be processed into a second feature extraction network in a second image classification model to obtain a third feature image of the image to be processed, wherein the second feature extraction network is obtained by training based on a first feature image and a second feature image, the first feature image is obtained by carrying out feature extraction on a sample image by utilizing the first feature extraction network in the first image classification model, the second feature image is obtained by carrying out feature extraction on a mask image of the sample image by utilizing the second feature extraction network, and the first image classification model is obtained by training based on the sample image and corresponding type tag data;

The classification module is used for classifying the third characteristic image by utilizing a classification network in the second image classification model so as to obtain type label information of the image to be processed;

18. The apparatus of claim 17, wherein the first feature extraction network is a convolutional neural network and the second feature extraction network is a self-attention mechanism based conversion transformer network.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7 or the method of any one of claims 8-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-7 or the method of any one of claims 8-9.