CN113762508A

CN113762508A - Training method, device, equipment and medium for image classification network model

Info

Publication number: CN113762508A
Application number: CN202111038731.1A
Authority: CN
Inventors: 刘浩
Original assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Current assignee: Jingdong Kunpeng Jiangsu Technology Co Ltd
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-12-07

Abstract

The embodiment of the invention discloses a training method, a device, equipment and a medium of an image classification network model, wherein the image classification network model comprises the following steps: the training method comprises the following steps of: training a feature extraction sub-model based on a label-free sample image in a sample image contrast mode, wherein the label-free sample image comprises a first sample image pair consisting of a positive sample image and a negative sample image; inputting the labeled sample image into a target feature extraction submodel after training is finished to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into a classification submodel to obtain a prediction category of the labeled sample image; and finishing the training of the classification sub-model under the condition that the loss functions determined by the label category and the prediction category meet the preset convergence condition, and finishing the training of the image classification network model, so that the labeling cost can be reduced, and the high performance of the model can be ensured.

Description

Training method, device, equipment and medium for image classification network model

Technical Field

The embodiment of the invention relates to computer technology, in particular to a training method, a device, equipment and a medium for an image classification network model.

Background

With the rapid development of deep learning technology, objects in an image can be identified and classified by using an image classification network model based on deep learning.

At present, before using an image classification network model, the image classification network model is usually trained in a supervised manner, so that the trained image classification network model can accurately perform image processing operations.

However, in the process of implementing the present invention, the inventor finds that at least the following problems exist in the prior art:

in the process of carrying out supervised training on the image classification network model, the performance of the model is limited by the quantity of sample data acquisition and the quality of labeling. In order to obtain a high-performance image classification network model, a large amount of labeled data is often needed to train the model, so that the labor cost of data acquisition and labeling is very high, and the cycle length is long, which is not beneficial to the iteration of the model.

Disclosure of Invention

The embodiment of the invention provides a training method, a training device, equipment and a training medium for an image classification network model, which are used for reducing the labeling cost and ensuring the high performance of the image classification network model.

In a first aspect, an embodiment of the present invention provides a method for training an image classification network model, where the image classification network model includes: the training method comprises the following steps of:

training the feature extraction sub-model based on a label-free sample image in a sample image contrast mode, wherein the label-free sample image comprises a first sample image pair consisting of a positive sample image and a negative sample image;

under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, finishing training of the feature extraction submodel to obtain a target feature extraction submodel;

inputting the labeled sample image into the target feature extraction submodel to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into the classification submodel to obtain the prediction category of the labeled sample image;

and finishing the training of the classification submodel under the condition that the loss function determined by the label category and the prediction category meets a preset convergence condition, and finishing the training of the image classification network model.

In a second aspect, an embodiment of the present invention further provides a device for training an image classification network model, where the image classification network model includes: the feature extraction submodel and at least one classification submodel corresponding to the category, the training device includes:

the characteristic extraction sub-model training module is used for training the characteristic extraction sub-model based on a label-free sample image in a sample image comparison mode, wherein the label-free sample image comprises a first sample image pair consisting of a positive sample image and a negative sample image; under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, finishing training of the feature extraction submodel to obtain a target feature extraction submodel;

the classification sub-model training module is used for inputting the labeled sample image into the target feature extraction sub-model to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into the classification sub-model to obtain the prediction category of the labeled sample image; and finishing the training of the classification submodel under the condition that the loss function determined by the label category and the prediction category meets a preset convergence condition, and finishing the training of the image classification network model.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of training an image classification network model as provided by any embodiment of the invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for training the image classification network model according to any embodiment of the present invention.

The embodiment of the invention has the following advantages or beneficial effects:

training a feature extraction sub-model in an image classification network model based on a label-free sample image, namely a first sample image consisting of a positive sample image and a negative sample image, finishing training the feature extraction sub-model to obtain a target feature extraction sub-model under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, so that the feature extraction sub-model can be trained in an automatic supervision mode to finish learning of a large amount of label-free sample data, the feature extraction sub-model can accurately extract feature information, inputting the labeled sample image into the target feature extraction sub-model after the feature extraction sub-model is trained to obtain the target feature extraction sub-model, obtaining a target feature map, and classifying labels corresponding to the target feature map and the labeled sample image, inputting a classification submodel to obtain a prediction category of the labeled sample image, finishing training of the classification submodel under the condition that a loss function determined by the label category and the prediction category meets a preset convergence condition, and finishing training of the image classification network model, so that the classification submodel in the image classification network model can be trained on the basis of a small amount of labeled sample images, the classification submodel can accurately classify the characteristic information extracted by the extraction submodel on the basis of the trained target characteristic, the high performance of the image classification network model is ensured, only a small amount of sample data needs to be labeled, and the labeling cost is greatly reduced.

Drawings

FIG. 1 is a flowchart of a training method of an image classification network model according to an embodiment of the present invention;

FIG. 2 is an example of an image classification network model according to an embodiment of the present invention;

FIG. 3 is an example of a feature extraction submodel training process according to an embodiment of the invention;

FIG. 4 is a flowchart of a training method of an image classification network model according to a second embodiment of the present invention;

FIG. 5 is an example of an image classification network model according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of a training apparatus for an image classification network model according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a training method for an image classification network model according to an embodiment of the present invention, which is applicable to a case of training an image classification network model, and especially applicable to a case of training an image classification network model for identifying an object in a scene in an automatic driving scene. The method can be executed by a training device of the image classification network model, and the device can be realized by software and/or hardware and is integrated in an electronic device.

Fig. 2 shows an example of an image classification network model. As shown in fig. 2, the image classification network model in this embodiment may include: the feature extraction submodel and the classification submodel corresponding to at least one category. Wherein the feature extraction submodel may be configured to: and performing feature extraction on the input image, acquiring a feature map corresponding to the input image, and inputting the feature map into the classification submodel. The classification submodel may be used to: and classifying the image based on the input feature map, predicting the category to which the object in the input image belongs, and outputting the category. The class to which the object in the image belongs can be identified by using an image classification network model.

As shown in fig. 1, the training method of the image classification network model specifically includes the following steps:

s110, training the feature extraction sub-model based on the label-free sample image in a sample image contrast mode, wherein the label-free sample image comprises a first sample image pair consisting of a positive sample image and a negative sample image.

The feature extraction submodel may refer to a convolutional network structure used for extracting image feature information in the image classification network model. For example, the feature extraction submodel may be, but is not limited to: VGG (visual Geometry group) convolutional network model, ResNet convolutional network model. The label-free sample image can be a sample image directly acquired without image annotation. The positive sample image may refer to a scene image in an application scene. The negative sample image may refer to an image that is completely unrelated to the application scene. For example, a scene image in an automatic driving scene may be taken as a positive sample image. Images such as an indoor scene image and a non-road scene image which are completely unrelated to an automatic driving scene are taken as negative sample images.

Specifically, a positive sample image and a negative sample image in the unlabeled sample images can form a first sample image pair and input the first sample image pair into the feature extraction submodel, and the feature extraction submodel is subjected to self-supervision training in a sample image contrast mode, so that the feature extraction submodel can extract feature information with universality from a large number of unlabeled sample images, and learning of a large number of unlabeled data is completed. The data do not need to be marked through a self-supervision training mode, so that the characteristic extraction submodel can be trained by using a large number of unlabeled sample images which are dozens of times or even hundreds of times larger than the labeled sample images, the trained characteristic extraction submodel can have higher background distinguishing capability, and the classification submodel is favorable for distinguishing positive and negative samples.

And S120, under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, finishing training of the feature extraction sub-model to obtain a target feature extraction sub-model.

Specifically, the image similarity between the positive sample image and the negative sample image in the first sample image pair may be determined based on the feature map output by the feature extraction submodel, and the feature extraction submodel may be trained with the image similarity between the positive sample image and the negative sample image as a training target, so that the trained target feature extraction submodel may have a higher background distinguishing capability. For example, the image similarity between the positive sample image and the negative sample image in the first sample image pair may be reversely propagated to the feature extraction submodel, and the weight in the feature extraction submodel is adjusted until the training is finished when the preset convergence condition is reached, for example, when the image similarity is smaller than the preset similarity or the variation range tends to be stable, or the iterative training number is equal to the preset number, it is indicated that the training of the feature extraction submodel is finished, and at this time, the trained target feature extraction submodel has a higher background distinguishing capability, and is helpful for the classification submodel to distinguish the positive sample image from the negative sample image.

S130, inputting the labeled sample image into a target feature extraction submodel to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into a classification submodel to obtain the prediction category of the labeled sample image.

The classification submodel may refer to a network structure for classifying images based on image feature information in the image classification network model, so as to determine a category to which an article in the image belongs. The labeled sample image may be a sample image corresponding to a category to be identified in an application scene, and data labeling, that is, labeling, is required. For example, in an automatic driving scene, if it is necessary to recognize an object such as a vehicle, a pedestrian, or a traffic sign, an image including a vehicle, an image including a pedestrian, and an image including a traffic sign may be used as sample images, and a category may be labeled.

Specifically, the trained target feature extraction submodel can accurately extract image feature information, and at the moment, the trained target feature extraction submodel can be used for supervised training of the classification submodel based on a small amount of labeled sample images, so that the classification submodel can accurately identify the classes to which the articles in the images belong, and the high performance of the image classification network model is ensured. For example, the training process of the classification submodel may include: and inputting the labeled sample image into a target feature extraction submodel to obtain an output target feature image, and then inputting the target feature image and the label category corresponding to the labeled sample image into a classification submodel to obtain a prediction category output by the classification submodel, namely the prediction category to which the sample object in the labeled sample image belongs.

And S140, under the condition that the loss function determined by the label category and the prediction category meets the preset convergence condition, finishing the training of the classification sub-model, and finishing the training of the image classification network model.

Specifically, based on a loss function, a training error may be determined according to an output prediction category and a label category corresponding to a labeled sample image, the training error is propagated back to the classification submodel, and each network parameter in the classification submodel is adjusted until training is finished when a preset convergence condition is reached, for example, when the training error is smaller than the preset error or an error change range tends to be stable, or when the iterative training number is equal to a preset number, it is indicated that training of the classification submodel is finished, so that a trained classification submodel may be obtained, and at this time, training of the image classification network model is finished.

It should be noted that, when the classification submodel is trained, all network parameters in the target feature extraction submodel are locked, so as to ensure that the network parameters in the target feature extraction submodel are not updated in the process of training the classification submodel.

It should be noted that the feature extraction submodel is used as a core in the image classification network model, and a training effect of the feature extraction submodel directly affects model performance of the image classification network model, so that the embodiment performs self-supervision training on the feature extraction submodel by using a large number of unlabeled sample images to obtain a high-performance feature extraction submodel, and after the training of the feature extraction submodel is finished, the high performance of the image classification network model can be ensured only by performing supervision training on the classification submodel by using a small number of labeled sample images, and only a small number of sample data need to be labeled, so that labeling cost is greatly reduced.

In the technical scheme of this embodiment, a feature extraction submodel in an image classification network model is trained based on a non-labeled sample image in a sample image comparison manner, that is, a first sample image composed of a positive sample image and a negative sample image, and when the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, the feature extraction submodel training is finished to obtain a target feature extraction submodel, so that the feature extraction submodel can be trained in an auto-supervision manner to complete the learning of a large amount of non-labeled sample data, so that the feature extraction submodel can accurately extract feature information, and after the feature extraction submodel training is finished to obtain the target feature extraction submodel, the labeled sample image is input into the target feature extraction submodel to obtain a target feature image, and the target feature image and the label category corresponding to the labeled sample image are input into the target feature extraction submodel, inputting a classification submodel to obtain a prediction category of the labeled sample image, finishing training of the classification submodel under the condition that a loss function determined by the label category and the prediction category meets a preset convergence condition, and finishing training of the image classification network model, so that the classification submodel in the image classification network model can be trained on the basis of a small amount of labeled sample images, the classification submodel can accurately classify the characteristic information extracted by the extraction submodel on the basis of the trained target characteristic, the high performance of the image classification network model is ensured, only a small amount of sample data needs to be labeled, and the labeling cost is greatly reduced.

On the basis of the above technical solution, the label-free sample image may further include: a second sample image pair consisting of the positive sample image and the enhanced sample image to which the positive sample image corresponds.

The enhanced sample image corresponding to the positive sample image may be a sample image obtained by randomly transforming the positive sample image. For example, the enhanced sample image is obtained by performing a transformation operation such as rotation, translation, scaling, or noise removal on the positive sample image. For example, in an auto-driving scenario, a positive sample image may refer to an object image that is relevant to the auto-driving scenario; the enhanced sample image corresponding to the positive sample image may be a sample image obtained by performing random transformation on the positive sample; the negative sample image may be an image of an object that is completely unrelated to the autonomous driving scene. For example, a scene image in an automatic driving scene may be taken as a positive sample image. Images such as an indoor scene image and a non-road scene image which are completely unrelated to an automatic driving scene are taken as negative sample images.

Specifically, a positive sample image, an enhanced sample image and a negative sample image corresponding to the positive sample image in the unlabeled sample images can be input into the feature extraction submodel as a group of training data, and the feature extraction submodel can be subjected to more effective self-supervision training in a mode of performing sample image comparison on the first sample image pair and the second sample image pair, so that the feature extraction submodel can extract more universal feature information from a large number of unlabeled sample images, complete the learning of a large number of unlabeled data, further improve the training effect of the feature extraction submodel, and enable the trained target feature extraction submodel to have higher background distinguishing capability.

Illustratively, the training process of the feature extraction submodel using the unlabeled positive sample image, the enhanced sample image corresponding to the positive sample image, and the negative sample image may include the following steps S111-S114:

and S111, respectively inputting the positive sample image, the enhanced sample image corresponding to the positive sample image and the negative sample image into a feature extraction sub-model, and determining a positive feature map corresponding to the positive sample image, an enhanced feature map corresponding to the enhanced sample image and a negative feature map corresponding to the negative sample image according to the output of the feature extraction sub-model.

Specifically, fig. 3 shows an example of a training process of a feature extraction sub-model, and as shown in fig. 3, a positive sample image and a negative sample image and an enhanced sample image corresponding to the positive sample can be randomly selected from unlabeled sample images, and the image sizes of the three sample images are unified, such as unified as 224 × 3. The three sample images may be input into the same feature extraction submodel one by one, or the three sample images may be input into three feature extraction submodels shared by weights at the same time, that is, the three feature extraction submodels are feature extraction submodels with the same weights, and are also one feature extraction submodel in nature, so as to improve the training efficiency, as shown in fig. 3. By extracting the output of the sub-model according to the features, a positive feature map corresponding to the positive sample image, an enhanced feature map corresponding to the enhanced sample image and a negative feature map corresponding to the negative sample image can be obtained.

And S112, determining the image similarity between the positive sample image and the negative sample image in the first sample image pair according to the positive feature map and the negative feature map.

Specifically, the image similarity between the positive sample image and the negative sample image can be determined according to the positive feature map and the negative feature map by using a similarity calculation method such as a cosine distance.

Exemplarily, S112 may include: leveling the positive feature map and the negative feature map, and determining a positive feature vector corresponding to the positive feature map and a negative feature vector corresponding to the negative feature map; image similarity between the positive sample image and the negative sample image in the first sample image pair is determined from the positive feature vector and the negative feature vector.

Specifically, as shown in fig. 3, the positive feature map and the negative feature map may be flattened to flatten the height and width of the three-dimensional image, that is, the three-dimensional image information is reduced to two-dimensional vector information, for example, the size of the feature map is (64, 32, 128), and after the flattening operation is changed to (2048, 128), so that the positive feature vector V corresponding to the positive feature map may be obtained through the flattening operation⁺Negative feature vector V corresponding to negative feature map^-. The image similarity between the positive sample image and the enhanced sample image can be determined based on the positive feature vector and the enhanced feature vector by utilizing a similarity calculation mode such as cosine distance.

And S113, determining the image similarity between the positive sample image and the enhanced sample image in the second sample image pair according to the positive feature map and the enhanced feature map.

Specifically, the image similarity between the positive sample image and the enhanced sample image can be determined according to the positive feature map and the enhanced feature map by using a similarity calculation mode such as a cosine distance.

Exemplarily, S113 may include: leveling the positive feature map and the enhanced feature map, and determining a positive feature vector corresponding to the positive feature map and an enhanced feature vector corresponding to the enhanced feature map; determining an image similarity between the positive sample image and the enhanced sample image in the first sample image pair according to the positive feature vector and the enhanced feature vector.

Specifically, as shown in fig. 3, the positive feature map and the enhanced feature map may be flattened to flatten the height and width of the three-dimensional image, that is, the three-dimensional image information is reduced to two-dimensional vector information, for example, the size of the feature map is (64, 32, 128), and after the flattening operation, the three-dimensional image information is changed to (2048, 128), so that the positive feature vector V corresponding to the positive feature map may be obtained through the flattening operation⁺Enhanced feature vector V corresponding to enhanced feature map⁺'. The image similarity between the positive sample image and the enhanced sample image can be determined based on the positive feature vector and the enhanced feature vector by utilizing a similarity calculation mode such as cosine distance.

S114, determining a training total error according to the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image, reversely propagating the training total error to the feature extraction submodel, adjusting the weight in the feature extraction submodel, and ending the training until reaching a preset convergence condition to obtain the target feature extraction submodel.

Specifically, as shown in fig. 3, the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image may be input into a sample measuring device, a preset loss function is used to calculate a loss magnitude output by the feature extraction submodel, that is, a total training error, the total training error is propagated back to the feature extraction submodel, and a weight in the feature extraction submodel is adjusted until training is completed under a preset convergence condition, for example, the total training error is smaller than the preset error or an error change range tends to be stable, or the iterative training time is equal to the preset time, which indicates that training of the feature extraction submodel is completed.

Exemplarily, the determining of the training total error according to the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image in S114 may include: and determining a total training error according to the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image on the basis of a ternary loss function.

Specifically, the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image are input into the ternary loss function, so that a total training error can be obtained, the feature extraction sub-model can be trained by taking the image similarity between the positive sample image and the enhanced sample image as 100% and the image similarity between the positive sample image and the negative sample image as 0 as a training target, and the trained target feature extraction sub-model can have higher background distinguishing capability and further improve the performance of the image classification network model.

On the basis of the above technical solutions, after S140, the method may further include: acquiring a target image to be processed; and inputting the target image into the image classification network model after training is finished, and obtaining the target class to which the target object in the target image belongs according to the output of the image classification network model.

Specifically, the target image to be processed is input into the image classification network model after the training is finished, and the target class to which the target object in the target image belongs can be obtained according to the output of the image classification network model, so that the accuracy of image processing can be ensured by using the high-performance image classification network model.

Example two

Fig. 4 is a flowchart of a training method of an image classification network model according to a second embodiment of the present invention. In the embodiment, on the basis of the above embodiments, the classification submodel is further optimized, and on this basis, the training process of the classification submodel is described in detail. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted.

Fig. 5 shows an example of an image classification network model. As shown in fig. 5, the classification submodel in the present embodiment may include: the coding network module and an independent branch network module corresponding to each category. Wherein each branch network module is configured to: and performing prediction processing on the feature map output by the feature extraction sub-model, determining the confidence coefficient that the object in the input image belongs to the current category, and inputting the confidence coefficient into the coding network module. The encoding network module may be to: and performing bit-wise splicing on the confidence degrees output by each branch network module, determining a target class according to a splicing result, and outputting the target class as a final classification result.

Wherein, each category corresponds to an independent branch network module, so that the confidence that the object in the input image belongs to the category can be predicted by using the branch network module, and the range is [0, 1 ]. Each branch network module may be composed of several convolution layers, Batch Normalization layer, nonlinear activation layer (such as Relu activation function), and sigmoid function mapping layer.

Specifically, the specific use process of the image classification network model is as follows: the input image is input into a feature extraction sub-model in the image classification network model for feature extraction, and the obtained feature map is input into the branch network module corresponding to each category, so that each branch network module can perform parallel processing on the input feature map, and performance loss caused by increase of the number of branches can be reduced. Each branch network module inputs the predicted confidence coefficient that the object in the input image belongs to the current category into the coding network module, the coding network module performs bit-wise splicing on the input confidence coefficients, the splicing sequence corresponds to the category corresponding to each branch network module, and the category corresponding to the position with the highest confidence coefficient can be used as a target category to be output, so that the final classification result can be obtained.

As shown in fig. 4, the training method of the image classification network model provided in this embodiment specifically includes the following steps:

s410, training the feature extraction sub-model based on the label-free sample image in a sample image contrast mode, wherein the label-free sample image comprises a first sample image pair consisting of a positive sample image and a negative sample image.

And S420, under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, finishing training of the feature extraction sub-model to obtain a target feature extraction sub-model.

And S430, inputting the labeled sample image into a target feature extraction submodel to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into a classification submodel to obtain the prediction category of the labeled sample image.

Specifically, based on the above-described correlation description, the prediction category to which the sample object in the labeled sample image belongs can be obtained using the image classification network model.

S440, determining a training error according to the prediction type and the label type corresponding to the labeled sample image, reversely transmitting the training error to each branch network module, adjusting the network parameters in each branch network module until the training is finished when a preset convergence condition is reached, and finishing the training of the image classification network model.

Specifically, the training error may be propagated back to each branch network module in the classification submodel, and the network parameter of each branch network module is adjusted until the training is finished when the preset convergence condition is reached, for example, when the training error is smaller than the preset error or the error variation range tends to be stable, or the iterative training number is equal to the preset number, it is determined that the training of each branch network module is finished. The coding network module does not need to be trained, so that when the training of each branch network module is finished, the training of the classification sub-model is finished, and the training of the image classification network model is finished at the moment.

It should be noted that, when each branch network module in the classification submodel is trained, all network parameters in the target feature extraction submodel are locked, so as to ensure that the network parameters in the target feature extraction submodel are not updated in the process of training each branch network module.

According to the technical scheme of the embodiment, the classification submodel comprising the coding network module and the independent branch network module corresponding to each category is arranged, so that each branch network module can process the input feature graph in parallel, performance loss caused by increase of the number of branches is reduced, and the performance of the image classification network model is further improved.

On the basis of the technical scheme, the method further comprises the following steps: and if the new category is added, adding a new branch network module parallel to the existing branch network module in the classification sub-model after the training is finished, and performing incremental training on the new branch network module based on the new category sample image corresponding to the new category.

Specifically, the classification submodel in this embodiment may dynamically add a branch network module. When a new category needs to be identified by using the existing image classification network model, a new branch network module parallel to the existing branch network module can be added in the classification sub-model after training is finished, and the network structure of the added new branch network module can be the same as that of the existing branch network module. The new branch network module is subjected to incremental training based on the new category sample image corresponding to the new category, so that an image classification network model capable of identifying the new category can be obtained, and incremental learning is completed.

It should be noted that when the image classification network model learns a new category, the added new branch network module may be subjected to incremental training only, and the image classification network model does not need to be retrained from the beginning, so that the model training time is shortened, and the model iteration cycle is accelerated. The image classification network model in this embodiment may have a persistent learning capability, i.e., the image classification network model may learn a new category while maintaining the recognition capability of an old category.

Illustratively, the incrementally training the new branch network module based on the new class sample image corresponding to the new class may include:

inputting the sample images in the sample set corresponding to the new category into the image classification network model after the new branch network module is added, wherein the positive sample images in the sample set are the sample images of the new category corresponding to the new category, and the negative sample images are the sample images of other categories corresponding to the other categories; determining the confidence coefficient that the sample object in the sample image belongs to the new category according to the output of a new branch network module in the image classification network model; and determining a training error according to the confidence, reversely transmitting the training error to the new branch network module, and adjusting the network parameters in the new branch network module until the training is finished when a preset convergence condition is reached.

Specifically, the new category sample image corresponding to the new category may be used as a positive sample image, the other category sample image corresponding to the other category learned by the model may be used as a negative sample image, and the negative sample image in the unlabeled sample image may also be obtained to perform incremental training on the new branch network module. To balance the effects of the difference in the number of positive and negative sample images, the training error may be determined using a Binary Focal function as a loss function. For example, the Binary Focal function can be expressed as follows:

L＝-α_t(1-P_t)^γlog(P_t)

wherein, P_tConfidence that the sample object in the sample image output for the new branch network module belongs to the new class. y represents the input sample image. When the input sample image is a positive sample image, i.e., y is equal to 1, the coefficient α in the Focal function_tIs alpha; when a sample image is inputCoefficient α in the Focal function for negative sample images, i.e. y equals 1_tIs 1-alpha. When the number of the negative sample images is small, α may be 0.5 and γ may be 1, and when the number of the negative sample images is large, α may be 0.25 and γ may be 2, so that the influence of the difference in the number of the positive and negative sample images on the performance of the image classification network model may be balanced.

In this embodiment, the training error is reversely propagated to the new branch network module, and each network parameter in the new branch network module is adjusted until the training is finished when the preset convergence condition is reached, for example, when the training error is smaller than the preset error or the error variation range tends to be stable, or the iterative training number of times is equal to the preset number of times, it is determined that the incremental training of the new branch network module is finished.

It should be noted that, when a new branch network module in the classification submodel is trained, all network parameters in the target feature extraction submodel and all network parameters in other branch network modules are locked to ensure that the network parameters in the target feature extraction submodel and other branch network modules are not updated in the process of training the new branch network module, so that the situation of catastrophic forgetting in incremental learning can be avoided, the image classification network model can learn a new category, and the recognition capability of an old category is maintained.

On the basis of the above technical solutions, the method further includes: and if the new sample image corresponding to the existing type is added, the new sample image and the existing sample image corresponding to the existing type are used as positive sample images, the other type sample images corresponding to the other types are used as negative sample images, and the branch network module corresponding to the existing type is subjected to incremental training.

Specifically, when a new sample image corresponding to an existing category is acquired, the new sample image is required to be used for performing incremental training on the branch network module corresponding to the existing category, so that the branch network module can have the identification capability of the new sample image. For example, for the category of the vehicle, it is assumed that the branch network module corresponding to the category is previously trained only by using the car image as the sample image, and if new sample images, such as a truck image or a bicycle image, which belong to the category of the vehicle but have not been learned before are currently obtained, the branch network module corresponding to the category needs to be incrementally trained by using the new sample images, so that the trained branch network module can classify the truck image or the bicycle image into the category of the vehicle. The image classification network model in the present embodiment may have a capability of persistent learning.

Illustratively, the incremental training process of the branch network module corresponding to the existing category includes: and taking the new sample image and the representative sample image corresponding to the learned existing class as a positive sample image, taking the sample image of the other class corresponding to the learned other class of the model as a negative sample image, and acquiring the negative sample image in the unlabeled sample image. Selecting a sample image from the sample image sets, inputting the sample image into an image classification network model, and determining the confidence degree of a sample object in the sample image belonging to an existing class according to the output of a branch network module corresponding to the existing class in the image classification network model; and determining a training error according to the confidence, reversely transmitting the training error to the branch network module corresponding to the existing category, and adjusting the network parameters in the branch network module corresponding to the existing category until the training is finished when a preset convergence condition is reached. In order to balance the influence of the number difference of the positive and negative sample images, the Binary Focal function can be used as a loss function to determine a training error, and when the training error is smaller than a preset error or the error variation range tends to be stable, or the iterative training times are equal to the preset times, the incremental training of the branch network module corresponding to the existing category is determined to be completed.

It should be noted that, when the branch network module corresponding to the current existing category in the classification submodel is trained, all network parameters in the target feature extraction submodel and all network parameters in the branch network modules corresponding to other existing categories are locked, so as to ensure that the network parameters in the target feature extraction submodel and the branch network modules corresponding to other existing categories are not updated in the process of training the branch network module corresponding to the current existing category, thereby avoiding the catastrophic forgetting in incremental learning, and through the incremental learning, the diversity of the categories can be increased, and the identification accuracy is ensured.

The following is an embodiment of the training apparatus for an image classification network model according to an embodiment of the present invention, which belongs to the same inventive concept as the training method for the image classification network model according to the above embodiments, and reference may be made to the embodiment of the training method for an image classification network model in the embodiment of the training apparatus for an image classification network model, details of which are not described in detail in the embodiment.

EXAMPLE III

Fig. 6 is a schematic structural diagram of a training apparatus for an image classification network model according to a third embodiment of the present invention, which is applicable to a case of training an image classification network model, and especially applicable to a case of training an image classification network model for identifying an object in a scene in an automatic driving scene. The image classification network model in this embodiment may include: the feature extraction submodel and the classification submodel corresponding to at least one category. Wherein the feature extraction submodel may be configured to: and performing feature extraction on the input image, acquiring a feature map corresponding to the input image, and inputting the feature map into the classification submodel. The classification submodel may be used to: and classifying the image based on the input feature map, determining the category to which the object in the input image belongs, and outputting the category. The class to which the object in the image belongs can be identified by using an image classification network model.

As shown in fig. 6, the training apparatus of the image classification network model in this embodiment specifically includes: a feature extraction submodel training module 610 and a classification submodel training module 620.

The feature extraction submodel training module 610 is configured to train a feature extraction submodel based on a label-free sample image in a sample image comparison manner, where the label-free sample image includes a first sample image pair composed of a positive sample image and a negative sample image; under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, finishing training of the feature extraction sub-model to obtain a target feature extraction sub-model; the classification submodel training module 620 is used for inputting the labeled sample image into a target feature extraction submodel to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into the classification submodel to obtain the prediction category of the labeled sample image; and finishing the training of the classification submodel under the condition that the loss functions determined by the label category and the prediction category meet the preset convergence condition, and finishing the training of the image classification network model.

Optionally, the unlabeled exemplar image further includes: a second sample image pair consisting of the positive sample image and the enhanced sample image to which the positive sample image corresponds.

Optionally, the feature extraction submodel training module 610 includes:

the feature map determining submodule is used for respectively inputting the positive sample image, the enhanced sample image corresponding to the positive sample image and the negative sample image into the feature extraction submodel, and determining the positive feature map corresponding to the positive sample image, the enhanced feature map corresponding to the enhanced sample image and the negative feature map corresponding to the negative sample image according to the output of the feature extraction submodel;

the first similarity determining submodule is used for determining the image similarity between the positive sample image and the negative sample image in the first sample image pair according to the positive feature map and the negative feature map;

the second similarity determining submodule is used for determining the image similarity between the positive sample image and the enhanced sample image in the second sample image pair according to the positive feature map and the enhanced feature map;

and the error propagation submodule is used for determining a training total error according to the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image, reversely propagating the training total error to the feature extraction submodel, adjusting the weight in the feature extraction submodel, and ending the training until a preset convergence condition is reached to obtain the target feature extraction submodel.

Optionally, the first similarity determining submodule is specifically configured to:

leveling the positive feature map and the negative feature map, and determining a positive feature vector corresponding to the positive feature map and a negative feature vector corresponding to the negative feature map; image similarity between the positive sample image and the negative sample image in the first sample image pair is determined from the positive feature vector and the negative feature vector.

Optionally, the error propagation submodule is specifically configured to:

and determining a total training error according to the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image on the basis of a ternary loss function.

Optionally, the positive sample image is an object image associated with an autonomous driving scene; the enhanced sample image corresponding to the positive sample image is a sample image obtained by randomly transforming the positive sample; the negative sample image is an object image that is completely unrelated to the automated driving scene.

Optionally, the classification submodel includes: the coding network module and an independent branch network module corresponding to each category;

wherein each branch network module is configured to: performing prediction processing on the feature map output by the feature extraction submodel, determining the confidence coefficient that an object in an input image belongs to the current category, and inputting the confidence coefficient into the coding network module;

the coding network module is used for: and performing bit-wise splicing on the confidence degrees output by each branch network module, determining a target class according to a splicing result, and outputting the target class as a final classification result.

Optionally, the apparatus further comprises:

and the first incremental training module is used for adding a new branch network module parallel to the existing branch network module in the classification sub-model after the training is finished if the new category is added, and carrying out incremental training on the new branch network module based on the new category sample image corresponding to the new category.

Optionally, the first incremental training module is specifically configured to:

Optionally, the apparatus further comprises:

and the second incremental training module is used for performing incremental training on the branch network module corresponding to the existing type by taking the new sample image and the existing sample image corresponding to the existing type as positive sample images and taking other type sample images corresponding to other types as negative sample images if the new sample image corresponding to the existing type is added.

Optionally, the apparatus further comprises:

the target image acquisition module is used for acquiring a target image to be processed after the training of the image classification network model is finished;

and the target class obtaining module is used for inputting the target image into the image classification network model after the training is finished and obtaining the target class of the target object in the target image according to the output of the image classification network model.

The training device for the image classification network model provided by the embodiment of the invention can execute the training method for the image classification network model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the training method for executing the image classification network model.

It should be noted that, in the embodiment of the training apparatus for an image classification network model, the included units and modules are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be realized; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

Example four

Fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. And 7 show a block diagram of an exemplary electronic device 12 suitable for use in implementing embodiments of the present invention. The electronic device 12 shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in FIG. 7, electronic device 12 is embodied in the form of a general purpose computing device. The components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by electronic device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)30 and/or cache memory 32. The electronic device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown with 7, commonly referred to as "hard drives"). Although not shown in FIG. 7, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally carry out the functions and/or methodologies of the described embodiments of the invention.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with electronic device 12, and/or with any devices (e.g., network card, modem, etc.) that enable electronic device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Also, the electronic device 12 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via the bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing by running a program stored in the system memory 28, for example, implementing steps of a training method of an image classification network model provided by the embodiment of the present invention, the method includes:

training a feature extraction sub-model based on a label-free sample image in a sample image contrast mode, wherein the label-free sample image comprises a first sample image pair consisting of a positive sample image and a negative sample image;

under the condition that the image similarity between the positive sample image and the negative sample image in the first sample image pair meets a preset convergence condition, finishing training of the feature extraction sub-model to obtain a target feature extraction sub-model;

inputting the labeled sample image into a target feature extraction submodel to obtain a target feature map, and inputting the target feature map and the label category corresponding to the labeled sample image into a classification submodel to obtain a prediction category of the labeled sample image;

and finishing the training of the classification submodel under the condition that the loss functions determined by the sample label and the prediction label are satisfied, and finishing the training of the image classification model.

Of course, those skilled in the art can understand that the processor can also implement the technical solution of the training method of the image classification network model provided by any embodiment of the present invention.

EXAMPLE five

The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, implements the method steps of training an image classification network model as provided by any of the embodiments of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer-readable storage medium may be, for example but not limited to: an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It will be understood by those skilled in the art that the modules or steps of the invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and optionally they may be implemented by program code executable by a computing device, such that it may be stored in a memory device and executed by a computing device, or it may be separately fabricated into various integrated circuit modules, or it may be fabricated by fabricating a plurality of modules or steps thereof into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A training method of an image classification network model is characterized in that the image classification network model comprises the following steps: the training method comprises the following steps of:

2. The method of claim 1, wherein the unlabeled exemplar image further comprises: a second sample image pair consisting of the positive sample image and an enhanced sample image to which the positive sample image corresponds.

3. The method of claim 2, wherein the training of the feature extraction submodel based on the unlabeled sample image is completed when the image similarity between the positive sample image and the negative sample image in the first sample image pair satisfies a preset convergence condition, and obtaining a target feature extraction submodel comprises:

respectively inputting the positive sample image, the enhanced sample image corresponding to the positive sample image and the negative sample image into the feature extraction submodel, and determining a positive feature map corresponding to the positive sample image, an enhanced feature map corresponding to the enhanced sample image and a negative feature map corresponding to the negative sample image according to the output of the feature extraction submodel;

determining image similarity between the positive sample image and the negative sample image in the first sample image pair according to the positive feature map and the negative feature map;

determining image similarity between the positive sample image and the enhanced sample image in the second sample image pair according to the positive feature map and the enhanced feature map;

determining a total training error according to the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image, reversely propagating the total training error to the feature extraction submodel, adjusting the weight in the feature extraction submodel until the training is finished when a preset convergence condition is reached, and obtaining a target feature extraction submodel.

4. The method of claim 3, wherein determining image similarity between the positive sample image and the negative sample image in the first sample image pair from the positive feature map and the negative feature map comprises:

leveling the positive feature map and the negative feature map, and determining a positive feature vector corresponding to the positive feature map and a negative feature vector corresponding to the negative feature map;

determining an image similarity between the positive sample image and the negative sample image in the first sample image pair from the positive feature vector and the negative feature vector.

5. The method of claim 3, wherein determining a total training error based on the image similarity between the positive sample image and the negative sample image and the image similarity between the positive sample image and the enhanced sample image comprises:

6. The method of claim 2, wherein the positive sample image is an object image associated with an autonomous driving scene; the enhanced sample image corresponding to the positive sample image is a sample image obtained by randomly transforming the positive sample; the negative sample image is an image of an object that is completely unrelated to the automated driving scene.

7. The method of claim 1, wherein the classifying the submodel comprises: the coding network module and an independent branch network module corresponding to each category;

wherein each of the branch network modules is configured to: performing prediction processing on the feature map output by the feature extraction submodel, determining the confidence coefficient that an object in an input image belongs to the current category, and inputting the confidence coefficient into the coding network module;

the encoding network module is configured to: and performing bit-wise splicing on the confidence degrees output by the branch network modules, determining a target class according to a splicing result, and outputting the target class as a final classification result.

8. The method of claim 7, further comprising:

and if a new category is added, adding a new branch network module parallel to the existing branch network module in the classification sub-model after the training is finished, and performing incremental training on the new branch network module based on a new category sample image corresponding to the new category.

9. The method of claim 8, wherein the incrementally training the new branch network module based on the new class sample image corresponding to the new class comprises:

inputting the sample images in the sample set corresponding to the new category into the image classification network model after the new branch network module is added, wherein the positive sample images in the sample set are the sample images of the new category corresponding to the new category, and the negative sample images are the sample images of other categories corresponding to the other categories;

determining the confidence degree that the sample object in the sample image belongs to the new category according to the output of the new branch network module in the image classification network model;

and determining a training error according to the confidence, reversely transmitting the training error to the new branch network module, and adjusting the network parameters in the new branch network module until the training is finished when a preset convergence condition is reached.

10. The method of claim 7, further comprising:

and if a new sample image corresponding to the existing type is added, the new sample image and the existing sample image corresponding to the existing type are used as positive sample images, other type sample images corresponding to other types are used as negative sample images, and the branch network module corresponding to the existing type is subjected to incremental training.

11. The method of any of claims 1-10, after completing the training of the image classification network model, further comprising:

acquiring a target image to be processed;

and inputting the target image into an image classification network model after training is finished, and obtaining a target class to which a target object in the target image belongs according to the output of the image classification network model.

12. An apparatus for training an image classification network model, wherein the image classification network model comprises: the feature extraction submodel and at least one classification submodel corresponding to the category, the training device includes:

13. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method of training an image classification network model as claimed in any one of claims 1 to 11.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of training an image classification network model as claimed in any one of claims 1 to 11.