CN114821190A

CN114821190A - Image classification model training method, image classification method, device and equipment

Info

Publication number: CN114821190A
Application number: CN202210551236.9A
Authority: CN
Inventors: 夏春龙
Original assignee: Apollo Zhilian Beijing Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-07-29

Abstract

The disclosure provides an image classification model training method, an image classification device and image classification equipment, and relates to the field of artificial intelligence, in particular to the technical fields of automatic driving, intelligent transportation, computer vision and the like. The specific implementation scheme is as follows: inputting a plurality of sample images into an image classification model, and acquiring a feature vector of each sample image and a prediction category of each sample image output by the image classification model; calculating the feature similarity between every two images in the plurality of sample images; calculating a first loss function value based on the labeling category of each sample image and the feature similarity between every two images; calculating a second loss function value based on the prediction category of each sample image and the annotation category of each sample image; and training the image classification model according to the first loss function value and the second loss function value to obtain the trained image classification model. The fine-grained classification accuracy can be improved without increasing complexity and time consumption.

Description

Image classification model training method, image classification method, device and equipment

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly to the field of automated driving, intelligent transportation, computer vision, and the like.

Background

Image classification is one of the basic tasks of computer vision, and images can be classified by an image classification model based on deep learning at present.

Disclosure of Invention

The disclosure provides an image classification model training method, an image classification device and image classification equipment.

According to a first aspect of the present disclosure, there is provided an image classification model training method, including:

acquiring a plurality of sample images and the labeling category of each sample image;

inputting the plurality of sample images into an image classification model, obtaining a feature vector of each sample image output by a full connection layer of the image classification model, and obtaining a prediction category of each sample image output by a classifier of the image classification model;

calculating the feature similarity between every two images in the plurality of sample images based on the feature vector of each sample image;

calculating a first loss function value based on the labeling category of each sample image and the feature similarity between every two images;

calculating a second loss function value based on the prediction category of each sample image and the annotation category of each sample image;

and training the image classification model according to the first loss function value and the second loss function value to obtain a trained image classification model.

According to a second aspect of the present disclosure, there is provided an image classification method, comprising:

acquiring a plurality of images to be classified;

inputting the plurality of images to be classified into an image classification model to obtain a prediction category of each image to be classified, wherein the image classification model is a trained image classification model obtained by training according to any one of the methods of the first aspect.

According to a third aspect of the present disclosure, there is provided an image classification model training apparatus, including:

the acquisition module is used for acquiring a plurality of sample images and the labeling category of each sample image;

the input module is used for inputting the plurality of sample images into an image classification model, acquiring a feature vector of each sample image output by a full connection layer of the image classification model, and acquiring a prediction category of each sample image output by a classifier of the image classification model;

the calculation module is used for calculating the feature similarity between every two images in the plurality of sample images based on the feature vector of each sample image;

the calculation module is further used for calculating a first loss function value based on the labeling category of each sample image and the feature similarity between every two images;

the calculation module is further configured to calculate a second loss function value based on the prediction category of each sample image and the annotation category of each sample image;

and the training module is used for training the image classification model according to the first loss function value and the second loss function value to obtain a trained image classification model.

According to a fourth aspect of the present disclosure, there is provided an image classification apparatus including:

the acquisition module is used for acquiring a plurality of images to be classified;

and the classification module is used for inputting the plurality of images to be classified into an image classification model to obtain the prediction category of each image to be classified, wherein the image classification model is an image classification model which is trained according to the device of the third aspect and is obtained after training.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first or second aspects.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of the first or second aspects.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the first or second aspects described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of an image classification model training method provided in an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of another image classification model training method provided by the embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a further method for training an image classification model according to an embodiment of the present disclosure;

FIG. 4 is an exemplary flowchart of an image classification model training method provided by an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an image classification model in a training phase according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a trained image classification model provided in an embodiment of the present disclosure;

fig. 7 is a flowchart illustrating an image classification method according to an embodiment of the disclosure;

fig. 8 is an exemplary flowchart illustrating an application of the image classification model training method provided by the embodiment of the present disclosure to the fields of automatic driving, intelligent transportation, and the like;

fig. 9 is a schematic structural diagram of an image classification model training apparatus provided in an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an image classification apparatus provided in an embodiment of the present disclosure;

fig. 11 is a block diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, only coarse-grained classification can be performed on images through an image classification model, and the coarse-grained classification cannot meet the actual business requirements, so that fine-grained classification needs to be performed on the images. For example, in the field of intelligent transportation, an existing coarse-grained image classification model can classify images containing pedestrians and images containing automobiles, but cannot accurately classify more similar images, such as images containing motorcycles and images containing electric vehicles.

At present, in order to implement fine-grained classification of an image, there are two main schemes, one of which is to pre-process the image to be classified by an image detection or image segmentation method, extract a foreground part in the image to be classified, and an image classification model can obtain more accurate features of the image to be classified according to the extracted foreground part of the image, and then perform fine-grained classification on the image to be classified according to the accurate features of the image to be classified. And the other method is to perform pixel-level labeling on the sample image, then select an image classification model capable of performing pixel-level classification on the image, and train the image classification model based on the sample image and the pixel-level labeling, so that the image classification model obtained by training can perform fine-grained classification on the image.

However, the first scheme requires detection or segmentation of the image in advance, and extraction of foreground portions, which increases the processing procedure, resulting in increased complexity and time consumption. The second scheme needs to adopt a model capable of performing pixel-level classification, the model structure is more complex, and the labeling process is more complex, so that the classification complexity and the time consumption of classification are increased. At present, fine-grained classification is difficult to accurately carry out on the premise of not improving the classification complexity and consuming time.

In order to solve the above technical problem, an embodiment of the present disclosure provides an image classification model training method, which may be executed by an electronic device, where the electronic device may be a smartphone, a tablet computer, a desktop computer, a server, or the like.

The following describes in detail the image classification model training method provided by the embodiment of the present disclosure.

As shown in fig. 1, an embodiment of the present disclosure provides an image classification model training method, including:

s101, obtaining a plurality of sample images and the labeling category of each sample image.

In the embodiment of the disclosure, a plurality of sample images can be pre-labeled to obtain the labeling type of each sample image, and the labeling type of each sample image is the actual type of the sample image.

The sample image may be, among other things, an image in various application scenarios.

For example, in the field of intelligent driving, the sample image may be an image acquired by a camera of a vehicle, and the labeled category of the sample image is a category of an obstacle in the sample image, for example, the category of the obstacle may include a bus, a truck, a taxi, a bicycle, an electric bicycle, a motorcycle, a pedestrian, and the like.

For another example, in the field of traffic management of public transportation, the sample image may be a road surface image collected by a camera on a road, and the labeling type of the sample image may be a vehicle type in the sample image.

For another example, in the environmental protection field, the sample image may be an image of an animal or plant, and the annotation category of the sample image may be an animal or plant category.

The sample image may be selected according to actual business requirements, which is not specifically limited in this disclosure.

S102, inputting a plurality of sample images into an image classification model, obtaining a feature vector of each sample image output by a full connection layer of the image classification model, and obtaining a prediction category of each sample image output by a classifier of the image classification model.

The image classification model is a basic classification model based on deep learning, and may be, for example, classification models such as google Network (google net), Residual Network (ResNet), ResNext (simple and highly modular Network structure for image classification), mobile Network (MobileNet), and shuffle Network (shuffle net).

S103, calculating the feature similarity between every two images in the plurality of sample images based on the feature vector of each sample image.

In the embodiment of the present disclosure, the feature Similarity between two images may be obtained by calculating Cosine Similarity (Cosine Similarity) or euclidean Distance (euclidean Distance) between feature vectors of two sample images. The feature similarity between two images can also be calculated by other methods for calculating the feature vector similarity, which is not limited in the embodiment of the present disclosure. In the embodiment of the present disclosure, the feature similarity between two images is taken as cosine similarity.

The method for calculating the feature similarity between two images of a plurality of sample images will be described in detail below.

And S104, calculating a first loss function value based on the labeling type of each sample image and the feature similarity between every two images.

In the embodiment of the disclosure, for two sample images, if the feature similarity between the two sample images is high, but the labeling categories of the two sample images belong to different categories, it is indicated that the feature vectors extracted by the image classification model for the two sample images are not accurate enough.

And if the similarity between the two sample images is higher, but the labeling categories of the two sample images belong to the same category, the feature vectors extracted by the image classification model for the two sample images are more accurate.

That is, the higher the feature similarity between sample images of the same annotation class is, the lower the feature similarity between sample images of different annotation classes is, the more accurate the feature extraction of the sample images by the image classification model is.

Therefore, when the feature similarity of the sample images of the same labeling type is higher and the feature similarity of the sample images of different labeling types is lower, the first loss function value is smaller.

And S105, calculating a second loss function value based on the prediction category of each sample image and the annotation category of each sample image.

In the embodiment of the present disclosure, an error between the prediction category of each sample image and the annotation category of the sample image may be calculated to obtain the second loss function value.

And S106, training the image classification model according to the first loss function value and the second loss function value to obtain the trained image classification model.

In the embodiment of the present disclosure, the image classification model is trained according to the first loss function value and the second loss function value, that is, the first loss function value and the second loss function value obtained by calculation are made smaller by adjusting parameters of the image classification model.

The smaller the first loss function value is, the higher the feature similarity between two sample images of the same annotation class is, and the lower the feature similarity between two sample images of different annotation classes is, which indicates that the feature extraction on the sample images in the image classification model is more accurate, and further, the classification of the sample images by the image classification model is more accurate.

By adopting the embodiment of the disclosure, when the image classification model is trained, the first loss function value and the second loss function value are utilized. The first loss function value is obtained by calculation based on the labeling category of each sample image and the feature similarity between every two images, and the image classification model is trained based on the first loss function value, so that the feature similarity between the images of the same category extracted by the image classification model is high, and the similarity between the images of different categories is low. The second loss function value is calculated based on the prediction category of each sample image and the labeling category of each sample image, and the image classification model is trained based on the second loss function value, so that the image classification model can accurately predict the category of the image. Because the feature similarity among the sample images of the same category extracted by the image classification model is high through the training, the image classification model can more accurately classify the sample images of the same category into one category; and because the similarity among the sample images of different types extracted by the image classification model is low through the training, the image classification model can accurately classify the sample images of different types into different types, and the fine-grained classification of the images can be realized under the condition that the marking type granularity of the sample images is finer. In addition, the image classification process is not added with an additional flow, and the complexity and time consumption are not improved, so that the fine-grained classification of the image can be accurately performed on the premise of not increasing the complexity and time consumption.

In another embodiment of the present disclosure, the feature similarity may be obtained by calculating a cosine similarity between feature vectors of two sample images, as shown in fig. 2, based on the above embodiment, the step S103 may specifically be implemented as:

and S1031, normalizing the feature vector of each sample image to obtain a feature matrix containing the normalized feature vectors of the plurality of sample images.

In the embodiment of the present disclosure, in order to facilitate subsequent feature similarity calculation, normalization processing may be performed on the feature vectors of each sample image.

Specifically, the feature vector of each sample image may be normalized by an L2 norm (L2 norm) normalization method to obtain a feature matrix including the normalized feature vector of each sample image, where each row element in the feature matrix represents the normalized feature vector of one sample image.

For example, if the number of sample images is N, and the normalized feature vector of each sample image is a vector of 1 × C, the feature matrix is a matrix of N × C, i.e., a matrix of N rows and C columns is obtained.

Where the L2 norm normalization refers to dividing each element in the vector by the norm of the vector, i.e., by the length of the vector.

For example, if the feature vector of a sample image is (a, b), it can be calculated

Obtaining a normalized feature vector (x) of the feature vector of the sample image after L2 norm normalization processing ₁ ,x ₂ )。

It will be appreciated that the above-described,

that is, each normalized feature vector is a unit vector, the vector modulo 1.

S1032, carrying out dimension transformation on the feature matrix to obtain a transposed matrix of the feature matrix.

It is understood that the feature matrix of N × C is subjected to dimension transformation, and the resulting transposed matrix is a matrix of C × N, where each column of elements in the transposed matrix of C × N represents a normalized feature vector of one sample image.

In one implementation, the feature matrix may be dimension transformed by a Dimshuffle () function. The Dimshuffle () function is a tool used to change the tensor structure.

And S1033, calculating feature similarity between every two images in the plurality of sample images based on the feature matrix and the transposed matrix.

Taking the feature similarity as the cosine similarity as an example, the feature similarity between two images in the plurality of sample images can be obtained by calculating the cosine similarity between the feature vector formed by one row of elements in the feature matrix and the feature vector formed by each row of elements in the transposed matrix.

The cosine similarity is a cosine value of an included angle between the two eigenvectors. Assuming that the included angle between the feature vector a and the feature vector B is θ, the cosine similarity between the feature vector a and the feature vector B is:

in the embodiment of the present disclosure, since the normalized feature vector of each sample image is obtained through the L2 norm normalization process, that is, the normalized feature vector of each sample image is a unit vector. Therefore, in the embodiment of the present disclosure, when the cosine similarity calculation formula is used, the denominators obtained by calculation are all 1, so that the cosine similarity calculation formula can be simplified as follows: the cos θ is a · B, i.e., the cosine similarity can be directly obtained as the product of the quantities between the two eigenvectors.

That is, the feature similarity between two images in the plurality of sample images can be obtained by calculating the number product of normalized feature vectors of every two sample images in the plurality of sample images.

Each row element in the feature matrix represents a normalized feature vector of one sample image, and each column element in the transposed matrix represents a normalized feature vector of one sample image, so that the feature matrix of N × C and the transposed matrix of C × N can be multiplied to obtain a matrix of N × N, and each element in the matrix of N × N is a feature similarity between two sample images.

It should be noted that, in the embodiment of the present disclosure, when the feature similarity is cosine similarity, that is, the value range of the feature similarity is [0, 1], and the cosine similarity is larger, the smaller the included angle between the normalized feature vectors of the two sample images is, the higher the feature similarity is.

By adopting the embodiment of the disclosure, the feature matrix containing the normalized feature vector of each sample image is obtained by performing normalization processing on the feature vector of each sample image, and the feature similarity between every two sample images can be obtained by calculating the feature matrix and the transpose matrix of the feature matrix, so that the loss function value can be calculated subsequently according to the feature similarity between every two sample images, namely when an image classification model is trained, the feature similarity extracted by the image classification model for the sample images of the same category is higher, and the feature similarity extracted for the sample images of different categories is lower, thereby improving the accuracy of feature extraction of the image classification model, and realizing the accurate classification of fine granularity of the images.

In another embodiment of the present disclosure, as shown in fig. 3, on the basis of the foregoing implementation, the foregoing S104 may specifically be implemented as:

s1041, aiming at every two sample images in the multiple sample images, calculating the same class loss value and different class loss values between the two sample images according to the class values of the two sample images and the feature similarity between the two sample images.

The same-class loss value represents a loss value calculated from the feature similarity between the sample images of two same labeling classes, the different-class loss value represents a loss value calculated from the feature similarity between the sample images of two different labeling classes, and the class value is used for indicating whether the labeling classes of the two sample images are the same class.

Specifically, the computational expression of the same class penalty value is-log (abs (pred) _i,j ))。

The calculation expression of the loss values of different classes is- (1-mask) _i,j )log(1-abs(pred _i,j ))。

Wherein, mask _i,j Is a category value, takes a value of 1 or 0, when the second one isWhen the labeling types of the i sample images and the j sample image are the same, the mask _i,j The value is 1, when the labeling types of the ith sample image and the jth sample image are different, the mask _i,j A value of 0; pred _i,j And representing the feature similarity of the image of the ith sample and the image of the jth sample.

S1042, calculating a first loss function value based on the same class loss value and the different class loss values between every two sample images in the plurality of sample images.

The first loss function used to calculate the first loss function value is:

as can be seen from this expression, when the annotation class of the two sample images is the same, the class value mask _i,j At 1, the different class loss value is 0, and the smaller the same class loss value is, the smaller the first loss function value is, i.e. the higher the feature similarity of the sample images of two same labeled classes is, the smaller the first loss function value is.

When the labeling categories of the two sample images are different, the category value mask _i,j Is 0, i.e. the same class loss value is 0, in which case the smaller the different class loss values, the smaller the first loss function value, i.e. the lower the feature similarity of the sample images of the two different annotation classes.

Therefore, the image classification model is trained through the first loss function value subsequently, so that the feature similarity of the features extracted from the sample images of the same labeling type by the image classification model is higher, the feature similarity of the features extracted from the sample images of different labeling types is lower, and further, the trained image classification model can perform fine-grained classification on the images to be classified with similar contents.

In another embodiment of the present disclosure, in the step S106, the training of the image classification model according to the first loss function value and the second loss function value to obtain the trained image classification model includes the following two implementation manners:

and in the first mode, adjusting parameters of the image classification model according to the first loss function value and the second loss function value until the image classification model is converged to obtain the trained image classification model.

Adjusting image classification model parameters according to the first loss function value and the second loss function value until the image classification model is converged, and taking the obtained image classification model as a candidate image classification model; performing iterative training on the candidate image classification model based on the plurality of sample images to obtain a plurality of candidate image classification models; and selecting one candidate image classification model from the plurality of candidate image classification models as the trained image classification model.

In the embodiment of the disclosure, a plurality of sample images can be used for training the image classification model, and when the image classification model converges, a candidate image classification model is obtained. And then, continuously carrying out iterative training on the converged image classification model to obtain a plurality of candidate image classification models.

Or, an initial image classification model can be trained by using a plurality of groups of sample images, and a candidate image classification model is trained based on each group of sample images.

After obtaining a plurality of candidate image classification models, selecting one candidate image classification model from the plurality of candidate image classification models as a trained image classification model by the following method:

respectively inputting the plurality of test images into each candidate image classification model, and obtaining the classification result of each candidate image classification model on the plurality of test images; determining the classification accuracy of each candidate image classification model based on the labeling classes of the plurality of test images and the classification result of each candidate image classification model on the plurality of test images; and taking the candidate image classification model with the highest classification accuracy as the trained image classification model. Thus, the classification accuracy of the finally obtained image classification model can be further improved.

By adopting the embodiment of the disclosure, the image classification model is trained through the first loss function value and the second loss function value, so that the trained image classification model can perform fine-grained classification on the images to be classified with similar contents; by adopting the iterative training method, a plurality of candidate image classification models can be obtained, and then an optimal image classification model can be selected from the candidate image classification models, so that the image classification accuracy can be improved.

Fig. 4 is a schematic diagram of an exemplary flow chart of training an image classification model according to an embodiment of the present disclosure, and is described below with reference to fig. 4.

S401, initializing image classification model parameters in a training phase.

The image classification model in the training stage comprises an associated attribute prediction module.

As shown in fig. 5, fig. 5 is a schematic structural diagram of an image classification model in a training phase, the image classification model in fig. 5 includes an Input module (Input), a feature extraction network (backhaul), a pooling layer (Globalpool), a Full Connectivity (FC) and a Classifier (Classifier), and a part in a dashed box is an associated attribute prediction module.

S402, obtaining a plurality of sample images and the labeling category of each sample image.

S403, inputting a plurality of sample images into the image classification model, and performing forward propagation to obtain the prediction category of each sample image and the feature similarity between every two images in the plurality of sample images.

Taking fig. 5 as an example, a plurality of sample images may be input into the input module in fig. 5, and then the sample images are processed by the backhaul, Globalpool and FC in fig. 5, the FC may output a feature vector of each sample image, and then the classifier may output a prediction category of each sample image based on the feature vector of each sample image. And, the correlation attribute prediction module may calculate a feature similarity between two images of the plurality of sample images based on the feature vector of each sample image output by the FC.

The L2-norm (L2 norm) module is used for normalizing the feature vectors of the sample images output by the full connection layer to obtain a feature matrix; the transposition module is used for carrying out dimension conversion on the characteristic matrix to obtain a transposition matrix; and calculating the product of the feature matrix and the transposed matrix to obtain the feature similarity pred between every two images in the plurality of sample images.

S404, calculating a first loss function value based on the labeling category of each sample image and the feature similarity between every two sample images.

S405, calculating a second loss function value based on the prediction type and the labeling type of each sample image.

S406, training the image classification model based on the first loss function value and the second loss function value to obtain a plurality of candidate image classification models.

The correlation attribute prediction module in the embodiment of the present disclosure is only effective in the training phase, and after the model converges, the correlation attribute prediction module in fig. 5 may be deleted, so as to obtain a candidate image classification model that does not include the correlation attribute prediction module.

S407, sequentially loading a plurality of candidate image classification models, inputting a plurality of test images into the candidate image classification models so that each candidate image classification model classifies the plurality of test images, and obtaining the classification accuracy of each candidate image classification model based on the classification result of each candidate image classification model.

And S408, taking the candidate image classification model with the highest classification accuracy as the trained image classification model.

As shown in fig. 6, fig. 6 is a schematic structural diagram of an exemplary trained image classification model provided in an embodiment of the present disclosure, and the associated attribute prediction module in fig. 5 only participates in the training process of the image classification model, and in the actual application stage of the image classification model, the image classification model does not include the module, and when the image classification model actually classifies an image to be classified, feature similarity calculation is not required, so that the complexity and time consumption of the image classification process are not increased.

By adopting the embodiment of the disclosure, a correlation attribute prediction module is added in the training stage of the image classification model, and a corresponding characteristic similarity loss function is designed, wherein the loss function can make the inter-class similarity distance of the characteristics of the sample images extracted by the image classification model large, and the intra-class similarity distance small, i.e. the characteristic similarity between the sample images of the same labeling type is high, and the characteristic similarity between the sample images of different labeling types is low; furthermore, the trained image classification model can perform fine-grained classification on the image to be classified; and the trained image classification model does not comprise an associated attribute prediction module, so that the calculated amount of the image classification model cannot be increased.

Corresponding to the above embodiment, the present disclosure also provides an image classification method, as shown in fig. 7, the method including:

s701, obtaining a plurality of images to be classified.

The image to be classified may be an image to be classified in various scenes, and reference may be specifically made to the description of the sample image in the above embodiments.

702. And inputting a plurality of images to be classified into the image classification model to obtain the prediction category of each image to be classified.

The image classification model is a trained image classification model obtained by training according to the image classification model training method.

By adopting the embodiment of the disclosure, a plurality of images to be classified are input into the image classification model to obtain the prediction category of each image to be classified, and the image classification model is obtained by training through the image classification model training method, so that the image classification model can extract the features with high feature similarity from the images to be classified of the same category, extract the features with low feature similarity from the images to be classified of the different categories, and further can distinguish the images to be classified of the same category with similar contents when the images to be classified are classified according to the extracted features, thereby realizing fine-grained classification.

Taking the application of the method provided by the embodiment of the present disclosure to the fields of automatic driving, intelligent transportation, etc. as an example, the sample image in the above embodiment may be a sample obstacle image, the sample obstacle image may be an image in a road acquired by an automatic driving vehicle or a road camera, and the obstacle category in the sample obstacle image may be a category of an obstacle in the road, such as a bus, a truck, a taxi, a bicycle, an electric bicycle, a motorcycle, a pedestrian, etc. The following describes an image classification model training method provided by the embodiment of the present disclosure with reference to the fields of automatic driving, intelligent transportation, and the like, as shown in fig. 8, the method includes:

s801, obtaining a plurality of sample obstacle images and labeling categories of each sample obstacle image.

Wherein the labeled category of the sample obstacle image is the actual category of the obstacle in the sample obstacle image.

S802, inputting the multiple sample obstacle images into an image classification model, obtaining a feature vector of each sample obstacle image output by a full connection layer of the image classification model, and obtaining a prediction type of each sample obstacle image output by a classifier of the image classification model.

And S803, calculating the feature similarity between every two images in the plurality of sample obstacle images based on the feature vector of each sample obstacle image.

In one implementation, the feature vector of each sample obstacle image may be normalized to obtain a feature matrix including normalized feature vectors of a plurality of sample obstacle images. And carrying out dimension transformation on the characteristic matrix to obtain a transposed matrix of the characteristic matrix. And calculating to obtain the feature similarity between every two images in the plurality of sample obstacle images based on the feature matrix and the transposed matrix.

The method for performing normalization processing on the feature vector of each sample obstacle image, performing dimension transformation on the feature matrix, and calculating the feature similarity based on the feature matrix and the transpose matrix is the same as the method for processing the sample image, and reference may be made to the foregoing description, which is not repeated herein.

S804, calculating a first loss function value based on the labeling type of each sample obstacle image and the feature similarity between every two images.

In one implementation, for every two sample obstacle images in the plurality of sample obstacle images, the same class loss value and the different class loss values between the two sample obstacle images may be calculated according to the class values of the two sample obstacle images and the feature similarity between the two sample obstacle images; the first loss function value is calculated based on the same class loss value and different class loss values between every two sample obstacle images in the plurality of sample obstacle images.

The calculation expressions of the same-class loss values, different-class loss values, and the first loss function values of the two sample obstacle images are the same as the calculation expressions of the same-class loss values, different-class loss values, and the first loss function values of the two sample images, and reference may be made to the foregoing description, which is not repeated here.

For example, if both the sample obstacle images are images including bicycles, the class values of the two sample obstacle images are the same, and the higher the feature similarity of the two sample obstacle images is, the smaller the first loss function value is.

And S805, calculating a second loss function value based on the prediction category of each sample obstacle image and the labeling category of each sample obstacle image.

In the embodiment of the present disclosure, an error between the prediction category of each sample obstacle image and the labeling category of the sample obstacle image may be calculated to obtain the second loss function value.

For example, if the labeled type of the sample obstacle image is a bicycle and the predicted type of the sample obstacle output by the image classification model is a bicycle, the second loss function value is small, whereas if the predicted type of the sample obstacle output by the image classification model is an automobile, the second loss function value is large.

And S806, training an image classification model according to the first loss function value and the second loss function value to obtain the trained image classification model for classifying the road obstacle.

The above S806 includes the following two implementation manners:

and in the first mode, according to the first loss function value and the second loss function value, adjusting parameters of the image classification model until the image classification model is converged to obtain the trained image classification model for classifying the road barrier.

Adjusting image classification model parameters according to the first loss function value and the second loss function value until the image classification model is converged, and taking the obtained image classification model as a candidate image classification model; performing iterative training on the candidate image classification model based on the plurality of sample obstacle images to obtain a plurality of candidate image classification models; and selecting one candidate image classification model from the candidate image classification models as a trained image classification model for classifying the road obstacle.

respectively inputting a plurality of test images into each candidate image classification model to obtain the classification result of each candidate image classification model on the plurality of test images; determining the classification accuracy of each candidate image classification model based on the labeling classes of the plurality of test images and the classification result of each candidate image classification model on the plurality of test images; and taking the candidate image classification model with the highest classification accuracy as the trained image classification model for classifying the road obstacle. Therefore, the accuracy of the finally obtained image classification model for classifying the road obstacles can be further improved.

By adopting the embodiment of the disclosure, the first loss function value is obtained by calculating the same class loss value and the different class loss values between every two sample obstacle images, the image classification model is trained through the first loss function value, so that the feature similarity of the sample obstacle images of the same labeling class extracted by the image classification model is high, the feature similarity of the sample obstacle images of the different labeling classes is low, and then the trained image classification model can be used for performing fine-grained classification on the obstacle images with similar contents in the automatic driving field.

In the fields of automatic driving, intelligent transportation and the like, the images to be classified in the embodiment can be specifically images of obstacles to be classified, and the image classification model used for classifying road obstacles and trained in the embodiment can be used for classifying the images of the obstacles to be classified. A plurality of obstacle images to be classified can be acquired, and the plurality of obstacle images to be classified are input into an image classification model for classifying road obstacles, so that the prediction category of each obstacle image to be classified is obtained.

By adopting the embodiment of the disclosure, a plurality of obstacle images to be classified are input into the image classification model to obtain the prediction category of each obstacle image to be classified, since the image classification model is obtained by training through the image classification model training method in the automatic driving field, therefore, the image classification model can extract the features with high feature similarity from the images of the obstacles to be classified of the same class, extract the features with low feature similarity from the images of the obstacles to be classified of the different classes, and further, when the images of the obstacles to be classified are classified according to the extracted characteristics, the images of the obstacles to be classified of the same category with similar contents can be distinguished, thereby realizing fine-grained classification of the road obstacles, in the classification process, feature similarity calculation is not involved, and operations such as detection, segmentation or labeling and the like for the images to be classified are not needed additionally, so that the implementation process is simplified.

Based on the same inventive concept, an embodiment of the present disclosure further provides an image classification model training apparatus, as shown in fig. 9, the apparatus includes:

an obtaining module 901, configured to obtain a plurality of sample images and an annotation category of each sample image;

an input module 902, configured to input a plurality of sample images into an image classification model, obtain a feature vector of each sample image output by a full connection layer of the image classification model, and obtain a prediction category of each sample image output by a classifier of the image classification model;

a calculating module 903, configured to calculate a feature similarity between every two images in the multiple sample images based on the feature vector of each sample image;

the calculating module 903 is further configured to calculate a first loss function value based on the labeling category of each sample image and the feature similarity between every two images;

a calculating module 903, configured to calculate a second loss function value based on the prediction category of each sample image and the annotation category of each sample image;

and a training module 904, configured to train the image classification model according to the first loss function value and the second loss function value, to obtain a trained image classification model.

In another embodiment of the present disclosure, the calculating module 903 is specifically configured to:

normalizing the feature vector of each sample image to obtain a feature matrix containing the normalized feature vectors of a plurality of sample images;

carrying out dimension transformation on the characteristic matrix to obtain a transposed matrix of the characteristic matrix;

and calculating to obtain the feature similarity between every two images in the plurality of sample images based on the feature matrix and the transposed matrix.

calculating the same class loss value and different class loss values between two sample images according to the class values of the two sample images and the feature similarity between the two sample images aiming at every two sample images in the plurality of sample images; the category value is used for indicating whether the labeling categories of the two sample images are the same category or not;

the first loss function value is calculated based on the same category loss value and different category loss values between each two sample images in the plurality of sample images.

In another embodiment of the present disclosure, the first loss function used to calculate the first loss function value is:

wherein, mask _i,j The class value is 1 or 0, when the labeling class of the ith sample image and the jth sample image is the same, the mask _i,j The value is 1, when the labeling types of the ith sample image and the jth sample image are different, the mask _i,j A value of 0; pred _i,j And representing the feature similarity of the image of the ith sample and the image of the jth sample.

In another embodiment of the present disclosure, the training module 904 is specifically configured to:

adjusting parameters of the image classification model according to the first loss function value and the second loss function value until the image classification model is converged to obtain a trained image classification model; alternatively, the first and second electrodes may be,

adjusting parameters of the image classification model according to the first loss function value and the second loss function value until the image classification model is converged, and taking the obtained image classification model as a candidate image classification model;

performing iterative training on the candidate image classification model based on the plurality of sample images to obtain a plurality of candidate image classification models;

and selecting one candidate image classification model from the plurality of candidate image classification models as the trained image classification model.

respectively inputting a plurality of test images into each candidate image classification model to obtain the classification result of each candidate image classification model on the plurality of test images;

determining the classification accuracy of each candidate image classification model based on the labeling classes of the plurality of test images and the classification result of each candidate image classification model on the plurality of test images;

and taking the candidate image classification model with the highest classification accuracy as the trained image classification model.

Based on the same inventive concept, the embodiment of the present disclosure further provides an image classification apparatus, as shown in fig. 10, the apparatus including:

an obtaining module 1001 configured to obtain a plurality of images to be classified;

the classification module 1002 is configured to input a plurality of images to be classified into an image classification model, to obtain a prediction category of each image to be classified, where the image classification model is an image classification model trained by the image classification model training device to obtain a trained image classification model.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

It should be noted that the sample image in the present embodiment is from a public data set.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the device 1100 comprises a computing unit 1101, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data necessary for the operation of the device 1100 may also be stored. The calculation unit 1101, the ROM 1102, and the RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

A number of components in device 1100 connect to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, and the like; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108 such as a magnetic disk, optical disk, or the like; and a communication unit 1109 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 can be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 1101 performs the respective methods and processes described above, such as an image classification model training method or an image classification method. For example, in some embodiments, the image classification model training method or the image classification method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1108. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the image classification model training method or the image classification method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the image classification model training method or the image classification method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An image classification model training method comprises the following steps:

2. The method of claim 1, wherein the calculating feature similarity between two images of the plurality of sample images based on the feature vector of each sample image comprises:

normalizing the characteristic vector of each sample image to obtain a characteristic matrix containing the normalized characteristic vectors of the plurality of sample images;

and calculating the feature similarity between every two images in the plurality of sample images based on the feature matrix and the transposed matrix.

3. The method of claim 1 or 2, wherein the calculating a first loss function value based on the annotation class of each sample image and the feature similarity between the two images comprises:

for every two sample images in the plurality of sample images, calculating the same class loss value and different class loss values between the two sample images according to the class values of the two sample images and the feature similarity between the two sample images; the class value is used for indicating whether the labeling classes of the two sample images are the same class or not;

calculating the first loss function value based on the same class loss value and different class loss values between each two sample images of the plurality of sample images.

4. The method of claim 3, wherein the first loss function used to calculate the first loss function value is:

wherein m isask _i,j The class value is 1 or 0, when the labeling class of the ith sample image and the jth sample image is the same, the mask _i,j The value is 1, when the labeling types of the ith sample image and the jth sample image are different, the mask _i,j A value of 0; pred _i,j And representing the feature similarity of the image of the ith sample and the image of the jth sample.

5. The method of any of claims 1-4, wherein the training the image classification model based on the first loss function value and the second loss function value, resulting in a trained image classification model, comprises:

adjusting the image classification model parameters according to the first loss function value and the second loss function value until the image classification model is converged, and taking the obtained image classification model as a candidate image classification model;

performing iterative training on the candidate image classification model based on a plurality of sample images to obtain a plurality of candidate image classification models;

and selecting one candidate image classification model from the candidate image classification models as the trained image classification model.

6. The method of claim 5, wherein the selecting one of the candidate image classification models from the plurality of candidate image classification models as a trained image classification model comprises:

7. An image classification method, comprising:

acquiring a plurality of images to be classified;

inputting the images to be classified into an image classification model to obtain the prediction category of each image to be classified, wherein the image classification model is a trained image classification model obtained by training according to the method of any one of claims 1 to 6.

8. An image classification model training apparatus, comprising:

the calculation module is further used for calculating a second loss function value based on the prediction category of each sample image and the annotation category of each sample image;

9. The apparatus of claim 8, wherein the computing module is specifically configured to:

10. The apparatus according to claim 8 or 9, wherein the calculation module is specifically configured to:

11. The apparatus of claim 10, wherein a first loss function used to calculate the first loss function value is:

12. The apparatus according to any one of claims 8-11, wherein the training module is specifically configured to:

13. The apparatus of claim 12, wherein the training module is specifically configured to:

respectively inputting a plurality of test images into each candidate image classification model, and obtaining the classification result of each candidate image classification model on the plurality of test images;

14. An image classification apparatus comprising:

a classification module, configured to input the multiple images to be classified into an image classification model, so as to obtain a prediction category of each image to be classified, where the image classification model is an image classification model trained according to any one of the apparatuses of claims 8 to 13.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-6 or 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6 or 7.