CN113076840A

CN113076840A - Vehicle post-shot image brand training method

Info

Publication number: CN113076840A
Application number: CN202110322535.0A
Authority: CN
Inventors: 陈利军; 傅慧源; 林焕凯; 董常青; 马华东; 王川铭; 洪曙光; 王祥雪; 刘双广
Original assignee: Gosuncn Technology Group Co Ltd
Current assignee: Gosuncn Technology Group Co Ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-07-06

Abstract

The invention provides a brand training method of vehicle post-shot images based on ternary loss function and anti-noise fusion.

Description

Vehicle post-shot image brand training method

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a brand training method for a vehicle post-shot image.

Background

Vehicles, as an important vehicle, are closely related to people's lives. With the development of modern automobile production technology, more and more automobiles appear in people's daily life. In the construction of intelligent traffic, a monitoring camera can only shoot front or back images of a vehicle. Whereas the vehicle back-shot image appears more in the road monitoring camera than the front-shot image. However, the post-shot images of the vehicles tend to have higher similarity and smaller differences between different categories, and different types of vehicles have only slight differences in the positions of the lamps and the like. Therefore, experts in the related art are often required to identify the fine granularity of the vehicle. However, it is time-consuming and impractical for experts to identify all vehicle brands, and therefore there is a need for an automatic brand identification method for vehicle post-shot images, which not only meets the construction requirements of modern intelligent traffic, but also promotes the development of smart cities. Due to practical requirements, how to design an automatic automobile particle classification method is always the research direction of researchers. This is a very challenging problem because identifying models requires capturing subtle visual differences between classes or instances that are easily masked by other factors (e.g., viewpoint, lighting, or scene). In recent years, with the intensive research, the release of large public image data sets and the appearance of high-performance computing systems, it is possible to train a deep neural network with a large number of parameters, while a Convolutional Neural Network (CNN) becomes one of the classic methods in the visual field due to its excellent feature extraction capability, and has made breakthrough progress on many visual tasks, so how to use the CNN to solve the problem of determining the fine-grained category of a vehicle according to a post-shot image of the vehicle becomes a hot spot of the current research.

The flow chart of the existing method for classifying the fine granularity of the vehicle is shown in fig. 1:

the fine-grained classification of the vehicle means that information such as the brand, the train system, and the design age of the vehicle can be accurately recognized from the input image including the vehicle. The high-performance fine-grained vehicle identification method can correctly identify the tiny characteristics of a certain part of the vehicle on the picture without the assistance of field experts, and plays an important role in the fields of urban traffic construction, public safety protection and the like.

Existing fine-grained vehicle classification methods are generally based on deep learning, and deep learning-based methods are composed of two parts: a training phase and a testing phase.

1. For the training phase, the basic flow can be summarized as:

basic feature extraction: the method comprises the steps of carrying out basic feature extraction on pictures to be classified through CNNs (common noise sources), and obtaining a feature map of input picture data.

Classifying by a classifier: inputting the extracted features into a classifier, which outputs fine-grained classes of vehicles contained in the picture

Calculating a loss function: comparing the output class of the classifier with the correct class of the vehicle included in the image, and calculating the loss (loss) by a preset loss function

And (3) back propagation: the back propagation algorithm transmits the loss to each parameter in the model, calculates the gradient of each parameter, and optimizes the parameters by a gradient descent method

For the testing stage (reasoning stage), only 1) basic feature extraction of the training stage is needed; and 2) classifier classification without computation and back propagation of the loss function.

The prior art has the following defects:

1. no consideration is given to supervising the extracted features. In the traditional vehicle identification method, only the supervision information is directly added to the classifier, and the addition of the supervision information to the extracted features is not considered, so that the features extracted by the convolutional neural network cannot completely concern the vehicle and possibly concern background information, and the extracted features are incomplete and cannot be used as the class characterization of the vehicle.

2. The extracted features are only applicable to the trained class. In the traditional method, only class supervision information is added, so that the extracted features are only effective for trained classes, the effect on classes which are not trained is poor, but new brands can continuously appear on vehicles, and the existing method cannot be well suitable for the new vehicle brands.

Disclosure of Invention

The invention provides a brand training method for a vehicle post-shot image, aiming at the problems in the prior art.

The invention is realized by the following technical scheme:

a brand training method for a vehicle post-shot image comprises the following steps:

s1, training a convolutional neural network feature extraction model;

the specific step S1 includes: s1.1, acquiring a plurality of images to form an input element, and selecting a triple;

s1.2, respectively extracting basic features of a first input image through a convolutional neural network model to obtain basic characteristics;

s1.3, calculating corresponding loss according to a ternary loss function;

s1.4, updating parameters in the convolutional neural network model through a back propagation algorithm to obtain a final convolutional neural network feature extraction model;

s2, training a classifier;

s2.1, extracting the features of a second input image by using the final feature extraction model;

s2.2, generating corresponding noise data according to the counternoise generator, and fusing the corresponding noise data with the features extracted in the step S2.1 through a fusion strategy;

s2.3, inputting the fused features into a classifier to obtain a classification result;

s2.4, performing loss calculation on the classification result and a preset correct class;

and S2.5, transmitting the loss calculated in the step S2.4 to parameters of the classifier through a back propagation algorithm, and optimizing the parameters of the classifier through a gradient descent method to obtain a final recognition model.

Further, the triplet is composed of anchor data x^aPositive case data x^pAnd opposite example data xⁿAnd (4) forming.

Further, in step S1.2, extracting the basic features includes adjusting the size of the image to 256 × 256 by linear interpolation, inputting the adjusted image into a convolutional neural network model, performing feature aggregation on the output of the model through a convolutional network layer, and taking the output of the layer as an extracted image characterization, where the characterization is a one-dimensional vector with 256 elements.

Further, in step S1.3, the calculation formula of the ternary loss function is:

where f () represents a convolutional neural network for feature extraction,

represents a distance metric based on L2; representation of the loss function: for N triples, calculating the difference between the positive example data and the negative example data and the anchor data, and summing the difference to obtain the final loss, which requires the network to make the characterization of the positive example data consistent with the characterization of the anchor data as far as possible, and the characterization of the negative example sample as far as possible, so as to realize all triples:

further, in the step S2.2, the generating corresponding noise data according to the counternoise generator, and fusing the corresponding noise data with the features extracted in the step S2.1 through a fusion strategy, includes the steps of:

s2.2.1, generating noise data with a mean value of 0 and a variance of 1 based on a Gaussian function;

s2.2.2, transforming the image using autoimplementation image enhancement method based on ImageNet training;

and S63, performing linear weighted fusion on the noise data and the features extracted in the step S5.

Further, in the step S2.2.2, the transformation includes rotation, channel change or brightness adjustment.

Further, in step S2.3, the fused feature is a feature vector with a size of [ C, H, W ], and the feature vector is passed through a spatial pooling layer, the pooling layer averages the features of [ H, W ], the feature map is aggregated into a feature vector with C values, and the feature vector representation is input into a full-link layer for classification, so as to obtain a final classification result.

Further, in step S2.5, a Stochastic Gradient Descent (SGD) is selected as an optimizer, and the parameter update may be represented as:

where θ is a parameter that needs to be updated and α represents the learning rate.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the brand training method for post-capture images of vehicles.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of a vehicle post-shot image brand training method when executing the program.

The invention provides a new brand training method for post-shot images of vehicles based on triple loss and anti-noise fusion. Meanwhile, the fusion of the countermeasures to the noise randomly adds noise interference in the training stage, so that the feature extraction can pay more attention to factors such as essential characteristics, but not background information and the like. Compared with the existing vehicle fine-grained classification method, the method considers different incidence relations between the interior and the exterior of the class, and simultaneously, the method has stronger robustness due to the addition of the anti-noise, so that more complete brand representation of the vehicle can be extracted, and a better fine-grained identification effect is achieved.

Compared with the prior art, the invention has the beneficial effects that: due to the existence of the pooling layer and the nonlinear mapping layer, the convolutional neural network feature extraction capability is strong, Gaussian noise added aiming at image input is easy to ignore, the Gaussian noise is added before the classifier, the image representation input into the classifier has diversity, the classifier continuously expands the boundary of the class subspace and has better robustness, and therefore, even if the representation extracted by the convolutional neural network by test data is possibly inconsistent with training data, the classifier can also correctly classify, and the classification accuracy is improved.

Drawings

The present invention will be described in further detail with reference to the accompanying drawings;

FIG. 1 is a basic flow of a conventional vehicle training method, wherein (a) is a training phase; FIG. (b) shows the test phase;

FIG. 2 is a flow chart of the training phase of the training method of the present invention, wherein (a) is convolutional neural network training; graph (b) is classifier training;

FIG. 3 is a flow chart of the present aspect during the inference phase.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention relates to a brand training method by taking images of vehicles behind, wherein a flow chart of a training stage is shown in figure 2, and compared with the traditional method, the training stage of the method provided by the invention consists of two parts: a convolutional neural network feature extraction model training stage and a classifier training stage. For the training stage of the convolutional neural network feature extraction model, the main process is as follows:

1. selecting a triple: triple loss is a widely used tool for metric learning, and generally has three inputs, namely anchor point data (anchor sample), positive example data (positive example) and negative example data (negative example), which achieves the purpose of feature learning by making the metric distance between the anchor point data and the positive example data closer and the metric distance between the anchor point data and the negative example data farther so that features of the same class have aggregated characteristics in a feature space. The ternary loss needs three inputs, namely anchor point data, positive case data and negative case data, so that an input element needs to be formed by a plurality of images, training data needs to be screened, and a triplet is selected;

2. basic feature extraction: respectively extracting basic features of input picture data through a classical convolutional neural network architecture to obtain basic characteristics;

3. and (3) ternary loss calculation: calculating corresponding loss according to the proposed ternary loss function;

4. and (3) back propagation: and optimizing parameters in the model through a back propagation algorithm to obtain a final feature extraction model.

For the classifier training part, the main flow is as follows:

1. extracting the features of the input image by using the convolutional neural network obtained in the convolutional neural network training as a feature extractor

2. According to the corresponding noise data generated by the anti-noise generator, the characteristics extracted by the convolution neural network are fused through a fusion strategy

3. Inputting the fused features into a classifier to obtain a classification result

4. Performing loss calculation on the classification result and the correct classification

5. And transmitting the loss to the parameters of the classifier through a back propagation algorithm, and optimizing the parameters of the classifier through a gradient descent method to obtain a final recognition model.

Example 1 triple selection

For a triplet, the anchor data x^aPositive case data x^pAnd opposite example data xⁿAnd (4) forming. Therefore, each time the convolutional neural network in the model is trained, a triplet needs to be selected from the data for training. The method comprises the steps of firstly selecting brand categories A and B of two vehicles from data, setting the brand category A as a reference brand type, selecting a in the category A as anchor point data, and then selecting n from the anchor point data as regular example data. For the category B, counterexample data is selected, m images are selected from the category B, the images are input into a convolutional neural network to obtain the representation of the images, the distance between each representation and the representation of a is calculated in sequence, and the image with the nearest representation distance a and smaller than a threshold value alpha is selected.

Example 2 basic feature extraction

For an image containing a vehicle post-shot, the size of the image is adjusted to 256 × 256 by a linear interpolation method, then the image is input into a convolutional neural network (using ResNet50 as a basic feature extraction network), the output of the model is subjected to feature aggregation through a layer of convolutional network, and the output of the layer is used as an extracted image representation. The token is a one-dimensional vector of 256 elements.

Example 3 ternary loss function

In order to minimize the distance between the representations of the same class in the metric space, while the distance between the representations of different classes in the metric space is as large as possible, a ternary loss function is designed:

where f () represents a convolutional neural network for feature extraction,

representing a distance metric based on L2. Representation of the loss function: for N triples, calculating the difference between the positive sample and the negative sample and the anchor data, and summing the difference to obtain the final loss, which requires the network to make the characterization of the positive sample and the anchor data consistent as far as possible, and the characterization of the negative sample as far as possible, so as to achieve the following effects for all triples:

the purpose of (1).

Example 4 countering noise Generation

Even if the same type of vehicle post-shot images are different due to factors such as changing of the view angle and changing of the vehicle posture, a strategy for resisting noise generation is provided for the influence caused by the change of the input image. First, noise data having a mean value of 0 and a variance of 1 is generated by a gaussian function. Secondly, the image is transformed (such as rotation, channel change, brightness adjustment, etc.) by using the automatic image enhancement method based on ImageNet training,

and finally, performing linear weighted fusion on the noise data and the image representation extracted by the convolutional neural network, inputting the fused result into a classifier for classification, and training to obtain a more robust classifier in sequence.

Example 5 classifier Classification

The finally obtained vehicle feature is a feature vector with the size of [ C, H, W ], the feature vector passes through a space pooling layer, the pooling layer averages the features of [ H, W ] dimensionality, the feature map is aggregated into a feature vector with C values, and then the feature vector representation is input into a full-link layer for classification to obtain the final classification result.

Example 6 network parameter training

The parameters in the convolutional neural network and the parameters in the implementation are obtained by learning through a back propagation and gradient descent method.

In the training stage, the image is normalized to 256x256, and is input into the method of the proposal, the image characteristics obtained by the convolutional neural network are input into a ternary loss function to obtain ternary loss, then the loss is transmitted back to each position needing parameter learning by a back propagation method, and the parameters are modified according to the obtained gradient. After the convolutional neural network training is finished, the training data is input into the convolutional neural network to obtain the representation of the image, so as to train the classifier.

In training the classifier, the loss between the predicted result and a given standard result is calculated using cross-entropy loss:

where N is the total number of pictures, M is the number of categories, pic indicates the probability that the current predictor is of category c, yic indicates whether the predictor category is the same as the given category, 1 if the same, and 0 if not.

A random Gradient Descent (SGD) is used as an optimizer, and the parameter update can be expressed as:

The promotion effect of this scheme is shown as follows:

model (model)	Rate of accuracy
		ResNet50	89.7％
Method of the invention	92.1％

The present invention also provides a computer readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, performs the steps of a method for brand training of post-capture images of a vehicle.

The invention also provides computer equipment comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the brand training method of the vehicle post-shooting image when executing the program.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the invention are also within the protection scope of the invention.

Claims

1. A brand training method for a vehicle post-shot image is characterized by comprising the following steps:

s1, training a convolutional neural network feature extraction model;

s1.3, calculating corresponding loss according to a ternary loss function;

s2, training a classifier;

2. The vehicle post-shot image brand training method of claim 1, wherein said triplets are represented by anchor point data x^aPositive case data x^pAnd opposite example data xⁿAnd (4) forming.

3. The method for brand training of post-capture images of vehicles according to claim 1, wherein in step S1.2, extracting the basic features comprises adjusting the size of the image to 256 × 256 by linear interpolation, inputting the image into a convolutional neural network model, performing feature aggregation on the output of the model through a convolutional network layer, and taking the output of the layer as an extracted image representation, wherein the representation is a one-dimensional vector with 256 elements.

4. The brand training method for post-shooting images of vehicles according to claim 2, wherein in step S1.3, the formula of the ternary loss function is:

where f () represents a convolutional neural network for feature extraction,

5. the brand training method for post-shooting images of vehicles according to claim 1, wherein in the step S2.2, the corresponding noise data generated by the countermeasure noise generator is fused with the features extracted in the step S2.1 through a fusion strategy, comprising the steps of:

6. The method for brand training of post-capture images of vehicles according to claim 5, wherein in said step S2.2.2, said transformation comprises rotation, channel change or brightness adjustment.

7. The method for brand training of post-shot images of vehicles according to claim 1, wherein in step S2.3, the fused feature is a feature vector with a size of [ C, H, W ], and the feature vector is passed through a spatial pooling layer, the pooling layer averages the features of [ H, W ] dimension, the feature map is aggregated into a feature vector with C values, and the feature vector representation is input into a full-link layer for classification, so as to obtain the final classification result.

8. The brand training method for post-shot images of vehicles according to claim 7, wherein in step S2.5, a Stochastic Gradient Descent (SGD) is selected as an optimizer, and the parameter update thereof can be expressed as:

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method for brand training by post-capture images of vehicles as claimed in any one of claims 1 to 8.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the vehicle post-shot image brand training method of any one of claims 1-8.