CN114842343A

CN114842343A - ViT-based aerial image identification method

Info

Publication number: CN114842343A
Application number: CN202210541111.8A
Authority: CN
Inventors: 熊盛武; 赵怡晨; 陈亚雄; 路雄博
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-05-17
Filing date: 2022-05-17
Publication date: 2022-08-02

Abstract

The invention discloses an ViT-based aerial image identification method, which comprises the following steps: s1, acquiring an aerial image data set, constructing a training set, a verification set and a test set; s2, expanding the data volume of the training set; s3, constructing an aerial image recognition model based on ViT; s4, inputting the expanded training set into a recognition model, performing discriminative label smoothing on labels corresponding to the images, training the model by adopting a cross entropy loss function and a discriminative contrast loss function, updating the recognition model through a back propagation algorithm, and selecting an optimal aerial image recognition model; and S5, testing the recognition performance of the model through the test set. According to the invention, the label corresponding to the image is subjected to discriminative label smoothing treatment, and meanwhile, the training process of the cross entropy loss function and discriminative contrast loss function supervision model is adopted, so that the ViT-based aerial image recognition model with stronger feature learning capability is obtained, and the method has the advantages of high recognition rate, strong expansibility and the like.

Description

ViT-based aerial image identification method

Technical Field

The invention relates to the technical field of machine learning algorithms and image processing, in particular to an ViT-based aerial image identification method.

Background

Aerial image recognition refers to identifying the category to which an aerial image belongs given. With the increasing maturity of aviation technology and the increasing improvement of aviation image resolution, aviation images play an increasingly important role in daily life of people. The aerial image recognition is not required to be carried out on tasks such as natural disaster detection, urban planning, resource exploration and thematic map making, so that the accurate recognition of the aerial image has important value.

Although the aerial image data volume is large, the data set which can be used for model training is small in quantity and low in quality, the labeled data set is rare, and the problems of noise samples and difficult samples generally exist. In addition, most aerial images are formed in a overlook mode, and the aerial image has the advantages of being wide in imaging range, large in size change, sparse in target change in a scene and the like. Therefore, aerial image recognition has a difficulty in that the data amount is small and the background is complicated compared to a natural image.

At present, most solutions are to establish a targeted lightweight deep learning algorithm, and do not extend to more diversified aerial images, so that the method has limitations. In addition, most of the methods adopt cross entropy loss of learning label information to supervise the model, and do not consider internal information of the aerial image.

Disclosure of Invention

Aiming at the defects existing in the background technology, the invention provides an ViT-based aerial image recognition method, which utilizes the advantages of ViT (Vision Transformer) in capturing long-distance dependence and dynamic self-adaptive modeling capability, takes ViT as a feature encoder of an image to capture remarkable semantic features, and improves on the basis of ViT, so that the limited aerial image data can be fully utilized for training, and noise in overfitting images is avoided.

In order to achieve the purpose, the invention designs an ViT-based aerial image recognition method, which is characterized by comprising the following steps:

s1) acquiring an aerial image data set to obtain a required original aerial image x _i And its corresponding category label y _i And dividing a training set, a verification set and a test set according to the proportional quantity, and respectively training, verifying and evaluating the model subsequently, wherein the training set is recorded as

B is the number of images of the training set;

s2) performing online data enhancement on the training set images to enable each image in the training set to generate M different enhanced images, wherein the number of the expanded images in the training set is B × M and is recorded as

S3) constructing an aerial image recognition model based on ViT;

s4) the training set

Inputting the ViT-based aerial image recognition model, performing discriminative label smoothing on labels corresponding to the image, training the model by adopting a cross entropy loss function and a discriminative contrast loss function, updating the recognition model by a back propagation algorithm, and selecting the optimal aerial image recognition model by using the verification set in the step S1);

s5) testing the recognition performance of the aerial image recognition model by using the test set of the step S1) to obtain the final model recognition accuracy, and inputting the image to be recognized into the aerial image recognition model for recognition when the model recognition accuracy reaches a set threshold value; otherwise, returning to the step S3) until the model identification accuracy reaches the set threshold.

Preferably, the step S2) randomly cuts the input image into 224 × 224 pixels and then randomly levels the cut pixelsTurning over, then using an image enhancement strategy to enhance the image, finally obtaining a training set after capacity expansion, and recording the training set as the training set

Preferably, the image enhancement strategy in step S2) includes one or more of the following operations in combination: the method comprises the steps of normalizing an image, sequentially carrying out random color distortion and Gaussian blur, automatically enhancing, randomly selecting one image enhancement operation each time, then randomly determining the enhancement amplitude of the image, enhancing the image, and randomly erasing a rectangular area from the image without changing the original label of the image.

Preferably, the ViT-based aerial image recognition model in step S3) is composed of an encoder F (-) and a classification head G (-) and a projection head P (-) only used for the training phase:

the encoder F (-) is composed of ViT pre-trained on the data set for learning and encoding the global features of the image, which will be trained

In the input feature encoder F (-) the first token of the encoder is used as

Is represented by a global feature of _i ；

The classification head G (-) is composed of an MLP layer, the structure of the classification head G (-) is full connection layer FC-activation function Tanh-full connection layer FC, and the number of MLP layer output neurons is the total classification number of the aerial images in the current data set;

the projection head P (-) is only used in the training phase of the model, and the role of the projection head P (-) is to express the coded global feature h _i Mapping into the potential space of applied contrast loss, and its structure is full connection layer FC-activation function ReLU-full connection layer FC.

Preferably, the step S4) performs discriminative label smoothing on the label corresponding to the image, which means that the image is subjected to discriminative label smoothing according to the discrete probability value output by the model and the current training phase, and then the smoothed label is used to calculate the cross entropy loss function value, where the expression is as follows:

in the formula, L _CE Is the cross entropy loss function value, K is the total number of categories in the aerial image dataset;

is the ith sample initial label probability distribution, i.e. for the correct label class

1, otherwise 0;

is the discrete probability distribution output by the model, which refers to the predicted probability of the model for the ith sample in the kth class, and γ(s) is a smooth variable.

Preferably, the smoothing variable γ(s) is composed of two smoothing variables γ _hard (s) and γ _simple (s) for controlling the respective smooth weights of the difficult samples and the simple samples in different training stages, the expression is as follows:

γ _simple (s)＝(γ _hard (s)+γ _bias )*0.5 ^(1+s/I)

wherein s ∈ {1 … I } is the iteration number of the current training, and I is the total iteration number; gamma ray _max Is the maximum value of the smoothing weight, gamma, corresponding to the difficult sample _min Is the minimum value; gamma ray _bias Is the deviation value of the hard sample and the simple sample smoothing weight;

refers to a smooth interpolation function, whose expression is as follows:

wherein Comb: is the number of permutation combinations, which represents the total number of fetching ways to fetch N elements from N + N elements, N being used to control the rate of smoothing.

Preferably, at the ith sample

Probability of K classes output according to model in division of difficult or simple samples

When the maximum value is more than 0.8 and the second maximum value is less than 0.2, the sample is considered to belong to a simple sample, otherwise, the sample is divided into difficult samples; therefore, corresponding smooth variables are respectively selected, and the cross entropy loss function value is calculated.

Preferably, when the cross entropy loss function and the discriminative contrast loss function are used to train the model in step S4), the total loss value L is calculated according to the following formula:

L＝L _CE +β*L _DCL

in the formula, L _CE As a cross-entropy loss function, L _DCL For the discriminative contrast loss function, β is a weight coefficient for adjusting the importance of the discriminative contrast loss function.

The expression of the discriminative contrast loss function is as follows:

wherein B M is the total number of training set samples,

is an indicator function, 1 if and only if the input condition is true, and the sample

In samples belonging to the same class, S _i Representing a set of samples enhanced by the same image, C _i It is indicated that in other cases,

representation and sample

Samples of the same type but obtained from different image enhancements

The ratio of the dot product of (a) to (b),

representation and sample

Samples of the same kind and enhanced by the same image

The dot product ratio of (1, t)>0 is a temperature parameter, ε is a similarity threshold, 1 ≧ ε>0。

The invention also provides ViT-based aerial image recognition computer equipment, which comprises a memory, a processor and program instructions stored in the memory and executable by the processor, wherein the processor executes the program instructions to realize the steps of the method.

The invention further provides a computer-readable storage medium, which stores a computer program, wherein the computer program is used for realizing the ViT-based aerial image recognition method when being executed by a processor.

The invention has the beneficial effects that:

1. the recognition rate is high: aiming at the problem that the amount of trainable data is small in aerial image recognition and overfitting of a deep learning algorithm is easily caused, the method adopts the distinguishing label smoothing to promote the model to learn enough good characteristic information and simultaneously prevent overfitting of the distribution of noise data.

2. Strong expansibility: the aviation image recognition method based on ViT has high principle universality, selects proper training data according to actual needs, and can be applied to different types of aviation image recognition tasks.

3. The data structure is reasonable: the invention designs a discriminative label smooth item and discriminative supervised contrast loss, and learns a more compact and reasonable data structure; therefore, the ViT-based aerial image recognition model with stronger saliency feature capturing capability is trained, so that aerial image recognition is more accurate.

Drawings

FIG. 1 is an overall flow chart of an ViT-based aerial image identification method according to the invention;

FIG. 2 is a diagram illustrating a random enhancement module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a model for aerial image event recognition according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples for illustrating the objects, aspects, advantages and realizability of the invention in detail. It should be understood that the specific examples described herein are intended to be illustrative only and are not intended to be limiting. In addition, the technical features mentioned in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the embodiment, event identification in the aerial image is taken as a scene, and the aerial image identification method based on ViT provided by the invention is described in detail.

As shown in FIG. 1, the aviation image identification method based on ViT provided by the invention is applied to an event identification task in an aviation image, and the method comprises the following detailed steps:

step S1: acquiring an event identification data set in the aerial image to obtain an aerial image x _i And its corresponding event label y _i In this embodiment, an event recognition data set in the ERA aerial image is selected, the data set includes 2864 sample images of 25 event categories, the training set and the test set which have been divided are directly used, the training set and the verification set are randomly divided from the original training set according to a ratio of 9:1, and the training set is recorded as the training set

B is the number of images in the training set.

Step S2: and constructing a data random enhancement module to expand the data volume of the training set, and inputting the training set image in the step S1 into the random enhancement module for online data enhancement. In the random enhancement module, firstly, an input image is randomly cut into 224 × 224 pixels and then is randomly horizontally flipped, and then six common image enhancement strategies of the current visual task are selected, including (1) BaseAugment (only image normalization operation is performed); (2) SimAugment (random color distortion and gaussian blur in order, and possibly an additional sparse image warping operation at the end of the sequence); (3) AutoAutoAutoAutoAutoAutoAutoAutoaugmentation; (4) RandAugment (random enhancement); (5) trivialagment (randomly selecting an image enhancement operation each time, then randomly determining the enhancement amplitude of the image enhancement operation, and enhancing the image); (6) randomerase (randomly erasing a rectangular area from an image without changing the original label of the image). Namely, giving an image in the training set, enhancing the image by M (6 is more than or equal to M is more than or equal to 0) types randomly selected in the six strategies, and finally obtaining the expanded training set which is marked as

In this embodiment, M is 4, as shown in fig. 2.

Step S3: an aviation image recognition model based on ViT is constructed, and the structure of the model is shown in FIG. 3. The model consists of an encoder F (-), a classification head G (-), and a projection head P (-), which is used only for the training phase:

encoder F(. h) consists of ViT pre-trained on ImageNet datasets for learning and encoding image global features. Specifically, the encoder F (-) includes two parts, a linear layer and a transform encoder: the linear layer is used for embedding the image into the representation; the transformer encoder is composed of a multi-head self-attention layer and a multi-layer perceptron block and is used for learning global features of an image. LayerNorm normalization is applied before each block and residual concatenation is applied after each block. Will train the picture

In the input feature encoder F (·), the first token of the last layer of transform encoder is used as

Is represented by a global feature of _i . Then h is _i Input to a classifier and a projector to calculate a total loss value.

The classification header G (-) is composed of an MLP layer, and has a structure of "full connected layer FC — activation function Tanh — full connected layer FC", and the number of MLP layer output neurons is the total classification number of aerial images in the current data set, which is 25 in this embodiment.

The projection head P (-) is only used in the training phase of the model, and the role of the projection head P (-) is to represent the coded representation h _i Mapping to a potential space applying contrast loss, wherein the structure is 'full connection layer FC-activation function ReLU-full connection layer FC', and the number of MLP layer output neurons is 128.

Step S4: training set in step S2

The image is input into the recognition model constructed in the step S3, then the discriminative label smoothing is carried out on the label corresponding to the image, meanwhile, the cross entropy loss function and the discriminative contrast loss function are adopted to train the model, the recognition model is updated through a back propagation algorithm, and the model with the optimal recognition accuracy on the verification set in the step S1 is selected as the recognition model which is trained finally.

The method comprises the following steps of performing discriminative label smoothing on a label corresponding to an image, performing discriminative label smoothing on the image according to a discrete probability value output by a model and a current training stage, and then using the smoothed label to calculate a cross entropy loss function value, wherein the expression is as follows:

wherein K is the total number of categories in the aerial image dataset;

1, otherwise 0;

is the discrete probability distribution output by the model, which refers to the predicted probability of the model for the ith sample in the kth class.

Obtaining annotated aerial images is typically more costly than a natural image dataset, so aerial image datasets are generally smaller in size, which is highly likely to result in an overfitting of the model on the training data. While the traditional label smoothing can relieve model overfitting to a certain extent, the risk of model underfitting exists when the data set is small in size. Therefore, the smoothing weight is controlled by proposing a smoothing variable γ.(s), and different smoothing weight values are given according to the change of the model training phase. In particular, γ(s) is formed by two smoothing variables γ _hard (s) and gamma _simple (s) for controlling the respective smooth weights of the difficult samples and the simple samples in different training stages, the expression is as follows:

γ _simple (s)＝(γ _hard (s)+γ _bias )*0.5 ^(1+s/I)

wherein s ∈ {1 … I } is the iteration number of the current training, and I is the total iteration number; gamma ray _max Is the maximum of the smoothing weight corresponding to the difficult sample, similarly, gamma _min Is the minimum value; gamma ray _bias Is the deviation value of the hard sample and the simple sample smoothing weight;

refers to a smooth interpolation function. The expression is as follows:

wherein Comb: indicating number of permutation combinations, e.g.

This means the total number of fetching modes for fetching N elements from N + N elements, regardless of the fetching order. N is used to control the rate of smoothing, taking 1 in this example.

At the ith sample

When the maximum value is more than 0.8 and the second maximum value is less than 0.2, the sample is considered to be a simple sample, otherwise, the sample is classified as a difficult sample. Thereby respectively selecting corresponding smoothing function calculations to cross the entropy loss function values.

Wherein, the model is trained by simultaneously adopting a cross entropy loss function and a discriminative contrast loss function, and a total loss value L is calculated according to the following formula:

L＝L _CE +β*L _DCL

in the formula, L _CE Is the cross entropy loss function of claim 5, L _DCL For the discriminative contrast loss function, β is a weight coefficient for adjusting the importance of the discriminative contrast loss function.

Wherein, the expression of the discriminative contrast loss function is as follows:

because aerial images have larger intra-class variation and inter-class similarity compared with natural images, even similar samples have certain difference, and the difference can be further enhanced after random enhancement, the discriminative contrast loss function is provided by further distinguishing whether similar images are enhanced by the same images or not. Specifically, B M is the total number of training set samples,

is an indicator function that is 1 if and only if the input conditions are true. And a sample

In samples belonging to the same class, S _i Representing a set of samples enhanced by the same image, C _i Indicating other situations.

Representation and sample

Samples of the same type but obtained from different image enhancements

The ratio of the dot product of (a) to (b),

representation and sample

Samples of the same kind and enhanced by the same image

The dot product ratio of (a). Tau is>0 is a temperature parameter, ε (1 ≧ ε)>0) Is the similarity threshold.

Step S5: and (5) inputting the test set images in the step (S1) into the trained recognition model, and comparing the output prediction type of the model with the real type to obtain the final recognition accuracy. When the model identification accuracy reaches a set threshold value, inputting an image to be identified into the aerial image identification model for identification; otherwise, returning to the step S3) until the model identification accuracy reaches the set threshold.

Based on the method, the invention also provides ViT-based aerial image recognition computer equipment which comprises a memory, a processor and program instructions stored in the memory and executable by the processor, wherein the processor executes the program instructions to realize the steps in the ViT-based aerial image recognition method.

The invention further provides a computer-readable storage medium storing a computer program, wherein the computer program is used for implementing the ViT-based aerial image recognition method when being executed by a processor.

Details not described in this specification are within the skill of the art that are well known to those skilled in the art.

It will be appreciated by those skilled in the art that modifications and variations may be made to the principles of the present invention and the foregoing description, or the methods provided by the present invention may be applied to similar aerial image recognition tasks, and all such modifications and variations are intended to fall within the scope of the appended claims.

Claims

1. An aerial image identification method based on ViT is characterized in that: the method comprises the following steps:

B is the number of images of the training set;

S3) constructing an aerial image recognition model based on ViT;

s4) training the training set

2. A method as claimed in claim 1ViT-based aerial image identification method is characterized in that: step S2), the input image is randomly cut into 224 × 224 pixels and then randomly horizontally flipped, and then the image is enhanced by using the image enhancement strategy, and finally the expanded training set is obtained and recorded as

3. The aerial image recognition method based on ViT of claim 2, wherein: the image enhancement strategy in step S2) includes one or more of the following operations: the method comprises the steps of normalizing an image, sequentially carrying out random color distortion and Gaussian blur, automatically enhancing, randomly selecting one image enhancement operation each time, then randomly determining the enhancement amplitude of the image, enhancing the image, and randomly erasing a rectangular area from the image without changing the original label of the image.

4. The aerial image recognition method based on ViT of claim 1, wherein: the ViT-based aerial image recognition model in step S3) is composed of an encoder F (-) and a classification head G (-) and a projection head P (-) only used for the training phase:

In the input feature encoder F (-) the first token of the encoder is used as

Is represented by a global feature of _i ；

5. The aerial image recognition method based on ViT of claim 1, wherein: step S4), performing discriminative label smoothing on the label corresponding to the image, which means performing discriminative label smoothing on the image according to the discrete probability value output by the model and the current training phase, and then using the smoothed label to calculate a cross entropy loss function value, wherein the expression is as follows:

1, otherwise 0;

is the discrete probability distribution output by the model, which refers to the prediction probability of the model to the ith sample in the kth class, gamma _· (s) is a smoothing variable.

6. The aerial image recognition method based on ViT of claim 5, wherein: the smoothing variable gamma _· (s) two smoothing variables γ _hard (s) and γ _simple (s) means for controlling each of the difficult and simple samples in different training phasesThe smoothing weight of the self is expressed as follows:

γ _simple (s)＝(γ _hard (s)+γ _bias )*0.5 ^(1+s/I)

wherein s is the iteration number of the current training, and I is the total iteration number; gamma ray _max Is the maximum value of the smoothing weight, gamma, corresponding to the difficult sample _min Is the minimum value; gamma ray _bias Is the deviation value of the hard sample and the simple sample smoothing weight;

refers to a smooth interpolation function, whose expression is as follows:

7. The aerial image recognition method based on ViT of claim 6, wherein: at the ith sample

8. The aerial image recognition method based on ViT of claim 6, wherein: step S4), when the cross entropy loss function and the discriminative contrast loss function are adopted to train the model at the same time, the total loss value L is calculated according to the following formula:

L＝L _CE +β*L _DCL

The expression of the discriminative contrast loss function is as follows:

wherein B M is the total number of training set samples,

representation and sample

Samples of the same type but obtained from different image enhancements

The ratio of the dot product of (a),

representation and sample

Samples of the same kind and enhanced by the same image

Is temperature parameter, epsilon is similarity threshold, and 1 is more than or equal to epsilon more than 0.

9. An ViT-based aerial image recognition computer device comprising a memory, a processor, and program instructions stored in the memory for execution by the processor, wherein the processor executes the program instructions to implement the steps in the method of any of claims 1-8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 8.