CN114842343A - ViT-based aerial image identification method - Google Patents

ViT-based aerial image identification method Download PDF

Info

Publication number
CN114842343A
CN114842343A CN202210541111.8A CN202210541111A CN114842343A CN 114842343 A CN114842343 A CN 114842343A CN 202210541111 A CN202210541111 A CN 202210541111A CN 114842343 A CN114842343 A CN 114842343A
Authority
CN
China
Prior art keywords
image
model
aerial image
vit
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210541111.8A
Other languages
Chinese (zh)
Inventor
熊盛武
赵怡晨
陈亚雄
路雄博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202210541111.8A priority Critical patent/CN114842343A/en
Publication of CN114842343A publication Critical patent/CN114842343A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention discloses an ViT-based aerial image identification method, which comprises the following steps: s1, acquiring an aerial image data set, constructing a training set, a verification set and a test set; s2, expanding the data volume of the training set; s3, constructing an aerial image recognition model based on ViT; s4, inputting the expanded training set into a recognition model, performing discriminative label smoothing on labels corresponding to the images, training the model by adopting a cross entropy loss function and a discriminative contrast loss function, updating the recognition model through a back propagation algorithm, and selecting an optimal aerial image recognition model; and S5, testing the recognition performance of the model through the test set. According to the invention, the label corresponding to the image is subjected to discriminative label smoothing treatment, and meanwhile, the training process of the cross entropy loss function and discriminative contrast loss function supervision model is adopted, so that the ViT-based aerial image recognition model with stronger feature learning capability is obtained, and the method has the advantages of high recognition rate, strong expansibility and the like.

Description

ViT-based aerial image identification method
Technical Field
The invention relates to the technical field of machine learning algorithms and image processing, in particular to an ViT-based aerial image identification method.
Background
Aerial image recognition refers to identifying the category to which an aerial image belongs given. With the increasing maturity of aviation technology and the increasing improvement of aviation image resolution, aviation images play an increasingly important role in daily life of people. The aerial image recognition is not required to be carried out on tasks such as natural disaster detection, urban planning, resource exploration and thematic map making, so that the accurate recognition of the aerial image has important value.
Although the aerial image data volume is large, the data set which can be used for model training is small in quantity and low in quality, the labeled data set is rare, and the problems of noise samples and difficult samples generally exist. In addition, most aerial images are formed in a overlook mode, and the aerial image has the advantages of being wide in imaging range, large in size change, sparse in target change in a scene and the like. Therefore, aerial image recognition has a difficulty in that the data amount is small and the background is complicated compared to a natural image.
At present, most solutions are to establish a targeted lightweight deep learning algorithm, and do not extend to more diversified aerial images, so that the method has limitations. In addition, most of the methods adopt cross entropy loss of learning label information to supervise the model, and do not consider internal information of the aerial image.
Disclosure of Invention
Aiming at the defects existing in the background technology, the invention provides an ViT-based aerial image recognition method, which utilizes the advantages of ViT (Vision Transformer) in capturing long-distance dependence and dynamic self-adaptive modeling capability, takes ViT as a feature encoder of an image to capture remarkable semantic features, and improves on the basis of ViT, so that the limited aerial image data can be fully utilized for training, and noise in overfitting images is avoided.
In order to achieve the purpose, the invention designs an ViT-based aerial image recognition method, which is characterized by comprising the following steps:
s1) acquiring an aerial image data set to obtain a required original aerial image x i And its corresponding category label y i And dividing a training set, a verification set and a test set according to the proportional quantity, and respectively training, verifying and evaluating the model subsequently, wherein the training set is recorded as
Figure BDA0003648391670000021
B is the number of images of the training set;
s2) performing online data enhancement on the training set images to enable each image in the training set to generate M different enhanced images, wherein the number of the expanded images in the training set is B × M and is recorded as
Figure BDA0003648391670000022
S3) constructing an aerial image recognition model based on ViT;
s4) the training set
Figure BDA0003648391670000023
Inputting the ViT-based aerial image recognition model, performing discriminative label smoothing on labels corresponding to the image, training the model by adopting a cross entropy loss function and a discriminative contrast loss function, updating the recognition model by a back propagation algorithm, and selecting the optimal aerial image recognition model by using the verification set in the step S1);
s5) testing the recognition performance of the aerial image recognition model by using the test set of the step S1) to obtain the final model recognition accuracy, and inputting the image to be recognized into the aerial image recognition model for recognition when the model recognition accuracy reaches a set threshold value; otherwise, returning to the step S3) until the model identification accuracy reaches the set threshold.
Preferably, the step S2) randomly cuts the input image into 224 × 224 pixels and then randomly levels the cut pixelsTurning over, then using an image enhancement strategy to enhance the image, finally obtaining a training set after capacity expansion, and recording the training set as the training set
Figure BDA0003648391670000024
Preferably, the image enhancement strategy in step S2) includes one or more of the following operations in combination: the method comprises the steps of normalizing an image, sequentially carrying out random color distortion and Gaussian blur, automatically enhancing, randomly selecting one image enhancement operation each time, then randomly determining the enhancement amplitude of the image, enhancing the image, and randomly erasing a rectangular area from the image without changing the original label of the image.
Preferably, the ViT-based aerial image recognition model in step S3) is composed of an encoder F (-) and a classification head G (-) and a projection head P (-) only used for the training phase:
the encoder F (-) is composed of ViT pre-trained on the data set for learning and encoding the global features of the image, which will be trained
Figure BDA0003648391670000031
In the input feature encoder F (-) the first token of the encoder is used as
Figure BDA0003648391670000032
Is represented by a global feature of i
The classification head G (-) is composed of an MLP layer, the structure of the classification head G (-) is full connection layer FC-activation function Tanh-full connection layer FC, and the number of MLP layer output neurons is the total classification number of the aerial images in the current data set;
the projection head P (-) is only used in the training phase of the model, and the role of the projection head P (-) is to express the coded global feature h i Mapping into the potential space of applied contrast loss, and its structure is full connection layer FC-activation function ReLU-full connection layer FC.
Preferably, the step S4) performs discriminative label smoothing on the label corresponding to the image, which means that the image is subjected to discriminative label smoothing according to the discrete probability value output by the model and the current training phase, and then the smoothed label is used to calculate the cross entropy loss function value, where the expression is as follows:
Figure BDA0003648391670000033
in the formula, L CE Is the cross entropy loss function value, K is the total number of categories in the aerial image dataset;
Figure BDA0003648391670000034
is the ith sample initial label probability distribution, i.e. for the correct label class
Figure BDA0003648391670000035
1, otherwise 0;
Figure BDA0003648391670000036
is the discrete probability distribution output by the model, which refers to the predicted probability of the model for the ith sample in the kth class, and γ(s) is a smooth variable.
Preferably, the smoothing variable γ(s) is composed of two smoothing variables γ hard (s) and γ simple (s) for controlling the respective smooth weights of the difficult samples and the simple samples in different training stages, the expression is as follows:
Figure BDA0003648391670000037
γ simple (s)=(γ hard (s)+γ bias )*0.5 (1+s/I)
wherein s ∈ {1 … I } is the iteration number of the current training, and I is the total iteration number; gamma ray max Is the maximum value of the smoothing weight, gamma, corresponding to the difficult sample min Is the minimum value; gamma ray bias Is the deviation value of the hard sample and the simple sample smoothing weight;
Figure BDA0003648391670000038
refers to a smooth interpolation function, whose expression is as follows:
Figure BDA0003648391670000041
wherein Comb: is the number of permutation combinations, which represents the total number of fetching ways to fetch N elements from N + N elements, N being used to control the rate of smoothing.
Preferably, at the ith sample
Figure BDA0003648391670000042
Probability of K classes output according to model in division of difficult or simple samples
Figure BDA0003648391670000043
When the maximum value is more than 0.8 and the second maximum value is less than 0.2, the sample is considered to belong to a simple sample, otherwise, the sample is divided into difficult samples; therefore, corresponding smooth variables are respectively selected, and the cross entropy loss function value is calculated.
Preferably, when the cross entropy loss function and the discriminative contrast loss function are used to train the model in step S4), the total loss value L is calculated according to the following formula:
L=L CE +β*L DCL
in the formula, L CE As a cross-entropy loss function, L DCL For the discriminative contrast loss function, β is a weight coefficient for adjusting the importance of the discriminative contrast loss function.
The expression of the discriminative contrast loss function is as follows:
Figure BDA0003648391670000044
Figure BDA0003648391670000045
wherein B M is the total number of training set samples,
Figure BDA0003648391670000046
is an indicator function, 1 if and only if the input condition is true, and the sample
Figure BDA0003648391670000047
In samples belonging to the same class, S i Representing a set of samples enhanced by the same image, C i It is indicated that in other cases,
Figure BDA0003648391670000048
representation and sample
Figure BDA0003648391670000049
Samples of the same type but obtained from different image enhancements
Figure BDA00036483916700000410
The ratio of the dot product of (a) to (b),
Figure BDA00036483916700000411
representation and sample
Figure BDA00036483916700000412
Samples of the same kind and enhanced by the same image
Figure BDA00036483916700000413
The dot product ratio of (1, t)>0 is a temperature parameter, ε is a similarity threshold, 1 ≧ ε>0。
The invention also provides ViT-based aerial image recognition computer equipment, which comprises a memory, a processor and program instructions stored in the memory and executable by the processor, wherein the processor executes the program instructions to realize the steps of the method.
The invention further provides a computer-readable storage medium, which stores a computer program, wherein the computer program is used for realizing the ViT-based aerial image recognition method when being executed by a processor.
The invention has the beneficial effects that:
1. the recognition rate is high: aiming at the problem that the amount of trainable data is small in aerial image recognition and overfitting of a deep learning algorithm is easily caused, the method adopts the distinguishing label smoothing to promote the model to learn enough good characteristic information and simultaneously prevent overfitting of the distribution of noise data.
2. Strong expansibility: the aviation image recognition method based on ViT has high principle universality, selects proper training data according to actual needs, and can be applied to different types of aviation image recognition tasks.
3. The data structure is reasonable: the invention designs a discriminative label smooth item and discriminative supervised contrast loss, and learns a more compact and reasonable data structure; therefore, the ViT-based aerial image recognition model with stronger saliency feature capturing capability is trained, so that aerial image recognition is more accurate.
Drawings
FIG. 1 is an overall flow chart of an ViT-based aerial image identification method according to the invention;
FIG. 2 is a diagram illustrating a random enhancement module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a model for aerial image event recognition according to an embodiment of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and examples for illustrating the objects, aspects, advantages and realizability of the invention in detail. It should be understood that the specific examples described herein are intended to be illustrative only and are not intended to be limiting. In addition, the technical features mentioned in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the embodiment, event identification in the aerial image is taken as a scene, and the aerial image identification method based on ViT provided by the invention is described in detail.
As shown in FIG. 1, the aviation image identification method based on ViT provided by the invention is applied to an event identification task in an aviation image, and the method comprises the following detailed steps:
step S1: acquiring an event identification data set in the aerial image to obtain an aerial image x i And its corresponding event label y i In this embodiment, an event recognition data set in the ERA aerial image is selected, the data set includes 2864 sample images of 25 event categories, the training set and the test set which have been divided are directly used, the training set and the verification set are randomly divided from the original training set according to a ratio of 9:1, and the training set is recorded as the training set
Figure BDA0003648391670000061
B is the number of images in the training set.
Step S2: and constructing a data random enhancement module to expand the data volume of the training set, and inputting the training set image in the step S1 into the random enhancement module for online data enhancement. In the random enhancement module, firstly, an input image is randomly cut into 224 × 224 pixels and then is randomly horizontally flipped, and then six common image enhancement strategies of the current visual task are selected, including (1) BaseAugment (only image normalization operation is performed); (2) SimAugment (random color distortion and gaussian blur in order, and possibly an additional sparse image warping operation at the end of the sequence); (3) AutoAutoAutoAutoAutoAutoAutoAutoaugmentation; (4) RandAugment (random enhancement); (5) trivialagment (randomly selecting an image enhancement operation each time, then randomly determining the enhancement amplitude of the image enhancement operation, and enhancing the image); (6) randomerase (randomly erasing a rectangular area from an image without changing the original label of the image). Namely, giving an image in the training set, enhancing the image by M (6 is more than or equal to M is more than or equal to 0) types randomly selected in the six strategies, and finally obtaining the expanded training set which is marked as
Figure BDA0003648391670000062
In this embodiment, M is 4, as shown in fig. 2.
Step S3: an aviation image recognition model based on ViT is constructed, and the structure of the model is shown in FIG. 3. The model consists of an encoder F (-), a classification head G (-), and a projection head P (-), which is used only for the training phase:
encoder F(. h) consists of ViT pre-trained on ImageNet datasets for learning and encoding image global features. Specifically, the encoder F (-) includes two parts, a linear layer and a transform encoder: the linear layer is used for embedding the image into the representation; the transformer encoder is composed of a multi-head self-attention layer and a multi-layer perceptron block and is used for learning global features of an image. LayerNorm normalization is applied before each block and residual concatenation is applied after each block. Will train the picture
Figure BDA0003648391670000071
In the input feature encoder F (·), the first token of the last layer of transform encoder is used as
Figure BDA0003648391670000072
Is represented by a global feature of i . Then h is i Input to a classifier and a projector to calculate a total loss value.
The classification header G (-) is composed of an MLP layer, and has a structure of "full connected layer FC — activation function Tanh — full connected layer FC", and the number of MLP layer output neurons is the total classification number of aerial images in the current data set, which is 25 in this embodiment.
The projection head P (-) is only used in the training phase of the model, and the role of the projection head P (-) is to represent the coded representation h i Mapping to a potential space applying contrast loss, wherein the structure is 'full connection layer FC-activation function ReLU-full connection layer FC', and the number of MLP layer output neurons is 128.
Step S4: training set in step S2
Figure BDA0003648391670000073
The image is input into the recognition model constructed in the step S3, then the discriminative label smoothing is carried out on the label corresponding to the image, meanwhile, the cross entropy loss function and the discriminative contrast loss function are adopted to train the model, the recognition model is updated through a back propagation algorithm, and the model with the optimal recognition accuracy on the verification set in the step S1 is selected as the recognition model which is trained finally.
The method comprises the following steps of performing discriminative label smoothing on a label corresponding to an image, performing discriminative label smoothing on the image according to a discrete probability value output by a model and a current training stage, and then using the smoothed label to calculate a cross entropy loss function value, wherein the expression is as follows:
Figure BDA0003648391670000074
wherein K is the total number of categories in the aerial image dataset;
Figure BDA0003648391670000075
is the ith sample initial label probability distribution, i.e. for the correct label class
Figure BDA0003648391670000076
1, otherwise 0;
Figure BDA0003648391670000077
is the discrete probability distribution output by the model, which refers to the predicted probability of the model for the ith sample in the kth class.
Obtaining annotated aerial images is typically more costly than a natural image dataset, so aerial image datasets are generally smaller in size, which is highly likely to result in an overfitting of the model on the training data. While the traditional label smoothing can relieve model overfitting to a certain extent, the risk of model underfitting exists when the data set is small in size. Therefore, the smoothing weight is controlled by proposing a smoothing variable γ.(s), and different smoothing weight values are given according to the change of the model training phase. In particular, γ(s) is formed by two smoothing variables γ hard (s) and gamma simple (s) for controlling the respective smooth weights of the difficult samples and the simple samples in different training stages, the expression is as follows:
Figure BDA0003648391670000081
γ simple (s)=(γ hard (s)+γ bias )*0.5 (1+s/I)
wherein s ∈ {1 … I } is the iteration number of the current training, and I is the total iteration number; gamma ray max Is the maximum of the smoothing weight corresponding to the difficult sample, similarly, gamma min Is the minimum value; gamma ray bias Is the deviation value of the hard sample and the simple sample smoothing weight;
Figure BDA0003648391670000086
refers to a smooth interpolation function. The expression is as follows:
Figure BDA0003648391670000082
wherein Comb: indicating number of permutation combinations, e.g.
Figure BDA0003648391670000087
This means the total number of fetching modes for fetching N elements from N + N elements, regardless of the fetching order. N is used to control the rate of smoothing, taking 1 in this example.
At the ith sample
Figure BDA0003648391670000084
Probability of K classes output according to model in division of difficult or simple samples
Figure BDA0003648391670000085
When the maximum value is more than 0.8 and the second maximum value is less than 0.2, the sample is considered to be a simple sample, otherwise, the sample is classified as a difficult sample. Thereby respectively selecting corresponding smoothing function calculations to cross the entropy loss function values.
Wherein, the model is trained by simultaneously adopting a cross entropy loss function and a discriminative contrast loss function, and a total loss value L is calculated according to the following formula:
L=L CE +β*L DCL
in the formula, L CE Is the cross entropy loss function of claim 5, L DCL For the discriminative contrast loss function, β is a weight coefficient for adjusting the importance of the discriminative contrast loss function.
Wherein, the expression of the discriminative contrast loss function is as follows:
Figure BDA0003648391670000091
Figure BDA0003648391670000092
Figure BDA0003648391670000093
because aerial images have larger intra-class variation and inter-class similarity compared with natural images, even similar samples have certain difference, and the difference can be further enhanced after random enhancement, the discriminative contrast loss function is provided by further distinguishing whether similar images are enhanced by the same images or not. Specifically, B M is the total number of training set samples,
Figure BDA0003648391670000094
is an indicator function that is 1 if and only if the input conditions are true. And a sample
Figure BDA0003648391670000095
In samples belonging to the same class, S i Representing a set of samples enhanced by the same image, C i Indicating other situations.
Figure BDA0003648391670000096
Representation and sample
Figure BDA0003648391670000097
Samples of the same type but obtained from different image enhancements
Figure BDA0003648391670000098
The ratio of the dot product of (a) to (b),
Figure BDA0003648391670000099
representation and sample
Figure BDA00036483916700000910
Samples of the same kind and enhanced by the same image
Figure BDA00036483916700000911
The dot product ratio of (a). Tau is>0 is a temperature parameter, ε (1 ≧ ε)>0) Is the similarity threshold.
Step S5: and (5) inputting the test set images in the step (S1) into the trained recognition model, and comparing the output prediction type of the model with the real type to obtain the final recognition accuracy. When the model identification accuracy reaches a set threshold value, inputting an image to be identified into the aerial image identification model for identification; otherwise, returning to the step S3) until the model identification accuracy reaches the set threshold.
Based on the method, the invention also provides ViT-based aerial image recognition computer equipment which comprises a memory, a processor and program instructions stored in the memory and executable by the processor, wherein the processor executes the program instructions to realize the steps in the ViT-based aerial image recognition method.
The invention further provides a computer-readable storage medium storing a computer program, wherein the computer program is used for implementing the ViT-based aerial image recognition method when being executed by a processor.
Details not described in this specification are within the skill of the art that are well known to those skilled in the art.
It will be appreciated by those skilled in the art that modifications and variations may be made to the principles of the present invention and the foregoing description, or the methods provided by the present invention may be applied to similar aerial image recognition tasks, and all such modifications and variations are intended to fall within the scope of the appended claims.

Claims (10)

1. An aerial image identification method based on ViT is characterized in that: the method comprises the following steps:
s1) acquiring an aerial image data set to obtain a required original aerial image x i And its corresponding category label y i And dividing a training set, a verification set and a test set according to the proportional quantity, and respectively training, verifying and evaluating the model subsequently, wherein the training set is recorded as
Figure FDA0003648391660000011
B is the number of images of the training set;
s2) performing online data enhancement on the training set images to enable each image in the training set to generate M different enhanced images, wherein the number of the expanded images in the training set is B × M and is recorded as
Figure FDA0003648391660000012
S3) constructing an aerial image recognition model based on ViT;
s4) training the training set
Figure FDA0003648391660000013
Inputting the ViT-based aerial image recognition model, performing discriminative label smoothing on labels corresponding to the image, training the model by adopting a cross entropy loss function and a discriminative contrast loss function, updating the recognition model by a back propagation algorithm, and selecting the optimal aerial image recognition model by using the verification set in the step S1);
s5) testing the recognition performance of the aerial image recognition model by using the test set of the step S1) to obtain the final model recognition accuracy, and inputting the image to be recognized into the aerial image recognition model for recognition when the model recognition accuracy reaches a set threshold value; otherwise, returning to the step S3) until the model identification accuracy reaches the set threshold.
2. A method as claimed in claim 1ViT-based aerial image identification method is characterized in that: step S2), the input image is randomly cut into 224 × 224 pixels and then randomly horizontally flipped, and then the image is enhanced by using the image enhancement strategy, and finally the expanded training set is obtained and recorded as
Figure FDA0003648391660000014
3. The aerial image recognition method based on ViT of claim 2, wherein: the image enhancement strategy in step S2) includes one or more of the following operations: the method comprises the steps of normalizing an image, sequentially carrying out random color distortion and Gaussian blur, automatically enhancing, randomly selecting one image enhancement operation each time, then randomly determining the enhancement amplitude of the image, enhancing the image, and randomly erasing a rectangular area from the image without changing the original label of the image.
4. The aerial image recognition method based on ViT of claim 1, wherein: the ViT-based aerial image recognition model in step S3) is composed of an encoder F (-) and a classification head G (-) and a projection head P (-) only used for the training phase:
the encoder F (-) is composed of ViT pre-trained on the data set for learning and encoding the global features of the image, which will be trained
Figure FDA0003648391660000021
In the input feature encoder F (-) the first token of the encoder is used as
Figure FDA0003648391660000022
Is represented by a global feature of i
The classification head G (-) is composed of an MLP layer, the structure of the classification head G (-) is full connection layer FC-activation function Tanh-full connection layer FC, and the number of MLP layer output neurons is the total classification number of the aerial images in the current data set;
the projection head P (-) is only used in the training phase of the model, and the role of the projection head P (-) is to express the coded global feature h i Mapping into the potential space of applied contrast loss, and its structure is full connection layer FC-activation function ReLU-full connection layer FC.
5. The aerial image recognition method based on ViT of claim 1, wherein: step S4), performing discriminative label smoothing on the label corresponding to the image, which means performing discriminative label smoothing on the image according to the discrete probability value output by the model and the current training phase, and then using the smoothed label to calculate a cross entropy loss function value, wherein the expression is as follows:
Figure FDA0003648391660000023
in the formula, L CE Is the cross entropy loss function value, K is the total number of categories in the aerial image dataset;
Figure FDA0003648391660000024
is the ith sample initial label probability distribution, i.e. for the correct label class
Figure FDA0003648391660000027
1, otherwise 0;
Figure FDA0003648391660000025
is the discrete probability distribution output by the model, which refers to the prediction probability of the model to the ith sample in the kth class, gamma · (s) is a smoothing variable.
6. The aerial image recognition method based on ViT of claim 5, wherein: the smoothing variable gamma · (s) two smoothing variables γ hard (s) and γ simple (s) means for controlling each of the difficult and simple samples in different training phasesThe smoothing weight of the self is expressed as follows:
Figure FDA0003648391660000026
γ simple (s)=(γ hard (s)+γ bias )*0.5 (1+s/I)
wherein s is the iteration number of the current training, and I is the total iteration number; gamma ray max Is the maximum value of the smoothing weight, gamma, corresponding to the difficult sample min Is the minimum value; gamma ray bias Is the deviation value of the hard sample and the simple sample smoothing weight;
Figure FDA0003648391660000031
refers to a smooth interpolation function, whose expression is as follows:
Figure FDA0003648391660000032
wherein Comb: is the number of permutation combinations, which represents the total number of fetching ways to fetch N elements from N + N elements, N being used to control the rate of smoothing.
7. The aerial image recognition method based on ViT of claim 6, wherein: at the ith sample
Figure FDA0003648391660000033
Probability of K classes output according to model in division of difficult or simple samples
Figure FDA0003648391660000034
When the maximum value is more than 0.8 and the second maximum value is less than 0.2, the sample is considered to belong to a simple sample, otherwise, the sample is divided into difficult samples; therefore, corresponding smooth variables are respectively selected, and the cross entropy loss function value is calculated.
8. The aerial image recognition method based on ViT of claim 6, wherein: step S4), when the cross entropy loss function and the discriminative contrast loss function are adopted to train the model at the same time, the total loss value L is calculated according to the following formula:
L=L CE +β*L DCL
in the formula, L CE As a cross-entropy loss function, L DCL For the discriminative contrast loss function, β is a weight coefficient for adjusting the importance of the discriminative contrast loss function.
The expression of the discriminative contrast loss function is as follows:
Figure FDA0003648391660000035
Figure FDA0003648391660000036
wherein B M is the total number of training set samples,
Figure FDA0003648391660000037
is an indicator function, 1 if and only if the input condition is true, and the sample
Figure FDA0003648391660000038
In samples belonging to the same class, S i Representing a set of samples enhanced by the same image, C i It is indicated that in other cases,
Figure FDA0003648391660000039
representation and sample
Figure FDA00036483916600000310
Samples of the same type but obtained from different image enhancements
Figure FDA0003648391660000041
The ratio of the dot product of (a),
Figure FDA0003648391660000042
representation and sample
Figure FDA0003648391660000043
Samples of the same kind and enhanced by the same image
Figure FDA0003648391660000044
Is temperature parameter, epsilon is similarity threshold, and 1 is more than or equal to epsilon more than 0.
9. An ViT-based aerial image recognition computer device comprising a memory, a processor, and program instructions stored in the memory for execution by the processor, wherein the processor executes the program instructions to implement the steps in the method of any of claims 1-8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 8.
CN202210541111.8A 2022-05-17 2022-05-17 ViT-based aerial image identification method Pending CN114842343A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210541111.8A CN114842343A (en) 2022-05-17 2022-05-17 ViT-based aerial image identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210541111.8A CN114842343A (en) 2022-05-17 2022-05-17 ViT-based aerial image identification method

Publications (1)

Publication Number Publication Date
CN114842343A true CN114842343A (en) 2022-08-02

Family

ID=82569586

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210541111.8A Pending CN114842343A (en) 2022-05-17 2022-05-17 ViT-based aerial image identification method

Country Status (1)

Country Link
CN (1) CN114842343A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394381A (en) * 2022-08-24 2022-11-25 哈尔滨理工大学 High-entropy alloy hardness prediction method and device based on machine learning and two-step data expansion
CN115396242A (en) * 2022-10-31 2022-11-25 江西神舟信息安全评估中心有限公司 Data identification method and network security vulnerability detection method
CN116758360A (en) * 2023-08-21 2023-09-15 江西省国土空间调查规划研究院 Land space use management method and system thereof
CN117173122A (en) * 2023-09-01 2023-12-05 中国农业科学院农业信息研究所 Lightweight ViT-based image leaf density determination method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394381A (en) * 2022-08-24 2022-11-25 哈尔滨理工大学 High-entropy alloy hardness prediction method and device based on machine learning and two-step data expansion
CN115394381B (en) * 2022-08-24 2023-08-22 哈尔滨理工大学 High-entropy alloy hardness prediction method and device based on machine learning and two-step data expansion
CN115396242A (en) * 2022-10-31 2022-11-25 江西神舟信息安全评估中心有限公司 Data identification method and network security vulnerability detection method
CN116758360A (en) * 2023-08-21 2023-09-15 江西省国土空间调查规划研究院 Land space use management method and system thereof
CN116758360B (en) * 2023-08-21 2023-10-20 江西省国土空间调查规划研究院 Land space use management method and system thereof
CN117173122A (en) * 2023-09-01 2023-12-05 中国农业科学院农业信息研究所 Lightweight ViT-based image leaf density determination method and device
CN117173122B (en) * 2023-09-01 2024-02-13 中国农业科学院农业信息研究所 Lightweight ViT-based image leaf density determination method and device

Similar Documents

Publication Publication Date Title
US20200285896A1 (en) Method for person re-identification based on deep model with multi-loss fusion training strategy
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN111754596B (en) Editing model generation method, device, equipment and medium for editing face image
CN109583501B (en) Method, device, equipment and medium for generating image classification and classification recognition model
CN114842343A (en) ViT-based aerial image identification method
CN108647583B (en) Face recognition algorithm training method based on multi-target learning
WO2020114378A1 (en) Video watermark identification method and apparatus, device, and storage medium
EP3690741A2 (en) Method for automatically evaluating labeling reliability of training images for use in deep learning network to analyze images, and reliability-evaluating device using the same
JP2020123330A (en) Method for acquiring sample image for label acceptance inspection from among auto-labeled images utilized for neural network learning, and sample image acquisition device utilizing the same
CN111582397A (en) CNN-RNN image emotion analysis method based on attention mechanism
CN113592007B (en) Knowledge distillation-based bad picture identification system and method, computer and storage medium
CN112560710B (en) Method for constructing finger vein recognition system and finger vein recognition system
JP7139749B2 (en) Image recognition learning device, image recognition device, method, and program
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN111539456B (en) Target identification method and device
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
CN111444816A (en) Multi-scale dense pedestrian detection method based on fast RCNN
CN114419379A (en) System and method for improving fairness of deep learning model based on antagonistic disturbance
CN114332075A (en) Rapid structural defect identification and classification method based on lightweight deep learning model
CN109101984B (en) Image identification method and device based on convolutional neural network
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
JPWO2019215904A1 (en) Predictive model creation device, predictive model creation method, and predictive model creation program
TWI803243B (en) Method for expanding images, computer device and storage medium
CN112507912A (en) Method and device for identifying illegal picture
CN116912920B (en) Expression recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination