CN115690479A

CN115690479A - Remote sensing image classification method and system based on convolution Transformer

Info

Publication number: CN115690479A
Application number: CN202210562253.2A
Authority: CN
Inventors: 陈辉; 张甜
Original assignee: Anhui University of Science and Technology
Current assignee: Anhui University of Science and Technology
Priority date: 2022-05-23
Filing date: 2022-05-23
Publication date: 2023-02-03

Abstract

The invention provides a remote sensing image classification method and system based on convolution Transformer, wherein the method comprises the following steps: firstly, extracting local features by using a lightweight convolutional neural network, and secondly, inputting the obtained local features into a mixed network of a CNN (convolutional neural network) fused multi-head self-attention mechanism to enhance the extraction capability of the global features of the image; then, transfer learning is introduced in the training process to accelerate the convergence speed; and finally, carrying out classification prediction on the obtained feature output. Compared with other common image classification methods, the method can efficiently extract the local characteristic information and the long-distance global dependency information of the remote sensing image while reducing the parameter number and the calculation cost. The method solves the technical problems of increased self-attention time complexity, higher calculation cost, low classification accuracy and low robustness.

Description

Remote sensing image classification method and system based on convolution Transformer

Technical Field

The invention relates to the field of remote sensing image classification, in particular to a remote sensing image classification method and system based on convolution Transformer.

Background

The remote sensing image has a complex overall structure and rich textural features, and although the remote sensing image classification method based on the Convolutional Neural Network (CNN) can capture rich local information, the limited receptive field makes the remote sensing image classification method unable to establish a long-distance dependency relationship on the global information, thereby resulting in low accuracy of remote sensing image classification.

The aerial remote sensing is a scientific technology which is not in direct contact and needs to collect information by using a satellite, an airplane or an unmanned aerial vehicle as a carrier, and is mainly applied to the fields of geological survey, environmental monitoring, crop prediction, resource exploration and the like. With the development of artificial intelligence and satellite sensors, the remote sensing technology breaks through numerous challenges and enters a new stage of accurately and efficiently providing various survey information, so that the resolution of a remote sensing image is obviously improved in the aspects of space, spectrum, time and the like. For example, the WorldView-2 satellite from Digital Global can provide panchromatic images at 0.5m resolution and multispectral images at 1.8m resolution; the CMOS sensor on the China OHS hyperspectral satellite has 256 spectral bands, the range of each spectral band is 400-1000nm, the spatial resolution is 10m, and therefore the faster and more accurate satellite service is brought to the world. The unmanned aerial vehicle is a main carrier for collecting remote sensing information, and is widely applied to the fields of geological disaster monitoring, ocean island surveying and mapping, emergency rescue and the like. Compared with a satellite, the unmanned aerial vehicle is light and convenient, low in complexity, low in development cost and easy to deploy so as to meet the requirements of monitoring and surveying and mapping; compared with an airplane, the unmanned aerial vehicle is not influenced by temperature and weather, can fly under the condition of permission of any weather, improves time resolution and reduces image blurring rate. In recent years, a method related to a Convolutional Neural Network (CNN) has been proposed to classify unmanned aerial vehicle images. For example, the prior art solution proposed by Liu to combine CNN with object-based image analysis (OBIA) and to use multi-view data for land cover classification, and the prior art solution proposed by Bazi to use a dual-branch neural network to assign multiple level labels to drone images, etc.

For the problem of effective information acquisition of remote sensing images, scene classification is one of the widely used research fields at present, and the main objective of the method is to acquire images and judge the scenes by identifying correct semantic labels in the images, so that the purpose of classification is achieved. There are many important application fields for scene classification, such as land management, forest fire reconstruction, city planning, etc. Early scene classification mainly relies on manual labeling to extract image features, such as methods of SIFT, GIST, gradient histogram, and the like, which achieve better effects in some simple scene classification tasks, but the limitations of these methods become more obvious as the complexity and number of categories of scenes increase. Therefore, researchers have proposed traditional image feature modeling methods such as bag of words BOW, LSA, local aggregation descriptor Vector (VLAD), etc.

Compared with the traditional image classification method, the deep learning methods such as the neural network and the automatic encoder have achieved remarkable achievement in multiple application fields such as remote sensing image classification, and particularly, the CNN surpasses other traditional methods in many applications. The CNN has the advantage of end-to-end detection, and reduces parameters of network training, thereby reducing the complexity of network space; the multichannel input reduces the process of characteristic information rearrangement and shortens the training time. On the basis, methods such as a Recurrent Neural Network (RNN), a generation countermeasure network (GANs), a Graph Convolution Network (GCNs), a long-term short-term memory (LSTM) and the like are also introduced successively, for example, the existing invention patent with the application number of CN202111368092.5, namely a remote sensing image classification network robustness improving method based on self-supervision learning, utilizes the large amount of label-free data in the remote sensing field, and excavates the information of the image through a twin network, so that the robustness of the model is effectively improved; performing feature extraction on the clean sample and the countermeasure sample by using a twin network simultaneously to obtain a feature vector, completing model training by comparing and learning the feature vectors of the approaching clean sample and the countermeasure sample, and acquiring a hyperspectral image data set and a corresponding label vector data set by the prior invention patent of self-compensating convolutional neural network-based remote sensing scene image classification method with the application number of CN 202111193355.3; step two, establishing a self-compensating convolution neural network; inputting the hyperspectral image data set and the corresponding label vector data set into the established self-compensation convolutional neural network, and performing iterative optimization to obtain an optimal self-compensation convolutional neural network; and step four, inputting the hyperspectral image to be detected into the optimal self-compensation convolutional neural network to perform classification result prediction. The technical solutions and specific technical features of the present application are not disclosed, unlike the present application. Girshick et al propose a new target detection method that uses deep convolutional networks for classification, which improves the method training speed and detection accuracy compared to previous network methods. Bi and the like regard aviation scene classification as a multi-instance learning problem, and provide a multi-instance dense connection convolution network (MIDC-Net) with fewer parameters for effectively storing features of different levels. Yu et al combine GAN with Attention and propose a new Attention-generating countermeasure network (Attention-GANs), which greatly improves the performance of aviation scene classification. XUE and the like use three popular CNNs as feature processors to obtain depth features from images and fuse the depth features to realize remote sensing scene classification. Yu and the like perform feature fusion on the two preprocessed convolutional neural networks and the double-current technology so as to classify the high-resolution aerial photography scene, and the classification accuracy is obviously improved.

In recent years, a new deep learning method called Transformer has been proposed in addition to CNN and has been widely popular in the field of computer vision. The Transformer is a network which mainly relies on self-attention to establish long-distance global information dependence between input features and output features, and obtains more effective results through parallelized output. The Transformer is the most advanced sequence encoder at present, and is widely applied to the field of natural language processing and has obvious effect. With this inspiring, some researchers have attempted to apply transformers to images, and Bello et al have used self-attention to replace some convolutional layers to enhance the feature extraction capability of CNNs, thereby improving the performance of image classification. However, since the large size of the image causes the increase of the self-attention time complexity, the calculation cost is high, wang et al propose a new end-to-end attention cyclic convolution network (ARCNet), and improve the classification performance by selectively paying attention to some key areas or positions. Dosovitskiy et al no longer uses the traditional method of combining or replacing part of CNN, but rather inputs the embedded sequence of image blocks into a Transformer, thereby directly applying the Transformer to the image classification task. Wu et al use a Transformer on CNN, first extract the feature map of the image with CNN, then provide the output feature map to the Transformer, and finally use visual markers to enhance image prediction.

The foregoing prior art and the present application are significantly different in specific implementation content and technical features, and since the remote sensing image has a complex overall structure and rich textural features, the CNN-based remote sensing image classification method can capture rich local information, but has poor modeling capability for global information of the image, thereby resulting in low accuracy of remote sensing image classification.

In summary, the prior art has the technical problems of increased self-attention time complexity, higher calculation cost, low classification accuracy and low robustness.

Disclosure of Invention

The technical problem to be solved by the invention is how to solve the technical problems of increased self-attention time complexity, higher calculation cost, low classification accuracy and low robustness.

The invention adopts the following technical scheme to solve the technical problems: a method for classifying remote sensing images based on convolution transformers comprises the following steps:

s1, performing first-layer convolution on a feature map with a preset size to obtain a layer of convolution feature map with the preset size;

s2, inputting the layer of convolution feature map into an L-Conv module, and processing the layer of convolution feature map by not less than 2L-Conv modules to obtain an L-Conv processing feature map, wherein the L-Conv module comprises: the method comprises the steps that a convolution position coding CPE layer and a local feature extraction LFE layer are arranged, the convolution position coding CPE layer obtains absolute position information of features in an image through depth convolution, the local feature extraction LFE layer processes a layer of convolution feature graph through depth separable convolution dimensionality reduction and enters a second layer, the convolution step length of the second layer is 2, second layer convolution processing is conducted on an L-Conv processing feature graph to obtain a second layer convolution processing feature graph, and the second layer convolution processing feature graph is continuously processed through not less than 2L-Conv modules to obtain a second layer convolution feature graph;

s3, inputting the two-layer convolution characteristic diagram into a Transformer module, wherein the Transformer module comprises: the light-weight convolutional position coding CPE layer encodes position information of features through deep convolution, the global feature extraction GFE layer utilizes a multi-head self-attention layer to model long-distance global information of deep image features, the two layers of convolutional feature graphs are continuously processed through a preset number of transform modules to obtain a light-weight convolutional feature graph, and the light-weight convolutional feature graph is averaged, pooled and processed through a preset full-connection layer to output a final prediction result.

According to the method, the light-weight convolutional neural network is used for replacing CNN to extract the local features of the image, so that the separation of a channel and a region is realized in the convolution process, and the operation amount and the parameter quantity are reduced; secondly, inputting the obtained local features into a mixed network of a CNN fused multi-head self-attention mechanism to enhance the extraction capability of the global features of the image; and finally, carrying out image classification on the obtained feature output. The L-CT (light weighted conditional transform, L-CT) can not only acquire local texture information of image features in a shallow layer, but also capture global semantic information of the image features in a deep layer, and has the advantages of CNN and a Transformer, so that the network is light and efficient, and the robustness of the algorithm is improved.

In a more specific aspect, the convolutional position Coding (CPE) layer position codes image features using depth convolution with the following logic:

CPE(X _in )＝DWConv(X _in )

in which

H denotes the height of the feature map, W denotes the width of the feature map, C denotes the number of input channels, and DWConv denotes depth convolution.

In order to capture the position relation of the feature vectors, the invention introduces position coding into the L-CT, the position coding can learn the time dimension information and the sequence relation of the image features, and the absolute position information of the features can be accurately captured in the feature extraction process.

In a more specific technical solution, the step S2 includes:

s21, given input image

S22, passing through a feature extractor f _ec From the input image

5 feature maps are extracted:

s23, inputting the extracted feature map into a position coding module (PEM), and processing the feature map by bilinear interpolation to obtain a feature map with the same spatial dimension;

s24, splicing the same spatial dimension characteristic graphs to obtain spliced characteristic graphs

For the splicing characteristic diagram

Performing a k convolution operation to generate a position map

The present invention uses the popular convolutional position Coding (CPE) to obtain the position information of the image features in order to adapt to the needs of different resolution inputs. The convolution position Coding (CPE) can help each feature to determine the absolute position of the feature by performing position coding on the image features by using deep convolution, and compared with the conventional convolution operation, the CPE has the advantages of lower parameter quantity and operation cost and reduced memory occupation. Due to the shared parameters and locality of the convolution, the CPE can overcome the permutation invariance and is friendly to any input length.

In a more specific technical solution, the step S24 includes:

s241, processing the same space dimension characteristic graph by the following logic to obtain a splicing characteristic graph

S242, matching the splicing characteristic graph

Performing a k convolution operation to generate a position map

In a more specific technical solution, the local feature extraction LFE layer extracts local texture information of an image feature by using depth separable convolution, where the depth separable convolution includes: the method comprises the following steps of deep convolution and point-by-point convolution, wherein the point-by-point convolution uses a convolution core of 1 x 1 to perform linear combination on feature maps on different channels and output the feature maps, the feature maps subjected to position coding are subjected to convolution operation, the point-by-point convolution performs linear combination on features at different depths, and channels and regions are classified in the process of the feature map convolution.

The feature graph after position coding is subjected to convolution operation, the features of different depths are subjected to linear combination through point-by-point convolution, so that the feature graph realizes the separation of a channel and a region in the convolution process, the parameter quantity and the calculated quantity of the depth separable convolution are less than those of standard convolution, and the parameter quantity of the method is reduced and the operation speed of the method is improved.

In a more specific technical scheme, a 1 × 1 point-by-point convolution is added to raise the dimension before the depth separable convolution is performed.

In a more specific solution, a residual function and a normalization structure are added to each output of the L-Conv module and the transform module, respectively.

Before the deep separable convolution, 1 multiplied by 1 point-by-point convolution is added to increase the dimension, so that the deep convolution can capture rich semantic information in a high-dimensional space; in addition, the invention adds a residual function and a normalization structure on each output of the L-Conv module and the Transformer module respectively, thereby preventing network degradation and gradient disappearance.

In a more specific technical solution, a Global Feature Extraction (GFE) layer employs a Transformer network structure, where the Transformer includes: the multi-head self-attention layer and the feedforward network layer are respectively added with a regularization layer and a residual connecting layer before and after each sub-layer, and the feedforward network layer carries out linear transformation by using a GELU activation function.

According to the invention, the regularization layer and the residual connection layer are added before and after each sub-layer to avoid gradient disappearance and network degradation, the GELU activation function is used for linear transformation in the feedforward network layer, and the random regularization idea is introduced into the GELU activation function, so that the generalization capability of the method is enhanced, and the classification effect of the remote sensing image is optimized.

In a more specific technical solution, the step S3 includes:

s31, respectively collecting RSSCN7 and AID remote sensing image dataPerforming cross-migration learning to obtain pre-trained weight coefficients, and performing cross-migration learning to obtain weight coefficients W ^Q ,W ^K ,W ^V Are each independently of x _i Multiplying i e (1,2,3,4.., n) to obtain a feature vector q _i ,k _i ,v _i ,i∈(1,2,3,4,...,n)：

q _i ＝x _i W ^Q

k _i ＝x _i W ^K

v _i ＝x _i W ^v

Wherein x is _i I ∈ (1,2,3,4.., n) is a characteristic diagram input at the bottom layer, W ^Q ,W ^K ,W ^V Is a parameter matrix of the method training;

s32, utilizing the characteristic vector q _i I e (1,2,3,4.., n) and the feature vector k _j J ∈ (1,2,3,4.., n), the following logic is used for carrying out dot product operation to obtain a vector dot product a _ij ,i,j∈(1,2,3,4,...,n)：

Wherein d is _z Representing the dimensions of q and k, and preventing the dimensions from exploding when the vectors are expanded;

s33, using a Softmax function to perform dot product on the vector a _ij I, j e (1,2,3,4.., n) to obtain a Softmax processing result

S34, utilizing the Softmax processing result by the following logic

Respectively associated with the feature vectors v of the corresponding positions _j Multiplying j ∈ (1,2,3,4.., n) to obtain an output vectorz＝(z ₁ ,z ₂ ,z ₃ ,z ₄ ,...,z _n ),z _i ：

S35, if the feature vector q is used _i ,k _i ,v _i I e (1,2,3,4.., n) is regarded as one header data, and the weighting factor W is not less than 2 groups ^Q ,W ^K ,W ^V And x _i I e (1,2,3,4.., n) are multiplied to obtain a multi-head feature vector q _i ,k _i ,v _i ,i∈(1,2,3,4,...,n)；

S36, splicing the multi-head feature vector q by the following logic _i ,k _i ,v _i I e (1,2,3,4,.., n) to obtain a multi-headed self-attention result and input the fully connected layer to perform a linear operation, thereby obtaining a final multi-headed self-attention value:

MultiHead(Q,K,V)＝Concat(h ₁ ,h ₂ ,...,h _m )W ^O

wherein Q, K, V are respectively a query vector, a key vector and a value vector, W ^O For the output matrix obtained by calculating attention, h _i I ∈ (1,2,3,. Said, m) denotes the number of heads;

and S37, carrying out normalization processing and residual error processing on the final multi-head self-attention value to obtain the final prediction result.

The invention introduces transfer learning in the method training process, replaces the weight value of random initialization in the network with the weight value trained in advance, accelerates the convergence speed and reduces the training time under the condition of not influencing the classification accuracy. The method for finely adjusting the parameters in the network is the most common learning strategy in the transfer learning application, can shorten the time spent by the network method training, and saves a large amount of computing resources, time and space cost.

The multi-head self-attention adopted by the invention is the core idea of a Transformer and is used for calculating the correlation degree between the characteristic vectors, so that different self-attention mechanisms focus on spatial information of different levels, a long-distance global dependency relationship is established between the characteristic vectors, the characteristic extraction capability of the method is improved, and the accuracy of remote sensing image classification is improved.

The invention introduces residual error structure and transfer learning in the training process to prevent network degradation and accelerate the convergence speed of the model. The whole multi-head self-attention module carries out normalization operation through the result obtained by calculation and introduces a residual error module to prevent the degradation problem of the network.

In a more specific technical scheme, a convolution Transformer-based remote sensing image classification system comprises:

the convolution module is used for performing first-layer convolution on the feature map with the preset size to obtain a layer of convolution feature map with the preset size;

an L-Conv module, configured to input the layer of convolution feature map into an L-Conv module, and process the layer of convolution feature map with not less than 2L-Conv modules to obtain an L-Conv processing feature map, where the L-Conv module includes: the method comprises the steps that a convolution position coding CPE layer and a local feature extraction LFE layer are arranged, the convolution position coding CPE layer obtains absolute position information of features in an image through depth convolution, the local feature extraction LFE layer adopts depth separable convolution dimensionality reduction to process a layer of convolution feature map and enters a second layer, the convolution step length of the second layer is 2, second layer convolution processing is conducted on the L-Conv processing feature map to obtain a second layer convolution processing feature map, the second layer convolution processing feature map is continuously processed through not less than 2L-Conv modules to obtain a second layer convolution feature map, and the L-Conv module is connected with the first layer convolution module;

a Transformer module for inputting the two-layer convolution signature into the Transformer module, wherein the Transformer module comprises: the lightweight convolutional position coding CPE layer encodes position information of features through deep convolution, the global feature extraction GFE layer utilizes a multi-head self-attention layer to model long-distance global information of deep image features, the two layers of convolutional feature graphs are continuously processed through a preset number of transform modules to obtain a lightweight convolutional feature graph, the lightweight convolutional feature graph is averaged and pooled, a final prediction result is output through preset full-connection layer processing, and the transform module is connected with the L-Conv module.

Compared with the prior art, the invention has the following advantages:

according to the method, the light-weight convolutional neural network is used for replacing CNN to extract the local features of the image, so that the separation of a channel and a region is realized in the convolution process, and the operation amount and the parameter quantity are reduced; secondly, inputting the obtained local features into a mixed network of a CNN (content-centric network) fused multi-head self-attention mechanism to enhance the extraction capability of the global features of the image; and finally, carrying out image classification on the obtained feature output. The L-CT (light weighted conditional transform, L-CT) can not only acquire local texture information of image features in a shallow layer, but also capture global semantic information of the image features in a deep layer, and has the advantages of CNN and a Transformer, so that the network is light and efficient, and the robustness of the algorithm is improved. Compared with other common image classification methods, the method can efficiently extract the local characteristic information and the long-distance global dependency information of various images including the remote sensing image while reducing the parameter number and the calculation cost, and can obtain higher image classification accuracy, thereby verifying the light weight, high efficiency and feasibility of the method.

The present invention employs a popular convolutional position Coding (CPE) to obtain the position information of image features in order to adapt to the needs of different resolution inputs. The convolution position Coding (CPE) can help each feature to determine the absolute position of the feature by performing position coding on the image features by using deep convolution, and compared with the conventional convolution operation, the CPE has the advantages of lower parameter quantity and operation cost and reduced memory occupation. Due to the shared parameters and locality of the convolution, the CPE can overcome the permutation invariance and is friendly to any input length.

The feature map with the position coding is subjected to convolution operation, point-by-point convolution is performed on features with different depths, so that the feature map is separated from a channel and a region in the convolution process, the parameter quantity and the calculated quantity of the depth separable convolution are less than those of standard convolution, and the parameter quantity and the calculated quantity of the method are reduced and the operation speed of the method is improved.

According to the invention, the regularization layer and the residual connection layer are added before and after each sub-layer to avoid gradient disappearance and network degradation, the linear transformation is carried out on the feed-forward network layer by using the GELU activation function, and the random regularization thought is introduced into the GELU activation function, so that the generalization capability of the method is enhanced, and the classification effect of the remote sensing image is optimized.

The invention introduces transfer learning in the method training process, uses the weight trained in advance to replace the weight initialized randomly in the network, accelerates the convergence speed and reduces the training time under the condition of not influencing the classification accuracy. The method for finely adjusting the parameters in the network is the most common learning strategy in the transfer learning application, can shorten the time spent on the training of the network method, and saves a large amount of computing resources, time and space cost.

The invention introduces residual error structure and transfer learning in the training process to prevent network degradation and accelerate the convergence speed of the model. The whole multi-head self-attention module performs normalization operation through the result obtained by calculation, and introduces a residual error module to prevent the degradation problem of the network. The invention solves the technical problems of increased self-attention time complexity, higher calculation cost, low classification accuracy and low robustness in the prior art.

Drawings

FIG. 1 is a schematic diagram of an overall L-CT network structure according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a network structure of convolutional position coding according to embodiment 2 of the present invention;

FIG. 3 is a schematic diagram of a deep separable convolutional network structure according to embodiment 2 of the present invention;

FIG. 4 is a schematic diagram of a self-attention network structure according to embodiment 2 of the present invention;

FIG. 5 is a schematic diagram of a multi-head self-attention network structure in embodiment 2 of the present invention;

FIG. 6 is a schematic diagram of a remote sensing image in the RSSCN7 data set according to embodiment 3 of the present invention;

fig. 7 is a schematic diagram of a remote sensing image in an AID dataset according to embodiment 3 of the present invention;

FIG. 8 is a graph of Loss versus Iteration time for example 3 of the present invention;

FIG. 9 is a graph showing the variation of Accuracy with Iteration time in embodiment 3 of the present invention;

FIG. 10 is a graph of the parameter number and the calculated first comparison on the RSSCN7 data set by different classification methods according to embodiment 3 of the present invention;

FIG. 11 is a second comparison chart of the parameter number and calculation of the RSSCN7 data set by the different classification methods of embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the remote sensing image classification method based on the lightweight convolution Transformer, in this embodiment, the classification objects to which the method is applied include, but are not limited to: and (4) remote sensing images. The main network structure in the remote sensing image classification method based on the lightweight convolution Transformer comprises the following steps:

a Lightweight Convolutional Transformer (L-CT) is mainly composed of a Local Convolution (L-Conv) module and a Transformer module. The overall network structure of the L-CT is shown in FIG. 1.

The L-CT consists essentially of 4 layers, each of which contains a conventional convolution to change the number of channels and pixel size of the feature map. The Convolution kernel size of the first layer is 4 × 4, the step size is 4, the number of channels is 64, the Feature map size obtained by convolving the Feature map with the size of 224 × 224 × 3 by the first layer is 56 × 56 × 64, and then the obtained Feature map is input into an L-Conv module composed of a Convolution Position Encoding (CPE) layer and a Local Feature Extraction (LFE) layer, as shown in fig. 1. The CPE layer uses depth convolution to obtain absolute position information of the features in the image, and the LFE layer uses depth separable convolution to reduce the dimension of the feature map. After continuously passing through 3L-Conv modules, the Feature map enters a second layer, the Convolution step size of the second layer is 2, the Feature map obtained after Convolution of the second layer is 28 × 28 × 128, then the Feature map continuously passes through 4L-Conv modules in the same manner and enters a third layer and a fourth layer, the Convolution step sizes of the last two layers are both 2, and the Feature map after Convolution operation is input into a Transformer module composed of a Convolution Position Encoding (CPE) layer and a Global Feature Extraction (GFE) layer, as shown in fig. 1. The CPE layer encodes the position information of the features through deep convolution, and the GFE layer achieves long-distance global information modeling of the features of the deep image through introducing multi-head self-attention on the basis of CNN. And after passing through 8 and 3 transform modules respectively, finally entering an average pooling layer and a full connection layer to output a final prediction result. In addition, a residual error structure and transfer learning are introduced in the training process so as to prevent network degradation and accelerate the convergence speed of the model. The L-CT can not only acquire local texture information of image features in a shallow layer, but also capture global semantic information of the image features in a deep layer, and simultaneously has the advantages of CNN and Transformer, so that the network is light in weight and efficient.

Example 2

Convolutional position Coding (CPE) layer:

as shown in FIG. 2, since the remote sensing image has a complex overall structure and rich textural features, a great deal of position information exists among feature vectors, and in order to capture the position relation of the feature vectors, the invention introduces position coding into L-CT. The position coding can learn the time dimension information and the sequence relation of the image features, and can accurately capture the absolute position information of the features in the feature extraction process. The traditional position coding includes absolute position coding, learning position vector, relative position expression and the like, but lacks certain flexibility. To accommodate the need for different resolution inputs, the present invention employs popular convolutional position Coding (CPE) to obtain position information for image features. The convolution position Coding (CPE) can help each feature to determine the absolute position of the feature by performing position coding on the image features by using deep convolution, and compared with the conventional convolution operation, the CPE is low in parameter and operation cost and reduces the memory occupation amount. Due to the shared parameters and locality of convolution, the CPE can overcome permutation invariance and is friendly to any input length, so that all characteristics can code own position information step by step in a unified manner by inquiring context information. The convolutional position coding formula is as follows:

CPE(X _in )＝DWConv(X _in ) (1)

wherein

H denotes the height of the feature map, W denotes the width of the feature map, C denotes the number of input channels, DWConv denotes the depth convolution, the convolution kernel size is set to 3 × 3, and the step size is set to 1. The network structure of convolutional position coding is shown in fig. 2. The image feature position information generation process comprises the following steps: given an input image

By means of a feature extractor f _ec 5 feature maps are extracted and recorded as

Inputting the extracted multiple feature maps into a position coding module (PEM), firstly performing bilinear difference operation on the feature maps to make the feature maps have the same spatial dimension, and then splicing the feature maps with the same spatial dimension to obtain the feature maps

Followed by a series of k convolution operations (i.e., the transfer function τ) _pos Operational procedure of) generating a location map

The specific formula is as follows:

wherein,

are trainable weights.

Local Feature Extraction (LFE) layer:

as shown in fig. 3, the LFE layer mainly mirrors the linear bottleneck structure of mobilenetv 2. The linear bottleneck structure adopts a depth Separable Convolution (Depthwise Separable Convolution) to extract local texture information of image features, wherein the depth Separable Convolution includes a depth Convolution (DWConv) and a Pointwise Convolution (PWConv), and the Pointwise Convolution uses a Convolution kernel of 1 × 1 to linearly combine and output feature maps on different channels. And performing convolution operation on the feature map subjected to position coding, and performing linear combination on features with different depths by point-by-point convolution, so that the feature map realizes separation of channels and regions in the convolution process, thereby reducing the parameter quantity of the method and improving the operation speed of the method. The network structure of the deep separable convolution is shown in fig. 3. In order to enable the deep convolution to capture rich semantic information in a high-dimensional space, the invention adds a 1 x 1 point-by-point convolution to increase the dimension before the deep separable convolution; in addition, in order to prevent network degradation and gradient disappearance, the present invention adds a residual function and a normalization structure on each output of the L-Conv module and the Transformer module, respectively, as shown in fig. 1.

To demonstrate the advantage of lightweight deep separable convolutional networks, assume a deep convolution size of D _K ′D _K The size of the point-by-point convolution is 1 multiplied by 1, the number of input channels is C, the number of output channels is N, and the size of the characteristic diagram is D _F ′D _F Then, the parameters and the calculated amount of the depth separable convolution are respectively expressed as:

N_DW＝D _K ·D _K ·C (4)

N_PW＝C·N (5)

C_DW＝D _K ·D _K ·(D _F -D _K +1)·(D _F -D _K +1)·C (6)

C_PW＝D _F ·D _F ·C·N (7)

N_DPW＝N_DW+N_PW (8)

C_DPW＝C_DW+C_PW (9)

wherein, N _ DPW, C _ DPW, N _ DW, N _ PW, C _ DW, and C _ PW represent parameters and calculated quantities of depth separable convolution, parameters and calculated quantities of depth convolution, and parameters and calculated quantities of point-by-point convolution, respectively. To verify that the parameters and computations for a deep separable convolution are less than those for a standard convolution, assume that the convolution kernel size of the standard convolution is D _K ′D _K The number of input channels is C, the number of output channels is N, and the size of the feature map is D _F ′D _F Then, the parameters and the calculated amount of the standard convolution are respectively expressed as:

N_Std＝N·C·D _K ·D _K (10)

C_Std＝D _K ·D _K ·(D _F -D _K +1)·(D _F -D _K +1)·C·N (11)

wherein N _ Std and C _ Std represent the parameter and the calculated amount of the standard convolution, respectively, and m and N represent the ratio of the parameter and the calculated amount of the depth separable convolution and the standard convolution, respectively. The number N of output channels is usually set to be larger, so that the parameters and the calculated amount of the depth separable convolution are reduced to that of the standard convolution

The effect of reducing the number of parameters and the calculated amount is achieved.

Global Feature Extraction (GFE) layer:

as shown in fig. 4, the GFE layer mainly adopts a transform network structure, the transform includes two sublayers, a multi-headed self-attention layer and a feedforward network layer, and a regularization layer and a residual connection layer are added before and after each sublayer to avoid gradient disappearance and network degradation. The multi-head self-attention is the core idea of a Transformer, mainly comprises a plurality of self-attention connections, and is used for calculating the correlation degree between feature vectors, so that different self-attention mechanisms focus on spatial information of different levels, a long-distance global dependency relationship is established between the feature vectors, and the feature extraction capability of the method is improved. Each self-attention calculates the similarity between features by means of q, k, v vectors.

In FIG. 4, x _i I e (1,2,3,4.., n) is the feature map of the bottom input, W ^Q ,W ^K ,W ^V Is a parameter matrix for method training. First, the weight coefficient W is set ^Q ,W ^K ,W ^V Are each independently of x _i Multiplying i e (1,2,3,4.., n) to obtain q _i ,k _i ,v _i ,i∈(1,2,3,4,...,n)：

q _i ＝x _i W ^Q (13)

k _i ＝x _i W ^K (14)

v _i ＝x _i W ^v (15)

Second utilizing q _i I e (1,2,3,4,.., n) and k _j The j ∈ (1,2,3,4.., n) carries out dot product operation to obtain a _ij ,i,j∈(1,2,3,4,...,n)：

Wherein d is _z And the dimensions of q and k are represented, so that the dimension is prevented from exploding when the vector is expanded.

Then pair a with Softmax function _ij I, j e ∈ (1,2,3,4.., n) obtained by treatment

All are at 0To 1, the formula is as follows:

finally utilize

V respectively corresponding to the position _j Multiplying j ∈ (1,2,3,4.., n) to obtain an output vector z = (z) ₁ ,z ₂ ,z ₃ ,z ₄ ,...,z _n ),z _i The calculation process of (c) is as follows:

as shown in fig. 5, the multi-head self-attention is obtained by introducing multiple heads on the basis of self-attention, and the network structure is shown in fig. 5. If q is obtained as above _i ,k _i ,v _i I e (1,2,3,4.., n) is considered as a "head" as a whole, then "multi-head" means that multiple groups of W are needed ^Q ,W ^K ,W ^V And x _i Multiplication of i ∈ (1,2,3,4.., n) to obtain multiple groups of q _i ,k _i ,v _i I ∈ (1,2,3,4.., n). And because the calculation process of each head is the same, splicing the obtained multiple heads, inputting the spliced result into a full-connection layer to perform linear operation, and obtaining the final multi-head self-attention value. The formula is as follows:

MultiHead(Q,K,V)＝Concat(h ₁ ,h ₂ ,...,h _m )W ^O (19)

wherein Q, K and V are respectively a Query Vector (Query Vector), a Key Vector (Key Vector) and a Value Vector (Value Vector), W ^O For the output matrix obtained by calculating attention, h _i I e (1,2,3.., m) represents the headAnd (4) the number of the cells. The whole multi-head self-attention module carries out normalization operation through the result obtained by calculation and introduces a residual error module to prevent the degradation problem of the network.

Transfer learning:

training a neural network completely is often a tedious process, not only does it require enough data sets to speed up the training and converge the network, but it also consumes significant computational resources and training time, and even causes an overfitting phenomenon. Based on this, in the application of the remote sensing image, the invention introduces the transfer learning in the method training process, replaces the weight value initialized randomly in the network with the weight value trained in advance, accelerates the convergence speed and reduces the training time. The method for finely adjusting the parameters in the network is the most common learning strategy in the transfer learning application, can shorten the time spent by the network method training, and saves a large amount of computing resources, time and space cost.

Transfer learning has found widespread and successful applications in the field of computer vision. Recent studies have shown that the feature mobility increases as the difference between the target data and the training data of the pre-training method decreases, demonstrating that even if there is a significant domain difference between the two, the effect is better than before. When a transfer learning strategy is applied in a real scene, the problems that whether the pre-training method can meet the requirements of a network framework and how to perform parameter fine adjustment when the pre-training method cannot meet the requirements need to be considered, and the like, are required to be considered, so that the method achieves a better effect. Therefore, the invention uses the similar task as a pre-training method to respectively perform cross migration learning on the RSSCN7 and AID remote sensing image data sets, so as to shorten the training time under the condition of not influencing the classification accuracy.

Example 3

Experimental analysis:

experimental data:

as shown in fig. 6 and 7, RSSCN7 and AID remote sensing image data sets are used as experimental data in the experiment. The RSSCN7 data set comprises 2800 remote sensing images which are divided into 7 typical scene categories, namely grassland, forest, farmland, parking lot, residential area, industrial area and river and lake, wherein each scene category comprises 400 remote sensing images, and the size of each image is 400 x 400. The remote sensing image data in the RSSCN7 dataset is shown in fig. 6. The AID data set comprises 10000 remote sensing images which are divided into 30 scene types of airports, sand beaches, business areas, forests, bridges and the like, each scene type comprises 220 to 420 remote sensing images, and the size of each image is 600 multiplied by 600. The remote sensing image data in the AID dataset are shown in fig. 7.

Because the image scene categories in the RSSCN7 and AID remote sensing image datasets are inconsistent (RSSCN 7 is 7, AID is 30), in the stages of network parameter training and accuracy testing, the two remote sensing image datasets are separately used for the parameter training and the accuracy testing of the network. And dividing each data set according to the proportion of 7:3, namely randomly extracting 70% of data as a training set and 30% of data as a verification set to verify the network classification accuracy.

In the experiment, classification methods such as ResNet50, VGG16 and ViT are selected as comparison experiments to verify the feasibility and the high efficiency of the remote sensing image classification method based on the lightweight convolution Transformer. Because the size of the remote sensing image can generate certain limitation when the remote sensing image is input into a part of neural network, and the image classification accuracy is influenced, image preprocessing is required before the image is input into the network. After image preprocessing, the resolution of the telesensing images in the RSSCN7 and AID data sets are 224 × 224.

The invention mainly carries out three groups of experiments: 1) Loading parameters trained by the method, and testing the classification accuracy of the remote sensing images by adopting the selected verification data of the RSSCN7 and AID remote sensing image data sets; 2) Modifying the batch sizes, and testing the influence of different batch sizes on the remote sensing image classification effect; 3) And testing the parameter quantity and the training speed of the remote sensing image classification.

Setting experimental environment and hyper-parameters:

the invention selects a pytorech frame for experiment, uses python3.8 for programming, uses AMD Ryzen 9 3900X as a CPU in hardware equipment, has the memory size of 32GB, uses GeForce RTX 2080Ti as a GPU model, has the display memory size of 11GB, and has Ubuntu18.04 as a system environment. The method optimization is carried out by using an Adam optimizer in the training process, and the same learning rate is used for each parameter in the model, wherein the learning rate is set to be 0.0001, the iteration number is set to be 100, the Batch _ size is set to be 32, and the image resolution is set to be 224 multiplied by 224.

And (3) analyzing an experimental result:

the RSSCN7 and AID remote sensing image data sets are used as the data sets of the experiment. And respectively dividing each scene category in the two data sets into a training set and a verification set according to the proportion of 7:3 after image preprocessing, firstly using 70% of the training set to carry out method parameter training, and then using the remaining 30% of the verification set to test the network classification accuracy.

Comparing experimental results of different classification methods on a remote sensing data set:

in order to verify the feasibility and the effectiveness of the method on the remote sensing image data set, classification methods such as ResNet50, VGG16 and ViT are selected as comparison experiments. The experimental results of different classification methods on RSSCN7 and AID remote sensing image data sets are shown in table 1. As can be seen from Table 1, the method of the invention has obvious effect compared with other image classification methods, achieves the highest classification accuracy on RSSCN7 and AID remote sensing image data sets, respectively 98.21% and 96.90%, and respectively improves the highest classification accuracy by 1.31% and 0.94% compared with other classification methods. ViT has relatively low classification accuracy on RSSCN7 and AID remote sensing image data sets, because ViT has a relatively obvious classification effect on a large data set, and has a relatively poor classification effect and relatively low accuracy on a small data set. The ResNet50 classification method and the VGG16 mainly utilize the convolution layer to extract rich local information, but have poor long-distance global modeling capability on remote sensing images, and cannot extract complete and rich image information, so that the classification effect is not ideal. The method of the invention inputs the local features of the remote sensing image extracted by the lightweight convolutional neural network into the mixed network of the CNN fused multi-head self-attention mechanism, so that the local features and the global features of the remote sensing image can be effectively extracted at the same time, and higher classification accuracy is realized.

TABLE 1 comparison of results on remote sensing image datasets for different classification methods

3.3.2 Effect of different batch sizes on remote sensing image Classification Effect

In order to further verify that the method has better robustness than other image classification methods, experiments for improving the influence of different batch sizes of different classification methods on the classification effect of the remote sensing image data set are added, the batch sizes selected in the experiments are 4,8,16,32 respectively, and the data set is the same as the data set in the classification accuracy test. The effect of different batch sizes of different classification methods on the classification of the remote sensing image dataset is shown in tables 2,3,4, 5.

The results in the table show that when the number of input samples is 4,8,16,32, the method of the invention achieves more obvious effects compared with other image classification methods, achieves the highest classification accuracy on RSSCN7 and AID remote sensing image data sets, is 97.38% and 96.10%, 97.02% and 96.63%, 97.97% and 96.66%, 98.21% and 96.90 respectively, and improves 3.69% and 3.24%, 0.48% and 1.07%, 1.55% and 0.76%, 1.31% and 0.94% compared with the highest classification accuracy of other image classification methods respectively, thereby further verifying that the method of the invention has strong robustness. The ResNet50 method and the VGG16 method can effectively acquire local features and reduce local redundancy through convolution, but the limited receptive field causes that the global dependency relationship cannot be captured; the ViT method can capture the long distance dependence of features by self-attention, but works well on large image datasets and is not friendly to small and medium image datasets. In comparison, the method introduces the lightweight convolutional neural network into the shallow network to effectively obtain the local features of the remote sensing image, and introduces the multi-head self-attention on the basis of the CNN in the deep network to enhance the extraction of the global features of the remote sensing image, so that the local features of the remote sensing image can be extracted, and the global information can be subjected to long-distance modeling, therefore, the method has stronger robustness and higher classification accuracy compared with other image classification methods.

As can be seen from the data in the table, when the number of the input samples is 32, the classification effect of the method on the RSSCN7 and AID remote sensing image data sets is optimal, and the classification effect is 98.21% and 96.90% respectively. In the training process, when the input batch quantity is too small, the training time is too long, and gradient oscillation occurs, so that the convergence of model parameters is not facilitated.

TABLE 2 comparison of the classification results of different classification methods for a batch size of 4

TABLE 3 comparison of the classification results of different classification methods for a batch size of 8

TABLE 4 comparison of the classification results of different classification methods when the batch size is 16

TABLE 5 comparison of the classification results of different classification methods for a batch size of 32

Testing the parameters and training speed of remote sensing image classification:

as shown in fig. 8 and 9, in the experiment, the training speed refers to the speed of parameter fitting of the method when training is performed by using the training set data, and the variation curves of the Loss value Loss and the Accuracy along with the Iteration time in the training process of the method of the present invention are respectively shown in fig. 8. As can be seen from fig. 8, the convergence rate of the method of the present invention on the RSSCN7 remote sensing image data set is relatively high, and it can be seen that when the iteration number is about 7, the model starts to gradually converge, the shaking oscillation becomes weaker and weaker, and finally stabilizes at about 2.0; and the convergence speed is low on the AID remote sensing image data set, and it can be seen that when the iteration number is about 22, the model starts to converge, and finally stabilizes at about 6.3.

As shown in fig. 10 and fig. 11, the remote sensing data set is used for training and testing, and the results of comparing the parameters and the calculation speed of ResNet50, VGG16, viT and the method of the present invention on the RSSCN7 remote sensing image data set are shown in fig. 9. As can be seen from fig. 9, the parameter number of the method of the present invention is 22M, the calculation speed reaches 3.6GFLOPs, which is reduced by 3.56M compared with the minimum parameter number in other classification methods, and the calculation speeds of the two methods are substantially the same compared with ResNet50, which greatly improves the calculation speed compared with other classification methods. The method effectively reduces the parameter quantity, improves the calculation speed and reduces the space complexity.

The invention provides a remote sensing image classification method based on a lightweight convolution Transformer. The method comprises the steps of firstly extracting local features by using a lightweight convolutional neural network, and secondly inputting the obtained local features into a mixed network of a CNN (convolutional neural network) fused multi-head self-attention mechanism to enhance the extraction capability of the global features of the image; then, transfer learning is introduced in the training process to accelerate the convergence speed; and finally, carrying out classification prediction on the obtained feature output. The RSSCN7 and AID remote sensing image data sets are adopted to carry out experiments, and the results show that the accuracy of the remote sensing image classification method based on the lightweight convolution Transformer reaches 98.21% and 96.90 respectively.

In summary, the invention firstly uses the lightweight convolution neural network to replace CNN to extract the local features of the image, so that the separation of channels and regions is realized in the convolution process, and the operation amount and the parameter amount are reduced; secondly, inputting the obtained local features into a mixed network of a CNN fused multi-head self-attention mechanism to enhance the extraction capability of the global features of the image; and finally, carrying out image classification on the obtained feature output. The L-CT (light weighted conditional transform, L-CT, lightweight convolution) not only can acquire local texture information of image features in a shallow layer, but also can capture global semantic information of the image features in a deep layer, and has the advantages of CNN and the transform, so that the network is light and efficient, and the robustness of the algorithm is improved.

The present invention employs a popular convolutional position Coding (CPE) to obtain the position information of image features in order to adapt to the needs of different resolution inputs. The convolution position Coding (CPE) can help each feature to determine the absolute position of the feature by performing position coding on the image features by using deep convolution, and compared with the conventional convolution operation, the CPE is low in parameter quantity and operation cost and reduces the memory occupation quantity. Due to the shared parameters and locality of the convolution, the CPE can overcome the permutation invariance and is friendly to any input length.

Before the deep separable convolution, 1 multiplied by 1 point-by-point convolution is added to raise the dimension, so that the deep convolution can capture rich semantic information in a high-dimensional space; in addition, the invention adds a residual function and a normalization structure on each output of the L-Conv module and the Transformer module respectively, thereby preventing network degradation and gradient disappearance.

The invention introduces transfer learning in the method training process, uses the weight trained in advance to replace the weight initialized randomly in the network, accelerates the convergence speed and reduces the training time under the condition of not influencing the classification accuracy. The method for finely adjusting the parameters in the network is the most common learning strategy in the transfer learning application, can shorten the time spent by the network method training, and saves a large amount of computing resources, time and space cost.

The invention introduces residual error structure and transfer learning in the training process to prevent network degradation and accelerate the convergence speed of the model. The whole multi-head self-attention module carries out normalization operation through the result obtained by calculation and introduces a residual error module to prevent the degradation problem of the network. The invention solves the technical problems of increased self-attention time complexity, higher calculation cost, low classification accuracy and low robustness in the prior art.

The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A remote sensing image classification method based on convolution Transformer is characterized by comprising the following steps:

s3, inputting the two-layer convolution characteristic diagram into a Transformer module, wherein the Transformer module comprises: the light-weight convolutional position coding CPE layer encodes position information of features through deep convolution, the global feature extraction GFE layer utilizes a multi-head self-attention layer to model long-distance global information of deep image features, the two layers of convolutional feature graphs are continuously processed through a preset number of transform modules to obtain a light-weight convolutional feature graph, the light-weight convolutional feature graph is averaged and pooled, and a final prediction result is output through preset full-connection layer processing.

2. The remote sensing image classification method based on convolution Transformer according to claim 1, characterized in that the convolution position Coding (CPE) layer performs position coding on image features by using depth convolution according to the following logic:

CPE(X _in )＝DWConv(X _in )

wherein

3. The remote sensing image classification method based on convolution Transformer as claimed in claim 1, wherein the step S2 includes:

s21, giving input images

S22, passing through a feature extractor f _ec From the input image

5 feature maps were extracted:

s24, splicing the sameSpace dimension characteristic diagram, thereby obtaining a splicing characteristic diagram

For the splicing characteristic diagram

Performing a k convolution operation to generate a position map

4. The remote sensing image classification method based on convolution Transformer as claimed in claim 3, wherein the step S24 includes:

s241, processing the same space dimension feature map by the following logic, thereby obtaining a splicing feature map

S242, matching the splicing characteristic diagram

Performing a k convolution operation to generate a position map

5. The remote sensing image classification method based on convolution Transformer, characterized in that, the local feature extraction LFE layer adopts depth separable convolution to extract local texture information of image features, wherein, the depth separable convolution comprises: the method comprises the following steps of depth convolution and point-by-point convolution, wherein the point-by-point convolution uses a convolution core of 1 x 1 to linearly combine and output feature maps on different channels, the feature maps subjected to position coding are subjected to convolution operation, the point-by-point convolution is used for linearly combining features at different depths, and channels and regions are classified in the feature map convolution process.

6. The remote sensing image classification method based on convolution Transformer according to claim 5, characterized in that a 1 x 1 point-by-point convolution is added to raise dimension before the depth separable convolution is performed.

7. The remote sensing image classification method based on convolution Transformer as claimed in claim 1, characterized in that a residual function and a normalization structure are added to each output of the L-Conv module and the Transformer module respectively.

8. The remote sensing image classification method based on convolution Transformer as claimed in claim 1, characterized in that Global Feature Extraction (GFE) layer adopts Transformer network structure, and the Transformer includes: the multi-head self-attention layer and the feedforward network layer are respectively added with a regularization layer and a residual connecting layer before and after each sub-layer, and the feedforward network layer carries out linear transformation by using a GELU activation function.

9. The remote sensing image classification method based on convolution Transformer as claimed in claim 1, wherein the step S3 includes:

s31, respectively carrying out cross migration learning on the RSSCN7 and AID remote sensing image data sets to obtain a pre-trained weight coefficient, and carrying out cross migration learning on the weight coefficient W according to the following logic ^Q ,W ^K ,W ^V Are each independently of x _i Multiplying i e (1,2,3,4.., n) to obtain a feature vector q _i ,k _i ,v _i ,i∈(1,2,3,4,...,n)：

q _i ＝x _i W ^Q

k _i ＝x _i W ^K

v _i ＝x _i W ^v

s33, using Softmax function to carry out dot product on the vector _ij I, j e (1,2,3,4.., n) to obtain a Softmax processing result

S34, utilizing the Softmax processing result by the following logic

The feature vectors v corresponding to the respective positions _j Multiplying j ∈ (1,2,3,4.., n) to obtain an output vector z = (z) ₁ ,z ₂ ,z ₃ ,z ₄ ,...,z _n ),z _i ：

S36, splicing the multi-head feature vector q by the following logic _i ,k _i ,v _i I ∈ (1,2,3,4.., n) to obtain a multi-headed self-attention result and input the full connection layer to perform a linear operation, thereby obtaining a final multi-headed self-attention value:

MultiHead(Q,K,V)＝Concat(h ₁ ,h ₂ ,...,h _m )W ^O

wherein Q, K, V are respectively a query vector, a key vector and a value vector, W ^O For the output matrix obtained by calculating attention, h _i I e (1,2,3,. Multidot., m) represents the number of heads;

10. A system for classifying remote sensing images based on convolution transformers, the system comprising:

an L-Conv module, configured to input the layer of convolution feature map into an L-Conv module, and process the layer of convolution feature map with not less than 2L-Conv modules to obtain an L-Conv processing feature map, where the L-Conv module includes: the method comprises the steps that a convolution position coding CPE layer and a local feature extraction LFE layer are arranged, the convolution position coding CPE layer obtains absolute position information of features in an image through depth convolution, the local feature extraction LFE layer processes a layer of convolution feature map through depth separable convolution dimensionality reduction, the layer of convolution feature map enters a second layer, the convolution step length of the second layer is 2, second layer convolution processing is conducted on an L-Conv processing feature map to obtain a second layer convolution processing feature map, the second layer convolution processing feature map is continuously processed through not less than 2L-Conv modules to obtain a second layer convolution feature map, and the L-Conv module is connected with the first layer convolution module;

a Transformer module for inputting the two-layer convolution signature into the Transformer module, wherein the Transformer module comprises: the system comprises a lightweight convolutional position coding CPE layer and a global feature extraction GFE layer, wherein the lightweight convolutional position coding CPE layer codes position information of features through deep convolution, the global feature extraction GFE layer utilizes a multi-head self-attention layer to model long-distance global information of deep image features, the two layers of convolutional feature graphs are continuously processed through a preset number of transform modules to obtain a lightweight convolutional feature graph, the lightweight convolutional feature graph is averaged and pooled and processed through a preset full-connection layer to output a final prediction result, and the transform module is connected with the L-Conv module.