CN117197663A

CN117197663A - Multi-layer fusion picture classification method and system based on long-distance dependency mechanism

Info

Publication number: CN117197663A
Application number: CN202311120442.5A
Authority: CN
Inventors: 肖亮; 黄姮祎
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-12-08

Abstract

The invention discloses a multi-layer fusion picture classification method and system based on a long-distance dependency mechanism, wherein the method comprises the following steps: constructing a multi-scale pyramid feature extraction model by adopting multi-core grouping convolution, and extracting multi-scale features; adopting a long-distance attention mechanism, and extracting context information with long dependence by utilizing a multi-head attention mechanism selected according to steps in a visual converter, dynamic learning position information and a multi-layer perceptron; unifying the characteristic scales of different layers in the middle by adopting rolling and pooling; fusing the features of different levels in the middle by using a vision converter attention mechanism; taking the intermediate fused features as guidance, and learning global features by using distillation loss guidance; and carrying out final picture classification according to the learned global features. The method can effectively solve the classification problems of high similarity among classes, large intra-class difference and different target scales, and improves the classification precision and classification speed of small sample pictures.

Description

Multi-layer fusion picture classification method and system based on long-distance dependency mechanism

Technical Field

The invention relates to a remote sensing scene classification technology, in particular to a multi-layer fusion picture classification method and system based on a long-distance dependency mechanism.

Background

The remote sensing image scene classification refers to deducing semantic tags according to the content of the remote sensing scene, and is a research hot spot and important content in the field of remote sensing image interpretation in recent years. Plays a vital role in the fields of land resource planning, city planning, traffic control and the like. The remote sensing scene images mostly originate from satellite shooting, and along with the continuous progress of remote sensing imaging technology, the images often contain complex and various topography and land features. This causes classification problems of high similarity between remote sensing images, large intra-class variability and large scale variability.

In recent years, the deep learning method has rapidly developed in the field of remote sensing image scene classification, and feature expression is more and more abundant. The deep feature extraction method using the neural network gradually replaces the early manual design method and aims at extracting richer feature representations and exploring feature fusion and enhancement. Lu Xiaojiang et al [ Xiaoqiang Lu et al, "Afeature aggregation convolutional neural network for remote sensing scene classification," IEEE trans. Geosci. Remote. Sens, "vol.57, no.10, pp.7894-7906,2019.] found that features of different layers in a convolutional neural network contained different spatial and semantic information, and by aggregating intermediate features (convolutional features and fully connected layer features), the accuracy of remote sensing classification was significantly improved. Dan Cuiping et al [ a multi-branch feature fusion remote sensing scene image classification method based on attention mechanism: CN202110192358.9[ P ].2021-05-28 ] propose a strategy for fusing multi-branch features, and after extracting feature information from multiple branches, re-extracting the branch information by combining attention mechanism, and finally fusing multiple segments of features. With the advancement of deep learning, a visual converter is a brand-new corner, achieves outstanding results in natural language processing, is widely applied to visual tasks, and is widely applied to a Hao Saiyuan et al [ a double-flow Swin transform remote sensing scene classification method: CN202210372827.X [ P ].2022-08-23 ]) respectively extracts original features and edge features through two Swin transforms with the same structure, and fuses the two features, so that classification performance is improved.

The complexity of remote sensing image scene classification is considered from different angles by the method, the characteristic representing capability of the convolution network is enhanced by the first two methods through characteristic fusion, attention mechanisms and the like, and the long-distance dependency relationship is captured by the last visual transducer. However, the advantages of the two are not well combined, and the deficiency of the global feature capturing capability of the convolutional neural network cannot be overcome by long-distance feature capturing.

Disclosure of Invention

The invention aims to provide a multi-layer fusion picture classification method and system based on a long-distance dependency mechanism, which fully utilize abundant semantic features of middle layers and process the classification problems of high similarity among classes, large intra-class difference and large scale difference in image classification by long-distance dependency, and have excellent classification performance.

The technical solution for realizing the purpose of the invention is as follows: in a first aspect, the present invention provides a multi-layer fusion picture classification method based on a long-distance dependency mechanism, including the following steps:

firstly, constructing a data preprocessing module by utilizing downsampling and layer normalization to realize image preprocessing;

secondly, grouping the feature graphs by adopting multi-core grouping convolution blocks, and independently adopting different convolution kernels for each group to construct a multi-scale pyramid feature extraction module so as to realize multi-scale feature extraction;

thirdly, taking the middle three-layer features extracted by the feature learning module in the second step as input, and sending the input to a long-distance attention mechanism module, and learning long-distance dependent features of each level by utilizing a multi-head attention mechanism selected according to the step length, dynamic position learning and multi-layer perceptron in the vision converter;

fourth, a scale standardization module is constructed by utilizing rolling and pooling operation, scale standardization is carried out on the long-distance dependence characteristic diagram of the middle layer, and the characteristics of the same size are unified;

fifthly, calculating the similarity relation between the first layer and the second layer of the same-size features by using the attention mechanism principle of the visual transducer by using the multi-layer feature fusion module, and adding the similarity relation with the third layer of features to obtain fusion features;

step six, the classification module sequentially carries out pooling, L2 regularization, full connection and Softmax classifier on the fused features to obtain various classification scores, and sequentially carries out pooling, full connection and Softmax classifier on the global features to obtain various classification scores; and calculating the hash relation between the global features and the fusion features through knowledge distillation loss, updating the loss, guiding the global features to learn, and further obtaining the final classification.

Further, in the first step, a data preprocessing module is adopted, including downsampling and layer normalization operations, so as to implement data preprocessing, and the specific process is as follows:

firstly, carrying out initial preprocessing operation on an image H multiplied by W multiplied by N, wherein H represents the height of a picture, W represents the width of the picture, and N represents the number of channels of the picture; and (3) through the convolution layers with the convolution kernel size of 7*7, the convolution number of C and the step length of 2, and layer normalization, 2 times of downsampling operation is realized, and the image is adjusted to H/2 XW/2 XC, wherein C is the characteristic dimension.

Further, the features after the first step of deep pretreatment are used as input, a multi-scale pyramid structure is constructed by utilizing a multi-core grouping convolution block, and multi-scale intermediate features are extracted, wherein the specific process is as follows:

(1) The composition of a multi-core packet convolution block is as follows:

firstly, the convolution kernel size is 1*1, the step length is 1, the dimension of the filling 0 adjusts the convolution layer, and then the function is normalized and activated; then grouping the feature images, namely grouping the feature images, adopting a convolution kernel with an independent step length of 2 for each group, splicing the feature images of each group, and passing through a normalization layer and an activation function; the convolution kernel is a convolution layer of 1*1, the convolution kernel is restored to 4 times of the input dimension, and the normalization layer and the activation function are performed similarly; finally adding the initial input tensor which is adjusted to be 4 times of the input dimension with a convolution layer and a normalization layer which are subjected to pooling layer downsampling and 1*1, and taking the initial input tensor as a final output;

(2) Stacking of multi-core grouping convolution blocks of four stages is performed in total, and the multi-core grouping convolution blocks of the first stage are stacked 3 times and divided into 4 groups { G } ₁ ,G ₂ ,G ₃ ,G ₄ Kernel size { K } ₁ ² ,K ₂ ² ,K ₃ ² ,K ₄ ² The output size is H/4 XW/4 X4C; entering a second stage, adjusting a convolution layer by one dimension, wherein the dimension is 2C, stacking multi-core grouping convolution blocks for 4 times, and dividing the multi-core grouping convolution blocks into 3 groups { G } ₁ ,G ₂ ,G ₃ Kernel size { K } ₁ ² ,K ₂ ² ,K ₃ ² Output size H/8 XW/8X 8C; entering a third stage, and after one dimension is adjusted to adjust the convolution layer, the dimension is 4C, and the multi-core grouping convolution blocks are stacked for 6 times and are divided into 2 groups { G } ₁ ,G ₂ Kernel size { K } ₁ ² ,K ₂ ² Output size H/16 XW/16X 16C; entering a fourth stage, adjusting a convolution layer by one dimension, wherein the dimension is 8C, stacking multi-core grouping convolution blocks for 3 times, and dividing the multi-core grouping convolution blocks into 1 group G ₁ The kernel size is 3*3, and the output size is H/32 XW/32 X32C;

the first, second and third stage outputs of the step are used as inputs of the long-distance attention mechanism module, and the fourth stage is used as a global feature input classification module.

Further, the features extracted by the first, second and third stages of the second multi-scale pyramid feature extraction module are used as input, and the multi-head attention of the vision converter and the multi-layer perceptron are utilized to learn the long-distance dependent features of each level, and the specific process is as follows:

(1) The H/4 XW/4 X4C size feature map is flattened and regularized to (H/4. W/4) 4C, and then the feature map is selected according to the step size S to obtain H.W/(16.S) ² )×S ² X4C followed by deformation of the full-connection layer and size by multi-head attention to give (H/4. W/4)/S ² ×n×S ² X 4C/n, n is the number of multi-head attentions; next, (H/4. W/4)/S ² ×n×S ² X4C/n and (H/4.W/4)/S ² ×n×4C/n×S ² Tensor multiplication to obtain characteristic dimension (H/4.W/4)/S ² ×n×S ² ×S ² ；

(2) Simultaneous generation of [1-s, s]The window learning dynamic position information of (H/4.W/4)/S is performed again ² ×n×S ² ×S ² And (H/4.W/4)/S ² ×n×S ² Multiplying by 4C/n tensors and performing size deformation to obtain a size of (H/4.W/4) x 4C;

(3) Then adding the obtained product after initial flattening treatment, and sending the obtained product into a multi-layer sensor, and sequentially passing through a full connection layer, an activation layer, a discarding layer, a full connection layer and a discarding layer to obtain a product with the size of (H/4.W/4) multiplied by 4C;

(4) Finally, the (H/4.W/4) multiplied by 4C size is deformed into H/4 multiplied by W/4 multiplied by 4C, and then added with the output characteristic H/4 multiplied by W/4 multiplied by 4C of one stage to finally obtain the output H/4 multiplied by W/4 multiplied by 4C;

(5) The second and third steps are identical to the first step.

Further, the long-distance dependent features extracted in the third step are used as input, a scale standardization module is constructed by utilizing rolling and pooling operations, scale standardization is carried out on the long-distance dependent feature images in the first three stages of the middle layer, and the long-distance dependent feature images are unified to the same size, and the specific process is as follows:

(1) Firstly, adopting 1*1 convolution to uniformly adjust the characteristic dimensions of three stages to 8C, wherein the sizes of the characteristic dimensions are H/4 XW/4X 8C, H/8 XW/8X 8C, H/16 XW/16X 8C respectively;

(2) And then pooling the features with the sizes of H/4×W/4×8C and the features with the sizes of H/8×W/8×8C by using the pooling cores 4*4 and 2×2 respectively, so as to unify the features with the sizes of H/16×W/16×8C.

Further, in the fifth step, the multi-layer feature fusion module is mainly formed by flattening, size deformation, multiplication and addition of full-connection layers and tensors, and the similarity relation between the first layer and the second layer of features with the same size is calculated by utilizing the principle of a visual transducer and is added with the features of the third layer to obtain a fusion feature, and the specific process is as follows:

(1) Flattening the H/16 XW/16X 8C features of the first stage and the second stage after the scale specification to obtain (H/16.W/16) X8C;

(2) Then the flattened features of the first stage are sent into a full connection layer with multi-head attention and deformed in size to obtain tensor q with the size of (H/16.W/16) multiplied by n multiplied by 8C/n ₁ N is the number of multi-head attentions; the flattened features of the second stage are sent to multi-head attention, and two full connection and size deformation are carried out to obtain tensor k with the size of (H/16.W/16) multiplied by n multiplied by 8C/n ₁ 、v ₁ ；

(3) Will q ₁ And k is equal to ₁ Tensor multiplication is performed, then the function is activated through Softmax, the layer is lost, and then v is used for the layer ₁ Multiplying, once through the full connection layer, the discarding layer and the size deformation to be adjusted to H/16 XW/16X 8C;

(4) And finally, adding the H/16 XW/16 X8C features of the third stage to obtain the final fusion feature H/16 XW/16 X8C.

Further, the classification module sequentially carries out pooling, L2 regularization, full connection and Softmax classifier on the features after the fifth step fusion to obtain various classification scores, and sequentially carries out pooling, full connection and Softmax classifier on the global features at the fourth stage of the second step to obtain various classification scores; and through knowledge distillation loss, calculating the hash relation between the global features and the aggregation features, updating the loss, guiding the global features to learn, and further obtaining final classification, wherein the specific process is as follows:

(1) Taking the output of the fifth step as input, sequentially performing pooling operation of 1*1, L2 regularization and full connection layer to obtain network output logits _t After that, a Softmax classifier is connected, and various classification scores of S are obtained under the parameter of T=t _tsoft The calculation formula is as follows;

wherein the method comprises the steps ofz _i For the i-th class of corresponding logits, cn is the total number of output classes, T is a parameter, p _i A probability distribution of class i;

(2) Taking the output of the second step as input, sequentially carrying out pooling operation of 1*1 and full connection layer to obtain network output logits _s After that, a Softmax classifier is connected, and each class classification score is obtained under the parameters of t=t and t=1 respectively as S _ssoft 、S _shard ；

(3) Will S _tsoft And S is equal to _ssoft Performing KL divergence calculation to obtain loss_KD, wherein the calculation formula is as follows;

wherein U and V are probability distributions;

(4) The total loss function calculation formula is as follows, so that the loss is updated, the global features are guided to learn, and the final classification is obtained;

loss＝(1-λ)×S _shard +λ×t×loss_kd, where λ and T are parameters.

In a second aspect, the present invention provides a multi-layered fused picture classification system based on a long-distance dependency mechanism, including:

the data preprocessing module is constructed by utilizing downsampling and layer normalization to realize image preprocessing;

the multi-scale pyramid feature extraction module is used for grouping feature graphs by adopting multi-core grouping convolution blocks, and each group independently adopts different convolution kernels to realize multi-scale feature extraction;

the long-distance attention mechanism module is used for learning long-distance dependency characteristics of each level by utilizing a multi-head attention mechanism selected based on a long-distance window, dynamic position learning and a multi-layer perceptron in the vision converter;

the scale normalization module is constructed by utilizing convolution and pooling operations, and performs scale normalization on the long-distance dependence characteristic diagram of the intermediate layer to unify the characteristics to the same size;

the multi-layer feature fusion module is mainly composed of all connecting layers, calculates the similarity relation between the first layer and the second layer of the same-size features by utilizing the attention mechanism principle of the vision converter, and adds the similarity relation with the third layer of features to obtain fusion features;

and the classification module takes the fused features as a guide, calculates the hash relation between the global features and the fused features through knowledge distillation loss, updates the loss, guides the global features to learn, and further carries out final classification on the pictures.

In a third aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of the first aspect.

In a fourth aspect, the invention provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

Compared with the prior art, the invention has the remarkable characteristics that: (1) Constructing a multi-scale pyramid network by adopting multi-core grouping convolution, and extracting features in a pre-training mode, wherein the extracted features are more abundant; (2) The long-distance attention mechanism module is used for acquiring long-distance dependency relations of all levels by utilizing a multi-head attention mechanism selected based on long-distance step length, dynamic position learning and a multi-layer perceptron in the visual converter; (3) Calculating the dependency relationship among different layers by using the attention mechanism principle of the vision converter to obtain fusion characteristics; (4) And taking the fused features as a guide, calculating the hash relation between the global features and the fused features through knowledge distillation loss, updating the loss, guiding the global features to learn, and realizing final classification of the pictures. The long-distance dependency relation capturing capability of the visual transducer is combined with the rich semantic information of the middle layer of the convolutional neural network, so that the classification accuracy is improved.

The invention is described in further detail below with reference to the accompanying drawings.

Drawings

FIG. 1 is a block diagram of the method of the present invention.

FIG. 2 is a diagram of a multi-core packet convolutional block structure.

Fig. 3 is a structural diagram of learning position information.

Fig. 4 is a block diagram of a multi-layer sensor.

FIG. 5 is a plot of a 50% partition classification confusion matrix for AID30 datasets in accordance with the method of the present invention.

FIG. 6 is a diagram of a 20% split classification confusion matrix for the NWPU-RESISC45 dataset according to the method of the invention.

Detailed Description

Compared with the prior art, the multi-layer fusion picture classification method and system based on the long-distance dependency mechanism are provided, a pre-trained multi-scale pyramid convolution network is used as a feature extractor, and a multi-head attention mechanism, dynamic position learning and a multi-layer perceptron based on long-distance window selection of a vision converter are combined to extract the long-distance semantic relation of the middle layer. And then the vision converter is used for fusing the middle layer characteristics, and the fusion characteristics are used for guiding the global learning characteristics through distillation loss, so that final classification is performed, and the classification accuracy is improved.

The implementation of the present invention will be described in detail with reference to fig. 1.

The multi-layer fusion picture classification method based on the long-distance dependency mechanism comprises the following steps:

the first step, a data preprocessing module is adopted, comprising operations of cutting, overturning, standardization, downsampling and layer normalization, and the data preprocessing is realized, wherein the specific process is as follows:

(1) The original image is subjected to data enhancement, cut into 224×224 sizes, the image is turned over at random level with a probability of 0.5, and then the image data is converted into standard n-ethernet distribution mean= [0.485,0.456,0.406], std= [0.229,0.224,0.225], and the standardized formula is as follows:

input[channel]＝(input[channel]-mean[channel])/std(channel)

for 224×224×3 images, performing layer normalization on the images by using a convolution layer with a convolution kernel size of 7*7, a convolution number of 64 and a step length of 2 to realize 2 times of downsampling operation, and adjusting the images to 112×112×64, wherein 64 is a characteristic dimension;

secondly, taking the features subjected to the first-step deep pretreatment as input, constructing a multi-scale pyramid structure by utilizing a multi-core grouping convolution block, extracting multi-scale intermediate features, and describing the specific process by combining with fig. 2:

(1) The composition of a multi-core packet convolution block is as follows:

firstly, the convolution kernel size is 1*1, the step length is 1, the dimension of the filling 0 adjusts the convolution layer, and then the function is normalized and activated; then grouping the feature images, namely grouping the feature images, adopting a convolution kernel with an independent step length of 2 for each group, splicing the feature images of each group, and passing through a normalization layer and an activation function; finally, the convolution layer with the convolution kernel of 1*1 is restored to 4 times of the input dimension, and the normalization layer and the activation function are performed similarly. Finally adding the initial input tensor which is adjusted to be 4 times of the input dimension with a convolution layer and a normalization layer which are subjected to pooling layer downsampling and 1*1, and taking the initial input tensor as a final output;

(2) Stacking multi-core grouping convolution blocks of four stages, wherein the multi-core grouping convolution blocks are stacked for 3 times in the first stage and are divided into 4 groups {1,4,8,16}, the kernel sizes are {3 x 3,5 x 5,7 x 7,9 x 9}, and the output sizes are 56 x 256; entering a second stage, and performing dimension adjustment on a convolution layer, wherein the dimension is 128, the multi-core grouping convolution blocks are stacked for 4 times and are divided into 3 groups {1,4,8}, the kernel sizes are {3 x 3,5 x 5,7 x 7}, and the output sizes are 28 x 512; entering a third stage, and performing dimension adjustment on a convolution layer, wherein the dimension is 256, the multi-core grouping convolution blocks are stacked for 6 times and are divided into 2 groups {1,4}, the kernel size is {3 x 3,5 x 5}, and the output size is 14 x 1024; entering a fourth stage, and performing dimension adjustment on a convolution layer, wherein the dimension is 512, the multi-core grouping convolution blocks are stacked for 3 times and are divided into 1 group {1}, the kernel size is 3*3, and the output size is 7 multiplied by 2048;

Thirdly, taking the features extracted by the first, second and third stages of the multi-scale pyramid feature extraction module in the second step as input, and learning long-distance dependent features of each level by utilizing the multi-head attention of the vision converter and the multi-layer perceptron, wherein the specific process is as follows:

(1) The method comprises the steps of flattening and regularizing a 56×56×256-size feature map extracted by a multi-scale pyramid feature extraction module in a first stage to 3136×256, then performing size deformation, namely selecting the feature map according to a window step size of 4 to obtain 196×16×256, and then obtaining 196×4×16×64, wherein the number of the multi-head attention is 196×4×16×64 through full connection layer and size deformation of the multi-head attention. The next is 196×4×16×64 multiplied by a 196×4×64×16 tensor, resulting in a feature size of 196×4×16×16;

(2) Meanwhile, window learning dynamic position information of [ -3,4] is generated, 196×4×16×16 and 196×4×16×64 tensors are multiplied again, and size deformation is carried out to obtain size 3136×256, and a dynamic position information learning structure diagram is shown in fig. 3;

(3) Adding the obtained product after initial flattening treatment, and sending into a multi-layer sensor, and sequentially passing through a full connection layer, an activation layer, a discarding layer, a full connection layer and a discarding layer to obtain a product with a size of 3136×256;

(4) Finally, 3136×256 is transformed into 56×56×256, and added to the output characteristics 56×56×256 of the first stage to obtain output 56×56×256;

(5) And (3) flattening and regularizing the 28 multiplied by 512 size feature graphs extracted in the second stage of the multi-scale pyramid feature extraction module to 784 multiplied by 512, then performing size deformation, namely selecting the feature graphs according to the step size of 4 to obtain 49 multiplied by 16 multiplied by 512, and then obtaining 49 multiplied by 4 multiplied by 16 multiplied by 128,4 as the number of the multi-head attention through the full connection layer and the size deformation of the multi-head attention. The next is 49×4×16×128 multiplied by 49×4×128×16 tensors to obtain feature sizes 49×4×16×16;

(6) Meanwhile, window learning dynamic position information of [ -3,4] is generated, and the window learning dynamic position information is multiplied by 49 multiplied by 4 multiplied by 16 multiplied by 49 multiplied by 16 multiplied by 128, and the window learning dynamic position information is subjected to size deformation to obtain a window learning dynamic position information with the size of 784 multiplied by 512;

(7) Then adding the obtained product after initial flattening treatment, and sending the obtained product into a multi-layer sensor as shown in figure 4, and sequentially passing through a full connection layer, an activation layer, a discarding layer, a full connection layer and a discarding layer to obtain a product with a size of 784×512;

(8) Finally, 784×512 dimensions are deformed into 28×28×512, and added to the two-stage output features 28×28×512, resulting in an output 28×28×512.

(9) And (3) flattening and regularizing the 14 multiplied by 1024-size feature map extracted in the third stage of the multi-scale pyramid feature extraction module into 196 multiplied by 1024, then performing size deformation, namely selecting the feature map according to the window step size of 7 to obtain 4 multiplied by 49 multiplied by 1024, and then obtaining 4 multiplied by 49 multiplied by 256,4 as the number of the multi-head attention through the full connection layer and the size deformation of the multi-head attention. The next is 4×4×49×256 multiplied by 4×4×256×49 tensors to obtain feature sizes of 4×4×49×49;

(10) Meanwhile, window learning dynamic position information of [ -6,7] is generated, and multiplication is carried out again by 4×4×49×49 and 4×4×49×256 tensors, and size deformation is carried out, so that size 196×1024 is obtained;

(11) Then adding the obtained product after initial flattening treatment, and sending the obtained product into a multi-layer sensor, and sequentially passing through a full connection layer, an activation layer, a discarding layer, a full connection layer and a discarding layer to obtain a product with a size of 196 multiplied by 1024;

(12) Finally, 196×1024 sizes are deformed into 14×14×1024 sizes, and the deformed sizes are added with the three-stage output characteristics 14×14×1024 to finally obtain the output 14×14×1024.

Fourth, taking the long-distance dependent features extracted in the third step as input, constructing a scale normalization module by utilizing rolling and pooling operations, and performing scale normalization on the long-distance dependent feature map in the first three stages of the middle layer to unify the long-distance dependent feature map to the same size, wherein the specific process is as follows:

(1) Firstly, adopting 1*1 convolution to uniformly adjust the characteristic dimensions of three stages to 512, wherein the sizes of the characteristic dimensions are 56×56×512, 28×28×512 and 14×14×512 respectively;

(2) Next, pooling operations are performed on the features of 56×56×512 and the features of 28×28×512 by using the pooling cores of 4*4 and 2×2, respectively, to unify the features of 14×14×512.

And fifthly, the multi-layer feature fusion module mainly comprises flattening treatment, size deformation, multiplication and addition of full-connection layers and tensors, calculates the similarity relationship between the first layer and the second layer of features with the same size by utilizing a visual transducer principle, and adds the similarity relationship with the third layer of features to obtain fusion features, wherein the specific process is as follows:

(1) Flattening the 14×14×512 features of the first stage and the second stage after the scale specification to 196×512;

(2) Then the flattened features of the first stage are sent into a full connection layer with multiple heads of attention and the size is deformed to obtain the q tensor with the size of 196 multiplied by 1 multiplied by 256 ₁ 1 is the number of multi-head attentions. The flattened features of the second stage are sent to multi-head attention, and the two full connection and size deformation are carried out to obtain two tensors k with the sizes of 196 multiplied by 1 multiplied by 256 ₁ 、v ₁ ；

(3) Will q ₁ And k is equal to ₁ Tensor multiplication is performed, then the function is activated through Softmax, the layer is lost, and then v is used for the layer ₁ Multiplying, once passing through the full connection layer, the discarding layer and the size deformation to be adjusted to be 14 multiplied by 512;

(4) And finally adding the features with the features of 14 multiplied by 512 in the third stage to obtain the final fusion features of 14 multiplied by 512.

The sixth step, the classification module sequentially carries out pooling, L2 regularization, full connection and Softmax classifier on the features after the fifth step fusion to obtain various classification scores, and sequentially carries out pooling, full connection and Softmax classifier on the global features at the fourth stage of the second step to obtain various classification scores; and calculating the hash relation between the global features and the aggregation features through knowledge distillation loss, updating the loss, guiding the global features to learn, and further obtaining final classification. The specific process is as follows:

(1) Taking the output of the fifth step as input, sequentially performing pooling operation of 1*1, L2 regularization and full connection layer to obtain network output logits _t After that, a Softmax classifier was connected, and various classification scores S were obtained under the t=15 parameter _tsoft The calculation formula is as follows;

wherein z is _i For the i-th class of corresponding logits, cn is the total number of output classes, T isParameters, p _i A probability distribution of class i;

(2) Taking the output of the second step as input, sequentially carrying out pooling operation of 1*1 and full connection layer to obtain network output logits _s After that, a Softmax classifier was connected, and class classification scores S were obtained under the parameters t=15 and t=1, respectively _ssoft 、S _shard ；

wherein U and V are probability distributions;

loss＝(1-λ)×S _shard +λ×t×loss_kd, where λ parameter is set to 0.2 and T parameter is set to 15.

Based on the same conception, the invention also provides a multi-layer fusion picture classification system based on a long-distance dependency mechanism, which comprises the following steps:

and the classification module takes the aggregated features as a guide, calculates the hash relation between the global features and the fusion features through knowledge distillation loss, updates the loss, guides the global features to learn, and further carries out final classification on the pictures.

The specific implementation manner of each module corresponds to the first to sixth steps of the picture classification method, and is not described herein.

Further, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the aforementioned picture classification method.

Further, the present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the aforementioned picture classification method.

The invention adopts a long-distance dependency mechanism, fully extracts the rich long-distance dependency characteristics of the middle layer of the convolutional neural network, fuses the long-distance dependency characteristics by using the attention mechanism of the vision converter, effectively guides the global learning context characteristics, adopts a pre-trained network to extract the characteristics, enhances the characteristic learning capability, can effectively solve the problem of insufficient global characteristic extraction, and improves the classification precision and the classification speed of pictures.

The effect of the invention can be further illustrated by the following simulation experiment:

simulation conditions

Two groups of optical remote sensing image data are adopted in the simulation experiment: an AID dataset and an NWPU-RESISC45 dataset. All images of the AID dataset were issued by university of martial arts and university of science and technology in China, and always contained 30 types of scene images, each type of scene images was approximately 220-420, and the total of all images was 10000, according to 50% training proportion. All images of the NWPU-RESISC45 dataset were created by the northwest industrial university, which contained 31,500 images, covering 45 scene classes, each class having 700 images. The data cover more than 100 countries and regions worldwide, and the scale is larger. The spatial resolution of most scene categories varies from 0.2 to 30m, except for islands, lakes, mountains, etc. where the spatial resolution is low. Meanwhile, the data set considers the influence of natural conditions such as different weather, seasons, illumination and the like, and has rich image changes in the aspects of background, shielding and the like. The training set and test set contained 6300 and 25200 images, 256×256 pixels larger, respectively, according to a training scale of 20%. In the experiment, all images of the AID, NWPU-RESISC45 dataset were adjusted to 224 x 224 size. The two groups of experiments take the overall classification precision as an evaluation index. Furthermore, the comparison method comprises: basic convolutional neural networks, such as: googLeNet, VGG _16, pyConvResNet, multi-scale feature aggregation mode (Self-intent-based deep feature fusion, SAFF), capsule network (capsule net), attention-consistent network (Attention consistent network, ACNet), vision transformer (Vision transformer, viT), pyramid vision transformer (Pyramid vision transfomer, PVT).

In the experiment, an Adam optimizer is adopted as a deep separation convolution characteristic learning network optimizer, the initial learning rate is 0.001, the initial learning rate is divided by 10 after 60 periods, and the momentum and weight attenuation are respectively 0.9 and 1e-4. In addition, the network trains the model over the first 50 epochs of AID and the first 50 epochs of NWPU-RESISC45, with other network hyper-parameter configurations summarized in table 1. The simulation experiments were all done under the Linux operating system using Python 3.8+pytorch1.8+cud11.2.

Table 1 network superparameter configuration

Simulation experiment result analysis

Tables 2-3 show the classification accuracy (%) of the simulation experiments performed on NWPU-RESISC45 and AID datasets by the method of the present invention.

Table 2 results of classification of AID datasets by different methods

TABLE 3 Classification results of NWPU-RESISC45 datasets by different methods

From the experimental results, we can find that the classification accuracy of two data sets can be obviously improved by using the method of the invention. The classification accuracy of the method of the invention was 97.40.+ -. 0.19% over 50% of the AID dataset partitioning, and the classification confusion map obtained by the method of the invention is shown in FIG. 5. The classification accuracy of the method of the invention was 95.25.+ -. 0.13% over 20% of the AID dataset partition. Compared with other methods, the method has a larger breakthrough in improving the classification accuracy of the holiday village class with more complex targets compared with other methods, and the method is beneficial to combining a long-distance attention mechanism module, and the module can more comprehensively extract long-distance dependent information. The average accuracy of the method of the present invention was 94.65.+ -. 0.16% over a 20% division of the NWPU-RESISC45 dataset, and the classification confusion map obtained by the method of the present invention is shown in FIG. 6. On the 10% division of the NWPU-RESISC45 data set, the average precision of the method is 92.93+/-0.12%, and compared with other methods, the method can obtain better classification results, and is mainly beneficial to multi-layer feature fusion by means of the attention mechanism of a vision transducer and guiding global feature extraction by utilizing the multi-layer feature fusion. The results fully show that the method can effectively learn the depth characteristic information of the remote sensing images, improve the classification accuracy of the remote sensing scene images and have higher classification performance.

Claims

1. The multi-layer fusion picture classification method based on the long-distance dependency mechanism is characterized by comprising the following steps of:

secondly, grouping the feature graphs by adopting a multi-core grouping convolution block, and independently adopting different convolution kernels for each group to construct a multi-scale pyramid feature extraction module so as to realize multi-scale feature extraction;

2. The multi-layer fusion picture classification method based on long-distance dependency mechanism as claimed in claim 1, wherein the first step adopts a data preprocessing module comprising downsampling and layer normalization operations to implement data preprocessing, and the specific process is as follows:

3. The multi-layer fusion picture classification method based on the long-distance dependency mechanism as claimed in claim 2, wherein the characteristics after the first step of deep preprocessing are used as input, a multi-scale pyramid structure is constructed by utilizing multi-core grouping convolution blocks, and multi-scale intermediate characteristics are extracted by the specific processes:

(1) The composition of a multi-core packet convolution block is as follows:

4. The multi-layer fusion picture classification method based on long-distance dependency mechanism as claimed in claim 3, wherein the features extracted by the first, second and third stages of the second multi-scale pyramid feature extraction module are used as input, and the multi-head attention and the multi-layer perceptron of the vision converter are utilized to learn the long-distance dependency features of each layer, and the specific process is as follows:

(5) The second and third steps are identical to the first step.

5. The multi-layer fusion picture classification method based on the long-distance dependency mechanism as claimed in claim 4, wherein the long-distance dependency feature extracted in the third step is used as input, a scale normalization module is constructed by using rolling and pooling operations, scale normalization is performed on the long-distance dependency feature map in the first three stages of the middle layer, and the long-distance dependency feature map is unified to the same size, and the specific process is as follows:

6. The method for classifying multi-layer fusion pictures based on long-distance dependency mechanism according to claim 5, wherein the fifth step, the multi-layer feature fusion module is mainly composed of flattening processing, size deformation, multiplication and addition of full-connection layers and tensors, and the method comprises the steps of calculating the similarity relationship between the first layer and the second layer of the same-size features by utilizing the principle of a visual transducer, adding the similarity relationship with the third layer of features, and obtaining fusion features, wherein the specific process is as follows:

7. The multi-layer fusion picture classification method based on the long-distance dependency mechanism according to claim 6, wherein the classification module sequentially carries out pooling, L2 regularization, full connection and Softmax classifier on the features after the fifth step of fusion to obtain various classification scores, and sequentially carries out pooling, full connection and Softmax classifier on the global features in the fourth step of the second step to obtain various classification scores; and through knowledge distillation loss, calculating the hash relation between the global features and the aggregation features, updating the loss, guiding the global features to learn, and further obtaining final classification, wherein the specific process is as follows:

wherein z is _i For the i-th class of corresponding logits, cn is the total number of output classes, T is a parameter, p _i A probability distribution of class i;

wherein U and V are probability distributions;

loss＝(1-λ)×S _shard +λ×t×loss_kd, where λ and T are parameters.

8. A multi-layered fused picture classification system based on a long-range dependency mechanism, comprising:

the long-distance attention mechanism module is used for learning long-distance dependency characteristics of each level by utilizing a multi-head attention mechanism selected based on long-distance step length, dynamic position learning and a multi-layer perceptron in the vision converter;

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1-7.