CN116206210A

CN116206210A - NAS-Swin-based remote sensing image agricultural greenhouse extraction method

Info

Publication number: CN116206210A
Application number: CN202211569653.2A
Authority: CN
Inventors: 佟威剑; 贾淑涵; 赵泉华
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-06-02

Abstract

The invention designs a remote sensing image agricultural greenhouse extraction method based on NAS-Swin, belonging to the field of remote sensing images; firstly, marking the position information of the ground object of the agricultural greenhouse in the acquired satellite image data, and dividing the marked position information into a training set and a testing set; inputting the training set into a Swin-transducer neural network module with improved parameters for training to obtain remote sensing characteristics of the agricultural greenhouse; inputting remote sensing features of the agricultural greenhouse into a NAS-FPN network, establishing a targeted feature pyramid aiming at the agricultural greenhouse in the Landsat image, and fusing the remote sensing features; inputting the remote sensing fusion characteristics into an RPN (remote procedure network) based on an Approx strategy improvement, and extracting and marking the position information of the agricultural greenhouse; obtaining a NAS-Swin neural network framework; and finally obtaining the information of the agricultural greenhouse. The invention constructs a better NAS-Swin neural network framework aiming at remote sensing agricultural greenhouse extraction, and overcomes the defect that large-scale image data are difficult to extract by the traditional agricultural greenhouse extraction method.

Description

NAS-Swin-based remote sensing image agricultural greenhouse extraction method

Technical Field

The invention belongs to the field of remote sensing images, and particularly relates to a remote sensing image agricultural greenhouse extraction method based on NAS-Swin.

Background

As an emerging agricultural facility, the agricultural greenhouse has low price and excellent effect of resisting insect damage disasters, can effectively reduce the influence of climate environment on the ground planting, can effectively improve the yield of the ground planting and relieves the phenomenon of shortage of supply. The vegetable supply period in northern areas can be improved, the land utilization rate is improved, the requirements of people on agricultural byproducts all the year round are met, the agricultural greenhouse has the characteristics of low investment, high return, high benefit and the like, is widely applied to the fields of livestock cultivation, forestry seedling cultivation, fruit tree cultivation, vegetable production and the like at present, becomes the post industry of agriculture, has extremely high social benefit and economic benefit, and becomes an important index for measuring agricultural modernization at present. Meanwhile, problems are brought, such as that the construction of the agricultural greenhouse cannot be timely detected and controlled, the distribution of the agricultural greenhouse cannot be effectively planned, the land of the agricultural greenhouse is illegally occupied, and the like, so that timely and accurate acquisition of the information such as the geographical distribution and coverage area of the distribution of the agricultural greenhouse has important significance in the aspects of agricultural yield evaluation, farmer production planning, national agricultural overall planning, and the like. The traditional acquisition mode of the distribution information of the agricultural greenhouse is still remained in manual field investigation or visit investigation, and the problems of large requirements on manpower and material resources, long time period, strong subjectivity and the like are caused by the complexity of the traditional acquisition mode of the distribution information of the agricultural greenhouse, and the phenomena of 'missing calculation, multiple calculation, miscalculation' and the like in statistics are often caused.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a remote sensing image agricultural greenhouse extraction method based on NAS-Swin, which can better and more efficiently extract the agricultural greenhouse information in the remote sensing image.

The remote sensing image agricultural greenhouse extraction method based on NAS-Swin specifically comprises the following steps:

step 1: obtaining Landsat satellite data and Sentinel-2 satellite image data, and then carrying out image preprocessing on the Landsat satellite data and the Sentinel-2 satellite image data; marking the position information of the ground object of the agricultural greenhouse in the image data through labelme, and marking the marked position information with 4:1 is divided into a training set and a testing set;

step 2: inputting the training set obtained in the step 1 into a Swin-transducer neural network module with improved parameters for training to obtain remote sensing characteristics of the agricultural greenhouse;

the Swin-transducer neural network module consists of 4 parts; dividing an input image into a set of non-overlapping image blocks through a patch part layer, wherein the size of each image block is 4x4, the corresponding feature dimension of each image block is 48, the first part firstly reduces the dimension of the divided feature dimension through a linear embedding, the first part is sent into a Swin Transformer Block module for calculation, the second part to the fourth part are completely the same, the previous layer of input image block set is combined through a patch part layer according to the neighborhood range of 2x2, the combined image block size is four times of the size of the previous layer of image block, the feature dimension is four times of the previous layer, and the feature dimension is reduced to be half of the previous one through a linear embedding layer, and then the first part is sent into a Swin Transformer Block module for calculation of the self-attention of the image block;

the Swin Transformer Block module needs to normalize the input characteristic Z1 passing through the patch raising Layer through the Layer-Norm Layer; then performing feature learning through window based self-attention (W-MSA), performing residual operation to obtain Z2, and performing Layer-Norm Layer, MLP Layer and residual operation to obtain Z3; z3 carries out feature learning through a Layer-Norm Layer and Shifted window based self-attribute (SW-MSA), then residual operation is carried out to obtain Z4, and then Z5, namely the output Layer feature, is obtained through a Layer-Norm Layer, an MLP and a residual; wherein the residual connection formula in the module is as follows,

Z2＝W-MSA(Layer-Norm(Z1))+Z1

Z3＝MLP(Layer-Norm(Z2))+Z2

Z4＝SW-MSA(Layer-Norm(Z3))+Z3

Z5＝MLP(Layer-Norm(Z4))+Z4

wherein Z1 is an input feature, Z2 and Z3 are output features of the W-MSA and MLP modules respectively, and Z4 and Z5 are output features of the SW-MSA and MLP modules respectively; W-MSA and SW-MSA respectively represent window multi-head self-attention and sliding window multi-head self-attention, and information interaction between different image blocks is carried out in a sliding window mode; to reduce the amount of computation, swin transformer Block performs a moving merge on the window of the SW-MSA;

the calculation for self-attitution is,

wherein Q, K and V correspond to query, key and value in the self-attention mechanism, respectively, and Q, K, V ε R ^m×d ，R ^m×d The real number matrix of M rows and d columns, d is the dimension of query and key, M is the number of patches, wherein the range of the relative position values between different patches is [ -M+1, M-1]M is the arithmetic square root of the number of patches, thus parameterizing a matrix B, B ε R ^(2M ^-1)×(2M-1) The method comprises the steps of carrying out a first treatment on the surface of the The softMax functions are normalized, taking the ith patch as an example, and are defined as follows:

wherein m is the number of the latches;

aiming at the problem of extracting the agricultural greenhouse, the moving window size and the downsampling ratio of the Swin-converter neural network module are improved, the moving window size is changed from 12×12 to 7×7, and the downsampling ratio is changed from 8,16,32,64 to 4,8, 16, 32;

step 3: inputting the remote sensing features of the agricultural greenhouse extracted in the step 2 into a NAS-FPN network, establishing a targeted feature pyramid aiming at the agricultural greenhouse in the Landsat image, and fusing the remote sensing features;

the construction of the targeted feature pyramid is to search in a search space overlapped for 7 times according to the NAS-FPN network, and select the feature pyramid with the highest feature fusion precision through comparison;

step 4: inputting the remote sensing fusion characteristics obtained in the step 3 into an RPN (remote procedure network) based on an Approx strategy improvement to obtain preselected anchor frame information;

the Approx strategy is used for improving the problem of positive and negative sample distribution of the RPN network; generating 9 pre-selected frames on each feature point of the remote sensing fusion feature map obtained in the step 3, and adopting an Approx strategy for the 9 pre-selected frames on each feature point: the maximum IOU value in each feature point is reserved through calculation and comparison of the IOU values of each pre-selected frame; the remote sensing feature map is input into convolution of 3 multiplied by 3 and 1 multiplied by 1 and then divided into two branches, one branch is used for judging whether the target is classified into two types in the candidate frame by using a softmax classifier, if the target is present, the pre-selected frame is reserved, if the target is not present, the pre-selected frame is removed, and the other branch is used for boundary frame regression; carrying out displacement operation on the boundary boxes according to the output predicted offset, determining whether the boundary boxes are background or not, then sequencing the bboxes according to probability, reserving the bboxes with highest probability scores, and selecting and rejecting the rest bboxes in a softening non-maximum suppression mode, wherein the area encircled by the rest bboxes is the ROI area finally;

step 5: extracting and marking the position information of the agricultural greenhouse based on the fusion characteristics of the step 3 and the pre-selected anchor frame information of the step 4;

step 6: calculating corresponding loss functions according to the agricultural greenhouse information obtained by prediction in the step 5 based on the sample marking information obtained in the step 1, wherein different loss functions are used according to different network modules; obtaining a NAS-Swin neural network framework;

-selection of the different loss functions; the overall loss function L of the invention _Toltal Is composed of two parts: RPN network loss function L _RPN And the ROIAlign loss function L _ROIAlign The method comprises the following steps:

L _Total ＝L _RPN +L _ROIAlign

wherein L is _RPN Containing object classification losses L in anchor frames _cls And bounding box position loss L of anchor box _reg ；

Wherein p is _i Representing the probability that the ith anchor frame is predicted to be a true label, p _i ^* 1 when positive sample, 0 when negative sample, t _i Representing the regression parameters of the frame of the ith anchor frame, t _i ^* Representing the real frame corresponding to the ith anchor frame, N _cls N is the total number of positive and negative samples of the anchor frame for training the RPN network _reg To represent the number of anchor frame positions;

L _cls ((p _i ，p _i ^* ) Representing classification loss, namely:

representing the bezel localization regression loss, namely:

the frame positioning regression loss and the classification loss of the ROIAlign are the same as those of the RPN network, and the mask loss adopts a two-class cross entropy loss function, namely:

wherein D is _i The probability of belonging to the target pixel is predicted for the i-th pixel,

the probability that the ith pixel belongs to a true target pixel;

step 7: and (3) inputting the training set obtained in the step (1) into a NAS-Swin neural network for training, and then inputting the testing set into the trained NAS-Swin neural network to finally obtain the information of the agricultural greenhouse.

The invention has the beneficial effects that:

according to the invention, the characteristic extraction is performed by using the Swin-transformer backbone network with the parameters adjusted aiming at the problem of extracting the agricultural greenhouse, and compared with the unmodified Swin-transformer backbone network, the method has a better effect. Through overlapping NAS-FPN search space for 7 times, a feature fusion strategy with better extraction effect aiming at the remote sensing agricultural greenhouse is found, and the problem of feature fusion of the traditional feature pyramid is solved. The invention constructs a better NAS-Swin neural network framework aiming at remote sensing agricultural greenhouse extraction, and overcomes the defect that large-scale image data are difficult to extract by the traditional agricultural greenhouse extraction method.

Drawings

FIG. 1 is a structural architecture diagram of an agricultural greenhouse extraction based on NAS-Swin in an embodiment of the invention;

FIG. 2 is a schematic diagram of a Swin-transducer backbone network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the structure of an embodiment Swin-Transformer block of the present invention;

FIG. 4 is a schematic diagram of an embodiment of an MLP network according to the present invention;

FIG. 5 is a schematic diagram of window multi-headed self-attention and sliding window multi-headed self-attention in accordance with an embodiment of the present invention;

FIG. 6 is a schematic diagram of a sliding window merging scheme according to an embodiment of the present invention;

FIG. 7 is a schematic diagram showing the operation of the building blocks of the NAS-FPN according to the embodiment of the present invention;

fig. 8 is a schematic view of a feature pyramid extracted from an agricultural greenhouse according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples;

a remote sensing image agricultural greenhouse extraction method based on NAS-Swin is shown in figure 1. The method specifically comprises the following steps:

step 1: downloading satellite images on the geospatial data cloud, acquiring Landsat satellite data and Sentinel-2 satellite image data, and then carrying out image preprocessing on the Landsat satellite data and the Sentinel-2 satellite image data; marking the position information of the ground object of the agricultural greenhouse in the image data through labelme, and marking the marked position information with 4:1 is divided into a training set and a testing set;

in this embodiment, a land at remote sensing image of a city in year 2000-year 2021 in year 21 is collected as experimental data, and is subjected to image labeling, and the image quantity ratio of the training set to the test set is 4:1.

step 2: inputting the training set obtained in the step 1 into a Swin-transducer neural network module with improved parameters for training to obtain remote sensing characteristics of the agricultural greenhouse; the Swin transformer Block prevents problems such as gradient dissipation or gradient explosion caused by the increase of the depth of the network in a residual connection mode;

the Swin-transducer neural network module is shown in figure 2 and consists of 4 parts; dividing an input image into a set of non-overlapping image blocks through a patch part layer, wherein the size of each image block is 4x4, the corresponding feature dimension of each image block is 48, the first part firstly reduces the dimension of the divided feature dimension through a linear embedding, the first part is sent into a Swin Transformer Block module for calculation, the second part to the fourth part are completely the same, the previous layer of input image block set is combined through a patch part layer according to the neighborhood range of 2x2, the combined image block size is four times of the size of the previous layer of image block, the feature dimension is four times of the previous layer, and the feature dimension is reduced to be half of the previous one through a linear embedding layer, and then the first part is sent into a Swin Transformer Block module for calculation of the self-attention of the image block;

as shown in FIG. 3, the Swin Transformer Block module needs to normalize the input feature Z1 passing through the patch raising Layer through the Layer-Norm Layer; then performing feature learning through window based self-attention (W-MSA), performing residual operation to obtain Z2, and performing Layer-Norm Layer, MLP Layer and residual operation to obtain Z3; the network structure of the MLP is shown in fig. 4, Z3 carries out feature learning again through a Layer-Norm Layer and Shifted window based self-attribute (SW-MSA), then residual operation is carried out to obtain Z4, and then Z5, namely the output Layer feature, is obtained through one Layer-Norm Layer, one MLP and one residual; wherein the residual connection formula in the module is as follows,

Z2＝W-MSA(Layer-Norm(Z1))+Z1

Z3＝MLP(Layer-Norm(Z2))+Z2

Z4＝SW-MSA(Layer-Norm(Z3))+Z3

Z5＝MLP(Layer-Norm(Z4))+Z4

wherein Z1 is an input feature, Z2 and Z3 are output features of the W-MSA and MLP modules respectively, and Z4 and Z5 are output features of the SW-MSA and MLP modules respectively; W-MSA and SW-MSA respectively represent window multi-head self-attention and sliding window multi-head self-attention, and information interaction between different image blocks is carried out in a sliding window mode; the specific operation is shown in fig. 5. The black line block in fig. 5 represents one tile, the adjacent 4*4 tile represents a fixed window, and the self-attention calculation is placed in the red box. FIG. 5 (a) is a W-MSA, and the SW-MSA of FIG. 5 (b) is obtained by sliding a window. To reduce the amount of computation, swin transformer Block performs a moving merge on the window of the SW-MSA, the specific operation is shown in fig. 6.

The calculation for self-attitution is,

wherein Q, K and V correspond to query, key and value, respectively, in the self-attention mechanism, and Q，K，V∈R ^m×d ，R ^m×d The real number matrix of M rows and d columns, d is the dimension of query and key, M is the number of patches, wherein the range of the relative position values between different patches is [ -M+1, M-1]M is the arithmetic square root of the number of patches, thus parameterizing a matrix B, B ε R ^(2M ^-1)×(2M-1) The method comprises the steps of carrying out a first treatment on the surface of the The softMax functions are normalized, taking the ith patch as an example, and are defined as follows:

wherein m is the number of the latches;

aiming at the problem of extracting the agricultural greenhouse, the moving window size and the downsampling ratio of the Swin-converter neural network module are improved, the moving window size is changed from 12×12 to 7×7, and the downsampling ratio is changed from 8,16,32,64 to 4,8, 16, 32; through the improvement of parameters, the Swin-Transformer backbone network has better information extraction effect on the agricultural greenhouse in the remote sensing image.

the NAS-FPN consists of a plurality of repeated raising cells, and the structure of the raising cells is shown in figure 7; the detection flow can be divided into 4 steps: firstly, selecting feature graphs with different scales, namely { C1, C2, C3, C4 and C5}, wherein the step sizes of the feature graphs are {8,16,32,64 and 128}, wherein C1, C2 and C3 are 3 feature layers extracted by a network, and C4 and C5 are obtained by downsampling the C3 feature layers according to step sizes of 2 and 4 respectively; secondly, constructing candidate layer features by using the 5 feature layers, and selecting 2 different feature layers from the candidate layers to perform feature fusion; then, performing feature fusion in an operation pool, taking the output of the previous enhancing cell as the input of the next enhancing cell, and repeating the operation of the enhancing cell until reaching a threshold value; finally, output feature layers are circularly generated and are marked as { P1, P2, P3, P4 and P5}, and the resolutions of the output feature layers correspond to 5 candidate feature layers which are input respectively. The 2 feature layer fusion operations in the operation pool are sum and global mapping, respectively. The sum is obtained by adding the smaller of the 2 feature maps to the large feature map size through bilinear interpolation and then adding the two feature maps pixel by pixel. global mapping is used for solving a attention feature of a smaller high-dimensional semantic feature map through averaging and sigmoid and multiplying the attention feature with the larger low-dimensional feature map. The smaller high-dimensional feature map is bilinear interpolated to the size of the larger feature map, and then added pixel by pixel, and the obtained fused feature map has the same size as the larger feature map. The feature fusion strategy for the agricultural greenhouse after searching is shown in fig. 8.

the Approx strategy is used for improving the problem of positive and negative sample distribution of the RPN network; generating 9 pre-selected frames on each feature point of the remote sensing fusion feature map obtained in the step 3, and adopting an Approx strategy for the 9 pre-selected frames on each feature point: the maximum IOU value in each feature point is reserved through calculation and comparison of the IOU values of each pre-selected frame; the remote sensing feature map is input into convolution of 3 multiplied by 3 and 1 multiplied by 1 and then divided into two branches, one branch is used for judging whether the target is classified into two types in the candidate frames by using a softmax classifier, if the target is present, the pre-selected frame is reserved, if the target is not present, the pre-selected frame is removed, and the other branch is used for Bounding-Box (bbox) regression; carrying out displacement operation on the boundary boxes according to the output predicted offset, determining whether the boundary boxes are background or not, then sequencing the bboxes according to probability, reserving the bboxes with highest probability scores, and selecting and rejecting the rest bboxes in a softening non-maximum suppression mode, wherein finally the area encircled by the rest bboxes is ROI (Region of Interest) area;

for the regression problem, the invention improves the RPN positive and negative sample distribution problem through an error strategy. The core idea of the error is as follows: the method comprises the steps of calculating the iou of 9 anchors and gt by using 9 anchor settings of each position of an original retinanet, selecting the highest iou value in the 9 iou of each position by using max operation in the 9 anchors, and calculating the subsequent MaxIoUAssigner by using the iou value, wherein at the moment, which positions on each feature map position are positive samples can be obtained.

-selection of the different loss functions; the overall loss function L of the invention _Toltal Consists of two parts: RPN network loss function L _RPN And the ROIAlign loss function L _ROIAlign The method comprises the following steps:

L _Total ＝L _RPN +L _ROIAlign

Wherein p is _i Representing the probability that the ith anchor box predicts as a true label,

1 when positive sample, 0 when negative sample, t _i Representing the ith anchor frame border regression parameters, +.>

Representing the real frame corresponding to the ith anchor frame, N _cls N is the total number of positive and negative samples of the anchor frame for training the RPN network _reg To represent the number of anchor frame positions;

representing classification loss, namely:

/>

representing the frame location regression loss, expressed by Smooth L1 loss, namely:

the frame-positioned regression and classification losses of ROIAlign are identical to RPN networks, and the mask loss uses a bi-classification cross entropy loss function (Binary Cross Entropy, BCE), namely:

the probability that the ith pixel belongs to a true target pixel;

step 7: the training set is input into the NAS-Swin neural network for training, the maximum iteration number is 30, and the learning rate is 0.00001. And (3) inputting the test set obtained in the step (1) into a trained NAS-Swin neural network to obtain the information of the agricultural greenhouse.

The agricultural greenhouse extraction is carried out on the land at remote sensing image, and the land at remote sensing image is compared with a Box Inst and SVM algorithm. The method disclosed by the invention has higher extraction precision of the agricultural greenhouse. As shown in table 1:

table 1 Swin transformer, SVM and BoxInst green house extraction accuracy results;

/>

Claims

1. the remote sensing image agricultural greenhouse extraction method based on NAS-Swin is characterized by comprising the following steps of:

2. The method for extracting the agricultural greenhouse from the remote sensing image based on the NAS-Swin as set forth in claim 1, wherein the Swin-transform neural network module in the step 2 is composed of 4 parts; each part is a similar repeating unit, firstly, an image is divided into a set of non-overlapping image blocks through a patch part layer, wherein the size of each image block is 4x4, then the corresponding feature dimension of each image block is 48, the first part firstly carries out dimension reduction on the divided feature dimension through a linear embedding, the first part carries out calculation in a Swin Transformer Block module, the second part to the fourth part are completely the same, firstly, the set of image blocks input from the last layer is combined through a patch part layer according to a neighborhood range of 2x2, the combined image block size is four times of the size of the image block of the last layer, and therefore the feature dimension is also increased to be four times of the size of the image block of the last layer, and then the feature dimension is reduced to be half of the previous dimension through a linear embedding layer, and then the self-attention of the image block is calculated through a Swin Transformer Block module.

3. The remote sensing image agricultural greenhouse extraction method based on NAS-Swin as set forth in claim 2, wherein the Swin Transformer Block module is required to perform normalization processing on the input feature Z1 passing through the patch raising Layer through the Layer-Norm Layer; then performing feature learning through window based self-attention (W-MSA), performing residual operation to obtain Z2, and performing Layer-Norm Layer, MLP Layer and residual operation to obtain Z3; z3 carries out feature learning through a Layer-Norm Layer and Shifted window based self-attribute (SW-MSA), then residual operation is carried out to obtain Z4, and then Z5, namely the output Layer feature, is obtained through a Layer-Norm Layer, an MLP and a residual; wherein the residual connection formula in the module is as follows,

Z2＝W-MSA(Layer-Norm(Z1))+Z1

Z3＝MLP(Layer-Norm(Z2))+Z2

Z4＝SW-MSA(Layer-Norm(Z3))+Z3

Z5＝MLP(Layer-Norm(Z4))+Z4

the calculation for self-attitution is,

wherein m is the number of the latches;

aiming at the problem of extracting the agricultural greenhouse, the moving window size and the downsampling ratio of the Swin-converter neural network module are improved, the moving window size is changed from 12×12 to 7×7, and the downsampling ratio is changed from 8,16,32,64 to 4,8, 16, 32.

4. The remote sensing image agricultural greenhouse extraction method based on NAS-Swin according to claim 1, wherein the Approx strategy in step 4 is used for improving the problem of positive and negative sample distribution of the RPN network; generating 9 pre-selected frames on each feature point of the remote sensing fusion feature map obtained in the step 3, and adopting an Approx strategy for the 9 pre-selected frames on each feature point: the maximum IOU value in each feature point is reserved through calculation and comparison of the IOU values of each pre-selected frame; the remote sensing feature map is input into convolution of 3 multiplied by 3 and 1 multiplied by 1 and then divided into two branches, one branch is used for judging whether the target is classified into two types in the candidate frame by using a softmax classifier, if the target is present, the pre-selected frame is reserved, if the target is not present, the pre-selected frame is removed, and the other branch is used for boundary frame regression; and carrying out displacement operation on the boundary boxes according to the output predicted offset, determining whether the boundary boxes are background, then sequencing the bboxes according to probability, reserving the bboxes with highest probability scores, and selecting and rejecting the rest bboxes in a softening non-maximum suppression mode, wherein the area encircled by the rest bboxes is the ROI area finally.

5. The method for extracting the agricultural greenhouse from the remote sensing image based on the NAS-Swin as set forth in claim 1, wherein the selection of the different loss functions in the step 6; the overall loss function L of the invention _Toltal Consists of two parts: RPN network loss function L _RPN And the ROIAlign loss function L _ROIAlign The method comprises the following steps:

L _Total ＝L _RPN +L _ROIAlign

Representing the real frame corresponding to the ith anchor frame, N _cls N is the total number of positive and negative samples of the anchor frame for training the RPN network _reg To represent the number of anchor frame positions; />

Representing classification loss, namely:

representing the bezel localization regression loss, namely:

for the ith pixel genusProbability at the true target pixel. />