CN117253044B

CN117253044B - A method for farmland remote sensing image segmentation based on semi-supervised interactive learning

Info

Publication number: CN117253044B
Application number: CN202311334268.4A
Authority: CN
Inventors: 文思鉴; 王永梅; 王芃力; 张友华; 吴雷; 吴海涛; 轩亚恒; 郑雪瑞; 张世豪; 潘海瑞
Original assignee: Anhui Agricultural University AHAU
Current assignee: Anhui Agricultural University AHAU
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-05-24
Anticipated expiration: 2043-10-16
Also published as: CN117253044A

Abstract

The invention is suitable for the technical field of agricultural image analysis, and particularly provides a farmland remote sensing image segmentation method based on semi-supervised interactive learning; and secondly, introducing a directivity contrast loss function into the CNN, and performing full-supervision training on the tagged data to ensure the consistency of the same identity features in the pictures under different scenes, thereby improving the generalization capability and robustness of the model.

Description

Farmland remote sensing image segmentation method based on semi-supervised interactive learning

Technical Field

The invention belongs to the technical field of agricultural image analysis, and particularly relates to a farmland remote sensing image segmentation method based on semi-supervised interactive learning.

Background

The farmland remote sensing image segmentation is an important task, and the aim is to classify the farmland remote sensing image at the pixel level so as to improve the efficiency of agricultural land production and management.

The conventional farmland remote sensing image segmentation method based on deep learning generally needs a large amount of labeling data for training, but the labeling data is high in acquisition cost, and the requirements are often difficult to meet in practical application. Therefore, semi-supervised learning is one of the effective methods to solve this problem.

Currently, semi-supervised learning methods use a small amount of labeled data and a large amount of unlabeled data for training to improve the performance of the model. In addition, due to the large number of parameters, the situation of over fitting is easy to occur, namely, the model performs well on the training set, but performs poorly on the test set. Therefore, the generalization capability of the farmland remote sensing image segmentation model is an important problem to be considered when being applied to actual scenes.

The existing framework for improving generalization capability and robustness of semi-supervised agricultural image segmentation algorithm can be divided into two main types: an agricultural image segmentation method based on a Convolutional Neural Network (CNN) and an agricultural image segmentation method based on a Transformer; the former, CNN, extracts features in image space by convolution operation, has the disadvantage that: CNNs use local receptive fields in processing images and gradually reduce image resolution from lower to higher layers through convolution and pooling operations, such local receptive field limitations may result in loss of detail information and global context information in the image, particularly for fine-grained segmentation tasks for large-scale farmland areas; the latter convertors model global relationships in sequence space by self-attention mechanisms, which have the disadvantage that: the goal of the transducer is to model the dependency relationship between each pixel and other pixels through global context information, and there is a limitation in processing local features. In a farmland remote sensing image, different crops or land types may have different scales, some fine feature details need finer perceptibility, and a Transformer may not accurately capture the details when processing different scale features, which leads to reduced accuracy and robustness of a segmentation result.

Disclosure of Invention

The embodiment of the invention aims to provide a farmland remote sensing image segmentation method based on semi-supervised interactive learning, which comprises the following steps of firstly, mutually cooperating CNN and a transducer through interactive learning, and mutually transmitting local characteristics and global characteristics of pixels through self-supervised training on unlabeled data, so that the requirement of labeling data is reduced, and meanwhile, the possible defects of the two existing methods are effectively avoided; secondly, introducing a directional contrast loss function into the CNN, and performing full-supervision training on the tagged data to ensure consistency of the same identity features in the pictures under different scenes, so as to improve generalization capability and robustness of the model.

In view of the above, the invention provides a farmland remote sensing image segmentation method based on semi-supervised interactive learning, which comprises the following steps:

Step S10: m input images divided with labels And N images without labels；

Step S20: training CNN and transducer using the tagged image data, respectively;

step S30: weak enhancement processing for Gaussian filtering and brightness adjustment of unlabeled image, and randomly cutting out two new images with overlapping area in the same image ，/>Meanwhile, the pixels of the unlabeled image are projected between an encoder and a decoder of the CNN, a directional contrast loss function is introduced, consistency of the same identity characteristic in the image under different scenes is guaranteed, and a transform prediction result is used as a pseudo tag to calculate context perception consistency loss; calculating consistency regularization loss by using the CNN prediction result as a pseudo tag of the transform prediction result;

Step S40: and (3) taking the trained CNN model as a backbone network, segmenting the test set image, and evaluating the accuracy of the result.

As a further limitation of the technical solution of the present invention, the step of performing the weak enhancement processing of gaussian filtering and brightness adjustment on the unlabeled image, and randomly cropping two new images with overlapping areas in the same image includes:

Step S31: applying gaussian filtering to reduce noise and detail in the unlabeled image, weighted averaging the neighborhood around each pixel; for each pixel Filtering is carried out by using a Gaussian kernel with the size of k, and the pixel value after filtering is as follows:

（1）

wherein, Is the pixel value in the neighborhood,/>Is the weight value of the gaussian kernel;

Finally, adjusting the brightness of the image;

step S32: for a given weakly enhanced unlabeled image, randomly selecting the size and position of a cropping window;

moving the cutting window upwards and leftwards for a certain distance to obtain two new images with overlapping areas 、/>Training a model;

Step S33: using bicubic interpolation algorithm to interpolate all images Are scaled to size/>And uses a bilinear interpolation algorithm to interpolate the corresponding label/>Scaled to the same size so that the input image meets the input specifications of DeepLab v + network.

As a further limitation of the technical solution of the present invention, the training process of the tagged image data includes:

Encoding matrices using tagged images Tag/>Respectively training two backbone network models of CNN and transducer, and calculating the loss function/>, which is related to the real label。

As a further limitation of the technical scheme of the invention, the loss function of the calculation and the real labelThe method comprises the following steps:

Data to be tagged Inputting CNN to obtain prediction probability/>, corresponding to each pixel pointInputting a transducer to obtain a prediction probability/>Calculating a loss function/>, between the calculated and corresponding real values；

Loss function between the computation and corresponding real valueThe process of (2) is as follows:

with real labels In contrast, the loss function/>, of the CNN fractionAs shown in formula (2):

（2）

Loss function of a transducer section As shown in formula (3):

（3）

Wherein the method comprises the steps of Representing a ReLU activation function,/>Representing Focal Loss, the expression is shown in formula (4):

（4）

Wherein the method comprises the steps of And/>Is a super parameter, here set as/>=0.25,/>=2；

The total loss calculation of the supervised learning model is shown in the formula (5):

（5）

Wherein when In the case of a real tag,/>=1, Vice versa/>=0，/>Is a real number ranging from 0 to 1, indicating the probability that the image belongs to the category noted in the label.

As a further limitation of the technical solution of the present invention, the training process of the label-free image data includes:

through inputting two groups of weak enhancement unlabeled images which are cut randomly, obtaining a prediction result through a CNN network framework and taking the prediction result as a pseudo label predicted by a transducer model, and calculating consistency regularization loss ；

Obtaining a prediction result through a transducer network framework and taking the prediction result as a pseudo tag of intermediate projection to calculate the context-aware consistency lossAnd the interactive transmission of the image local information and the context global information in the training process is ensured, so that the model fully learns the consistency regularization capability.

As a further limitation of the technical scheme of the invention, in the training process of the unlabeled images, the two groups of input unlabeled images subjected to weak enhancement are subjected to,/>Two groups of predicted values/>, are generated through a CNN model framework、/>; Similarly, two sets of predicted values/>, are generated through a transducer model framework、/>：

（6）

Wherein,Representing CNN network model,/>Representing a transducer network model;

Pseudo tag 、/>、/>、/>The calculation method of (2) is shown in the formula (7):

（7）

wherein, Representation such that the probability value/>The label corresponding to the maximum;

CNN prediction result is used as pseudo tag of transducer, and consistency regularization loss is carried out The calculation method of (2) is shown in the formula (8):

（8）

wherein, Representing a ReLU activation function,/>Representing a Dice loss function;

,/> Feature map/>, obtained by DeepLab v < 3+ > encoder Encoder And/>And then is passed through a nonlinear projector/>Projection as/>And/>Using a directional contrast loss function, overlapping regions/>, are encouragedAligning the contrast features with high confidence under different backgrounds, and finally keeping consistency;

for the first Label-free images, loss of directionality contrast/>The calculation formula of (2) is as follows:

（9）

（10）

（11）

wherein N represents the number of spatial positions of the overlapping feature regions; calculating the feature similarity; /(I) Representing a two-dimensional spatial position; /(I)Representing a negative set of images,/>Representation/>Negative samples in/>Representing a classifier;

Calculating a consistency loss function after context awareness constraint by using a prediction result of a transducer as a pseudo tag The formula is as follows:

（12）

wherein, Representing a Dice loss function; /(I)Representing the classifier.

As a further limitation of the technical scheme of the invention, the overall loss is minimized, in the training process, firstly, parameters of a classification network are initialized, then, training data are used for forward propagation and backward propagation, gradients of a loss function are calculated, and network parameters are updated by utilizing optimization algorithms such as gradient descent and the like until the overall loss function is reachedReaching a preset convergence condition;

The total loss To supervise learning model total loss/>Consistency regularization loss/>Loss of directivity contrast/>Context aware consistency loss/>Wherein the total loss/>The calculation formula of (2) is as follows:

（13）

wherein, And/>Is a weight factor, with the aim of controlling the directional contrast loss/>And consistency regularization lossIn the total loss function/>Is a ratio of the number of the first and second groups.

As a further limitation of the technical scheme of the present invention, the step S40 specifically includes a model test, for a given test set farmland remote sensing image, CNN is used as a backbone network model to extract features, the probability of the category to which each pixel point of the target image belongs is segmented and output, and a threshold is set to mark the pixel point as a segmentation target or background.

As a further limitation of the present invention, the step of marking the pixel point as a segmentation target or background by the set threshold includes: obtaining maximum gray value of image by dividing modelAnd minimum gray value/>Let the initial threshold be/>According to/>Dividing an image into a foreground and a background, and respectively obtaining the average gray value/>And/>Find new threshold/>Iterate until ifThe obtained value is the threshold value, the prediction probability is larger than the threshold value and marked as the foreground, and smaller than the threshold value and marked as the background, so that the final segmentation mask is obtained.

Compared with the prior art, the farmland remote sensing image segmentation method based on semi-supervised interactive learning has the beneficial effects that:

Firstly, a consistency constraint module for directional perception is inserted into a CNN and a Transformer interactive learning network, the CNN is excellent in image processing, spatial features in farmland remote sensing images can be effectively extracted, local and global features are captured through convolution and pooling operations, the Transformer is excellent in the field of natural language processing, sequence data and long-distance dependency relations are good in processing, and the advantages of the CNN in spatial feature extraction and the advantages of the Transformer in long-distance dependent modeling can be combined, so that the features in farmland remote sensing images can be better extracted and modeled, and the segmentation accuracy is improved;

secondly, the transducer uses a self-attention mechanism in the encoder-decoder framework, can effectively capture the context information in the image, is very important for accurate segmentation results for farmland remote sensing image segmentation, and can better model the relationship between pixels in the image due to the fact that crops and backgrounds in the image often have wide spatial correlation, so that the segmentation accuracy is improved; on the other hand, farmland remote sensing images usually have higher resolution, detail information needs to be accurately recovered in image segmentation, and the calculation and memory requirements of the traditional CNN decoder on the high-resolution images are higher; the invention can gradually restore the image resolution between the encoder and the decoder by introducing the Transformer layer, and reduce the consumption of calculation and memory resources, thereby more effectively processing high-resolution images.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the present invention.

FIG. 1 is a system architecture diagram of a farmland remote sensing image segmentation method based on semi-supervised interactive learning;

FIG. 2 is a flow chart of an implementation of a farmland remote sensing image segmentation method based on semi-supervised interactive learning;

FIG. 3 is a sub-flow of a farmland remote sensing image segmentation method based on semi-supervised interactive learning;

FIG. 4 is a block diagram of a farmland remote sensing image segmentation system provided by the invention;

Fig. 5 is a block diagram of a computer device according to the present invention.

Detailed Description

The present application will be further described with reference to the accompanying drawings and detailed description, wherein it is to be understood that, on the premise of no conflict, the following embodiments or technical features may be arbitrarily combined to form new embodiments.

In order to make the objects, technical solutions and advantages of the present application more apparent, the following embodiments of the present application will be described in further detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two non-identical entities with the same name or non-identical parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover a non-exclusive inclusion, such as a process, method, system, article, or other step or unit that comprises a list of steps or units.

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

At present, the existing framework for improving generalization capability and robustness of a semi-supervised agricultural image segmentation algorithm can be divided into two main categories: an agricultural image segmentation method based on a Convolutional Neural Network (CNN) and an agricultural image segmentation method based on a Transformer; the former, CNN, extracts features in image space by convolution operation, has the disadvantage that: CNNs use local receptive fields in processing images and gradually reduce image resolution from lower to higher layers through convolution and pooling operations, such local receptive field limitations may result in loss of detail information and global context information in the image, particularly for fine-grained segmentation tasks for large-scale farmland areas; the latter convertors model global relationships in sequence space by self-attention mechanisms, which have the disadvantage that: the goal of the transducer is to model the dependency relationship between each pixel and other pixels through global context information, and there is a limitation in processing local features. In a farmland remote sensing image, different crops or land types may have different scales, some fine feature details need finer perceptibility, and a Transformer may not accurately capture the details when processing different scale features, which leads to reduced accuracy and robustness of a segmentation result.

In order to solve the problems, the invention designs a farmland remote sensing image segmentation method of semi-supervised interactive learning, which comprises the steps of firstly, mutually cooperating CNN and a transducer through interactive learning, and mutually transmitting local characteristics and global characteristics of pixels through self-supervised training on unlabeled data, so that the requirement of labeling data is reduced, and meanwhile, the possible defects of the two existing methods are effectively avoided; and secondly, introducing a directivity contrast loss function into the CNN, and performing full-supervision training on the tagged data to ensure the consistency of the same identity features in the pictures under different scenes, thereby improving the generalization capability and robustness of the model.

Specific implementations of the invention are described in detail below in connection with specific embodiments.

Example 1

FIG. 1 illustrates an exemplary system architecture for implementing a semi-supervised interactive learning based farmland remote sensing image segmentation method.

FIG. 2 shows the implementation flow of the farmland remote sensing image segmentation method based on semi-supervised interactive learning;

As shown in fig. 1 and fig. 2, in an embodiment of the present invention, a farmland remote sensing image segmentation method based on semi-supervised interactive learning includes the following steps:

Step S10: m input images divided with labels And N images without labels；

Step S30: weak enhancement processing of Gaussian filtering and brightness adjustment on unlabeled images, randomly cropping two new images with overlapping regions in the same image, i.e. for each of the N unlabeled images Random cropping to obtain two groups of new images with overlapping areas/>，/>The sizes of all images are unified, meanwhile, pixels of the unlabeled images are projected between an encoder and a decoder of the CNN, a directional contrast loss function is introduced, consistency of the same identity features in the images under different scenes is guaranteed, and a transform prediction result is used as a pseudo tag to calculate context perception consistency loss; and calculating consistency regularization loss by using the CNN predicted result as a pseudo tag of the converter predicted result.

Further, as shown in fig. 3, in the step S30, the step of performing weak enhancement processing of gaussian filtering and brightness adjustment on the unlabeled image, and randomly cropping two new images with overlapping areas in the same image includes:

Step S31: applying gaussian filtering to reduce noise and detail in the unlabeled image, weighted averaging the neighborhood around each pixel; for each pixel (x, y), filtering is performed using a gaussian kernel of size k, the filtered pixel values being:

（1）

Finally, adjusting the brightness of the image;

Step S33: using bicubic interpolation algorithm to interpolate all images Are scaled to an image of size 513px ∗ px and the corresponding label/>, using a two-line interpolation algorithmScaled to the same size so that the input image meets the input specifications of DeepLab v + network.

Further, in an embodiment of the present invention, the training process of the tagged image data includes:

（2）

Loss function of a transducer section As shown in formula (3):

（3）

（4）

（5）

Further, in an embodiment of the present invention, the training process of the label-free image data includes:

Further, in the embodiment of the invention, in the label-free image training process, two main network models are respectively focused on learning local features and global features, information interaction is used for feature knowledge transfer, short plates are complementary, and meanwhile, in order to ensure that a CNN module has better robustness and generalization capability on the premise of only a small amount of data, a direction perception consistency constraint is introduced; specifically:

For two groups of input non-label images after weak enhancement ,/>Two groups of predicted values/>, are generated through a CNN model framework、/>; Similarly, two sets of predicted values/>, are generated through a transducer model framework、/>：

（6）

CNN is good at capturing local characteristics and spatial correlation in image processing, and the local structure of the image is extracted through convolution operation of local receptive fields;

The transducer is more suitable for modeling global dependence and long-range relation, and global information interaction can be established in the whole input sequence through a self-attention mechanism;

Thus, these predictions have essentially different properties at the output level, pseudo tags 、/>、/>、/>The calculation method of (2) is shown in the formula (7):

（7）

（8）

（9）

（10）

（11）

wherein N represents the number of spatial positions of the overlapping feature regions; calculating the feature similarity; h, w represents a two-dimensional spatial position; /(I) Representing a negative set of images,/>Representation/>Negative samples in/>Representing a classifier;

（12）

wherein, Representing a Dice loss function; /(I)Representing the classifier.

Further, in the embodiment of the present invention, the overall loss is minimized, in the training process, parameters of the classification network are initialized first, then forward propagation and backward propagation are performed using training data, gradients of the loss function are calculated, and the network parameters are updated by using optimization algorithms such as gradient descent until the overall loss functionReaching a preset convergence condition;

By iteratively updating network parameters, we want the classification network to learn the proper feature representation so that the difference between the predicted result and the real label is minimized;

（13）

As a further limitation of the present invention, the step of marking the pixel point as a segmentation target or background by the set threshold includes: obtaining maximum gray value of image by dividing modelAnd minimum gray value/>Let the initial threshold be/>According to/>Dividing an image into a foreground and a background, and respectively obtaining the average gray value/>And/>Find new threshold/>Iterating until if/>The obtained value is the threshold value, the prediction probability is larger than the threshold value and marked as the foreground, and smaller than the threshold value and marked as the background, so that the final segmentation mask is obtained.

In summary, the invention inserts the consistency constraint module of the directional perception in the interactive learning network of the CNN and the Transformer, the CNN is excellent in image processing, the spatial characteristics in the farmland remote sensing image can be effectively extracted, the local and global characteristics are captured through convolution and pooling operation, the Transformer is excellent in the natural language processing field, the sequence data and the long-distance dependency relationship are good in processing, and the advantages of the CNN in the aspect of spatial characteristic extraction and the advantages of the Transformer in the aspect of long-distance dependency modeling can be combined, the characteristics in the farmland remote sensing image can be better extracted and modeled, and the segmentation accuracy is improved.

In addition, the transducer uses a self-attention mechanism in the encoder-decoder framework, can effectively capture the context information in the image, is very important for accurate segmentation results for farmland remote sensing image segmentation, and can better model the relation between pixels in the image because crops and backgrounds in the image often have wide spatial relevance, thereby improving the segmentation accuracy;

On the other hand, farmland remote sensing images usually have higher resolution, detail information needs to be accurately recovered in image segmentation, and the calculation and memory requirements of the traditional CNN decoder on the high-resolution images are higher; the invention can gradually restore the image resolution between the encoder and the decoder by introducing the Transformer layer, and reduce the consumption of calculation and memory resources, thereby more effectively processing high-resolution images.

Example 2

As shown in fig. 4, in an exemplary embodiment provided by the present disclosure, the present invention further provides a farmland remote sensing image segmentation system, the farmland remote sensing image segmentation system 50 includes:

A preprocessing module 51, the preprocessing module 51 being used for dividing M input images with labels And unlabeled N images/>；

A first training module 52, wherein the first training module 52 is configured to train CNN and a transducer using the tagged image data, respectively;

a second training module 53, where the second training module 53 is configured to perform weak enhancement processing of gaussian filtering and brightness adjustment on the unlabeled image, and randomly clip two new images with overlapping areas in the same image, i.e. for each of the N unlabeled images Random cropping to obtain two groups of new images with overlapping areas，/>The sizes of all images are unified, meanwhile, pixels of the unlabeled images are projected between an encoder and a decoder of the CNN, a directional contrast loss function is introduced, consistency of the same identity features in the images under different scenes is guaranteed, and a transform prediction result is used as a pseudo tag to calculate context perception consistency loss; and calculating consistency regularization loss by using the CNN predicted result as a pseudo tag of the converter predicted result.

The model test module 54 is used for dividing the test set image by taking the trained CNN model as a backbone network and evaluating the accuracy of the result.

Example 3

As shown in fig. 5, in an embodiment of the present invention, the present invention further provides a computer device.

The computer device 60 comprises a memory 61, a processor 62 and computer readable instructions stored in the memory 61 and executable on the processor 62, which processor 62 when executing the computer readable instructions implements the farmland telemetry image segmentation method based on semi-supervised interactive learning as provided by embodiment 1.

The farmland remote sensing image segmentation method based on semi-supervised interactive learning comprises the following steps:

Step S10: m input images divided with labels And N images without labels；

Step S30: weak enhancement processing of Gaussian filtering and brightness adjustment on unlabeled images, randomly cropping two new images with overlapping regions in the same image, i.e. for each of the N unlabeled images Random cropping to obtain two groups of new images with overlapping areas/>，The sizes of all images are unified, meanwhile, pixels of the unlabeled images are projected between an encoder and a decoder of the CNN, a directional contrast loss function is introduced, consistency of the same identity features in the images under different scenes is guaranteed, and a transform prediction result is used as a pseudo tag to calculate context perception consistency loss; and calculating consistency regularization loss by using the CNN predicted result as a pseudo tag of the converter predicted result.

In addition, the device 60 according to the embodiment of the present invention may further have a communication interface 63 for receiving a control command.

Example 4

In an exemplary embodiment provided by the present disclosure, a computer-readable storage medium is also provided.

Specifically, in an exemplary embodiment of the present disclosure, the storage medium stores computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform a farmland remote sensing image segmentation method based on semi-supervised interaction learning as provided by embodiment 1.

Step S10: m input images divided with labels And N images without labels；

In various embodiments of the present invention, it should be understood that the size of the sequence numbers of the processes does not mean that the execution sequence of the processes is necessarily sequential, and the execution sequence of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present invention, or a part contributing to the prior art or all or part of the technical solution, may be embodied in the form of a software product stored in a memory, comprising several requests for a computer device (which may be a personal computer, a server or a network device, etc., in particular may be a processor in a computer device) to execute some or all of the steps of the method according to the embodiments of the present invention.

Those of ordinary skill in the art will appreciate that some or all of the steps of the various methods of the described embodiments may be implemented by hardware associated with a program that may be stored in a computer-readable storage medium, including Read-Only Memory (ROM), random-access Memory (RandomAccess Memory,11 RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (CD-ROM) or other optical disc Memory, magnetic disk Memory, tape Memory, or any other medium capable of being used to carry or store data.

The farmland remote sensing image segmentation method based on semi-supervised interactive learning disclosed by the embodiment of the invention is described in detail, and specific examples are applied to explain the principle and the implementation mode of the invention, and the description of the above examples is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A farmland remote sensing image segmentation method based on semi-supervised interactive learning, characterized in that the farmland remote sensing image segmentation method comprises the following steps:

Step S10: Divide into M labeled input images x _L ={x ₁ , x ₂ , ..., x _M } and N unlabeled images x _U ={x ₁ , x ₂ , ..., x _N };

Step S20: Use the labeled image data to train CNN and Transformer respectively;

Step S30: Perform weak enhancement processing of Gaussian filtering and brightness adjustment on the unlabeled image, randomly crop two new images x _U1 = {x ₁₁ , x ₂₁ , ..., x _N1 }, x _U2 = {x ₁₂ , x ₂₂ , ..., x _N2 } with overlapping areas from the same image, and project the pixels of the unlabeled image between the encoder and decoder of CNN, introduce a directional contrast loss function to ensure the consistency of the same identity features in different scenes in the picture, use the Transformer prediction result as the pseudo label to calculate the context-aware consistency loss; use the CNN prediction result as the pseudo label of the Transformer prediction result to calculate the consistency regularization loss;

Step S40: Using the trained CNN model as the backbone network, the test set images are segmented and the accuracy of the results is evaluated;

The steps of performing weak enhancement processing of Gaussian filtering and brightness adjustment on the unlabeled image and randomly cropping two new images with overlapping areas from the same image include:

Step S31: Apply Gaussian filtering to reduce noise and details in the unlabeled image, and perform weighted averaging on the neighborhood around each pixel; for each pixel (x, y), use a Gaussian kernel of size k to filter, and the pixel value after filtering is:

I(x, y) = ∑(G(x′, y′) * I(x′, y′)) (1)

Among them, I(x′, y′) is the pixel value in the neighborhood, and G(x′, y′) is the weight value of the Gaussian kernel;

Finally, adjust the image brightness;

Step S32: for a given weakly enhanced unlabeled image, randomly select a size and position of a cropping window;

Move the cropping window upward and leftward by a certain distance to obtain two new images x _u1 and x _u2 with overlapping areas to train the model;

Step S33: Use the bicubic interpolation algorithm to scale all images x to images of size 513px*513px, and use the bilinear interpolation algorithm to scale the corresponding labels y to the same size, so that the input image meets the input specifications of the DeepLab v3+ network;

The training process of the labeled image data includes:

Use the labeled image encoding matrix x _L = {χ ₁ , χ ₂ , ..., χ _M } and the label y _L = {y ₁ , y ₂ , ..., y _M } to train the CNN and Transformer backbone network models respectively, and calculate the loss function with the real label

Calculate the loss function with the true label The steps include:

Input the labeled data χ _L into CNN to obtain the predicted probability corresponding to each pixel Input Transformer to get predicted probability/> Calculate the loss function between the corresponding true value/>

The loss function between the calculation and the corresponding true value The process is as follows:

Compared with the true label y _l , the loss function of the CNN part As shown in formula (2):

Loss function of Transformer part As shown in formula (3):

Where σ represents the ReLU activation function, l _FL represents the Focal Loss, and the expression is shown in formula (4):

Where α _l and γ are hyper parameters, which are set as α _l = 0.25 and γ = 2 here;

The total loss calculation of the supervised learning model is shown in formula (5):

When l is the true label, y _{l = 1} , otherwise y _{l = 0} ; It is a real number between 0 and 1, indicating the probability that the image belongs to the category marked in the label;

The training process of the unlabeled image data includes:

By inputting two sets of randomly cropped weakly enhanced unlabeled images, the prediction results are obtained through the CNN network framework and used as pseudo labels predicted by the Transformer model to calculate the consistency regularization loss.

The prediction results are obtained through the Transformer network framework and used as pseudo labels for intermediate projections to calculate the context-aware consistency loss.

In the unlabeled image training process, for the two sets of weakly enhanced unlabeled images x _u1 and x _u2 , the CNN model framework generates two sets of prediction values Similarly, two sets of prediction values are generated through the Transformer model framework

in, Represents the CNN network model, /> Represents the Transformer network model;

Pseudo Labels The calculation method of is shown in formula (7):

Among them, argmax(p) represents the label corresponding to the maximum predicted probability value p;

CNN prediction results as Transformer pseudo labels, consistency regularization loss The calculation method of is shown in formula (8):

Among them, σ represents the ReLU activation function, l _dice represents the Dice loss function;

x _u1 , x _u2 are passed through the DeepLab v3+ encoder to obtain feature maps M _u1 and M _u2 , and then are projected by the nonlinear projector Projected into _Mo1 and _Mo2 , using directional contrast loss function, the overlapping region _xo is encouraged to align to the contrast features with high confidence in different backgrounds and finally remain consistent;

For the i-th unlabeled image, the directional contrast loss The calculation formula is as follows:

Where N represents the number of spatial positions of overlapping feature regions; r calculates feature similarity; h, w represent two-dimensional spatial positions; _Mu represents the image negative sample set, m represents the negative sample in _Mu , and C(*) represents the classifier;

Use the Transformer prediction results as pseudo labels and calculate the consistency loss function after context-aware constraints The formula is as follows:

Where, l _dice represents the Dice loss function; C(*) represents the classifier;

It also includes minimizing the overall loss. During the training process, the parameters of the classification network are initialized, and then the training data is used for forward propagation and back propagation, the gradient of the loss function is calculated, and the gradient descent optimization algorithm is used to update the network parameters until the total loss function Reach the preset convergence condition;

The total loss is the total loss of the supervised learning model/> Consistency Regularization Loss/> Directional contrast loss/> Context-aware consistency loss/> A linear combination of , where the total loss/> The calculation formula is as follows:

Among them, λ and λ _w are weight factors, the purpose is to control the directional contrast loss and consistency regularization loss/> In the total loss function/> The proportion of .

2. According to the method for segmenting farmland remote sensing images based on semi-supervised interactive learning in claim 1, it is characterized in that step S40 specifically includes model testing. For a given test set of farmland remote sensing images, CNN is used as the backbone network model to extract features, the probability of the category to which each pixel point of the segmented output target image belongs is obtained, and a threshold is set to mark the pixel point as a segmentation target or background.

3. The farmland remote sensing image segmentation method based on semi-supervised interactive learning according to claim 2 is characterized in that the step of setting a threshold to mark the pixel point as a segmentation target or background comprises: the segmentation model calculates the maximum grayscale value P _max and the minimum grayscale value P _min of the image, sets the initial threshold as T ₀ = (P _max + P _min )/2, and divides the image into foreground and background according to T _(k) , k = 0, 1, 2..., k, and calculates the average grayscale values H ₁ and H ₂ of the two respectively, and calculates a new threshold T _(k+1) = (H ₁ + H ₂ )/2, iterates until T _(k) = T _(k+1) , the result is the threshold, and the predicted probability is greater than the threshold and marked as foreground, and less than the threshold and marked as background, so as to obtain the final segmentation mask.