CN114022777A

CN114022777A - Sample manufacturing method and device for ground feature elements of remote sensing images

Info

Publication number: CN114022777A
Application number: CN202111222961.3A
Authority: CN
Inventors: 李敏; 尤江彬; 隋正伟; 李俊杰; 苏文博; 胡国庆; 杨易鑫
Original assignee: China Center for Resource Satellite Data and Applications CRESDA
Current assignee: China Center for Resource Satellite Data and Applications CRESDA
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-02-08

Abstract

The invention discloses a method and a device for manufacturing a sample of a ground feature element of a remote sensing image. The method comprises the following steps: pre-training a backbone network based on a large open source data set to obtain a pre-training weight of the model; combining the positive and negative click encoding data and the sample image, and inputting the combined data into a backbone network to extract a characteristic image; training the segmentation sub-network based on the labeling information of the characteristic image and the sample image to complete a model training process; processing the training data set by using the trained model, and outputting a predicted mask; according to the real sample marking mask and the predicted mask, automatically generating simulated differential positive and negative click encoding data and updating; performing iterative training on the model for N times to obtain a target model; and inputting the image data to be marked into the model, and combining interactive click coding to obtain high-precision marking sample data. The method can reduce the manufacturing cost of deep learning sample data and improve the problem of lack of label data of the remote sensing image semantic segmentation sample.

Description

Sample manufacturing method and device for ground feature elements of remote sensing images

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a method and a device for manufacturing a sample of a ground feature element of a remote sensing image.

Background

In recent years, with the rapid development of earth observation technology, the data of multi-source high-resolution satellite remote sensing images is greatly increased, abundant data resources are provided for realizing rapid and effective ground feature identification, and meanwhile, higher requirements are provided for image processing capacity.

Because the remote sensing image has special spectral and spatial characteristics, the general image segmentation algorithm is difficult to obtain ideal effects in the remote sensing data processing. As one of three basic tasks in the field of computer vision, semantic segmentation realizes classification and positioning of targets at a pixel level, and has been widely applied to various fields such as medical image analysis, unmanned vehicle driving, robots and the like. With the continuous improvement of the model, the advantages of the deep learning semantic segmentation technology in the remote sensing field are continuously highlighted by means of the capabilities of the convolutional neural network on automatic learning and feature extraction, and the deep learning semantic segmentation technology plays an important role in the remote sensing intelligent information extraction of surface feature elements such as buildings, roads, water bodies and the like.

The combination of semantic segmentation and the remote sensing task provides a new opportunity for fast and accurate acquisition of the earth surface element information. However, the deep learning model driven by the data model and based on the large sample has huge parameter scale, and the optimization from the model construction to the parameter is closely related to the data, so that the method has stronger dependence on the quality and quantity of the sample compared with the traditional method. Although the acquisition modes of data in the remote sensing field are numerous, a large amount of manpower and material resources are consumed for data processing and label labeling. How to rapidly realize large-scale sample preparation is one of the problems to be solved urgently in remote sensing target extraction.

At present, the sample preparation which adopts an interactive mode to assist deep learning receives more and more attention, and the preliminary success is achieved in the target segmentation application of natural images. However, the remote sensing image is substantially different from the natural image. On one hand, the acquisition of the remote sensing image is seriously interfered by factors such as illumination, weather and the like; on the other hand, the types of the targets on the remote sensing image are more, and the background is more complex. Therefore, models trained with natural image datasets are difficult to apply directly to remote sensing images. How to effectively solve the problem of making sample label data in the high-resolution remote sensing image is a crucial factor for promoting the further development of deep learning semantic segmentation in the remote sensing field.

Disclosure of Invention

The technical problem solved by the invention is as follows: the method overcomes the defects of the prior art and provides a sample manufacturing method of the ground feature elements of the remote sensing images.

The technical solution of the invention is as follows:

in a first aspect, an embodiment of the present invention provides a method for making a sample of a ground feature element of a remote sensing image, including:

preprocessing the obtained high-resolution remote sensing image to obtain a training data set;

pre-training a backbone network of the model based on a large open source data set to obtain pre-training weights of the model;

combining the marking information of the training sample image, randomly generating positive and negative click data and coding the positive and negative click data to obtain positive and negative click coded data;

combining the positive and negative click encoding data with sample images in a training data set, and inputting the pre-trained backbone network to extract a characteristic image;

inputting the feature image and the sample image labeling information in the training data set into a segmentation sub-network of a model for training to obtain a model finished by one-time training;

processing the training data set based on the trained model, and outputting a predicted mask;

automatically generating simulated differential positive and negative click encoding data according to the real sample marking information of the training data set and the predicted mask;

updating original positive and negative click encoding data based on the differential positive and negative click encoding data, and iteratively training the model which is trained once for N times to obtain a trained target model; n is a positive integer;

and processing the image data to be labeled based on the target model, and combining interactive click coding to obtain high-precision labeling sample data.

Optionally, the preprocessing the acquired high-resolution remote sensing image to obtain a training data set includes:

acquiring a high-resolution remote sensing image;

preprocessing the remote sensing image to generate an image slice with RGB three bands;

and manufacturing a real sample labeling mask according to the RGB three-band image slice to obtain the training data set.

Optionally, the making a real sample labeling mask according to the RGB three-band image slice to obtain the training data set includes:

and drawing a vector boundary of each ground object target in the RGB three-band image slice, carrying out class labeling and grid binarization processing on the vector boundary, and taking the obtained mask grid file as a training labeling sample.

Optionally, the randomly generating positive and negative click data and encoding the positive and negative click data to obtain positive and negative click encoded data includes:

generating positive and negative click data based on a random sampling mode;

and coding the positive and negative click data based on a coding mode of a space form to generate the positive and negative click coded data.

Optionally, the processing the image data to be labeled based on the target model, and obtaining high-precision labeling sample data by combining with an interactive click code includes:

inputting a preset number of image data to be marked into the target model;

processing the image data to be marked based on the target model to obtain sample image data with mask marks;

and based on an interactive click mode, carrying out positive and negative click coding on the sample image with the mask mark, and automatically carrying out regional mask correction to obtain a high-precision sample mark data set.

In a second aspect, an embodiment of the present invention provides an apparatus for preparing a sample of a feature element of a remote sensing image, including:

the training data set acquisition module is used for preprocessing the acquired high-resolution remote sensing image to obtain a training data set;

the pre-training model weight acquisition module is used for pre-training the backbone network of the model based on the large open-source data set to obtain the pre-training weight of the model;

the positive and negative coded data acquisition module is used for randomly generating positive and negative click data and coding the positive and negative click data by combining the marking information of the training sample image to obtain positive and negative click coded data;

the characteristic image extraction module is used for combining the positive and negative click encoding data with a sample image in a training data set and inputting the combined sample image into a backbone network after pre-training to extract a characteristic image;

the single training model acquisition module is used for inputting the segmentation sub-network of the model to train based on the characteristic image and the sample image labeling information in the training data set to obtain a model finished by one-time training;

a predicted mask output module, configured to process the training data set based on the trained model, and output a predicted mask;

the differential coding data generation module is used for automatically generating simulated differential positive and negative click coding data according to the real sample marking information of the training data set and the predicted mask;

the target model obtaining module is used for updating original positive and negative click encoding data based on the differential positive and negative click encoding data, and iteratively training the model which is trained once for N times to obtain a trained target model; n is a positive integer;

and the sample labeling data acquisition module is used for processing the image data to be labeled based on the target model and obtaining high-precision labeling sample data by combining interactive click codes.

Optionally, the training data set obtaining module includes:

the remote sensing image acquisition unit is used for acquiring a high-resolution remote sensing image;

the image slice generating unit is used for preprocessing the remote sensing image and generating an image slice with RGB three wave bands;

and the training data set acquisition unit is used for manufacturing a real sample marking mask according to the RGB three-band image slice to obtain the training data set.

Optionally, the training data set obtaining unit includes:

and the training labeling sample acquisition subunit is used for drawing the vector boundary of each ground feature target in the RGB three-band image slice, performing class labeling and grid binarization processing on the vector boundary, and taking the obtained mask grid file as a training labeling sample.

Optionally, the positive and negative coded data obtaining module includes:

the positive and negative click data generation unit is used for generating positive and negative click data based on a random sampling mode;

and the positive and negative click coding data generating unit is used for coding the positive and negative click data based on a coding mode of a space form to generate the positive and negative click coding data.

Optionally, the model training sample obtaining module includes:

the sample image data input unit is used for inputting a preset number of sample image data to be labeled into the target model;

the marked image data primary acquisition unit is used for processing the sample image data to be marked based on the target model to obtain sample image data with mask marks;

and the sample marking information optimizing unit is used for carrying out positive and negative click coding on the sample image with the mask mark based on an interactive click mode, and automatically carrying out regional mask correction to obtain a high-precision sample marking data set.

Compared with the prior art, the invention has the advantages that:

the sample preparation scheme of the remote sensing image surface feature element provided by the embodiment of the invention can reduce the cost for preparing deep learning data, effectively solves the problem of lacking of semantic segmentation sample label data of the remote sensing image, can be suitable for various surface feature types, and has good adaptability to domestic data.

Drawings

Fig. 1 is a flowchart illustrating steps of a method for manufacturing a sample of a ground feature element of a remote sensing image according to an embodiment of the present invention;

fig. 2 is a flowchart of a sample manufacturing method for a remote sensing image surface feature element based on an iterative training and interactive click segmentation model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a sample creation apparatus for remote sensing image surface feature elements according to an embodiment of the present invention.

Detailed Description

Example one

Referring to fig. 1, a flowchart illustrating steps of a sample manufacturing method for a remote sensing image surface feature element according to an embodiment of the present invention is shown, and as shown in fig. 1, the sample manufacturing method may include the following steps:

step 101: and preprocessing the obtained high-resolution remote sensing image to obtain a training data set.

A sample production flow can be combined with that shown in fig. 2. In the embodiment of the present invention, when obtaining a batch model training sample, a remote sensing image with a high resolution may be obtained first, and a training data set is made based on a preprocessed image slice, which may be described in detail with reference to the following specific implementation manner.

In a specific implementation manner of the embodiment of the present invention, the step 101 may include:

substep A1: acquiring a high-resolution remote sensing image;

substep A2: preprocessing the remote sensing image to generate an image slice with RGB three bands;

substep A3: and manufacturing a real sample labeling mask according to the RGB three-band image slice to obtain the training data set.

In the embodiment of the present invention, the high-resolution remote sensing image refers to a remote sensing image with a resolution higher than a set resolution threshold, and a specific value of the set resolution threshold may be determined according to a business requirement, which is not limited in this embodiment. The high-resolution remote sensing image can be obtained by adopting optical satellite remote sensing data with a visible light-near infrared sensor, such as WorldView-3 in the United states and high-resolution second in China, and can also be obtained by using aerial remote sensing data acquired by a common optical camera, such as unmanned aerial vehicle data. These data are then subjected to preprocessing operations including geometric correction, fusion of panchromatic and multispectral images, band selection, etc., to meet the spatial and spectral requirements of the model. In addition, in order to improve the training efficiency of the model, a region of interest with typical targets needs to be selected in the image and cut according to a uniform size and format. Finally, a series of RGB tri-band image slices are generated that can be directly input into the model. Then, samples of deep learning semantic segmentation are made by ArcGIS or other tools, the vector boundary of each ground feature target is drawn, and class labeling is carried out on the vector boundary. The generated vector label data needs to be subjected to grid binarization processing, and a corresponding grid mask file can be obtained and used as a training labeling sample.

After the training data set is obtained, step 102 is performed.

Step 102: and pre-training the backbone network of the model based on the large open source data set to obtain the pre-training weight of the model.

In this embodiment, the backbone network of the model may be pre-trained based on a large open source data set to obtain the pre-training weights of the model, and specifically, the neural network structure of the present invention is substantially similar to a semantic segmentation network, and includes a backbone network, a segmentation subnetwork, a detection head, and so on. The backbone network is a deep convolutional neural network and mainly comprises a convolutional layer, a pooling layer and a full-connection layer. The number of output channels of the convolution layer is consistent with the number of channels of the convolution kernel, so that the convolution operation can be used for not only acquiring the characteristic information of the image, but also performing the conversion on the data space dimension, such as converting the RGB image data of 3 wave bands into 64 channels. In semantic segmentation, common backbone networks include a ResNet series and an EfficientNet series. Before training, the backbone network usually needs to be pre-trained to speed up the training process. A common pre-training Data Set in natural image target detection is ImageNet, in the invention, an open source Data Set UC Merceded Land-Use Data Set is adopted, and the Data Set totally contains remote sensing images of 21 types of scenes and can be used for pre-training of a backbone network.

Step 103: and combining the marking information of the training sample image, randomly generating positive and negative click data and coding the positive and negative click data to obtain positive and negative click coded data.

In the primary training, the method disclosed by the invention automatically generates a certain number of positive and negative clicks by combining the sample marking mask information of the image and adopting a random sampling mode. The positive click and the negative click are represented in the image in the form of coordinate points, and certain conditions need to be met:

a、N_posthe forward points are generated in a mask area of the target object and satisfy: 1) the distance between any two forward points is larger than a preset threshold value d_stepAnd 2) any forward point and mask boundary pointIs greater than a preset threshold value d_margin。

b、N_negThe negative points surround the outer boundary of the mask covering the target object, and the distance meeting each negative click is as large as possible.

To input positive and negative clicks into the convolutional network, they first need to be spatially encoded. Encoding the click through a distance transform is the most common way. However, when a point is added or moved, the effect of this approach is global, and a large change in the code pattern can cause confusion of the network. Therefore, the invention adopts the disc with a certain radius to represent positive and negative clicks, so that when one point is added or moved, the change of the code pattern is small.

Step 104: and combining the positive and negative click encoding data with a sample image in a training data set, and inputting the combined sample image into the backbone network after pre-training to extract a characteristic image.

After the positive and negative click encoding data are obtained, the positive and negative click encoding data and the sample image in the training data set can be combined and input to the backbone network after pre-training so as to extract the characteristic image.

In practical application, the input of the semantic segmentation model is usually an RGB three-band image, and in order to introduce positive and negative click encoding data, the problem of the size of the additionally input data in the backbone network needs to be solved.

In the invention, a convolution block is introduced, and the tensor size of the output of the convolution block is completely the same as that of the first convolution block in the backbone network. This output tensor is then summed pixel by pixel with the output data of the sample image in the first convolutional layer of the backbone network. Typically, the output tensor data are 64 channels.

Step 105: and inputting the feature image and the sample image labeling information in the training data set into a segmentation sub-network of the model for training to obtain a model finished by one-time training.

After the feature image is extracted, a segmentation sub-network of the model may be trained based on the feature image and the label information of the sample image in the training data set to obtain a model completed by one-time training, and specifically, a series of hyper-parameters of the model training may be set according to the structure of the segmentation sub-network, including a learning rate, a learning rate update coefficient, an update step length, an iteration number, a batch size, a weight attenuation coefficient, and the like. In addition, Adam or SGD, etc. also need to be selected as the training optimizer. After the acquired feature images are input into the segmentation sub-network, a predicted mask can be generated initially at the head of the network. And then, comparing the real sample labeling mask with the predicted mask, and performing prediction optimization based on the loss function. When the loss function reaches the convergence state, the model training process is completed.

Step 106: and processing the training data set based on the trained model, and outputting a predicted mask.

After the model which is trained once is obtained, the training data set can be processed based on the model to output the predicted mask, specifically, the image in the training data set can be used as the data to be inferred, the model which is trained in the previous step is input, and the predicted mask and the category confidence corresponding to the image can be output.

Step 107: and automatically generating simulated differential positive and negative click coding data according to the real sample marking information of the training data set and the predicted mask.

To evaluate and analyze the predicted mask, it is compared to the true sample labeling information of the training data set. Obviously, the mask preliminarily predicted has the phenomena of missing detection and false detection. Therefore, positive and negative click encoding data can be generated in an iterative sampling mode, and the main mode is as follows:

a. and comparing the predicted mask with a mask marked by the real sample to obtain an error part (including missing detection and false detection) in the prediction result.

b. The erroneous part is divided into different clusters using connected components in units of pixels.

c. And applying morphological erosion operation to each cluster to obtain a new difference region. A largest cluster is determined based on the number of difference pixels and a click is generated within the new difference region of this cluster. If the click is generated in the cluster before, in order to avoid too close distance among multiple clicks, the collected point needs to have the maximum Euclidean distance with the boundary after the cluster is eroded and the previous click; if there is no click before, the center point of the cluster eroded difference region can be determined as the click position of the acquisition.

Step 108: and updating original positive and negative click encoding data based on the differential positive and negative click encoding data, and iteratively training the model which is trained once for N times to obtain a trained target model, wherein N is a positive integer.

During iterative training, both the output segmentation mask and the click-coded data may provide additional a priori information to improve the quality of the prediction. In the invention, firstly, the data of two channel points of positive and negative click codes and the predicted binaryzation mask data are stacked. Here, the mask data may be empty when no reasoning is done. And then introducing a convolution block to make the output tensor size of the convolution block completely identical to the first convolution block in the backbone network. And then carrying out pixel-by-pixel summation and inputting the summation into the backbone network. At this time, the backbone network does not need to be pre-trained any more, and can continue training directly on the basis of the completion of the last iteration. And circularly executing the model training process, continuously updating the predicted mask and the positive and negative click codes until iterative training reaches a preset number N, and outputting and storing the obtained model.

Step 109: and processing the image data to be labeled based on the target model, and combining interactive click coding to obtain high-precision labeling sample data.

According to the sample image making method, the image to be marked is processed, and an image slice with a proper size is generated. Inputting the data into a segmentation model which is finished by iterative training, and preliminarily obtaining mask data with target labels.

In order to improve the quality of the sample, the mask data can be post-processed by an interactive tool, namely, the error mode of the model automatic judgment result is set by positive and negative clicks, and the correction is carried out. Such as: and setting a positive click in a target area of model missed detection according to a set mode, automatically correcting the target area into a target by the model and updating the mask, and setting a negative click in the area of model false detection to delete the mask in the area. Compared with the traditional sample manufacturing mode which completely depends on manual drawing, the method has the advantages of rapidness, accuracy and high efficiency.

Example two

Referring to fig. 3, a schematic structural diagram of a sample preparation apparatus for remote sensing image surface feature elements according to an embodiment of the present invention is shown, and as shown in fig. 3, the sample preparation apparatus may include the following modules:

a training data set acquisition module 310, configured to preprocess the acquired high-resolution remote sensing image to obtain a training data set;

a pre-training model weight obtaining module 320, configured to pre-train a backbone network of a model based on a large open-source data set to obtain a pre-training weight of the model;

the positive and negative coded data acquisition module 330 is configured to randomly generate positive and negative click data and code the positive and negative click data in combination with the labeling information of the training sample image to obtain positive and negative click coded data;

the feature image extraction module 340 is configured to combine the positive and negative click encoding data with a sample image in a training data set, and input the combined sample image to a backbone network after pre-training to extract a feature image;

a single training model obtaining module 350, configured to input the segmentation sub-network of the model based on the feature image and the sample image labeling information in the training data set to perform training, so as to obtain a model completed by one training;

a predicted mask output module 360, configured to process the training data set based on the trained model, and output a predicted mask;

a differential encoding data generating module 370, configured to automatically generate simulated differential positive and negative click encoding data according to the real sample labeling information of the training data set and the predicted mask;

the target model obtaining module 380 is configured to update original positive and negative click encoding data based on the differential positive and negative click encoding data, and iteratively train the model completed by one training for N times to obtain a trained target model; n is a positive integer;

and a sample labeling data acquisition module 390, configured to process image data to be labeled based on the target model, and obtain high-precision labeling sample data by combining interactive click coding.

Optionally, the training data set obtaining module includes:

Optionally, the training data set obtaining unit includes:

Optionally, the positive and negative coded data obtaining module includes:

Optionally, the model training sample obtaining module includes:

The detailed description set forth herein may provide those skilled in the art with a more complete understanding of the present application, and is not intended to limit the present application in any way. Thus, it will be appreciated by those skilled in the art that modifications or equivalents may still be made to the present application; all technical solutions and modifications thereof which do not depart from the spirit and technical essence of the present application should be covered by the scope of protection of the present patent application.

Those skilled in the art will appreciate that those matters not described in detail in the present specification are well known in the art.

Claims

1. A method for manufacturing a sample of a ground feature element of a remote sensing image is characterized by comprising the following steps:

2. The method of claim 1, wherein preprocessing the acquired high resolution remote sensing image to obtain a training data set comprises:

acquiring a high-resolution remote sensing image;

3. The method as claimed in claim 2, wherein said generating a real sample labeling mask from said RGB tri-band image slices to obtain said training data set comprises:

4. The method of claim 1, wherein randomly generating and encoding positive and negative click data to obtain positive and negative click encoded data comprises:

generating positive and negative click data based on a random sampling mode;

5. The method of claim 1, wherein the processing the image data to be labeled based on the target model and combining with an interactive click code to obtain high-precision labeling sample data comprises:

inputting a preset number of image data to be marked into the target model;

6. A sample preparation device for remote sensing image surface feature elements is characterized by comprising:

7. The apparatus of claim 6, wherein the training data set acquisition module comprises:

8. The apparatus of claim 7, wherein the training data set acquisition unit comprises:

9. The apparatus of claim 6, wherein the positive and negative coded data acquisition module comprises:

10. The apparatus of claim 6, wherein the model training sample acquisition module comprises: