CN113177956A

CN113177956A - Semantic segmentation method for unmanned aerial vehicle remote sensing image

Info

Publication number: CN113177956A
Application number: CN202110508833.9A
Authority: CN
Inventors: 于扬鸿; 车明亮; 杨帆; 周雨航
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-07-27

Abstract

The invention discloses a semantic segmentation method for remote sensing images of unmanned aerial vehicles, which is characterized in that firstly, block processing is carried out, and then, the semantic segmentation of the remote sensing images is carried out block by block, so that the scale of reading the remote sensing image data by the method is reduced, and the risk of memory overflow in the semantic segmentation processing is reduced. The invention designs and realizes that the zoom is used for extracting target images in different spatial ranges, retains the most complete key characteristic information in the target images by constructing a characteristic image pyramid, and can obtain the most accurate classification predicted value, thereby ensuring the high classification precision of pixels. In the implementation of the invention, the image slices adopt a multi-process and image semantic segmentation parallel processing mode to reduce the total running time cost. The invention uses the light convolutional neural network in the classifier, reduces the model quantity to the maximum extent under the condition of ensuring that the image classification precision is not reduced, and reduces the memory and disk space consumed by the method in application.

Description

Semantic segmentation method for unmanned aerial vehicle remote sensing image

Technical Field

The invention relates to the field of image semantic segmentation, in particular to a semantic segmentation method for unmanned aerial vehicle remote sensing images.

Background

Image semantic segmentation is an important means in remote sensing image computer interpretation. The category of each pixel in the image can be judged through semantic segmentation, so that a land cover/utilization classification map is generated. In the land utilization classification investigation at the present stage, the low-altitude remote sensing image data of the unmanned aerial vehicle greatly improves the working efficiency due to the characteristics of convenience in use, low acquisition cost, high spatial resolution and the like. However, compared with conventional medium and high altitude remote sensing (aviation and satellite remote sensing) images, the unmanned aerial vehicle remote sensing image contains more complex ground object target information, so that the application of the traditional segmentation method, such as support vector machine classification, neural network classification, decision tree classification, expert system classification and the like, is greatly limited. Under the background of rapid development of artificial intelligence technology, deep learning can directly learn image feature expression from massive image data, so that more computer vision tasks such as image classification, image recognition, image segmentation and the like are solved. This makes deep learning increasingly a new method for remote sensing image ground feature classification.

At present, the remote sensing image classification method by applying deep learning is mainly based on a classical or advanced image segmentation model. A Full Connectivity Network (FCN) is the most basic framework for image segmentation. The FCN changes all the fully connected layers behind the convolutional neural network into convolutional layers so as to obtain a low-dimensional to high-dimensional characteristic diagram. And then carrying out pixel-by-pixel classification prediction on the characteristic graphs with different dimensions, carrying out upsampling on the characteristic graphs to expand to the size of the original image, and finally carrying out prediction result fusion. Since FCN mainly utilizes deep networks to extract features and classify, it is insensitive to small-sized objects and inaccurate to segmentation details in images. The U-Net model is an improvement and extension of FCN, and follows the idea of performing image semantic segmentation by FCN, namely performing feature extraction by using a convolution layer and a pooling layer, and reducing the image size by using a deconvolution layer. The U-Net structure uses symmetric compression-expansion channels. The compression channel is used for capturing context, extracting image features layer by layer, and the expansion channel is used for accurately positioning and restoring the position information of the image. Experiments prove that the U-Net model can obtain more accurate classification results under fewer training samples. However, the model is usually used for binary semantic segmentation, and the semantic segmentation needs to additionally modify the model structure for multiple classes of labels. The SegNet model is similar to FCN, with the full connection layer removed. The core structure of SegNet includes an encoder network, a decoder network, and a pixel-by-pixel classification layer. The encoder part uses the first 13 layers of convolution layers of the VGG-16 network, and extracts the image high-dimensional feature map through downsampling. Each encoder layer corresponds to a decoder layer for upsampling the low resolution feature map to the full input resolution feature map for pixel classification. Data verification shows that the training speed and the segmentation precision of the model are better than those of an FCN model, but the model is classified independently of pixels, and the spatial relation among the pixels is not considered, so that the segmentation result has a blockiness effect. For this purpose, the deep lab model recovers the target boundary details by introducing a fully connected Conditional Random Field (CRF) at the last layer of the convolutional network, so as to achieve accurate positioning. Based on semantic segmentation, the appearance of Mask-RCNN improves the hierarchy of segmentation tasks, namely processing example segmentation. The Mask-RCNN follows the idea of fast RCNN, a ResNet residual network is adopted for feature extraction, and a Mask prediction branch is additionally added. The Mask-RCNN framework still adopts a two-step strategy, firstly, a Region generation Network (RPN) is found out, then, each interest Region found by the RPN is classified and positioned, and a binary Mask is calculated. This makes Mask RCNN have higher segmentation precision, and the Feature Pyramid Network (FPN) strategy makes this model can support the multi-scale detection simultaneously.

Although the segmentation model achieves a good segmentation effect on an image test data set, for the unmanned aerial vehicle remote sensing image, the segmentation accuracy and the operation efficiency of the unmanned aerial vehicle remote sensing image still face huge challenges. Firstly, unmanned aerial vehicle remote sensing image ground objects are relatively complex, target scales are changeable, training samples are insufficient, and therefore the classification accuracy of the model is low. Secondly, the scale of the remote sensing image data of the unmanned aerial vehicle is large, the segmentation model has more modules due to the fact that the network is deep, the size of the model is large, and therefore the operation efficiency of the model is low when the image data are read, weight files are loaded, and classification results are predicted. Finally, the segmentation model has the information attenuation problem in the aspect of extracting the target object features, and particularly for small target objects, the segmentation model has poor effect in the process of segmenting small-size ground objects. Current image segmentation models are mainly based on convolutional layer extraction features, such as single feature maps, pyramid feature hierarchies, and feature pyramid networks. Although the three methods can extract high-dimensional abstract image features, part or most of original key features of the image are lost without exception. In contrast, the characterized image pyramid has strong semantics at all levels, can retain all image feature information as much as possible, and is applied more in algorithms with advanced ImageNet and COCO detection challenge match ranks, but consumes more time and has higher requirements on calculation amount and memory. Therefore, the application of the deep learning semantic segmentation model in the aspect of unmanned aerial vehicle remote sensing image classification is restricted due to the existence of the problems, and the problems need to be solved and processed urgently.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the problems of low precision and low efficiency of the existing image semantic segmentation algorithm in the aspect of unmanned aerial vehicle remote sensing image classification, and provides a high-precision unmanned aerial vehicle remote sensing image semantic segmentation method. The method can achieve a good segmentation effect, greatly improves the classification precision of the remote sensing image, and achieves the purpose of accurately segmenting the high-resolution unmanned aerial vehicle remote sensing image.

The technical scheme is as follows: the invention relates to a semantic segmentation method for unmanned aerial vehicle remote sensing images, which comprises the following steps:

(1) image preprocessing: carrying out image enhancement processing on the unmanned aerial vehicle remote sensing image, wherein the image enhancement processing comprises operations of image drying removal, image sharpening, image equalization and the like;

(2) initializing parameters: initializing parameters related to the method, such as an image block size s, an image slice number n, a zoom window radius r, a zoom level (namely a focal length) f and the like;

(3) image blocking and slicing: according to the block size of a preset image, dividing the original image according to blocks, temporarily storing the original image according to block numbers, and overlapping m between two adjacent image blocks to keep complete boundary semantics₁And (4) each pixel element. And according to the block numbers, creating an image block queue, and sequentially performing semantic segmentation processing on each image block according to the dequeuing sequence. According to the number of preset image slices, equally dividing the image block to be processed, and overlapping m between two adjacent image slices for keeping complete boundary semantics₂And (4) each pixel element.

(4) Constructing an image segmentation process pool: and constructing a process pool according to the set CPU core number, and processing the tasks of the sub-processes in the process pool in an asynchronous non-blocking manner. And respectively sending each image slice segmented in the steps to a subprocess, and creating a parallel image segmentation task.

(5) Jumping and traversing the pixels, and extracting a target subgraph by using a zoom device: and (5) extracting a target sub-image by using an initial focal length of the zoom to slide pixel by pixel on the basis of an image slice coordinate system. And when the output probability of the classified target subgraph is lower than a threshold value, performing upscaling zooming according to the zooming level, and expanding the range of the extracted target subgraph. And after the target subgraphs are classified, the zoom recovers the initial focal length to perform jumping traversal on the image element according to the span, and the next target subgraph is extracted.

(6) Classifying by a classifier: classifying the target subgraph in the extracted image slice by using a classifier, outputting a classification result of a central pixel, and completing semantic segmentation in the image slice;

(7) image slice merging and boundary fusion: and merging all the image slices subjected to semantic segmentation, and fusing the overlapped boundaries.

(8) Image block merging and boundary fusion: and merging the image blocks subjected to semantic segmentation according to the block numbers, and fusing the overlapped boundaries.

(9) And (3) post-treatment: and post-processing the semantic segmentation image processed by the steps to obtain a more accurate remote sensing image classification result.

Preferably, in step (1), the image noise (such as salt and pepper spots) is removed by filtering with a median and a mean value with a convolution kernel of 3 × 3, that is, replacing and eliminating the noise pixels by the median and the mean value of the intensity values in the neighborhood of the noise pixels. The image sharpening operation used for highlighting the boundary of the ground object is mainly laplacian operators (Laplace operators) in 4 and 8 fields. The image equalization method used for keeping the brightness consistency of each area of the image and improving the image definition of partial areas is mainly global histogram equalization.

Preferably, in step (2), the parameters to be initialized mainly include: the image block size s, the number of image block overlapping pixels m1, the number of image slices n, the number of image slice overlapping pixels m2, the number of process pool CPU cores, the zoom level (i.e., focal length) f and the translation coefficient c of the zoom, the classification probability threshold thr, and the like.

Preferably, in the step (3), the format of the image block number is: sequence number _ row index _ column index. The sequence number is the only identification of the block and is calculated by the size of the original image and the size of the image block, and the row index and the column index are respectively obtained by the number of blocks according to the height and the number of blocks according to the width.

bid＝N_row*N_col＝f_ceil(H/s)*f_ceil(W/s)

Where bid is the number of the image block, N_rowIs the number of blocks by height, N_colIs the number of blocks by width, f_ceil() For the ceiling function, H and W are the pixel height and width, respectively, of the original image, and s is the image block size.

Preferably, in the step (5) above, the zoom device has an auto zoom function of the camera lens, and the zoom window size r and the zoom level f have the following relationship:

r＝R(f)＝2*f+c

where r is the window size, f is the zoom level, and c is the translation coefficient. When f is smaller, the space range of the ground features captured by the zoom is smaller, and vice versa.

Whether the zoom is zoomed or not is related to the classification probability of the feature image currently extracted, and is expressed by the following formula:

in the formula, p is the maximum classification probability, and thr is the probability threshold.

The zoom step of moving from the current pixel to the next pixel is related to the zoom level and is expressed by:

k＝K_ceil(f/2)

wherein K is span, K_ceilIs an rounding-up function.

In general, the zoom device will not activate the zoom mechanism, and it will calculate the window size according to the initial zoom level and then extract the target image for subsequent classification, centered on the pixel where it is located. However, when the classification probability is lower than the corresponding threshold, the zoom device starts a zoom mechanism to gradually enlarge the range of the target image until the classification probability meets the classification requirement. Repeating this process results in a series of characterized image pyramids. The images can obtain the most accurate classification value because the most complete key characteristic information is kept.

Preferably, in the step (6), the classifier for performing the class determination on the target subgraph mainly uses a lightweight convolutional neural network, which is composed of a series of convolutional layers (compression layer and expansion layer), pooling layer, normalization layer, activation function layer, and full-link layer. Before the classifier is used, a sample interest region can be selected in the remote sensing image in advance for classification training, wherein the learning rate is dynamically adjusted according to step length (Multistep) in the training process.

Preferably, in the step (7), the image overlap boundary fusion method mainly uses a median filter with a convolution kernel of 5 × 5.

Preferably, in the step (8), the image overlap boundary fusion method mainly uses a median filter with a convolution kernel of 5 × 5.

Preferably, in the step (9), a small part of false pixels, such as broken patches, island pixels, and the like, may exist in the finally generated semantic segmentation image. To effectively remove false pixels, the process mainly uses erosion and dilation operations.

The invention has the beneficial effects that:

the method can achieve a good segmentation effect, greatly improves the classification precision of the remote sensing image, and achieves the purpose of accurately segmenting the high-resolution unmanned aerial vehicle remote sensing image.

1) Aiming at the characteristic of large scale of the unmanned aerial vehicle remote sensing image space, the invention firstly carries out block processing and then carries out remote sensing image semantic segmentation block by block, thereby reducing the scale of reading the remote sensing image data by the method and reducing the risk of memory overflow in the semantic segmentation processing.

2) The invention designs and realizes that the zoom is used for extracting target images in different spatial ranges, retains the most complete key characteristic information in the target images by constructing a characteristic image pyramid, and can obtain the most accurate classification predicted value, thereby ensuring the high classification precision of pixels.

3) The invention adopts a multi-process and image semantic segmentation parallel processing mode in implementation so as to reduce the total running time cost. Furthermore, the jumping traversal of the zoom also shortens the run time. This ensures the timeliness of the semantic segmentation of the remote sensing image.

4) The invention uses the light convolutional neural network in the classifier, reduces the model quantity to the maximum extent under the condition of ensuring that the image classification precision is not reduced, and reduces the memory and disk space consumed by the method in application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flow chart of an image semantic segmentation method for unmanned aerial vehicle remote sensing images.

Fig. 2 is a schematic diagram of image segmentation and slicing according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a zoom device structure and a constructed image pyramid according to an embodiment of the invention.

Fig. 4 is a schematic diagram of a classifier according to an embodiment of the present invention.

Fig. 5 is a diagram of a semantic segmentation effect of an unmanned aerial vehicle remote sensing image in the embodiment of the present invention.

Detailed Description

The technical solution of the present invention will be further described in detail with reference to the following embodiments and the accompanying drawings. Specifically, the invention relates to a semantic segmentation method for unmanned aerial vehicle remote sensing images, which comprises the following specific steps as shown in fig. 1:

(1) image preprocessing: the method is used for enhancing the remote sensing image of the unmanned aerial vehicle so as to achieve the purposes of reducing noise and highlighting details, and mainly comprises the operations of image drying removal, image sharpening, image equalization and the like.

In the above steps, the image noise (such as salt and pepper noise) processing mainly adopts median and mean filtering with convolution kernel of 3 × 3, that is, the noise pixel is replaced and eliminated by using the median and mean of the intensity values in the neighborhood of the noise pixel. The image sharpening operation used for highlighting the boundary of the ground object is mainly laplacian operators (Laplace operators) in 4 and 8 fields. The image equalization method used for keeping the brightness consistency of each area of the image and improving the image definition of partial areas is mainly global histogram equalization.

In the above steps, the processing procedure of the laplacian is described as follows: the method comprises the steps of firstly utilizing an edge enhancement operator to highlight local edges in an image, and then gradually tracking edge points along two different directions from a place with higher edge strength until two tracks meet and form a closed contour line.

(2) Initializing parameters: initializing parameters related in the method, which mainly comprises the following steps: image block size s, imageNumber of block overlapped pixels m₁N number of image slices, m number of overlapping pixels of image slices₂The number of CPU cores in the process pool, zoom level (i.e., focal length) f and translation coefficient c of the zoom, and classification probability threshold thr.

(3) Image blocking and slicing: according to the block size of a preset image, dividing the original image according to blocks, temporarily storing the original image according to block numbers, and overlapping m between two adjacent image blocks to keep complete boundary semantics₁And (4) each pixel element. And according to the block numbers, creating an image block queue, and sequentially performing semantic segmentation processing on each image block according to the dequeuing sequence. According to the number of preset image slices, equally dividing the image block to be processed, and overlapping m between two adjacent image slices for keeping complete boundary semantics₂And (4) each pixel element. As shown in fig. 2.

In the above step, the format of the image block number is: sequence number _ row index _ column index. The sequence number is the only identification of the block and is calculated by the size of the original image and the size of the image block, and the row index and the column index are respectively obtained by the number of blocks according to the height and the number of blocks according to the width.

bid＝N_row*N_col＝f_ceil(H/s)*f_ceil(W/s)

(4) Constructing an image segmentation process pool: and constructing a process pool according to the set CPU core number, and processing the tasks of the sub-processes in the process pool in an asynchronous non-blocking manner. And respectively sending each image slice segmented in the steps to a subprocess, and creating and executing a parallel image segmentation task. As shown in fig. 3.

(5) And jumping and traversing the pixels, and extracting a target subgraph by using a zoom. And (5) extracting a target sub-image by using an initial focal length of the zoom to slide pixel by pixel on the basis of an image slice coordinate system. And when the output probability of the classified target subgraph is lower than a threshold value, performing upscaling zooming according to the zooming level, and expanding the range of the extracted target subgraph. And after the target subgraphs are classified, the zoom recovers the initial focal length to perform jumping traversal on the image element according to the span, and the next target subgraph is extracted.

In the above steps, the zoom device has an auto zoom function of the camera lens, and the zoom window size r and the zoom level f have the following relationship:

r＝R(f)＝2*f+c

k＝K_ceil(f/2)

wherein K is span, K_ceilIs an rounding-up function.

in the above steps, the classifier for performing the class determination on the target subgraph mainly adopts a light-weight convolutional neural network. In the present embodiment, although not limited thereto, the SqueezeNet is used, and is composed of a series of convolution layers (compression layer and expansion layer), pooling layer, normalization layer, activation function layer, and full connection layer. The core module of the SqueezeNet model is a fire module and consists of a compression layer (1 multiplied by 1 convolution) and an expansion layer (1 multiplied by 1 convolution +3 multiplied by 3 convolution), so that the convolutional neural network can keep equal accuracy on a limited parameter budget. As shown in fig. 4. The size of the input image of the SqueezeNet model is preset to 32 × 32 pixels, but is not limited thereto; each layer of channel retains its original value; the category number depends on the semantic segmentation category of the remote sensing image; the learning rate is initially set to 0.001, and is dynamically adjusted according to a step size (Multistep) in the training process, but the learning rate is not limited to this; before the SqueezeNet model is used, a sample interest region can be selected in the remote sensing image in advance for classification training. In addition, classical convolutional neural network frameworks such as VGGNet, ResNet, and golelenet can also be used as classifiers in the present method.

In the above steps, the image overlap boundary fusion method mainly uses a median filter with a convolution kernel of 5 × 5. But is not limited thereto;

(9) and (3) post-treatment: in the steps, a small part of false image elements such as broken patches, island image elements and the like may exist in the finally generated semantic segmentation image. In order to effectively remove the false pixels, post-processing is needed to obtain a more accurate remote sensing image classification result, and the process mainly adopts corrosion and expansion operations for processing. The effect of the finally classified remote sensing image is shown in fig. 5.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A semantic segmentation method for unmanned aerial vehicle remote sensing images is characterized by comprising the following steps:

step 1: preprocessing the remote sensing image of the unmanned aerial vehicle;

step 2: initializing the related parameters;

and step 3: partitioning and numbering the preprocessed unmanned aerial vehicle remote sensing images, and slicing;

and 4, step 4: constructing an image segmentation process pool, processing the sub-processes in the image segmentation process pool according to an asynchronous non-blocking type task, respectively sending each image slice segmented in the step 3 into the sub-processes, and creating a parallel image segmentation task;

and 5: jumping and traversing pixels in each image slice, and extracting a target sub-image by using a zoom device;

step 6: classifying the target subgraph in the extracted image slice by using a classifier, outputting a classification result of a central pixel, and completing semantic segmentation in the image slice;

and 7: merging all the image slices subjected to semantic segmentation, and fusing overlapping boundaries;

and 8: merging the image blocks subjected to semantic segmentation according to the block numbers, and fusing overlapping boundaries;

and step 9: and post-processing the semantic segmentation image processed by the steps to obtain a more accurate remote sensing image classification result.

2. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein in step 1, the preprocessing comprises image drying, image sharpening and image equalization; filtering the image by adopting a median and a mean with a convolution kernel of 3 multiplied by 3; the image sharpening operation adopts Laplacian operators in 4 fields and 8 fields; the image equalization operation employs global histogram equalization.

3. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein in the step 2, the parameters to be initialized comprise: size of image block s, number of overlapped pixels m of image block₁N number of image slices, m number of overlapping pixels of image slices₂The number of CPU cores in the process pool, zoom level f and translation coefficient c of the zoom device, and classification probability threshold thr.

4. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein the step 3 specifically comprises: according to the block size of a preset image, dividing the original image according to blocks, temporarily storing the original image according to block numbers, and overlapping m between two adjacent image blocks₁A pixel element; according to the block numbers, creating an image block queue, and sequentially performing semantic segmentation processing on each image block according to the dequeuing sequence; according to the number of preset image slices, equally dividing the image block to be processed, wherein m is overlapped between two adjacent image slices₂And (4) each pixel element.

5. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 4, wherein the image block numbering format is as follows: sequence number row index column index, where the sequence number is the unique identification of the block, calculated from the original image size and image block size, the row index and column index are derived from the number of height tiles and the number of width tiles respectively,

bid＝N_row*N_col＝f_ceil(H/s)*f_ceil(W/s)

6. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein the step 5 specifically comprises: taking an image slice coordinate system as a reference, and utilizing the initial focal length of the zoom to slide pixel by pixel to extract a target sub-image; when the output probability of the classified target subgraph is lower than a threshold value, performing upscaling zooming according to the zooming level, and expanding the range of the extracted target subgraph; and after the target subgraphs are classified, the zoom recovers the initial focal length to perform jumping traversal on the image element according to the span, and the next target subgraph is extracted.

7. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein in the step 6, a light convolutional neural network is adopted as a classifier for carrying out class judgment on the target subgraph, and the classifier is composed of a series of convolutional layers, pooling layers, normalization layers, activation function layers and full-connection layers; selecting a sample interest area on the remote sensing image in advance before using the classifier for classification training, wherein the learning rate is dynamically adjusted in a step-by-step mode in the training process.

8. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein in step 7, the image overlapping boundary fusion method adopts a median filter with a convolution kernel of 5 x 5.

9. The semantic segmentation method for the unmanned aerial vehicle remote sensing image according to claim 1, wherein in step 8, the image overlapping boundary fusion method adopts a median filter with a convolution kernel of 5 x 5.

10. The semantic segmentation method for unmanned aerial vehicle remote sensing images according to claim 1, wherein in step 9, the post-processing comprises erosion and dilation operations.