CN113361662B

CN113361662B - Urban rail transit remote sensing image data processing system and method

Info

Publication number: CN113361662B
Application number: CN202110831395.XA
Authority: CN
Inventors: 张开婷; 李俊; 周立荣; 蔺陆洲; 贾蔡; 祝宏; 邓平科; 杨军; 马长斗; 张迪
Original assignee: Quantutong Position Network Co ltd
Current assignee: Quantutong Position Network Co ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2023-08-29
Anticipated expiration: 2041-07-22
Also published as: CN113361662A

Abstract

The invention relates to the technical field of urban rail transit remote sensing image data processing, in particular to a system and a method for processing urban rail transit remote sensing image data. The system comprises a remote sensing image feature extraction module, a region recommendation module and an object prediction semantic segmentation module, and the invention fuses a convolutional neural network and a semantic segmentation algorithm to realize building objectification extraction, thereby solving the problems of low accuracy, low speed and the like of the existing building extraction based on the remote sensing image based on deep learning.

Description

Urban rail transit remote sensing image data processing system and method

Technical Field

The invention relates to the technical field of urban rail transit remote sensing image data processing, in particular to a system and a method for processing urban rail transit remote sensing image data.

Background

The existing method for extracting buildings from remote sensing images by using a deep learning algorithm is mainly Convolutional Neural Network (CNN). The convolutional neural network is an improved algorithm of a fully-connected neural network, changes the input nodes of the neural network from image pixels to the features extracted by image convolution and pooling, reduces the number of the input nodes of the neural network, reduces the network scale, and is suitable for processing two-dimensional image data.

The existing method comprises the steps of extracting a building from a remote sensing image by using a convolutional neural network, mainly dividing the building into two stages, training a model in the first stage, cutting a typical building from the remote sensing image to serve as training data, designing a convolutional layer, a pooling layer and a full-connection layer, training parameters of the neural network by using the training data, and carrying out forward propagation and calculation errors and backward propagation and regression convergence to enable the model to learn building characteristics; and secondly, predicting data, namely inputting a remote sensing image, sliding on the remote sensing image according to a designed window, inputting an image within the sliding window range into a convolutional neural network, and predicting whether a building exists or not by forward propagation, wherein the existing building is recorded in the original remote sensing image identification, so that building identification is realized.

In the prior art, an R-CNN algorithm (Region-CNN) is closest to the method, and is divided into four modules, namely a Region suggestion algorithm module (selective search), which is a method for generating a recommended interest Region according to the information of an image, and recommending about 1000-2000 potential building Region frames in a picture through the module; the feature extraction module is used for extracting the features of each potential building area image by using a classical convolutional neural network AlexNet; and then sending the extracted features into a linear classifier, classifying the extracted features according to the high-dimensional features extracted by the convolutional neural network by using a Support Vector Machine (SVM), scoring the features of each potential building region in the support vector machine, filtering the features by a threshold (0.5), sending the building region meeting the requirements to a boundary frame correction regression module, and accurately positioning the building region to a building outer frame rectangle by regression of four sets of parameters of the boundary frame. The R-CNN model training method is the same as other convolutional neural networks, a building training data set is constructed, and network parameters are adjusted according to forward propagation errors.

However, the R-CNN algorithm has the following general disadvantages:

1. R-CNN only supports the input of single-scale remote sensing pictures, and the existing method cannot compare the influence of multi-source and multi-size remote sensing images on a model because the resolution and the image quality of remote sensing data are different;

2. R-CNN can extract remote sensing image buildings with higher precision, but a preselected frame is completed by a selective search algorithm with slower speed and the calculation of a convolution network is repeated, so that the model training speed is slow and the memory occupation is large;

3. the R-CNN four modules are mutually independent and connected in series, so that parallel operation is impossible, and calculation resources are difficult to fully utilize;

4. the classifier is used as a support vector machine, so that the classifier is required to be trained for detecting the number of objects, and the training process is complex;

5. and the fusion instance segmentation is not performed, so that R-CNN can only extract the outer frame of the building, and the building object instance segmentation cannot be completed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a processing system and a processing method for urban rail transit remote sensing image data, which are used for realizing building objectification extraction by fusing a convolutional neural network and a semantic segmentation algorithm, and solve the problems of low building precision, low speed and the like in remote sensing image extraction based on deep learning in the prior art.

The technical scheme adopted by the invention is as follows:

a processing system of urban rail transit remote sensing image data, which comprises a remote sensing image feature extraction module, a region recommendation module and an object prediction semantic segmentation module,

the remote sensing image feature extraction module is used for extracting trunk features of the urban rail transit remote sensing image and constructing a feature pyramid;

the region recommending module is used for extracting the features of the feature pyramid by using shared convolution to generate a recommending frame;

the object prediction semantic segmentation module is used for generating a prediction frame after convolution of the local features extracted by the suggestion frame, and intercepting the local features of the prediction frame from the shared feature map and the prediction frame to generate an object mask.

The remote sensing image feature extraction module comprises a trunk feature extraction unit, a feature pyramid construction unit and a feature pyramid unit, wherein,

the trunk feature extraction unit is used for extracting features of different levels of the remote sensing image by using a multi-layer residual convolution neural network, and generating four-level feature images with 256 channels;

the feature pyramid construction unit is used for starting convolution and up-sampling from the feature map with the lowest dimension, and constructing five-layer feature maps by superposing the feature map with the first-level feature map with the highest dimension to form a shared feature pyramid unit.

The region recommending module comprises a region recommending convolutional network unit, an object predicting unit, a frame adjusting unit and a generating proposal frame module unit, wherein,

the region recommended convolution network unit is used for extracting the features of the feature pyramid by using a layer of shared convolution;

the object prediction unit is used for predicting whether the sliding window of each characteristic point contains an object or not and adjusting parameters relative to the sliding window by two special convolutions, filtering whether the sliding window of the predicted characteristic point contains the object or not by using a threshold value, and the sliding window passing the threshold value is a preselected frame;

the frame adjustment unit adjusts the preselected frame window by using frame adjustment parameters, so that the suggestion frame generation module unit generates a suggestion frame.

The object prediction semantic segmentation module comprises a feature interception unit, an object classification prediction unit, a frame adjustment prediction unit, a mask feature extraction unit and a mask prediction unit, wherein,

the feature intercepting unit is used for interpolating the local features intercepted by the suggestion frame into 7*7 feature graphs through bilinear interpolation;

the object classification prediction unit is used for predicting classification results and frame adjustment parameters respectively by using two full-connection after two-layer convolution;

the frame adjustment prediction unit is used for adjusting the frame of the object with the classification result higher than the threshold value to be a prediction frame by using frame adjustment parameters;

the mask feature extraction unit is used for intercepting local features of the prediction frame from the shared feature map;

the mask prediction unit is used for predicting the object mask by interpolation after two-layer convolution and one-layer deconvolution by using bilinear interpolation unified feature map size of 14 x 14.

A processing method of urban rail transit remote sensing image data comprises the following steps:

A. inputting high-resolution remote sensing images, extracting trunk features by using a feature interception network, and constructing a feature pyramid;

B. inputting the trunk feature into an area recommendation network to predict a recommendation frame which possibly contains a building;

C. according to the main feature interception of the suggestion frame, inputting the main feature into a fully-connected network prediction object type, using a frame adjustment parameter adjustment suggestion frame of synchronous prediction as a prediction frame, intercepting a local feature map from the main feature according to the prediction frame, and inputting the local feature map into a convolution network of a semantic segmentation module to predict a building mask;

D. and marking the obtained object prediction frame and the building mask on the image.

The step A specifically comprises the following steps:

a1, firstly, extracting features of different layers of a remote sensing image by using a multi-layer residual convolution neural network, and generating four-layer feature images with 256 channels;

a2, starting convolution and up-sampling from the feature map with the lowest dimension, and superposing the feature map with the first level higher dimension to construct a five-layer feature map as a shared feature pyramid.

The step B specifically comprises the following steps:

and extracting the characteristic points of the characteristic pyramid by using a layer of shared convolution, respectively predicting whether the sliding window of each characteristic point contains an object or not and adjusting parameters relative to the sliding window by using two specific convolutions, filtering whether the sliding window of the predicted characteristic point contains the object by using a threshold value, and adjusting the window of the preselected frame by using the frame adjusting parameters to obtain a recommended frame, wherein the sliding window passing the threshold value is the preselected frame.

The step C specifically comprises the following steps:

c1, carrying out bilinear interpolation on local features intercepted by the suggestion frame to obtain a 7*7 feature map, and respectively predicting classification results and frame adjustment parameters by using two full-connection after two-layer convolution;

c2, adjusting the frame of the object with the classification result higher than the threshold value into a prediction frame by using the frame adjustment parameter;

and C3, intercepting local features of the prediction frame from the shared feature map, using bilinear interpolation to unify feature map sizes to be 14 x 14, and carrying out interpolation prediction on the object mask after two-layer convolution and one-layer deconvolution.

The technical scheme provided by the invention has the beneficial effects that:

(1) The multi-source and multi-size remote sensing images are fused to compare the training effect of the model, so that the algorithm model has better robustness and generalization in the process of coping with the multi-source remote sensing images;

(2) The shared convolution layer is adopted, so that the speed of training and applying an algorithm is greatly simplified, each picture only needs to be convolved once, and independent convolution calculation on each frame to be selected is not needed;

(3) The candidate region is determined by adopting a convolutional neural network method, so that the pre-selection target frame can be trained and learned by training a learning rule, and the extraction efficiency of the pre-selection frame is improved;

(4) The multi-task full-connection network is adopted, the same set of parameters are shared by type prediction and angular point offset, and the prediction accuracy is improved while the processing speed is greatly prompted;

(5) And adding a semantic segmentation module to the convolutional neural network object classification, objectively identifying the building and simultaneously predicting the detailed outline of the building.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a method for processing urban rail transit remote sensing image data.

Fig. 2 is a block diagram of a remote sensing image feature extraction module of a processing system for urban rail transit remote sensing image data according to the present invention.

Fig. 3 is a block diagram of a regional recommendation module of a processing system for urban rail transit remote sensing image data according to the present invention.

Fig. 4 is a structural block diagram of an object prediction semantic segmentation module of the urban rail transit remote sensing image data processing system.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

Example 1

As shown in fig. 1, a processing method of urban rail transit remote sensing image data comprises the following steps:

The step A specifically comprises the following steps:

The step B specifically comprises the following steps:

The step C specifically comprises the following steps:

A specific example is given below for illustration:

the model input is an RGB three-channel high-resolution remote sensing image slice, and the dimension is as follows: 1024×1024×3 (length×width×channel).

(1) The method comprises the steps of firstly entering a feature extraction module, wherein the purpose of the module is to obtain a main feature of an image (see 2), firstly obtaining four layers of main features (the dimensions are 256×256, 128×128×256, 64×64×256 and 32×256 respectively) through a main feature extraction convolution network, then inputting the obtained four layers of main features into a pyramid construction convolution network to obtain five layers of feature graphs (P2, P3, P4, P5 and P6, the dimensions are 256×256, 128×128×256, 64×64×256, 32×32×256 and 16×16×256 respectively), and using the five layers of feature graphs as main features for later processing.

(2) Then entering a region recommendation model, wherein the module aims to predict a suggestion frame (the module structure is shown in fig. 3) of a building region on a picture according to trunk characteristics, five trunk characteristic diagrams P2-P6 are sequentially input into a region recommendation convolution network, the operation process of the region recommendation network is illustrated by taking P6 (dimension: 16 x 256) as an example, the length and width of P6 are 16 x 16, the original diagram (length and width: 1024 x 1024) is actually divided into 16 x 16 small regions, the length and width of each small region is 64 x 64, therefore, each point on the length and width of the characteristic diagram P6 corresponds to a region with the size of 64 x 64 on the original diagram, a first task of the region recommendation convolution network predicts whether each point on the characteristic diagram corresponds to the building in the region of the original diagram, the first task output result of the characteristic diagram P6 is 16 x 16 values, each point on the characteristic diagram P6 corresponds to one output value, the range of each output value is 0-1, the value is larger than 0.5, and the point on the original diagram corresponds to the building region not corresponding to the building region, and the original diagram is not directly ignored; because the area of the corresponding original image of each point on the feature map is fixed, but the building does not necessarily fall into the area of the corresponding original image, and deviation is possible, the second task of the area recommendation convolution network is to predict how the area of the corresponding original image of each point on the feature map is adjusted to exactly contain the building, and the output result of the second task of the feature map P6 is as follows: 16×16×4, each point on the feature map corresponds to four predicted values, which are respectively a length-width adjustment parameter of the point corresponding to the upper left corner and the lower right corner of the original map area, the range of the feature point corresponding to the original map area including the building is adjusted by using the adjustment parameter, so that a suggestion frame of the feature map P6 is obtained, each point of the feature map P6 corresponds to the area under 64×64 of the original map in consideration of the difference of the building size, each point of the feature map P2 corresponds to the area of the original map 4*4, and can be regarded as a suggestion frame of the feature map P6 for large building prediction, P2 is used for small building prediction, and five-layer feature maps are sequentially input into the area recommendation convolution network to obtain five-layer feature.

(3) Then, an object prediction module is entered to predict whether the interior of a suggestion frame truly contains a building and the adjustment parameters of the angular points of the suggestion frame (the module structure is shown in fig. 4), the previous module only roughly predicts whether each point of a feature map contains the building, the result is inaccurate, the local feature of an original image area represented by the suggestion frame is intercepted by the module, the full-connection neural network is used for accurately predicting, because the full-connection neural network needs to have unified input dimensions, the size of the suggestion frame is different, the number of the internal feature points of the local feature is different (for example, two suggestion frames, one intercepted local feature has a length and width of 8*9, the other intercepted local feature has a length and width of 11 x 6, then the first local feature has 72 points, the second local feature has 66 points, the dimension is not unified and is input into the full-connection neural network, therefore, the local feature map with a dimension of 7 x 7 is obtained by using a bilinear interpolation method, the full-connection neural network is input into the full-connection network, and the prediction value is larger than 256 in the full-connection neural network, and the total prediction value is equal to or equal to 256 building input into the first layer, and the prediction value is larger than the first layer, and equal to or equal to 1 x 7, and the total output value is contained in the prediction value is larger than 1024, and equal to 1.1024 is contained in the first layer, and equal to 0; the second fully-connected network comprises 1024 nodes of a layer of hidden layer, outputs 4 values, predicts the adjustment parameters of the upper left corner and the lower right corner of the suggestion frame, and then adjusts the angular points of the suggestion frame containing the building by using the predicted adjustment parameters to obtain the prediction frame, wherein the prediction frame is the result of the object prediction module.

(4) Finally, entering a semantic segmentation module, wherein the purpose is to obtain a mask of a building in a prediction frame (the module structure is shown in fig. 4), the third step is that the object prediction module obtains the prediction frame, the prediction frame is intercepted to correspond to a main feature map area to serve as local features of the prediction frame, the local features are unified to be 14 x 256 in dimension by using a bilinear interpolation method, the length and the width are doubled to be 28 x 28 by deconvolution, the number of channels is convolved to be 1, namely, the size of an output image of the semantic segmentation convolution network is 28 x 1, the last dimension value is only 0 and 1, the pixel is represented as the building when the value is 1, and finally, the local feature size of the prediction frame is interpolated to serve as a mask result of the building. And finishing the whole flow of the algorithm, wherein all the prediction frames are the building object detection results, and the semantic segmentation model result of each prediction frame is the building semantic segmentation result.

Example two

The embodiment provides a processing system of urban rail transit remote sensing image data, which comprises a remote sensing image feature extraction module, a region recommendation module and an object prediction semantic segmentation module,

As shown in fig. 2, the remote sensing image feature extraction module includes a main feature extraction unit, a feature pyramid construction unit and a feature pyramid unit, wherein,

As shown in fig. 3, the region recommendation module includes a region recommendation convolutional network unit, an object prediction unit, a border adjustment unit, and a generate suggestion frame module unit, wherein,

As shown in fig. 4, the object prediction class semantic segmentation module includes a feature interception unit, an object classification prediction unit, a frame adjustment prediction unit, a mask feature extraction unit, and a mask prediction unit, wherein,

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A processing system of urban rail transit remote sensing image data comprises a remote sensing image feature extraction module and an area recommendation module

The module and the object prediction semantic segmentation module are characterized in that:

the remote sensing image feature extraction module is used for extracting trunk features of the urban rail transit remote sensing image and constructing

A feature pyramid;

the object prediction semantic segmentation module is used for generating local features extracted from the suggestion frame after convolution

A prediction frame, and local features of the prediction frame are intercepted from the shared feature map and the prediction frame to generate an object mask;

the remote sensing image feature extraction module comprises a trunk feature extraction unit, a feature pyramid construction unit and a feature golden word

A tower unit, wherein,

the trunk feature extraction unit is used for extracting characteristics of different layers of remote sensing images by using a multi-layer residual convolution neural network

Generating four hierarchical feature graphs with channels of 256;

the feature pyramid construction unit is used for starting convolution and upsampling from the feature map with the lowest dimension, and is higher than the feature map with the first level of dimension

The feature graphs are overlapped to construct five layers of feature graphs, so that a shared feature pyramid unit is formed;

the object prediction unit is used for predicting whether the sliding window positioned at each characteristic point contains two special convolutions or not

An object and an adjustment parameter relative to the sliding window, filtering whether the sliding window of the predicted feature points contains the object by using a threshold value,

the sliding window passing the threshold value is a preselection frame;

the frame adjusting unit adjusts the pre-selected frame window by using frame adjusting parameters so as to generate a suggested frame module

The unit generates a suggestion frame;

the object classification prediction unit is used for predicting classification results and frame adjustment respectively by using two full-connection after two-layer convolution

Finishing parameters;

the frame adjustment prediction unit is used for adjusting the frame of the object with the classification result higher than the threshold value to be prediction by using the frame adjustment parameter

A frame;

the mask prediction unit is used for using bilinear interpolation to unify the feature map size to be 14 x 14 through two-layer convolution and one-layer convolution

And deconvoluting and then interpolating to predict the object mask.

2. A processing method of urban rail transit remote sensing image data comprises the following steps:

D. labeling the obtained object prediction frame and a building mask thereof on an image;

the step A specifically comprises the following steps:

a1, firstly, extracting features of different layers of remote sensing images by using a multi-layer residual convolution neural network to generate 256 channels

Four hierarchical feature graphs;

a2, starting convolution from the feature map with the lowest dimension, upsampling, and superposing with the feature map with the first level higher dimension to construct five-layer dtex

A symptom map as a shared feature pyramid;

the step B specifically comprises the following steps:

extracting feature points of the feature pyramid by using a layer of shared convolution, and respectively predicting each feature by two specific convolutions

Whether or not the sliding window of the point contains an object and an adjustment parameter relative to the sliding window, and predicting the sliding of the feature point using a threshold value

Whether the window contains object filtering, the sliding window passing the threshold is the preselection frame, and the frame adjusting parameters are used for preselecting the frame window

Adjusting to obtain a suggestion frame;

the step C specifically comprises the following steps:

c1, carrying out bilinear interpolation on the local features intercepted by the suggestion frame to obtain a 7*7 feature map, and using two branches after two-layer convolution

The full connection predicts the classification result and the frame adjustment parameter respectively;