CN115272882A

CN115272882A - Discrete building detection method and system based on remote sensing image

Info

Publication number: CN115272882A
Application number: CN202210925795.1A
Authority: CN
Inventors: 董传胜; 李国华; 解加粉; 丁仕军; 刘强; 陈建忠; 孙如瑶
Original assignee: Shandong Provincial Institute of Land Surveying and Mapping
Current assignee: Shandong Provincial Institute of Land Surveying and Mapping
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-11-01

Abstract

The invention relates to a discrete building detection method and a system based on remote sensing images, which comprises the following steps: acquiring a remote sensing image; obtaining the characteristics of a target building in a remote sensing image based on a first model, classifying the image blocks corresponding to the characteristics to obtain example category scores, and aggregating the example category scores to form scene category scores to obtain the area where the scene of the target building is located, namely an initial candidate frame containing the scene of the building; and based on the second model, segmenting the target building instance in the candidate frame containing the building target scene information to obtain the mask, the category and the positioning frame information of the target building. The scene of the building is obtained from the remote sensing image, the mask, the category and the positioning frame of the building are extracted from the candidate frame with scene information, the target identification can be carried out on the basis of eliminating the false alarm target, the problem of more false alarm targets in the detection process is effectively solved, and the accuracy is improved.

Description

Discrete building detection method and system based on remote sensing image

Technical Field

The invention relates to the technical field of building detection, in particular to a remote sensing image-based discrete building detection method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

According to the mask, the category and the positioning frame of the discrete building (structure), whether the discrete building is a violation building or not can be judged, illegal occupation of cultivated land building rooms can be restrained, and red lines of cultivated land are protected. Because discrete buildings are different in size and various in shape, a man-machine interaction mode is usually adopted for conventional discrete building extraction, and the method is large in task amount, easy to lose and leak, low in efficiency and low in quality.

With the increasing abundance of high-resolution remote sensing image resources, more precise and detailed information can be obtained, and the high-resolution remote sensing image becomes a main data source for obtaining information. However, it is difficult to extract and detect the feature information quickly and accurately due to the problem of feature occlusion caused by the satellite observation angle, the problems of "same-spectrum foreign matter", "same-object different-spectrum", small target buildings, and the like.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a remote sensing image-based discrete building detection method and system, wherein a scene where a building (structure) is located is obtained from a remote sensing image, and then a mask, a category and a positioning frame of the building (structure) are extracted from candidate frames with scene information of the building (structure), so that the target building (structure) can be identified on the basis of eliminating false alarm targets, the problem of more false alarm targets in the detection process is effectively solved, and the accuracy is indirectly improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a discrete building detection method based on remote sensing images, which comprises the following steps:

acquiring a remote sensing image;

obtaining the characteristics of a target building in a remote sensing image based on a first model, classifying the image blocks corresponding to the characteristics to obtain example category scores, and aggregating the example category scores to form scene category scores to obtain the area where the scene of the target building is located, namely an initial candidate frame containing the scene of the building;

and based on the second model, segmenting the target building instance in the candidate frame containing the building target scene information to obtain the mask, the category and the positioning frame information of the target building.

The first model comprises an example classifier and a collection function, wherein the example classifier classifies examples through the extracted features; and receiving the example labels by the aggregation function, and integrating the example predicted category labels into a prediction result of the scene to form a final category label.

Example classifiers include at least one of convolutional network models Alex-Net, VGG-Net, or ResNet 50/101.

The example classifier has three maximum pooling layers and three fully connected layers.

The aggregation function aggregates the example category scores into a scene category score, the aggregation process focuses attention on a key part of the image classification according to the magnitude of the example feature weights, and the scene category score is a weighted average of the strength categories.

Attention weight value W according to example _i，j The importance of the evaluation examples is specifically as follows:

the attention weight is calculated in the following manner:

a, b are parameters of attention network mechanism, h _i,j Is an output characteristic.

And the second model extracts a feature map of the building from a candidate frame containing scene information, sets a fixed number of interest areas for each pixel position of the feature map, sends the interest areas into an area generation network for secondary classification and coordinate regression to obtain refined interest areas, maps the interest areas to the last layer of convolution feature map of the feature extraction network to obtain a feature map with a fixed size, and classifies the interest areas.

A second aspect of the present invention provides a system for implementing the above method, comprising:

a data acquisition module configured to: acquiring a remote sensing image;

a scene classification module configured to: obtaining the characteristics of a target building in the remote sensing image based on a first model, classifying according to image blocks corresponding to the characteristics to obtain example category scores, aggregating the example category scores to form scene category scores, and obtaining the area where the scene of the target building is located, namely an initial candidate frame containing the scene of the building;

an extraction detection module configured to: and based on the second model, segmenting the target building instance in the candidate frame containing the building target scene information to obtain the mask, the category and the positioning frame information of the target building.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of a method for remote sensing image based discrete building detection as described above.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of a method for remote sensing image based discrete building detection as described above when executing the program.

Compared with the prior art, the above one or more technical schemes have the following beneficial effects:

1. the scene of the building is obtained from the remote sensing image, the mask, the category and the positioning frame of the building are extracted from the candidate frame with scene information, the target building can be identified on the basis of eliminating the false alarm target, the problem of more false alarm targets in the detection process is effectively solved, and the accuracy is improved.

2. Although the detection algorithm is subjected to two stages of scene classification and instance segmentation, compared with the traditional target detection, the scene classification stage is added, but the instance segmentation is directly carried out in a candidate box of the scene classification, so that the workload of detection and extraction of a large number of target-free areas is effectively reduced, and compared with the traditional mode, the detection efficiency is higher.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a schematic flow diagram of a discrete building detection method provided by one or more embodiments of the invention;

FIG. 2 is an example level method flow diagram in a discrete building detection method provided by one or more embodiments of the invention;

FIG. 3 is a schematic diagram of a multi-example convolutional network classification framework structure in a discrete building detection method according to one or more embodiments of the present invention;

FIG. 4 is a schematic diagram of an Alex-Net network structure in a discrete building detection method according to one or more embodiments of the present invention;

5 (a) - (d) are partial sample illustrations of a sample library in a discrete building detection method provided by one or more embodiments of the invention;

6 (a) - (d) are diagrams of artwork, semantic graphs, and attention examples of samples in a discrete building detection method provided by one or more embodiments of the present invention;

FIG. 7 is a sample of sparse building detection in alpine regions in a discrete building detection method according to one or more embodiments of the present invention;

FIG. 8 is a diagram illustrating the classification of Mask R-CNN network structures in a discrete building detection method according to one or more embodiments of the present invention;

FIG. 9 is a partial sample of a discrete building detection method in a data set provided by one or more embodiments of the invention;

fig. 10 (a) - (b) are graphs of test results of a discrete building detection method provided by one or more embodiments of the present invention in an accuracy evaluation process.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As described in the background art, it is difficult to extract and detect the feature information quickly and accurately due to the problem of feature blocking caused by the observation angle of the satellite, the problems of "same-spectrum foreign matter", "same-object different-spectrum", small target buildings, and the like. Therefore, it is difficult to efficiently and accurately extract the building information from the high-resolution remote sensing image in the prior art.

Therefore, the following embodiments provide a method and a system for detecting a discrete building based on a remote sensing image, which first improve the attention mechanism of a multi-instance convolutional neural network (MI-CNN), and then identify a "discrete" building scene by using the improved MI-CNN: defining a building (structure) scene as a combination of multiple example semantics such as a target, vegetation, mountain land and the like, applying class supervision information of a scene level to an example level by using MI-CNN, judging the semantics of a local scene, and fusing examples by using a multi-example aggregation function, so that the identification of the remote sensing image discrete building (structure) scene is realized, the rough identification of the building (structure) scene is realized, and an initial candidate frame of the discrete building (structure) is obtained. And finally, accurately extracting the target positioning and example segmentation of the building (structure) in the initially selected candidate frame by using a Mask R-CNN example segmentation model to obtain the outline of the building, which is further refined in the rough processing process, thereby completing the detection and extraction of the discrete building (structure).

The first embodiment is as follows:

as shown in fig. 1 to 10, a method for detecting a discrete building based on a remote sensing image includes the following steps:

acquiring a remote sensing image;

Specifically, the method comprises the following steps:

inputting an image to be detected into a built MI + Mask-RCNN neural network, wherein the MI + Mask-RCNN is divided into an MI-CNN scene recognition stage and a Mask-RCNN example segmentation stage;

and step two, calling and utilizing the trained MI-CNN model to carry out discrete construction (structure) building scene identification, wherein a sliding window detection method is utilized in the detection process. And finally, on the range of the candidate frame, performing detection extraction on the discrete building by using the trained Mask-RCNN model to obtain the category, the Mask file and the positioning frame of the target ground object. And if the MI-CNN is that the discrete building structure scene is detected, the program execution is ended. And sequentially detecting the next sliding window.

In this embodiment, a multi-instance convolutional neural network (MI-CNN) scene classification algorithm and a Ma sk R-CNN instance segmentation algorithm are improved, so as to improve extraction of discrete building (structure) targets, as follows:

(1) A high-resolution multi-source remote sensing image (high-resolution No. 2 satellite image and unmanned aerial vehicle image) discrete building scene recognition and case segmentation data set is constructed, and a reliable sample source is provided for high-resolution remote sensing image building (structure) extraction.

(2) MI-CNN scene classification: aiming at the characteristics of discrete building (structure) distribution and various forms, the attention mechanics learning mechanism of the MI-CNN model is improved, the improved MI-CNN is further used for carrying out scene classification on the building (structure) to highlight the importance of the key examples in the scene classification, the influence of the irrelevant examples in the scene classification is weakened, the scene classification effect is improved, and finally the initial candidate frame of the discrete building (structure) is obtained.

(3) Mask R-CNN example segmentation: aiming at the problem of high false alarm rate, target positioning and example segmentation are carried out on a building (structure) target in an initial candidate box by using a Mask R-CNN algorithm, a building outline is obtained, and the identification accuracy rate is improved.

The MI + Mask R-CNN model is used for extracting and detecting the discrete building (structure) on the remote sensing image, so that the accuracy rate and efficiency of identifying the discrete building (structure) can be effectively improved, and the Mask, category and position information of the discrete building (structure) can be accurately acquired.

In the basic idea of multi-example convolutional neural network scene classification, multi-example Learning (MIL) is used for predicting drug activity at the earliest, and is gradually applied to multiple fields of image retrieval, face detection, image classification and the like. In the classification process, each example corresponds to a category label. MIL maps examples to class labels through an example classifier, as shown in equation (1-1).

Y _i ＝h(x _i ) (1-1)

The basic assumption of multi-instance application is that each packet (sample) consists of multiple instances, a "negative" packet if none of the positive instances in the packet exist, and a "positive" packet otherwise, as in equations (1-2).

The label of the packet is a multi-example aggregation function, and the function associates the example level label with the packet level label to further obtain the label of the packet. Since the real label of the example class is unknown, a Multi-Instance Convolutional Neural network (MI-CNN) employs an example level method for building scene classification.

The class labels illustrated in the example level method may be obtained by an example classifier h, and the example classes may be aggregated into packet classes by the MIL aggregation function. The example label can be obtained by a sink function and packet classification joint optimization, and the flow is shown in fig. 2;

as shown in fig. 3, MI-CNN can be considered as a scene classification framework. The MI-CNN is used for extracting and classifying the characteristics of local image blocks in a scene, and then aggregating example categories into scene categories through a collection function to obtain a predicted classification result.

The MI-CNN mainly comprises two parts of an example classifier and an MIL sink function. The example classifier discriminates and classifies examples through convolution features extracted by the CNN; the MIL aggregation function is responsible for receiving example labels and integrating the example predicted category labels into a prediction result of a scene, namely a final category label.

Various CNN models can be used in the MI-CNN example classifier to extract example features, and the MI-CNN classification framework can use convolution network models (such as Alex-Net, VGG-Net and Re sNet 50/101) to perform feature extraction, so that the MI-CNN example classifier has good expandability.

The multi-example convolutional neural network adopted in this embodiment is an MI-AlexNet model, that is, the example classifier in the framework is an AlexNet model, the network structure of AlexNet includes five convolutional layers (3 max pooling layers) and three full connection layers, the AlexNet network adopted by MI-AlexNet discards the full connection layer and one max pooling layer before the convolutional layer is used alone as a backbone network, fig. 4 is the network structure of Alex-Net, and table 1 is a parameter table of the MI-AlexNet network structure.

TABLE 1 MI-AlexNet network architecture parameters Table

Network layer	Convolution kernel size	Step size	Output dimension
				conv1	11*11	4	96
pool1	3*3	2	-
				conv2	5*5	1	256
pool2	3*3	2	-
				conv3	3*3	1	384
conv4	3*3	1	384
				conv5	3*3	1	256
conv6	1*1	1	256
				conv7	1*1	1	256
conv8	1*1	1	C

Example features are extracted at conv5, each corresponding to an image block in the scene, x in FIG. 3 _i,j Example features correspond to image blocks in a white frame in the original image. Three 1 × 1 convolutional layers are applied to each of the conv6, conv7, and conv8 layers to classify the image block, and finally an example class score map of H × W × C (C is the number of classes) is obtained.

The MI aggregation function aggregates the example category scores into scene category scores, an attention mechanism is adopted in an aggregation process to simulate a human recognition mechanism to perform information fusion on the image blocks, attention is focused on a key part of image classification according to the example feature weights, and the scene category scores are weighted averages of real force categories:

W _i,j is an exemplary attention weight value, which is an exemplary classification network conv7 output feature h _i,j Is calculated by a single layer attention network:

multi-instance convolutional neural network scene classification, i.e., using W _i,j The importance of the examples is evaluated by the attention weight value in the embodiment, and the attention network is improved in the embodiment, so that the importance evaluation of different examples is more consistent with the situation of the discrete building scene, and the most relevant examples in the discrete building scene category are highlighted.

The original attention weight calculation mode is as follows: a is a _i ＝exp(sigmoid(Wh _i +b))，

The present embodiment proposes to improve the way of attention weight calculation as:

a, b are attention network machinesThe importance of the example block in the classification of the scene category is adjusted through the adaptive learning of the mechanism, namely the determination of the attention weight value through the tanh function and the h _i,j The function is adjusted to more highlight the attention weight value of the key example in the scene classification, so that the influence of the irrelevant example in the scene classification is weakened, and the accuracy of the scene classification result is improved.

The importance of the local semantic features quantitatively expresses the contribution of local semantics to global scene semantics, and is an important part in a scene classification process, the evaluation of the importance of the local semantics directly influences the result of the scene classification, and is also the most important part in the scene classification process and the most important part in the whole improvement.

Building scene extraction route based on multi-example convolutional neural network

(1) And constructing a remote sensing image building sample set. The building data set manufactured in this embodiment refers to the UCM data set, and constructs a data set including 5 types of land features (including buildings, airplanes, windmills, greenhouses, and parking lots), specifically as follows:

the image is cropped to a sample of a data set with a pixel size of 256 × 256, in a data format of jpg. The data set is labeled by respectively placing images of different ground feature types in different folders and directly naming the images by the names of the ground features. Through the code, the label of the image data is determined, namely the label is directly converted into 0,1,2,3 \8230accordingto the sequence of the folder, 5 types of ground object categories are established in the sample library, and 1000 images are formed in the sample library, wherein part of samples in the sample library are respectively the samples "airplane", "building", "windmill" and "parking lot" as shown in fig. 5 (a) - (d).

(2) And (5) enhancing data. In the embodiment, a data enhancement mode is adopted to solve the overfitting phenomenon, namely, the amount and diversity of training sample data are artificially increased by means of random cropping, up-down and left-right turning, random adjustment, noise increase and the like of image data.

(3) And (6) building a model. In the embodiment, an MI-AlexNet network is selected as a model for scene classification, wherein the model comprises 8 layers of convolutional neural networks, the front 7 layers are used for feature extraction, and the last layer obtains example category scores by using a multi-example aggregation function, so that the category of the input image can be obtained.

(4) And (4) training a model. Dividing a data set into an 80% training set and a 20% testing set, and carrying out improved MI-CNN model training by using the training data set, wherein the initial learning rate is 0.0001, and the batch processing size is 32; the model training is terminated after 100 iterations for each 30 (epoch) rounds of learning rate multiplied by 0.1.

(5) And (6) evaluating the precision. The overall accuracy refers to the ratio of the number of images predicted by the model to the overall number on the test set, and can well represent the overall classification accuracy. Meanwhile, in order to ensure the reliability of the experimental results and reduce the randomness thereof, the embodiment sets 5 repeated tests, each test adopts a randomly manufactured training set and a test set, and finally, the average value and the standard deviation of the experimental results are obtained.

Firstly, a multi-example convolutional neural network is carried out on a building data set, the building extraction effect is compared with an improved MI-AlxNet model, firstly, the overall accuracy and the standard deviation are compared, and the result is shown in a table 2;

TABLE 2 MI-AlexNet vs. modified MI-AlexNet Overall accuracy

From the aspect of overall accuracy, compared with a reference method MI-AlexNet building, the extraction accuracy of the improved MI-AlexNet method is improved by 0.58%, namely the improved MI-AlexNet has better performance in building scene classification. In terms of standard deviation, it can be seen that the algorithm based on this example is smaller than the standard deviation of MI-AlexNet. The standard deviation is a measure of the degree of dispersion of the mean values of a set of data. The smaller the standard deviation, the closer the values are to the mean, and the more stable the model. The improved AlexNet model has higher overall accuracy and is more stable to the building scene.

And visualizing semantics and attention weight in the process of extracting the building scene by the improved MI-AlexNet. The attention weight reflects the importance of semantics in a scene, and as shown in fig. 6 (a) - (d), the attention weight is an original drawing of a sample, a semantic graph and an attention example graph, respectively, where the attention example graph is a weight for a heat graph to represent its attention (the heat graph may be a color graph, and the redder the color graph is, the greater the attention weight is), the greater the contribution of the features of the corresponding portion to the scene classification result is, the greater the category label of the local scene can be indirectly associated with the global scene, thereby solving the problem that discrete buildings are difficult to identify. In the diagrams (a) - (d), it can be seen that in a discrete small-target building scene containing several semantics, the convolutional neural network can obtain the correct building semantics (for example, the semantic graph may be a color graph, and the yellow part semantics of the color graph is the building semantics), the attention mechanism in the improved MI-AlexNet can give a larger importance response value to the key target through adaptive learning, and the attention weight graph can show that the attention weight value corresponding to the sparse building part in the scene is larger, that is, the importance is higher, and then the local semantics and the overall semantics can be linked through the aggregation function to highlight the importance of the key target, thereby improving the identification accuracy of the discrete small-target building.

In order to check the effect of the model, this embodiment performs target detection on a discrete building (structure) in a high mountain area by using the trained model, and obtains the trained MI-CNN model as shown in fig. 7.

Mask R-CNN example segmentation

The Mask-RCNN is a fully-connected divided sub-network added after a base feature network of the Faster-RCNN, and the original two tasks (classification + regression) are changed into three tasks (classification + regression + division). Mask R-CNN is a two-stage framework, the first stage scanning the image and generating proposals (pro-sals, i.e. areas that are likely to contain an object), the second stage classifying the proposals and generating bounding boxes and masks.

FIG. 8 is a classification diagram of the network structure of Mask R-CNN, which is modified by the lower half part with a hook based on fast-RCNN, and the overall process is as follows:

1) Inputting a preprocessed original image;

2) Sending the input picture into a Feature extraction network, and performing Feature extraction to obtain a Feature Map (Feature Map);

3) Setting a fixed number of ROI (region of interest) at each pixel position of the feature map, then sending the ROI into an RPN (region generation network) to perform two-classification (foreground and background) and coordinate regression to obtain refined ROI, and generating N ROI for each picture.

4) Mapping the ROI to the last layer of convolution Feature Map of the Feature extraction network;

5) Generating Feature Maps with fixed size for each ROI through the ROI Align layer;

6) And finally, performing multi-class classification on the ROI areas, performing candidate box regression, introducing an FC N semantic segmentation algorithm to generate a Mask, and completing a segmentation task.

The model training adopts a multi-task loss function, and the value of the loss function is continuously reduced through learning until a global optimal solution is obtained. The formula of the loss function is classified error, bounding box error and segmentation error respectively according to three terms in formula (2-3).

L＝L _cls +L _box +L _mask (2-3)

The Mask R-CNN algorithm adopts an RPN network when extracting the ROI, and the RPN network can accept input pictures of any size and output the input pictures as regions possibly having targets. Mask R-CNN adopts the form of shared convolution to extract ROI in order to reduce calculated amount, namely, the picture input into RPN network is the feature picture output by the last layer of convolution network.

The Mask R-CNN example segmentation model comprises the following parts:

(1) Production of discrete building (structure) instance segmentation data sets.

The example segmentation data set production comprises three steps of data preprocessing, data cutting and data labeling. Where the training dataset is 80% of the entire dataset and the validation dataset is 20% of the dataset.

Example segmentation data sets, incorporating the characteristics of buildings, ultimately contain 5 types of samples: the spatial resolution of buildings, greenhouses, windmills, parking lots and airports is 1m, the image size is 256 multiplied by 256, and the total number of images is 1500. In order to enhance the generalization capability of the model, the embodiment performs 90 °, 180 °, and 270 ° rotation on the image, and simultaneously performs corresponding transformation on the label data. An example of a portion of a data set is shown in fig. 9.

(2) And (5) establishing a Mask R-CNN model.

The Mask RCNN model builds a Feature Pyramid Network (FPN) and ResNet 50-based backbone network. The specific procedures have been described above.

(3) And (5) training a Mask R-CNN model.

The model training adopts a pre-training model mask _ rcnn _ coo.h5, the initial learning rate is set to be 0.001, the learning attenuation rate is set to be 0.9, the batch processing size is set to be 16, and the whole stage of the model is trained.

(4) Evaluation of accuracy

In order to ensure the reliability of the experimental result and reduce the randomness of the experimental result, 5 times of repeated experiments are also set, a training set and a testing set which are randomly divided are adopted in each experiment, finally, the average value of the experimental result is taken as the experimental result, the average precision is taken as the evaluation index of the algorithm, the average precision reaches 92 percent, and a result graph is tested.

In this embodiment, a remote sensing image of 6000 × 6000 large graphs is used, and a discrete building (structure) extraction detection method based on Ma sk R-CNN of MI-CNN is specifically as follows:

and step two, calling and utilizing the trained MI-CNN model to carry out discrete construction (structure) building scene identification, wherein a sliding window detection method is utilized in the detection process. And if the MI-CNN model identifies the discrete building scene, obtaining a candidate frame containing the discrete building scene, and finally, performing example segmentation of detection and extraction of the discrete building by using the trained Mask-RCNN model in the range of the candidate frame to obtain the category, the Mask file and the positioning frame of the target ground object. And if the MI-CNN is that the discrete building structure scene is detected, the program execution is ended. And sequentially detecting the next sliding window.

In order to effectively evaluate the effect of extracting the discrete building structure target of the method, the performance of the algorithm is evaluated by using the accuracy, the recall rate and the F-measure evaluation index.

Precision (Precision) is the percentage of the number of correctly detected buildings to the number of all detected buildings, represents the accuracy of prediction in the positive sample result, and can reflect the extraction performance of a detection algorithm in a certain sense, but Precision cannot fully and effectively evaluate the performance of an algorithm. Recall (Recall), also known as Recall, is the percentage of the number of correctly detected buildings to the number of buildings that should be detected, and is a measure of coverage, reflecting the algorithm's ability to Recall all the target samples. The accuracy and the recall rate are the most important indexes of the evaluation model. However, the two indexes are not intuitive enough, and in the field of target detection, the two indexes are often combined into one variable F-measure, which is a weighted harmonic average of accuracy and recall rate and is often used for evaluating the quality of a classification model, in this embodiment, the F1 index is used for evaluating the performance of the model, and the calculation formula is shown as formula (4-5).

Where P is precision and R is recall.

The judgment criteria of the experimental results of this example are: and (4) judging the classification confidence (probability of being classified into a certain class) to be more than 0.5, and acquiring a final detection result by combining non-maximum inhibition. Table 3 shows comparative analysis of the detection results of the high score No. 2 experiment remote sensing image by the method of this embodiment and the Mask R-CNN method in terms of accuracy, F1 value, and operation time.

TABLE 3 comparison of different evaluation indexes of experimental images by two experimental methods

It can be seen from table 3 that the accuracy of the method of this embodiment is obviously improved from 82.75% to 90.15%, mainly because the method of this embodiment first extracts the discrete building (structure) scene by using the improved MI-CNN algorithm to obtain its candidate frame, and the false alarm target can be effectively excluded by selecting the candidate frame by using the improved MI-CNN algorithm, so as to perform accurate identification of the discrete building (structure) target in the candidate frame, thereby effectively solving the problem of many false alarm targets in the target detection process, and indirectly improving the accuracy.

From the recall rate, the recall rate of the two methods is the same, namely the ability to find the correct target is the same. Since the two methods of this embodiment both use Mask R-CNN algorithm to perform example segmentation, the ability to detect correct samples is approximately the same. The larger the F1 value is, the better the performance of the model is, and the comprehensive evaluation on the accuracy and the recall rate is realized, and the performance of the algorithm model of the embodiment is higher than that of the traditional Mask R-CNN target detection algorithm as can be seen from the F1 value.

Meanwhile, in terms of running time, the algorithm of the embodiment has the advantages of shorter running time, higher efficiency and obvious advantages. Although the algorithm of the embodiment undergoes the two stages of MI-CNN scene classification and Mask R-CNN instance segmentation, compared with the MI-CNN target detection, the MI-CNN scene classification stages are added, but the method of the embodiment directly performs the Mask R-CNN instance segmentation in the candidate frame of the MI-CNN scene classification, so that the workload of detection and extraction of a large number of non-target areas is effectively reduced, the efficiency of extracting the candidate frame of the MI-CNN scene classification is very high, and the advantage of the method is more obvious compared with the detection efficiency of the Mask R-CNN, so that the method of the embodiment undergoes the two stages of detection, but the operation efficiency is higher. Therefore, the MI + Mask R-CNN model is superior to or equal to the Mask R-CNN model in all evaluation indexes. The MI + Mask R-CNN method can effectively reduce false alarm targets, improve the accuracy and the operation efficiency of target detection, and solve the problem in target detection.

Example two:

the embodiment provides a system for implementing the method, which includes:

a data acquisition module configured to: acquiring a remote sensing image;

an extraction detection module configured to: and extracting a feature map of the building from the candidate frame containing the target scene information of the building based on the second model to obtain the mask, the category and the positioning frame information of the target building.

The scene of the building is obtained from the remote sensing image, the mask, the category and the positioning frame of the building are extracted from the candidate frame with scene information, the target building can be identified on the basis of eliminating the false alarm target, the problem of more false alarm targets in the detection process is effectively solved, and the accuracy is indirectly improved.

Although the detection algorithm is subjected to two stages of scene classification and example segmentation, compared with the traditional target detection, the scene classification stage is added, but the example segmentation is directly carried out in a candidate box of the scene classification, so that the workload of detection and extraction of a large number of target-free areas is effectively reduced, and the detection efficiency is higher compared with that of the traditional mode.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the steps in a remote sensing image-based discrete building detection method as set forth in the first embodiment.

In the method for detecting a discrete building based on a remote sensing image executed by the computer program in the embodiment, the scene where the building (structure) is located is obtained from the remote sensing image, and then the mask, the category and the position of the building (structure) are extracted from the candidate frame with the scene information, so that the target building (structure) can be identified on the basis of eliminating the false alarm target, the problem of more false alarm targets in the detection process is effectively solved, and the accuracy is indirectly improved.

Example four

The embodiment provides a computer device, which includes a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to implement the steps in the remote sensing image-based discrete building detection method as set forth in the above embodiment.

The visualization method executed by the processor of the embodiment obtains the scene of the building from the remote sensing image, and extracts the mask, the category and the positioning frame of the building from the candidate frame with the scene information, so that the target building can be identified on the basis of eliminating the false alarm target, the problem of more false alarm targets in the detection process is effectively solved, and the accuracy is indirectly improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A discrete building detection method based on remote sensing images is characterized in that: the method comprises the following steps:

acquiring a remote sensing image;

obtaining the characteristics of a target building in the remote sensing image based on a first model, classifying according to image blocks corresponding to the characteristics to obtain example category scores, aggregating the example category scores to form scene category scores, and obtaining the area where the scene of the target building is located, namely an initial candidate frame containing the scene of the building;

2. The remote sensing image-based discrete building detection method of claim 1, wherein: the first model comprises an example classifier and a collection function, wherein the example classifier classifies examples through the extracted features; and the aggregation function receives the example labels, integrates the example predicted category labels into a prediction result of the scene, and forms a final category label.

3. The remote sensing image-based discrete building detection method of claim 2, wherein: the example classifier includes at least one of a convolutional network model Alex-Net, VG G-Net, or ResNet 50/101.

4. The remote sensing image-based discrete building detection method of claim 2, wherein: the example classifier has three maximum pooling layers and three fully connected layers.

5. The remote sensing image-based discrete building detection method of claim 2, wherein: the aggregation function aggregates the example category scores into a scene category score, the aggregation process focuses attention on a key portion of the image classification according to the magnitude of the example feature weights, and the scene category score is a weighted average of the example categories.

6. A remote-based system as in claim 5The discrete building detection method of the sensing image is characterized in that: attention weight value W according to example _i，j The importance of the evaluation examples is specifically as follows:

the attention weight calculation method is as follows:

a and b are parameters of attention network mechanism, h _i,j Is the output characteristic.

7. The remote sensing image-based discrete building detection method of claim 1, wherein: and the second model extracts a feature map of the building from a candidate frame containing building target scene information, sets a fixed number of interest areas for each pixel position of the feature map, sends the interest areas into an area generation network for secondary classification and coordinate regression to obtain a refined interest area, maps the interest areas to the last layer of convolution feature map of the feature extraction network to obtain a feature map with a fixed size, and classifies the interest areas.

8. A discrete building detecting system based on remote sensing image is characterized in that: the method comprises the following steps:

a data acquisition module configured to: acquiring a remote sensing image to be detected;

a scene classification module configured to: obtaining the characteristics of a target building in a remote sensing image based on a first model, classifying the image blocks corresponding to the characteristics to obtain example category scores, and aggregating the example category scores to form scene category scores to obtain the area where the scene of the target building is located, namely an initial candidate frame containing the scene of the building;

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of a method for remote sensing image-based discrete building detection as claimed in any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor when executing the program performs the steps of a method for remote sensing image based discrete building detection as claimed in any one of claims 1-7.