CN110826478A

CN110826478A - Aerial photography illegal building identification method based on countermeasure network

Info

Publication number: CN110826478A
Application number: CN201911063291.8A
Authority: CN
Inventors: 宫法明; 徐晨曦; 李昕; 杨天濠; 刘芳华; 袁咪咪; 唐昱润; 司朋举
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-02-21

Abstract

The invention relates to an aerial photography default construction identification method based on a countermeasure network, and belongs to the technical field of target detection and identification in deep learning. The method aims to solve the problems that the ground deformation field training samples caused by aerial photography are few, and the deformation field is difficult to identify due to poor robustness of a target detector. While general antagonistic networks are usually trained to produce a good generator for realistic images by antagonistic learning, our networks are doing the opposite task, and by competing against the network, we hopefully train a better detector robust to deformation. The capacity of identifying deformation samples is improved through antagonistic learning, and the detection accuracy is improved through the antagonistic learning strategy, so that the performance is substantially improved compared with that of a standard Fast-RCNN network.

Description

Aerial photography illegal building identification method based on countermeasure network

Technical Field

The invention belongs to the field of computer graphics and image processing, and particularly relates to an aerial photography default identification method based on a countermeasure network.

Background

The speed of city construction is increasing day by day, and a lot of problems are brought to city planning and management work: the driving and management of benefits generate a large amount of illegal construction, the large-scale land occupation of the urban and rural junctions is violated, the construction of cultural relics is destroyed, the large-scale demolition and large construction of cultural relics are destroyed, the invisible planning is unauthorized to construct, and the like, and the existence of the illegal construction has a plurality of hazards to urban planning and construction. Aiming at the illegal construction detection, the traditional aerial photography mode (such as satellite remote sensing and common aerial remote sensing) has the problems of high information acquisition cost, long data acquisition period, lack of maneuvering flexibility and the like, and is not suitable for short-term and high-frequency urban dynamic monitoring research. The flight mode of the unmanned aerial vehicle has the advantages of short data acquisition period, flexible acquisition of high-resolution image data, free and flexible flight time, relatively low one-time investment cost and use cost, easy operation and maintenance of equipment and the like. Gradually becoming the current default building detection preferred equipment.

With the continuous development of image processing technology and deep learning technology, the target detector is trained by utilizing a deep learning framework, and the purpose identification of the aerial video is very meaningful research. At the initial stage of model training, the problem of field deformation caused by high-altitude oblique shooting is found out when relevant personnel work and the camera shooting is oblique shooting in order to improve the working efficiency. That for learning a target detector that is invariant to deformation, our current solution is to use a data-driven strategy, i.e. to collect large-scale data sets with target instances under different conditions. However, in practical situations, the data set cannot cover all situations, which causes a problem of difficult recognition of the deformation field.

Disclosure of Invention

In order to solve the problems, the invention provides an aerial photography default identification method based on a countermeasure network, which improves the identification accuracy of a target detector on a deformation target by means of a countermeasure learning strategy.

The method comprises the following specific steps:

s1, dividing a target video into pictures according to frames by screening videos shot by the unmanned aerial vehicle in work, manually labeling the target, and properly reducing the resolution of the pictures to improve the calculation speed;

s2, taking the training set as the input of the convolution network of Fast-RCNN, obtaining a basic target detector as a discriminator in the countermeasure network through training;

s3, using a warped network, the key idea is to create a warp on the target feature and make the target recognition of the detector difficult, we use an STN spatial transform network, which contains three parts: a local network, a grid generator and a sampler;

s4, the input feature map, STN spatial transform network will estimate the amount to be deformed (e.g., rotation degree, translation distance and affine transform), these variables will be used as input to the grid generator and sampler on the feature map, the output is the deformed feature map;

s5, inputting the deformed characteristic diagram into a discriminator, namely a pre-trained basic target detector for discrimination, competing with the networks and overcoming obstacles, and processing deformation by Fast-RCNN learning in a robust mode;

s6, the discriminator feeds back the recognition result to the generator, and updates the generator parameters, so that the generator can generate a more useful deformation characteristic diagram;

s7, a more robust target detector is obtained through the antagonistic learning.

The technical scheme of the invention is characterized by comprising the following steps:

for step S2, the convolution network of Fast-RCNN takes the whole image as input and produces as output a convolution feature map, since the operation is mainly convolution and max pooling, the spatial size of the output feature map will vary according to the size of the input image, given a feature map, the RoI-posing layer is used to project candidate regions onto the feature space, the RoI pooling layer will crop and resize each target candidate region to generate a fixed-size feature vector, and then pass these feature vectors through the fully-connected layer whose output is the probability and bounding box coordinates of each target class including the background class, for training, SoftMax Loss and Regression Loss are applied to these two outputs, respectively, and back-propagated to perform end-to-end learning, the total Loss function is as follows:

L(p,u,t^u,v)＝L_cls(p,u)+λ[u≥1]L_cls(t^u,v) (1)

in formula (1), the total loss function is divided into a weighted sum of classification loss and regression positioning loss function, and p represents the discrete probability of each RoI; u represents the category of true correct tags; t represents the regression compensation of the true correct label; v represents the bounding box regression target for the true correct label; λ is a hyper-parameter, which is specified before training, indicating the degree of importance of classification and regression;

for step S4, the operation mechanism of the space transformer adopted in the present invention can be divided into three parts, namely, local Network (localization Network), Grid generator (Grid generator), and Sampler (Sampler). The local network is a network used for regression transformation parameter theta, the local network inputs the characteristic image and outputs spatial transformation parameters through a series of hidden network layers, the form of theta can be various, if 2D affine transformation is required to be realized, the theta is the output of a 6-dimensional (2x3) vector; the size of θ depends on the type of transform; a Grid Generator (Grid Generator) constructs a sampling Grid according to predicted transformation parameters, calculates the coordinate position of each coordinate in the original feature map U in the transformed target feature V in a matrix operation mode, namely generates T (G), and the corresponding relation can be written as an expression (2), namely the affine transformation of the picture can be written as an expression (3); the sampler uses the sampling grid and the input characteristic diagram as input to generate output at the same time, and obtains the result of the characteristic diagram after conversion;

in the formula (2)

Is the coordinate corresponding to each pixel in the feature map U after transformation,

is the coordinate, T, corresponding to each pixel in the original target feature V_θAs a two-dimensional affine transformation function, matrix A_θA parameter matrix output for the local network;

in the formula (3)Is the coordinate corresponding to each pixel in the feature map U after transformation,

the coordinate is corresponding to each pixel in the original target feature V, and the coefficient matrix theta is an affine transformation coefficient.

For step S5, the challenge network must learn features that are predicted to be misdetected by the detector, and we train this challenge network by the following loss function:

L＝-L_softmax(F_c(A(X)),C) (4)

in formula (4), A (X) represents STN network, X is the calculated feature on the image, F_cThe output representing class probability, C is the true class of X, so if features generated by the challenge network are easily classified by the detector, they will get high loss for the STN network, on the other hand, if after the challenge feature generation, it is difficult for the detector to classify, we get high loss for the detector, and the loss for the STN network is low.

For step S7, our network is built on a space transformation network, which was proposed to deform features to make classification easier in the original work, while our network is doing the opposite task, by competing with our Fast-RCNN network, we can train a better detector robust to deformation.

The invention has the beneficial effects that:

(1) the invention provides an automatic detection and identification method for the identification work of the illegal site by using the aerial photography of the unmanned aerial vehicle, and can improve the working efficiency and liberate manpower.

(2) Aiming at the inherent deformation problem existing in unmanned aerial vehicle shooting, the invention provides a method for generating deformation samples by using an antagonistic network for training, and solves the problem of less deformation samples in data concentration, thereby obtaining a detector with robustness.

Drawings

FIG. 1 is a flow chart of an aerial photography violation identification method based on a countermeasure network according to the present invention.

FIG. 2 is a network structure diagram of an aerial photography violation identification method based on a countermeasure network.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in the first figure, the present invention provides an aerial photography default identification method based on countermeasure network, which is an implementation flowchart of the present invention, and the method includes:

s2, the training set is used as the input of the convolution network of Fast-RCNN, the basic target detector is obtained through training and is used as the discriminator in the countermeasure network, the convolution network of Fast-RCNN takes the whole image as the input, and the convolution characteristic graph is generated as the output. Since the operations are mainly convolution and maximum pooling, the spatial size of the output feature map will vary depending on the input image size. Given the feature map, the RoI-posing layer is used to project the candidate region onto the feature space. The RoI pooling layer will crop and resize each target candidate region to generate a fixed-size feature vector. These feature vectors are then passed through the fully connected layers. The output of the fully connected layer is the probability and bounding box coordinates of each object class including the background class. For training, SoftMax Loss and Regression Loss are applied to the two outputs, respectively, and propagated backwards to perform end-to-end learning. The overall loss function is as follows:

L(p,u,t^u,v)＝L_cls(p,u)+λ[u≥1]L_cls(t^u,v) (5)

in the formula (5), the total loss function is divided into a weighted sum of classification loss and regression positioning loss function, and p represents the discrete possibility of each RoI; u represents the category of true correct tags; t represents the regression compensation of the true correct label; v represents the bounding box regression target for the true correct label; λ is a hyper-parameter, which is specified before training, indicating the degree of importance of classification and regression;

s3, using deformation Network, the key idea is to create deformation on the target feature and make the target identification of the detector difficult, we adopt STN space transformation Network, the operation mechanism of a space transformer can be divided into three parts, local Network (localization Network), Grid generator (Grid generator), Sampler (Sampler). (ii) a

S4, the input feature map, STN spatial transformation network will estimate the amount to be deformed (e.g., degree of rotation, translation distance, and affine transformation). These variables will be used as inputs to the grid generator and sampler on the feature map, the output being the warped feature map. The local network is a network for regressing a transformation parameter theta, which is input into a characteristic image and then outputs a spatial transformation parameter through a series of hidden network layers, wherein the form of theta can be various, and if 2D affine transformation is required to be realized, the theta is output of a 6-dimensional (2x3) vector. The size of θ depends on the type of transform; a Grid Generator (Grid Generator) constructs a sampling Grid according to predicted transformation parameters, calculates the coordinate position of each coordinate in the original feature map U in the transformed target feature V in a matrix operation mode, namely generates T (G), and the corresponding relation can be written as an expression (2), namely the affine transformation of the picture can be written as an expression (3); the sampler uses the sampling grid and the input characteristic diagram as input to generate output at the same time, and obtains the result after the characteristic diagram is transformed.

In the formula (6)

Is the coordinate corresponding to each pixel in the feature map U after transformation,is the coordinate, T, corresponding to each pixel in the original target feature V_θAs a two-dimensional affine transformation function, matrix A_θAnd (4) parameter matrix output by the local network.

In the formula (7)

S5, inputting the deformed feature map into a discriminator, namely a pre-trained basic target detector for discrimination, competing with the networks and overcoming obstacles, and processing deformation in a robust mode by the Fast-RCNN society. The countermeasure network must learn features that are predicted to be misdetected by the detector. We train this countermeasure network by the following loss function:

L＝-L_softmax(F_c(A(X)),C) (8)

in formula (8), A (X) represents STN network, X is the calculated feature on the image, F_cThe output representing class probability, C is the true class of X, so if features generated by the countermeasure network are easily classified by the detector, they will get high loss for the STN network. On the other hand, if after the antagonistic features are generated, it is difficult for the detector to classify, then for the detector we get a high loss, and the loss for the STN network is low.

S6, the discriminator feeds back the recognition result to the generator, and updates the generator parameters, so that the generator generates a more useful deformation feature map, as shown in fig. 2.

S7, a more robust target detector is obtained through the antagonistic learning. As shown in fig. 2, our network is built on a spatial transformation network. In the original work, a spatial transformation network is proposed to deform the features and make the classification easier. While our network is accomplishing the opposite task. By competing with our Fast-RCNN network, we can train a better detector robust to deformation.

The invention discloses an aerial photography default identification method based on a countermeasure network. This antagonism creates examples of different deformations in the training process, making them difficult for the original target detector to recognize. Therefore, the model with invariance to deformation is obtained, and the problem that the recognition accuracy of the ground deformation field is low due to the fact that a training sample is concentrated and a large number of deformation samples are not few is solved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An aerial photography default identification method based on a countermeasure network is characterized by comprising the following specific steps:

s4, inputting the feature map, the STN space transform network will estimate the amount to be deformed, these variables will be used as input to the grid generator and sampler on the feature map, the output is the deformed feature map;

2. The method for identifying aerial photography violation based on countermeasure network as claimed in claim 1, wherein for step S2, the convolution network of Fast-RCNN of the present invention takes the whole image as input and generates convolution feature map as output, since the operation is mainly convolution and maximum pooling, the spatial size of the output feature map will change according to the size of the input image, given the feature map, the RoI-posing layer is used to project the candidate region onto the feature space; the RoI pooling layer will crop and resize each target candidate region to generate a fixed-size feature vector, and then pass these feature vectors through a fully connected layer whose outputs are the probability and bounding box coordinates of each target class including the background class, to which outputs SoftMax Loss and Regression Loss are applied, respectively, for training, back-propagating to perform end-to-end learning, the total Loss function is as follows:

L(p,u,t^u,v)＝L_cls(p,u)+λ[u≥1]L_cls(t^u,v) (1)

in formula (1), the total loss function is divided into a weighted sum of classification loss and regression positioning loss function, and p represents the discrete probability of each RoI; u represents the category of true correct tags; t represents the regression compensation of the true correct label; v represents the bounding box regression target for the true correct label; λ is a hyperparameter, which is specified prior to training, indicating the degree of importance of classification and regression.

3. The method for identifying the aerial photography default based on the countermeasure network according to claim 1, wherein for step S4, the operation mechanism of the space transformer adopted by the invention can be divided into three parts, namely a local network (localization network), a Grid generator (Grid generator), and a Sampler (Sampler); the local network is a network used for regression transformation parameter theta, the local network inputs the characteristic image and outputs spatial transformation parameters through a series of hidden network layers, the form of theta can be various, if 2D affine transformation is required to be realized, the theta is the output of a 6-dimensional (2x3) vector; the size of θ depends on the type of transform; a Grid Generator (Grid Generator) constructs a sampling Grid according to predicted transformation parameters, calculates the coordinate position of each coordinate in the original feature map U in the transformed target feature V in a matrix operation mode, namely generates T (G), and the corresponding relation can be written as an expression (2), namely the affine transformation of the picture can be written as an expression (3); the sampler uses the sampling grid and the input characteristic diagram as input to generate output at the same time, and obtains the result of the characteristic diagram after conversion;

in the formula (2)

Is the coordinate corresponding to each pixel in the feature map U after transformation,is the coordinate, T, corresponding to each pixel in the original target feature V_θAs a two-dimensional affine transformation function, matrix A_θA parameter matrix output for the local network;

in the formula (3)

4. The method for identifying aerial photography default based on countermeasure network according to claim 1, wherein for step S5, the STN network adopted in the present invention must learn the characteristics predicted to be misdetected by the detector, and we train this countermeasure network by the following loss function:

L＝-L_softmax(F_c(A(X)),C) (4)

in the formula (4), A (X) represents STN network, and X is figureLike the feature being calculated, F_cThe output representing class probability, C is the true class of X, so if features generated by the countermeasure network are easily classified by the detector, they will get high loss for the STN network; on the other hand, if after the antagonistic features are generated, it is difficult for the detector to classify, then for the detector we get a high loss, and the loss for the STN network is low.

5. The method for recognizing the aerial photography violation based on the countermeasure network as claimed in claim 1, wherein for step S7, the network of the present invention is built on a space transformation network, and in the original work, the space transformation network is proposed to deform the features, so that the classification is easier, while our network is performing the opposite task, and by competing with our Fast-RCNN network, we can train a better detector with robustness to deformation.