CN118015469B

CN118015469B - Urban and rural junction illegal building detection method and system

Info

Publication number: CN118015469B
Application number: CN202410276667.8A
Authority: CN
Inventors: 李作进; 曹亚男; 贺学乐; 聂玲; 青美伊; 吴昭; 蔡俊锋
Original assignee: Chongqing University of Science and Technology
Current assignee: Chongqing University of Science and Technology
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-09-10
Anticipated expiration: 2044-03-12
Also published as: CN118015469A

Abstract

The invention relates to the technical field of artificial intelligence, and particularly discloses a method and a system for detecting urban and rural junction illegal buildings, aiming at the problem that RETINANET algorithm still has the phenomenon of missing detection on small-target objects at the edges of images in a data set, the method and the system add attention modules into Stage1 and Stage2 modules (stages) of a feature extraction network ResNet, and simultaneously add parallel cavity convolution modules into Stage3 modules to perform multi-scale feature fusion to obtain global feature information, so that the detection precision of the algorithm on urban and rural illegal buildings is improved; to avoid gradient explosion problems during training, the activation function in the network is replaced with Gelu. The feasibility and the effectiveness of the method and the system are verified through experiments.

Description

Urban and rural junction illegal building detection method and system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a system for detecting urban and rural junction illegal buildings.

Background

The complex geographic environment of urban and rural joints contributes to the diversified building forms, the buildings are mainly small property houses, and the distribution comparison and aggregation result in diversified illegal building types and difficult finding. Building is a continuous process, if the contradiction can be reduced and prevented in time in the construction process, the property loss of citizens is reduced. Besides the mode of mass report and field inspection by management personnel, the method for checking the illegal building also comprises a method for combining a satellite remote sensing technology with deep learning and combining an unmanned aerial vehicle aerial technology with the deep learning and combining a fixed-point monitoring video with the deep learning.

The satellite remote sensing image technology has long monitoring period, poor timeliness, weather factor restriction and poor remote sensing image data quality, so that the detection precision is not high. With the development of unmanned aerial vehicle aerial photography technology, urban illegal building detection based on unmanned aerial vehicle aerial photography technology gradually becomes the mainstream. The unmanned aerial vehicle aerial photographing method needs to input a large amount of manpower and material resources, and can not monitor a specific area in real time due to weather restriction. Compared with the satellite remote sensing technology and the unmanned aerial vehicle aerial photographing technology, the method for monitoring the video at fixed points is less influenced by weather factors, is low in monitoring cost, can monitor in real time, and is more suitable for monitoring illegal buildings in urban and rural junction areas in real time.

RETINANET is a one-Stage detector designed for verification of Focal Loss validity in 2017 Tsung-YiLin. The RETINANET network structure is shown in figure 1. As can be seen from fig. 1, the RETINANET algorithm mainly includes a backbone network and a classification and regression sub-network, where the backbone network includes ResNet feature extraction network and FPN (Feature Pyramid Network, feature pyramid, including P1-P5 five-layer structure) feature fusion network. The ResNet feature extraction network generally selects ResNet or ResNet101, C1-C5 as an output feature layer of each stage of the feature extraction network, and C3-C5 output features realize multi-scale feature extraction through the FPN feature fusion network. And then inputting the feature map into a predictor part for category judgment and position regression, wherein the size of the feature map output by the classification sub-network is W multiplied by H multiplied by KA, K is the number of predicted categories, A=9 is the number of prediction frames generated by each feature point, and W, H is the width and the height of the input image respectively.

The RETINANET algorithm enables the model to pay more attention to samples which are difficult to classify through dynamic adjustment of parameters, improves the detection precision of few samples in urban and rural illegal buildings, and still has the phenomenon of missing detection of small-target objects to be detected on the edges of images in the data sets.

Disclosure of Invention

The invention provides a method and a system for detecting illegal buildings at urban and rural joints, which solve the technical problems that: how to improve the detection precision of the urban and rural junction illegal building.

In order to solve the technical problems, the invention provides a method for detecting illegal buildings at urban and rural joints, which comprises the following steps:

S1, improving RETINANET a model to construct a violation building detection model;

The violation building detection model comprises a main network and a classification regression sub-network, wherein the main network comprises ResNet feature extraction networks and feature fusion networks; the ResNet feature extraction network comprises five modules, namely Stage0, stage1, stage2, stage3 and Stage4, which are sequentially connected, wherein the output features of the Stage2, stage3 and Stage4 modules are input into the feature fusion network to perform multi-scale feature fusion, and a fusion feature map is input into the classification regression sub-network to perform classification judgment and position regression;

The Stage0 module pre-processes the input image and inputs the input image into the Stage1 module after passing through the maximum pooling layer; the Stage1 module comprises 1 first BlockA modules, N ₁ BlockB modules and 1 attention module, wherein N ₁ is more than or equal to 2; the Stage2 module comprises 1 second BlockA modules, N ₂ BlockB modules and 1 attention module, wherein N ₂ is more than or equal to 2; the Stage3 module comprises 1 second BlockA modules, N ₃ BlockB modules and 1 parallel cavity convolution module, wherein N ₃ is more than or equal to 2; the Stage4 module comprises 1 second BlockA modules and N ₄ BlockB modules, wherein N ₄ is more than or equal to 2;

The first BlockA module and the second BlockA module adopt structures of BlockA modules with different convolution steps, and the BlockA module downsamples an input feature map according to the convolution steps to realize feature map scale transformation; the convolution step length of the BlockB module is 1, and only the number of channels is changed without changing the width and the height of the input feature map;

the attention module is used for coding and reserving position information of the features along the horizontal direction and the vertical direction respectively;

The parallel cavity convolution module is used for extracting multi-scale features by using a plurality of parallel cavity convolution modules;

s2, constructing a data set;

S3, training and verifying the illegal building detection model by adopting the data set;

and S4, deploying and training the obtained violation building detection model.

Further, the parallel hole convolution module performs the following operations:

Firstly, changing the number of Input characteristic channels of an Input characteristic Input of CxH2xW through convolution of 1 x 1, normalizing through a BN layer to obtain a characteristic I, simultaneously normalizing through the BN layer and adding a nonlinear activation function to obtain a characteristic A of a first scale, simultaneously normalizing through the BN layer and adding a nonlinear activation function to obtain a characteristic B of a second scale, and simultaneously normalizing through the BN layer and adding a nonlinear activation function to obtain a characteristic C of a third scale;

Then, carrying out cavity convolution on the feature A by using a first cavity convolution module, and obtaining a feature D through BN layer standardization and nonlinear activation function; carrying out cavity convolution on the feature B by using a second cavity convolution module, and then carrying out BN layer standardization and nonlinear activation function to obtain a feature E; carrying out cavity convolution on the feature C by using a third cavity convolution module, and then carrying out BN layer standardization and nonlinear activation function to obtain a feature F;

then, carrying out Concat operation on the multi-scale feature D, E, F in the depth direction to obtain a multi-scale fusion feature G;

Finally, add the multiscale fusion feature G and the feature I, and then obtain the Output feature Output of CxH xW through a nonlinear activation function.

Further, the convolution kernel sizes of the first, second and third hole convolution modules are all 3×3, the step sizes are all 1, and the padding values are 1,2 and 5 respectively.

Further, the attention module performs the following operations:

firstly, carrying out average pooling on Input features of CxH x W in the horizontal direction and the vertical direction to obtain output features of CxH x 1 and Cx1 x W respectively, wherein C, H, W respectively represent the number of channels, the height and the width;

Then, performing dimension transformation on the output characteristics in the horizontal direction C multiplied by H multiplied by 1, and performing Concat operation on the output characteristics in the vertical direction C multiplied by 1 multiplied by W to obtain combined characteristics of C multiplied by 1 multiplied by (W+H);

Then, convolving the combined characteristic of Cx1× (W+H), normalizing the BN layer and performing a nonlinear activation function to obtain an output characteristic of C/rx1× (W+H), wherein r is the reduction rate;

Then, the output characteristics of C/rx1x (W+H) are changed into CxHx1 and Cx1 xW again by convolution along the space dimension, and the corresponding attention weight g ^h、g^w is obtained by using a sigmoid activation function;

Finally, the Input feature Input of c×h×w is multiplied by the attention weight g ^h、g^w to obtain the Output feature Output of c×h×w.

Further, firstly, performing convolution with a convolution kernel of 1×1 and a step length of 1, performing BN layer standardization and nonlinear activation function, performing convolution with a convolution kernel of 3×3 and a step length of 2, and performing BN layer standardization to obtain a first intermediate feature; meanwhile, the Input characteristic is convolved with a convolution kernel of 1 multiplied by 1 and a step length of S1, and then a nonlinear activation function is added through the normalization of a BN layer to obtain a second intermediate characteristic;

then, add the first intermediate feature and the second intermediate feature, and then obtain an Output feature Output through a nonlinear activation function.

Further, in the first BlockA modules, the step S1 is set to 1; in the second BlockA module, step S2 is set to 2.

Further, the BlockB module performs the operations of:

firstly, performing convolution with a convolution kernel of 1×1 and a step length of 1, performing BN layer standardization and nonlinear activation function on an Input feature, performing convolution with a convolution kernel of 3×3 and a step length of 1, performing BN layer standardization and nonlinear activation function on the Input feature, performing convolution with a convolution kernel of 1×1 and a step length of 1, and performing BN layer standardization to obtain a first intermediate feature;

then, add the first intermediate feature and the Input feature Input, and then obtain the Output feature Output through a nonlinear activation function.

Further, the nonlinear activation functions adopted in the ResNet feature extraction network are Gelu activation functions; n ₁＝2,N₂＝3,N₃＝5,N₄ = 2.

Further, the step S2 specifically includes the steps of:

S21, acquiring urban and rural illegal building real-time images shot by a high-definition monitoring camera, cutting each real-time image into a plurality of images with the same size according to fixed cutting positions, removing images which do not contain detection target information, and constructing an urban and rural illegal building image data set;

s22, resetting the size of the network downloaded image to be consistent with the image cut in the step S1, and adding the network downloaded image into an urban and rural illegal building image data set to obtain an original data set;

And S23, expanding and annotating the original data set to obtain a data set which contains 9 types of characteristics of bricks, reinforcing steel bars, people, color steel buildings, house frames, brick structures, farms, adobe houses and abandoned houses.

The invention also provides a system for detecting the illegal buildings at the urban and rural junction, which is characterized in that: the method comprises a violation building detection model construction unit, a data set construction unit, a violation building detection model training unit and a violation building detection model application unit which are respectively used for executing steps S1, S2, S3 and S4 in the method.

According to the urban and rural junction illegal building detection method and system provided by the invention, in order to enhance the perception capability of an algorithm to a small target in a complex scene, attention modules are added in Stage 1and Stage2 modules (stages) of a feature extraction network ResNet, and meanwhile, a parallel cavity convolution module is added in Stage3 module to perform multi-scale feature fusion to obtain global feature information, so that the detection precision of the algorithm to urban and rural illegal buildings is improved; to avoid gradient explosion problems during training, the activation function in the network is replaced with Gelu. Experiments show that the average accuracy rate of detecting urban and rural illegal buildings by the improved illegal building detection model reaches 93.28%, meanwhile, the parameter is reduced by 16247552, and the feasibility and the effectiveness of the method and the system are verified.

Drawings

FIG. 1 is a block diagram of RETINANET networks provided in the background of the invention;

FIG. 2 is a block diagram of an improved ResNet network provided by an embodiment of the present invention;

FIG. 3 is a block diagram of an attention module provided by an embodiment of the present invention;

FIG. 4 is a block diagram of a parallel hole convolution module provided by an embodiment of the present invention;

FIG. 5 is a block diagram of BlockA modules provided in an embodiment of the present invention;

FIG. 6 is a block diagram of BlockB modules provided in an embodiment of the present invention;

FIG. 7 is an exemplary view of image cutting provided by an embodiment of the present invention;

FIG. 8 is an exemplary diagram of 9 kinds of images in a dataset provided by an embodiment of the present invention;

FIG. 9 is a cluster diagram of urban and rural junction violation building labeling frames provided by an embodiment of the invention;

FIG. 10 is a graph comparing the convergence effects of training loss using activation functions Relu and Gelu according to an embodiment of the present invention;

FIG. 11 is a comparison of model mAP in training provided by an embodiment of the invention;

fig. 12 is a comparison chart of detection effects of a RETINANET original model and an improved RETINANET model provided by an embodiment of the present invention on a first example picture;

Fig. 13 is a comparison chart of detection effects of a RETINANET original model and an improved RETINANET model provided by an embodiment of the present invention on a second example picture;

Fig. 14 is a comparison chart of detection effects of a RETINANET original model and an improved RETINANET model provided by an embodiment of the present invention on a third example picture;

Fig. 15 is a comparison chart of detection effects of a RETINANET original model and an improved RETINANET model provided by an embodiment of the present invention on a fourth example picture.

Detailed Description

The following examples are given for the purpose of illustration only and are not to be construed as limiting the invention, including the drawings for reference and description only, and are not to be construed as limiting the scope of the invention as many variations thereof are possible without departing from the spirit and scope of the invention.

The RETINANET algorithm adopts Focal Loss to adjust the weight of classified samples, so that samples which are difficult to classify are more prominent in the Loss function, and the situation of unbalanced classification is better handled. The Focal Loss calculation method is shown as follows:

FL(p_t)＝-α_t(1-p_t)^γ×log(p_t) (1)

In the formula (1), alpha _t epsilon [0,1] is a weighting factor, (1-p _t)^γ is a modulation factor, p _t is the prediction probability of a model on samples, gamma is a focusing parameter (gamma epsilon [0,5 ]) p _t value is more accurate in classification when approaching 1, the smaller corresponding loss value is, the smaller p _t value is more difficult to classify samples, and the loss of the samples difficult to classify is increased by increasing gamma for the samples difficult to classify, so that the model is more concerned about the samples difficult to classify, and the model performance is improved.

The RETINANET algorithm enables the model to pay more attention to samples which are difficult to classify through dynamic adjustment of parameters, improves the detection precision of few samples in urban and rural illegal buildings, and still has the phenomenon of missing detection of small-target objects to be detected on the edges of images in the data sets. Aiming at the problem, the embodiment of the invention provides a method for detecting urban and rural junction illegal buildings, which comprises the following steps:

s2, constructing a data set;

S3, training and verifying the illegal building detection model by adopting a data set;

and S4, deploying and training the obtained violation building detection model.

(1) Step S1

As shown in fig. 1, the structure of the violation building detection model obtained by improving RETINANET model is shown in fig. 2, the violation building detection model comprises a main network and a classification regression sub-network, and the main network comprises a ResNet feature extraction network and a feature fusion network. The ResNet feature extraction network comprises five modules (or five stages) of Stage0, stage1, stage2, stage3 and Stage4 which are sequentially connected, wherein the output features of the Stage2, stage3 and Stage4 are input into a feature fusion network to perform multi-scale feature fusion, and the obtained fusion feature map is input into a classification regression sub-network to perform classification judgment and position regression.

In the method, an attention module is added in Stage1 and Stage2 respectively, and a parallel cavity convolution module is added in Stage3, so that the detection accuracy of the algorithm on urban and rural illegal buildings is improved. The Stage0 is used for preprocessing an input image, output features are input into the Stage1 after passing through a maximum pooling layer, and the stages from Stage1 to Stage4 only comprise one BlockA and a plurality of BlockB modules except for an attention module and a parallel cavity convolution module which are added in the Stage1 to Stage 4.

The Stage0 module performs preprocessing (convolution with a convolution kernel size K of 7×7, a step size s of 2, and a padding value of 3) on an Input image (Input), and inputs the Input image into the Stage1 module after passing through a maximum pooling layer (MaxPooling, a convolution kernel size of 3×3, and a step size of 2) by a BN layer standardization plus nonlinear activation function Gelu. The Stage1 module comprises 1 first BlockA modules, N ₁ BlockB modules and 1 attention module, wherein N ₁ is more than or equal to 2. The Stage2 module comprises 1 second BlockA modules, N ₂ BlockB modules and 1 attention module, wherein N ₂ is more than or equal to 2. The Stage3 module comprises 1 second BlockA modules, N ₃ BlockB modules and 1 parallel cavity convolution module, wherein N ₃ is more than or equal to 2. The Stage4 module comprises 1 second BlockA modules and N ₄ BlockB modules, wherein N ₄ is more than or equal to 2. In this embodiment, N ₁＝2,N₂＝3,N₃＝5,N₄ =2. In addition, in the Stage4 module, the feature map output by the last BlockB module obtains the final output feature through an average pooling layer (AveragePooling) and a full connection layer (FC).

The first BlockA module and the second BlockA module adopt the structure of BlockA modules with different convolution steps. The Stage1 does not need to change the size of the input feature map, the step S1 in BlockA is set to be 1, the step S1 in BlockA is set to be 2 in the stages from Stage2 to Stage4, and the width and the height of the input feature are changed to be half of the original width and the height. BlockA and BlockB are two different residual structures, and the BlockA residual block downsamples the input feature map according to the convolution step length to realize the feature map scale transformation, and the BlockB residual block convolution step length is 1, and only the channel number is changed without changing the width and the height of the input feature map.

In connection with the RETINANET network architecture shown in fig. 1, the ResNet feature extraction network of the present invention is modified on the basis of ResNet. The output characteristic layer of each Stage from Stage0 to Stage4 in ResNet network structure can be embodied as C1-C5 layer, wherein the C1 and C2 layers are main shallow characteristic layers. The RETINANET algorithm starts multi-scale feature fusion from the C3 layer to reduce the parameter amount of the algorithm. The data set used in the method contains more small target objects to be detected, and because the small targets occupy smaller images and have unobvious characteristics, related semantic information is easy to lose in the network training process, so that the model is difficult to accurately position the small targets. In order to improve the detection precision of RETINANET algorithm on small target violation buildings, the invention adds attention modules in Stage1 and Stage2 of the feature extraction network ResNet. And attention modules are added in Stage1 and Stage2, so that the network model is more focused on the C2 shallow characteristic information, and the perception capability of an algorithm on a small target is enhanced. And a layer of parallel cavity convolution module is additionally added in the Stage3, the output characteristics of the attention module are subjected to multi-scale fusion, global context information is obtained, and the detection accuracy of an algorithm is further improved.

A network structure diagram of the attention module is shown in fig. 3. The attention module encodes the location information of the retention feature in a horizontal direction (X) and a vertical direction (Y), respectively. The input features are subjected to average pooling in the horizontal direction and the vertical direction to obtain output features respectively. Performing Concat operation on the horizontal output characteristics and the vertical output characteristics after dimension transformation, and combining the width and height characteristics. The combined features are then convolved, BN layer normalized and nonlinear activation functions (Gelu) to obtain output features with dimensions C/r x1× (w+h), r being the reduction rate, for reducing the complexity of the algorithm, where r=16 is set. The input features are then reconverted along the spatial dimension to C x H x1 and C x1 x W using convolution, with the corresponding attention weights g ^h、g^w obtained using the sigmoid activation function. The Input features are multiplied by weights g ^h、g^w to obtain the final output features.

The network structure of the parallel cavity convolution module is shown in fig. 4, the input feature changes the number of input feature channels through 1×1 convolution, and the output feature I, A, B, C is obtained through BN layer standardization or BN layer standardization plus nonlinear activation function (Gelu) respectively. And carrying out cavity convolution on the features A, B, C respectively, normalizing the BN layer and adding a nonlinear activation function (Gelu) to obtain a multi-scale output feature D, E, F. Concat operation is carried out on D, E, F multi-scale features in the depth direction to obtain multi-scale fusion features G, and add operation is carried out on the multi-scale fusion features G and the features I to obtain final Output features.

The parallel hole convolution modules use 3 parallel hole convolution modules, the convolution kernel sizes are all 3 multiplied by 3, the step sizes are all 1, and the padding values are respectively 1,2 and 5. Hole convolution can enlarge the receptive field compared with ordinary convolution but can cause local information loss, and the hole rate R is set to be 1,2 and 5 respectively by using mixed expansion convolution (HDC).

The network structure of BlockA modules is shown in fig. 5, and it can be seen that the BlockA module performs the following operations:

firstly, performing convolution with a convolution kernel of 1 multiplied by 1 and a step length of 1, performing BN layer standardization and nonlinear activation functions on Input features, performing convolution with a convolution kernel of 3 multiplied by 3 and a step length of 2, and performing BN layer standardization to obtain first intermediate features; meanwhile, the Input characteristic is convolved with a convolution kernel of 1 multiplied by 1 and a step length of S1, and then a nonlinear activation function is added through the normalization of a BN layer to obtain a second intermediate characteristic;

Wherein, in the first BlockA module (Stage 1), the step S1 is set to 1; in the second BlockA modules (Stage 2, stage3, stage 4), the step S2 is set to 2.

The network structure of BlockB modules is shown in fig. 6, and it can be seen that the BlockB module performs the following operations:

(2) Step S2

The step S2 specifically comprises the steps of:

In view of the rural illegal building data set which is not disclosed at present, most of experimental data sets used in the method are real-time images shot by high-definition monitoring cameras erected on 16 signal towers in the peripheral area of a certain city, and part of scene pictures are high-altitude shot images downloaded by a network. The size of the image acquired in real time is 1920 multiplied by 1080, the size of the input image is required to be kept consistent when the network model is trained, meanwhile, in order to enlarge the proportion of relevant illegal building features in the image, each 1920 multiplied by 1080 image is cut into 8 images with the size of 640 multiplied by 540 according to fixed cutting points, the image which does not contain detection target information is removed, and an urban and rural illegal building image data set is constructed. The specific cutting mode is shown in fig. 7, and the cutting mode is that two rows and three columns are cut to obtain 6 pictures, then 2 pictures in the center are cut, in fig. 7, the characters in the upper left corner represent the preset point 3, and the characters in the upper right corner represent the time stamp: 2023-04-02 12:46:59, lower left corner represents camera model: IP PTZ Camera.

The network downloaded image is resized to 640 x 540 and added to the urban and rural violation building image dataset. The constructed original data set totally comprises 526 images of detection targets, and the data set is expanded to 3156 images by means of translation, noise adding, random brightness adjustment, random rotation and random point adding to improve the accuracy of the algorithm. The constructed dataset contains ZK (brick), GJ (reinforcing steel bar), person (color steel building), CGJZ (house frame), FWKJ (brick structure), ZTJG (brick structure), YZC (breeding factory), TPF (adobe house) and FQFW (abandoned house) 9 types of characteristics, and the sample sizes of various types of data are shown in table 1. An example of various types of data samples in a dataset is shown in FIG. 8.

Table 1 data set category and number of categories

And carrying out cluster analysis on the labeling frame of the urban and rural violation building data set, and finding that the urban and rural violation building data set contains a large number of small targets to be detected. The cluster map of the labeling frame is shown in fig. 9.

From table 1, it can be known that the sample number difference between each category in the urban and rural violation building data set is too large, i.e. the data set has the problem of category imbalance. OHEM (online difficult sample mining) and Focal Loss method are generally adopted in deep learning to solve the problem of unbalanced data category. OHEM dynamically selecting a difficult sample by calculating the loss of the region of interest of each picture, so that the model is more focused on the difficult sample, and the performance of the model is improved. OHEM can improve sample imbalance problems, but there are also problems such as increasing computational complexity, resulting in slower model training speed; increasing the risk of model overfitting; in the case of the model focusing more on difficult samples, omitting a part of easy samples leads to deterioration of the overall effect of the model. Focal Loss solves the problem of unbalanced data category from the viewpoint of sample classification difficulty. The Focal Loss method is adopted to solve the problem of unbalanced data set category.

(3) Step S3

The experimental relevant configurations herein are shown in table 2.

Table 2 experiment environment configuration table

The average accuracy mAP is used as a main evaluation index of the model, and the parameter is used as an auxiliary evaluation index of the model. The average accuracy mAP correlation formula is as follows:

Wherein P is the accuracy, R is the recall, TP is the sample correctly classified, FP is the negative sample and is identified as the positive sample number, FN is the positive sample and is identified as the negative sample number, AP is the average accuracy, P is the curve area surrounded by the accuracy P as the Y axis and the recall R as the X axis, n is the sample type, mAP is the average AP value of the 9-class illegal building.

The training round is set 300 times, the input image size is 640 multiplied by 540, the number of images processed in each batch is 6, the initial learning rate is 0.0001, the Adam optimizer is used for optimizing network parameters, the momentum parameters are set to be 0.9, the learning rate attenuation strategy adopts a cos descent adjustment strategy, and the weight parameters are stored once every 10 rounds of training. The raw model in ResNet network uses Relu activation function whose formula is as follows:

Relu(x)＝max(0,X) (6)

The improved ResNet network model deepens the network layer number, and the network model is trained by using the Relu activation function in the experimental process, so that the network model occasionally has gradient explosion phenomenon after 100 iterations, and the model training fails. Replacing the activation function in the model with Gelu (gaussian error linear unit activation function) for this problem is formulated as follows:

Compared with Relu activation function, gelu activation function introduces nonlinear transformation, can prevent the phenomenon that the weight parameters are not updated when the input is smaller than 0, and improves the generalization capability of the model and the convergence rate of the model. A comparison graph of the loss convergence effects of the training network model using the two activation functions, respectively, is shown in fig. 10. As is evident from fig. 10, using Gelu instead of Relu activation functions can accelerate the convergence of the model and the loss curve is smoother. Experiments prove that changing the activation function can improve the gradient explosion problem of the model but can not improve the detection precision of the model.

Adding additional layers to the backbone feature extraction network increases the complexity of the model, adding different modules to the ResNet network, and the mAP change during training is shown in fig. 11. The average accuracy of the added additional feature extraction network is improved compared with that of the original model, the pre-training weight is not matched with the network after the network model is changed, and therefore the improved model needs to train more rounds to reach a stable state compared with the original model.

To verify the model improvement effect, the network model was trained herein using the same training parameters, the model was verified using the test set, and the results are shown in table 3:

Table 3 ablation experiments

According to Table 3, resNet50 0 is taken as a base line network in the comparison experiment, and the improved network model has an improvement effect on the detection accuracy of rural illegal buildings compared with the original model. The precision of the attention module is increased by 1.46% in Stage1 and Stage2 of ResNet network; the parallel cavity convolution module is added in Stage3 of ResNet network, so that the model precision is improved by 1.24%; attention modules are added in Stage1 and Stage2 of ResNet network, parallel cavity convolution modules are added in Stage3, and model accuracy is improved by 2.19%. When the RETINANET algorithm takes ResNet as a characteristic extraction network, the model parameter is 5.55 multiplied by 10 ⁷, 3 modified ResNet are taken as characteristic extraction network parameter respectively as 3.955 multiplied by 10 ⁷、3.918×10⁷、3.924×10⁷, and compared with the original model, the model parameter is reduced by 16247552 by using the modified final network model RETINANET.

As shown in Table 4, the comparison results of the experiments of different algorithms show that the SSD algorithm has the least parameter, but the recognition accuracy of the rural illegal building is lower, YOLO-V3 is a novel network structure published after RETINANET, compared with the RETINANET algorithm, the improved RETINANET algorithm has the advantages that the parameter is more and the detection accuracy is lower than that of the RETINANET algorithm, the detection accuracy of the algorithm to the rural and urban illegal building is improved, and the parameter of the algorithm is reduced.

Table 4 comparison table of different algorithm detection effects

The comparison diagrams of the RETINANET original model and the improved RETINANET model detection effect on four example pictures are shown in fig. 12, 13, 14 and 15 respectively, wherein the left side is the detection result of the RETINANET original model, and the right side is the detection result of the improved model. The comparison shows that the improved model has higher detection precision on the small-target object to be detected at the image edge and lower omission ratio.

(4) Step S4

After model training and testing are completed, the method can be directly deployed on any deployable equipment to detect illegal buildings on input images.

Corresponding to the method, the embodiment of the invention also provides a system for detecting the illegal building of the urban and rural junction, which comprises an illegal building detection model construction unit, a data set construction unit, an illegal building detection model training unit and an illegal building detection model application unit, wherein the system is respectively used for executing steps S1, S2, S3 and S4 in the method. All four units can be operated independently from the other units.

In summary, according to the urban and rural junction violation building detection method and system provided by the embodiment of the invention, attention modules are added in Stage1 and Stage2 of ResNet network so as to enhance the perception capability of an algorithm on small target features; adding a parallel cavity convolution module in Stage3 to obtain multi-scale fusion characteristics; the activation function of the network of the replacement ResNet is Gelu, so that the convergence of the model is accelerated, and the problem of gradient explosion of the model in the training process is improved. The average accuracy of detecting urban and rural illegal buildings by the improved RETINANET algorithm reaches 93.28%, the average accuracy is improved by 2.19% compared with that of an original model, meanwhile, the parameter quantity of the algorithm is reduced by 16247552, the omission ratio of the algorithm to a small target is reduced, and the effectiveness of model improvement is verified.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. A method for detecting urban and rural junction illegal buildings is characterized by comprising the following steps:

the attention module is used for coding and reserving position information of the features along the horizontal direction and the vertical direction respectively; the attention module performs the following operations:

finally, the Input characteristic Input of the CxH x W is multiplied by the attention weight g ^h、g^w to obtain the Output characteristic Output of the CxH x W;

the parallel hole convolution module performs the following operations:

Finally, add the multiscale fusion feature G and the feature I, and then obtain an Output feature Output of C×H×W through a nonlinear activation function;

s2, constructing a data set;

and S4, deploying and training the obtained violation building detection model.

2. The urban and rural junction violation building detection method according to claim 1, wherein,

The convolution kernel sizes of the first cavity convolution module, the second cavity convolution module and the third cavity convolution module are all 3 multiplied by 3, the step sizes are all 1, and the padding values are respectively 1,2 and 5.

3. The urban and rural junction violation building detection method according to claim 1, wherein the BlockA module performs the following operations:

4. A method for detecting urban and rural junction offending buildings according to claim 3, characterized by: in the first BlockA module, step S1 is set to 1; in the second BlockA module, step S2 is set to 2.

5. The urban and rural junction violation building detection method according to claim 1, wherein the BlockB module performs the following operations:

6. The urban and rural junction violation building detection method according to any one of claims 1 to 5, wherein the method comprises the following steps of: the nonlinear activation functions adopted in the ResNet feature extraction network are Gelu activation functions; n ₁＝2,N₂＝3,N₃＝5,N₄ = 2.

7. The urban and rural junction violation building detection method according to claim 1, wherein the step S2 specifically comprises the steps of:

8. A urban and rural junction breaks rules and regulations building detecting system which characterized in that: the method comprises a violation building detection model construction unit, a data set construction unit, a violation building detection model training unit and a violation building detection model application unit, wherein the violation building detection model construction unit, the data set construction unit, the violation building detection model training unit and the violation building detection model application unit are respectively used for executing steps S1, S2, S3 and S4 of the urban and rural junction violation building detection method according to any one of claims 1 to 7.