CN115223017B

CN115223017B - Multi-scale feature fusion bridge detection method based on depth separable convolution

Info

Publication number: CN115223017B
Application number: CN202210610157.0A
Authority: CN
Inventors: 黄亮; 孙宇; 赵俊三; 唐伯惠; 陈国坤; 李小祥; 裘木兰
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2023-12-19
Anticipated expiration: 2042-05-31
Also published as: CN115223017A

Abstract

The invention discloses a multi-scale feature fusion bridge detection method based on depth separable convolution, which is characterized by comprising the following steps of: firstly, constructing a trunk feature extraction network by utilizing depth separable convolution to extract bridge features; secondly, the last layer of the feature map is subjected to multi-branch parallel cavity convolution to obtain a multi-scale receptive field, so that bridges with different scales are better matched, and multi-scale bridge features are extracted; then fully utilizing bridge details and semantic information with different depths, and carrying out cross-level fusion on three bridge effective feature layers with different levels through a multi-scale feature pyramid; and finally, testing the bridge detection result and evaluating the precision. According to the invention, mAP reaches 94.26%, FPS reaches 60.04, most of mainstream target detection networks can be advanced in precision and speed, and the system can be integrated into a mobile terminal to finish high-precision rapid bridge detection tasks; the network parameters are greatly reduced, the operation cost is reduced, the running speed of the network is improved, and the detection capability of the multi-scale bridge is improved.

Description

Multi-scale feature fusion bridge detection method based on depth separable convolution

Technical Field

The invention belongs to the technical field of bridge detection, and particularly relates to a multi-scale feature fusion bridge detection method based on depth separable convolution.

Background

Bridge is an important transportation facility, and is a key transportation junction between land and water. Secondly, as the urban step is gradually accelerated, the bridge plays an increasingly important role in urban planning construction, and the maximum demand is an indispensable part in urban planning construction. The bridge is used as a large-scale artificial ground object and is the object which is most easy to change in a geographic database, and the bridge is updated and maintained to play a guiding role in urban planning construction. The bridge is automatically detected by using the image processing technology, so that the detection speed can be increased, the detection precision can be improved, and the method has wide development prospect and important research significance in military and civil aspects. The high-spatial-resolution remote sensing images (High Spatial Resolution Remote Sensing Imags, HSRRSIs) play important roles in the fields of industry, agriculture, military, economy and the like, become important data sources for target detection, are influenced by environmental factors and imaging conditions, have large differences in bridge target background in the HSRRSIs, have obvious bridge shape differences, and are difficult to distinguish bridges in different HSRRSIs by utilizing unified characteristics; meanwhile, the different bridges in the same HSRRSIs have larger size difference, so that the phenomenon of imbalance of positive and negative samples is easy to occur, and the difficulty of bridge detection is further increased. Therefore, the research on the high-precision rapid detection of the bridge target has important research significance.

Currently, bridge inspection can be largely divided into the following two categories: 1) Bridge detection method based on traditional method. The method mainly adopts a mode of manually selecting the characteristics and then detecting by utilizing a sliding window. However, in a real scene, the characteristic extraction is difficult due to dependence on certain specific features (such as water bodies, shorelines and the like), or the false detection and omission of the bridge are caused due to imaging condition limitation and artificial subjective factors. For example, fan Lisheng et al propose a cross entropy-based feature extraction and river region target detection method, which can be used for bridge detection in a river region, but the feature parameters proposed by the method are too dependent on water body features under specific conditions, so that the detection robustness of bridges in HSRRSIs with different backgrounds is poor; jiangmei et al propose a bridge-target-oriented automatic detection multi-source remote sensing image fusion method which can effectively detect a bridge in a complex large-scale scene, but needs to fuse a near infrared image, a full-color image and an SAR image, but different data in the same area are difficult to acquire and complicated to work, and high-efficiency automatic detection of the bridge is difficult to realize; g Sithole et al propose a method for detecting bridges in laser scanning data, which utilizes topology information to identify bridge seed points, and sets a threshold value to detect single bridges by using the seed points, so that bridges with different shapes can be effectively detected, but the poor setting of the threshold value can divide river banks into bridges in a staggered manner, and the accuracy of bridge positioning is affected; the Chaudhuri et al propose a method for detecting a water bridge by utilizing a multispectral image, wherein the image is classified into water, concrete and background, but the problem of imbalance of positive and negative samples cannot be solved, and when a bridge target is very small or the image contains noise, a classification error phenomenon can be generated; huang Yong et al propose a scene semantic SAR image bridge detection algorithm which can effectively inhibit coherent speckle noise and reduce the phenomena of missing detection and false detection of bridges, but is only effective for the bridges on water. 2) A bridge detection method based on deep learning. For example, L Chen et al propose a bridge detection network based on a multi-resolution balance and attention mechanism, which can effectively solve the bridge detection problem in SAR images, but has complex model and low precision; zhou Xing provides an optical remote sensing image bridge detection method based on a dual-attention mechanism, which effectively solves the problem of low target detection precision under a complex background, but has to be improved in terms of detection speed.

At present, the deep learning method is used for target detection, which becomes a research hot spot, but the related reports of bridge detection are less. Target detection methods based on deep learning can be mainly divided into two types. One type is a region suggestion based target detection algorithm, also known as a two-stage algorithm, such as R-CNN, fast R-CNN, etc. The two-stage algorithm has higher accuracy, but the candidate region extraction process has high complexity, large calculated amount and low detection speed; another type is regression-based object detection, also known as one-stage algorithms, such as SSD, YOLO, etc. SSD detection speed is fast, but the detection capability to small targets is poor, and the feature extraction of the YOLO algorithm is more comprehensive and has high accuracy and high detection speed. However, the above methods often cannot achieve both speed and accuracy.

Therefore, in order to solve the above-mentioned problem, a multi-scale feature fusion bridge detection method based on depth separable convolution is proposed herein.

Disclosure of Invention

In order to solve the technical problems, the invention designs a multi-scale feature fusion bridge detection method based on depth separable convolution, which comprises the steps of firstly constructing a backbone feature extraction network by using the depth separable convolution to extract bridge features; secondly, the last layer of the feature map is subjected to multi-branch parallel cavity convolution to obtain a multi-scale receptive field, so that bridges with different scales are better matched, and multi-scale bridge features are extracted; then fully utilizing bridge details and semantic information with different depths, and carrying out cross-level fusion on three bridge effective feature layers with different levels through a multi-scale feature pyramid; finally, testing bridge detection results and evaluating accuracy; the mAP reaches 94.26%, the FPS reaches 60.04, most of mainstream target detection networks can be advanced in precision and speed, and the method can be integrated into a mobile terminal to finish high-precision rapid bridge detection tasks.

In order to achieve the technical effects, the invention is realized by the following technical scheme: a multi-scale feature fusion bridge detection method based on depth separable convolution is characterized by comprising the following steps:

step1: constructing a bridge feature extraction network by utilizing depth separable convolution, reducing network parameters and compressing a network model;

step2: applying multi-branch parallel cavity convolution to enlarge receptive fields on the final layer of bridge feature map, and further extracting features of bridges with different scales;

step3: the multi-scale feature fusion pyramid is utilized to realize cross-level bridge feature map fusion, and the details and semantic information of different feature maps of the bridge are fully utilized;

step4: and outputting a bridge detection result through the detection head.

In Step1, the convolutional neural network is used as the best choice for extracting target features, the conventional convolution forming the convolutional neural network is utilized for bridge detection, firstly, the input feature map of each channel and the corresponding convolution kernel are subjected to convolution operation, and then the results are added and output; for D _F ×D _F Input image of bridge of x M using N dimensions D _K ×D _K Performing convolution operation on the standard convolution kernel of the xM; wherein M is the number of input channels, N is the number of convolution kernels, namely the number of output channels; when the standard convolution kernel is adopted for convolution, the step length is 1, padding is adopted, and the size of the output characteristic diagram is D _F ×D _F X N, the calculated amount is:

P ₁ ＝D _F ×D _F ×D _K ×D _K ×M×N (1)

further, the depth separable convolution improves the conventional convolution into two processes of layer-by-layer convolution and point-by-point convolution;

the layer-by-layer convolution is a convolution without crossing channels, each channel of the feature map corresponds to an independent convolution kernel in the process, each convolution kernel only acts on one specific channel, and the number of channels of the output feature map is equal to that of channels of the input feature map; for D _F ×D _F X M bridge input images, respectively using MThe convolution kernel carries out convolution, convolution calculation is only carried out in each channel, information among the channels is not added, and finally M feature images are output; the calculation amount of the layer-by-layer convolution is thus:

P ₂ ＝D _F ×D _F ×M×D _K ×D _K (2)

the point-by-point convolution is used for feature combination and dimension change, the features of each point are traversed through 1X 1 convolution, and the space information of a plurality of channels is collected; each point-by-point convolution layer is followed by a BN layer and a ReLU layer, so that nonlinear change of the model is effectively increased, and generalization capability of the model is enhanced; for the output feature map of the layer-by-layer convolution, the point-by-point convolution uses N convolution kernels with the size of 1 multiplied by M to carry out convolution operation, and finally the size of the output feature map is D _F ×D _F X N, the calculated amount obtained is:

P ₃ ＝D _F ×D _F ×M×N (3)

the ratio of the depth separable convolution to the conventional computation is:

n is the number of output channels, and is usually largerNegligible; taking dk=3 as an example, +.>I.e. the calculated amount of the depth separable convolution is only +.>The operation efficiency of the model is improved.

Further, the multi-branch parallel cavity convolution in Step2 enlarges the receptive field, and the receptive field represents the spatial range of the input image corresponding to the unit pixel on the output characteristic diagram; the method of cavity convolution can obtain receptive fields with different scales, thereby solving the problem of lower detection precision of the multi-scale bridge;

the cavity convolution is to introduce a parameter of cavity rate in the conventional convolution, wherein the cavity rate is the distance between each unit in the convolution kernel, and the cavity rate of the conventional convolution is 1; convolution kernel size after addition of holes:

k'＝n×(k-1)+1 (5)

size of receptive field after hole convolution:

r＝[(n-1)×(k+1)+k]×[(n-1)×(k+1)+k] (6)

wherein k' is the convolution kernel size after the cavity is added, n is the cavity rate, and k is the conventional convolution kernel size;

in order to extract the characteristics of the hollow part, the position information of bridges with different dimensions is positioned more accurately, the calculation resources are further saved, the conventional convolution operations with different dimensions and the hollow convolution operations are connected in series by combining the concept of an admission structure, and then the convolution operations are connected in parallel to form a group of convolution modules with asymmetric structures, and the consistency of the dimension of the characteristic diagram output by each parallel branch can be ensured.

Further, in Step2, in order to reduce the calculation amount, three parallel branches firstly use 1×1 convolution to reduce the number of channels; in order to meet the requirements of target detection with different sizes, three parallel branches respectively adopt convolution kernels with two sizes of 3×3, 3×3 and 5×5, and the corresponding void ratios are respectively 1, 3 and 5; the characteristics of the cavity part can be extracted by conventional convolution, so that not only can the continuous information be obtained, but also the receptive fields with different sizes can be obtained; and finally, carrying out channel splicing on the feature images with different scales, adding the feature images with short-circuit edges of the input feature images, and outputting the feature images.

Further, in Step3, the multi-scale cross-layer feature pyramid structure, the main feature extraction network comprises six main convolution modules to obtain six convolution graphs with different sizes, and the number of channels of the three final feature graphs P1, P2 and P3 is adjusted by 1×1 convolution to obtain feature layers p1_in, p2_in and p3_in; the P3_in is up-sampled and then stacked with the P2_in to obtain P2_m, and the P2_m is up-sampled and then stacked with the P1_in to obtain P1_out; the P1_out is subjected to downsampling and then is stacked with the P2_in and the P2_m to obtain the P2_out, and the P2_out is subjected to downsampling and then is stacked with the P3_in to obtain the P3_out; all feature graphs are subjected to feature fusion in the operation, and all the feature graphs contribute to multi-scale feature fusion; the input feature map and the output feature map with the same scale are directly connected, so that richer features can be fused; finally, stacking feature maps multiple times gives the pyramid more powerful feature representation capability.

The beneficial effects of the invention are as follows:

(1) The backbone network is built by utilizing the depth separable convolution, so that network parameters are greatly reduced, the operation cost is reduced, and meanwhile, the running speed of the network is improved, thereby providing possibility for real-time and efficient bridge detection;

(2) Introducing multi-branch parallel cavity convolution to obtain receptive fields with different sizes, reserving detail information of small targets, and improving the detection capability of the multi-scale bridge;

(3) Utilizing a multi-scale feature pyramid to fully utilize bridge feature information in different layers of feature graphs to realize cross-layer network feature fusion;

(4) The mAP reaches 94.26%, the FPS reaches 60.04, most of mainstream target detection networks can be advanced in precision and speed, and the method can be integrated into a mobile terminal to finish high-precision rapid bridge detection tasks.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a conventional convolution;

FIG. 3 is a depth separable convolution of the present invention;

FIG. 4 is a diagram of a bridge feature extraction network architecture of the present invention;

FIG. 5 is a receptive field corresponding to objects of different dimensions;

FIG. 6 is a schematic diagram of conventional and hole convolution;

FIG. 7 is a multi-branch parallel hole convolution;

FIG. 8 is a multi-scale cross-layer feature pyramid;

FIG. 9 is a dataset sample example;

FIG. 10 is a 416 pixel by 416 pixel bridge image;

FIG. 11 is a graph showing the results of conventional bridge inspection;

FIG. 12 is a graph showing the detection results of bridges with large scale differences in the same image;

FIG. 13 is a graph showing the detection results of a multi-scale bridge in a large-format HSRRSIs;

FIG. 14 is a small scale bridge inspection result;

FIG. 15 is a high aspect ratio bridge inspection result;

fig. 16 shows the results of the cross-island bridge inspection.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1 to 8, a multi-scale feature fusion bridge detection method based on depth separable convolution is characterized in that: firstly, constructing a bridge feature extraction network by utilizing depth separable convolution, so as to achieve the purposes of reducing network parameters and compressing a network model; secondly, multi-branch parallel cavity convolution is applied to the final layer of bridge characteristic diagram to enlarge the receptive field, and the characteristics of bridges with different scales are further extracted; then, realizing cross-level bridge feature map fusion by utilizing a multi-scale feature fusion pyramid, and fully utilizing the details and semantic information of different feature maps of the bridge; and finally outputting a bridge detection result through the detection head. The specific flow of the proposed method is shown in figure 1.

Depth separable convolution built bridge feature extraction network

The most direct method for improving the network performance of the bridge detection is to increase the depth and the width of the network, but as the network depth is increased continuously, the gradient explosion problem in the back propagation process is also accompanied, the parameters required to be learned by the network are also more and more huge, and the huge parameter amount is easy to generate the over fitting phenomenon, so that the performance of the bridge detection is affected. Secondly, the complex network structure and numerous parameters have high requirements on computer hardware, and data in the same batch of incoming networks are severely limited.

Convolutional neural networks have been widely used in various target detection methods as the best choice for extracting target features. The conventional convolution forming the convolutional neural network is utilized to carry out bridge detection, firstly, the input characteristic diagram of each channel and the corresponding convolution kernel are subjected to convolution operation, and then the results are added and output, and the process is shown in figure 2. For D _F ×D _F Input image of bridge of x M using N dimensions D _K ×D _K The standard convolution kernel of x M performs the convolution operation. Wherein M is the number of input channels, and N is the number of convolution kernels, i.e. the number of output channels. When the standard convolution kernel is adopted for convolution, the step length is 1, padding is adopted, and the size of the output characteristic diagram is D _F ×D _F X N, the calculated amount is:

P ₁ ＝D _F ×D _F ×D _K ×D _K ×M×N (1)

depth separable convolution improves conventional convolution into two processes, layer-by-layer convolution and point-by-point convolution, as shown in fig. 3. The layer-by-layer convolution is a convolution without crossing channels, in the process, each channel of the feature map corresponds to an independent convolution kernel, each convolution kernel only acts on one specific channel, and the number of channels of the output feature map is equal to that of channels of the input feature map. For D _F ×D _F The bridge input image of the xM is convolved by M convolution kernels respectively, convolution calculation is only carried out in each channel, information among the channels is not added, and finally M feature images are output. The calculation amount of the layer-by-layer convolution is thus:

P ₂ ＝D _F ×D _F ×M×D _K ×D _K (2)

the point-by-point convolution is used for feature combination and dimension change, and features of each point are traversed through the 1 multiplied by 1 convolution, so that spatial information of a plurality of channels is collected, cross-channel information integration is realized, and the problem that features of all channels are mutually separated in a layer-by-layer rolling process is solved. Secondly, each point-by-point convolution layer is followed by a BN layer and a ReLU layer, so that nonlinear change of the model is effectively increased, and generalization capability of the model is enhanced. For the output feature map of the layer-by-layer convolution, the point-by-point convolution uses N convolution kernels with the size of 1 multiplied by M to carry out convolution operation, and finally the size of the output feature map is D _F ×D _F X N, the calculated amount obtained is:

P ₃ ＝D _F ×D _F ×M×N (3)

n is the number of output channels, and is usually largerNegligible. Taking dk=3 as an example, +.>I.e. the calculated amount of the depth separable convolution is only +.>The operation efficiency of the model is greatly improved.

To reduce the number of parameters and reduce the amount of computation, a bridge feature extraction network (Depthwise Separable Convolution-CSPDarknet53, DSC-CSPDarknet 53) combining depth separable Convolution and CSPDarknet53 is proposed, which is mainly composed of a Convolution normalized activation function (Convoltion+ Banch Normalization +Mish, CBM) and a phase Cross Module (CSPX) module, as shown in FIG. 4. The CBM module is composed of a convolution layer connected with a batch standardization and then connected with a Mish laser function, wherein the convolution layer is composed of layer-by-layer convolution and point-by-point convolution, namely, the depth separable convolution layer. The CSPX module divides the bridge characteristic diagram into two parts: the first part passes through the CBM module and X residual components (Res units); the second part is directly combined with the first part by Concat. The DSC-CSPDarknet integrates gradient change in the bridge characteristic diagram, effectively strengthens the learning capacity of the network, adopts depth separable convolution for all convolution layers, reduces the calculated amount and ensures higher precision.

Multi-branch parallel cavity convolution to obtain multi-scale receptive field

The receptive field represents the spatial extent of the output feature map where the unit pixels correspond to the input image. In CNN, the size of the receptive field is directly affected by the size of the convolution kernel, and the size of the receptive field directly affects the detection effect of targets of different scales. Therefore, the single-scale receptive field generated by the single-size convolution kernel cannot meet the detection of the multi-scale bridge in the same image. Although operations such as downsampling and pooling can effectively enlarge the receptive field, spatial resolution is reduced, so that small-scale bridge information cannot be reconstructed. The method of cavity convolution can obtain receptive fields with different scales, thereby solving the problem of lower detection precision of the multi-scale bridge.

A 1 x1 convolution kernel can produce a receptive field of 1 x1 size suitable for detecting small-sized bridges, as shown in the right-hand red box of fig. 5; however, the 1×1 receptive field is difficult to cover a large-sized bridge, and as shown in the yellow box on the left side of fig. 5, only 7×7 receptive fields can be used for detection. If the whole image adopts a 7×7 receptive field, a large amount of irrelevant background is contained when a small target is detected, so that the detail information of the small-size bridge is lost, and the acquisition of the target characteristics is not facilitated.

The hole convolution is a parameter of introducing the hole rate, namely the distance between each unit in the convolution kernel, into the conventional convolution, and the hole rate of the conventional convolution is 1. The left panel of fig. 6 is a conventional convolution schematic, and the right panel of fig. 6 is a cavity convolution schematic. Convolution kernel size after addition of holes:

k'＝n×(k-1)+1 (5)

size of receptive field after hole convolution:

r＝[(n-1)×(k+1)+k]×[(n-1)×(k+1)+k] (6)

where k' is the convolution kernel size after adding the holes, n is the hole rate, and k is the conventional convolution kernel size. For the conventional convolution in fig. 6, the convolution kernel size is 3×3, the void ratio is 1, the resultant receptive field is shown as the blue part of the figure, the size is 3×3, and the pixels involved in the calculation, that is, including the weights, are 9 red dots in the figure. For the hole convolution in fig. 6, the convolution kernel size is 5×5, the hole rate is 2, the receptive field size is 7×7, the pixels involved in calculation, that is, the pixels containing weights, are only 9 red points in the figure, and the weights of other points are all 0. Therefore, the hole convolution can solve the problem of the loss of the space information of the feature mapping caused by the pooling operation, and can enlarge the receptive field without increasing additional parameters and calculation amount.

The extracted bridge information has discontinuity because the hole part of the hole convolution does not participate in the sampling operation. In order to extract the characteristics of the hollow part, the position information of bridges with different dimensions is positioned more accurately, the calculation resources are further saved, the conventional convolution operations with different dimensions and the hollow convolution operations are connected in series by combining the concept of an admission structure, and then the convolution operations are connected in parallel to form a group of convolution modules with asymmetric structures, and the consistency of the dimension of the characteristic diagram output by each parallel branch can be ensured. FIG. 7 is a schematic diagram of a multi-scale parallel hole convolution.

To reduce the computational effort, three parallel branches first reduce the number of channels using a 1×1 convolution; secondly, in order to meet the requirements of target detection with different sizes, three parallel branches respectively adopt convolution kernels with two sizes of 3×3, 3×3 and 5×5, and the corresponding void ratios are respectively 1, 3 and 5. The characteristics of the cavity part can be extracted by conventional convolution, so that not only can the continuous information be obtained, but also the receptive fields with different sizes can be obtained; and finally, carrying out channel splicing on the feature images with different scales, adding the feature images with short-circuit edges of the input feature images, and outputting the feature images.

Multi-scale feature pyramid

Networks with good performance generally have deeper network hierarchies, and feature maps of different depths within the network can express multi-scale features. However, the feature expression capability is also different because of the depth difference of the feature map. For bridge detection networks, the lower-layer feature resolution is higher, and more bridge position and detail information are contained, but the lower-layer feature resolution is lower and the noise is more due to fewer convolutions. The high-level features have stronger bridge semantic information, but have low resolution and poor perception of details. Namely, as the convolutional neural network level deepens, abstract features become more and more obvious, but the spatial information of a shallow layer is gradually lost. Therefore, good detection effect cannot be obtained by directly predicting bridge targets with different scales through feature maps with different depths in a network, and a cross-level feature pyramid is required to be constructed by utilizing the feature maps with different depths so as to realize multi-scale feature fusion.

Fig. 8 is a multi-scale cross-layer feature pyramid structure diagram, where the main feature extraction network includes six main convolution modules to obtain six convolution graphs with different sizes, and the number of channels of the three final feature graphs P1, P2, and P3 is adjusted by using 1×1 convolution to obtain feature layers p1_in, p2_in, and p3_in. The P3_in is up-sampled and then stacked with the P2_in to obtain P2_m, and the P2_m is up-sampled and then stacked with the P1_in to obtain P1_out. P1_out is downsampled and then stacked with P2_in and P2_m to obtain P2_out, and P2_out is downsampled and then stacked with P3_in to obtain P3_out. All feature graphs are subjected to feature fusion in the operation, and all the feature graphs contribute to multi-scale feature fusion; secondly, a connection is directly constructed between the input feature map and the output feature map with the same scale, so that richer features can be fused; finally, stacking feature maps multiple times gives the pyramid more powerful feature representation capability.

Example 2

Experimental data

The data set adopts the bridge target automatic identification data set in the high-resolution visible light image provided by the high-resolution remote sensing image interpretation software large race of the fourth set of 'Zhongkexing cup'. The data set contains 2000 remote sensing images shot by high-resolution No. 2, the resolution is 1 m after full-color images and multispectral images are fused, the images comprise 668 pixels by 668 pixels bridge images 1686, and 1001 pixels by 1001 pixels bridge 314, and each image contains at least one bridge target, mainly an overwater bridge.

Here, 6 bridges under different conditions and different scenes are selected in the test set for detection, as shown in fig. 9. Fig. 9 (a) shows a conventional bridge, which has a size of 668 pixels×668 pixels, and the imaging is clear and the bridge is obvious. Fig. 9 (b) contains a plurality of large bridge targets with larger scale difference, the size is 668 pixels×668 pixels, the image background is simple, the imaging condition is good, and the bridge targets are obvious but the size difference between the bridge targets is larger. Fig. 9 (c) shows a large-format remote sensing image, the size of which is 1001 pixels×1001 pixels, and the background of which is complex and the format of which is large, but the whole is clear, the bridge target pixels occupy less pixels, and the positive and negative sample electrodes are unbalanced. Fig. 9 (d) shows a small target bridge inspection, the size is 668 pixels×668 pixels, the image background is complex, a small amount of thin cloud coverage exists, the width color is uneven, the image contains 11 bridge targets, the maximum size is only 40 pixels×40 pixels, and compared with the image of 668 pixels×668 pixels, the image can be regarded as a small target. Fig. 9 (e) shows a bridge with a large aspect ratio, and the size is 668 pixels×668 pixels, and a large amount of background information is contained in the target frame, so that training difficulty is increased. Fig. 9 (f) shows an island-crossing bridge of 668 pixels×668 pixels, which spans a river with land-like areas such as a beach or island in the middle.

Design of experiment

The experimental environment is based on Windows 10 operating system, and the computer is configured as Intel (R) i7-9700k CPU,NVIDIA GeForce GTX1070Ti video card, 8GB video memory. Training and testing was performed using a GPU with a platform of pytorch1.2.0. During the training process, the learning rate is gradually decreased. A total of 100 epochs were trained, the first 50 epochs were learned with an initial learning rate of 1X 10-3, a batch size set to 16, the last 50 epochs learned with a learning rate of 1X 10-4, a batch size set to 8, and an IoU threshold set to 0.5. The data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1. The training set is subjected to online data enhancement through methods such as image rotation, scaling, translation, clipping and splicing, and the enhanced image size is uniformly scaled into 416 pixels by 416 pixels and then transmitted into a network. The experiments selected average accuracy (mean Average Precision, mAP) and Precision, recall, F-score as objective evaluation indexes, and the detection speed was evaluated using Frame Per Second (FPS). Network training uses the CIOU loss function, defined as follows:

wherein b represents a prediction frame; b ^gt Representing a real frame; ρ ² (b，b ^gt ) Is the euclidean distance between the predicted and real frames; c is the minimum closure region diagonal length that can contain both the predicted and real frames.

Alpha is a weight parameter, defined as follows:

v is used to measure the similarity of aspect ratios and is defined as follows:

wherein w and h respectively represent the width and the height of the prediction frame; w (w) ^gt 、h ^gt Representing the width and height of the real frame, respectively.

The 416 pixel by 416 pixel bridge image shown in fig. 10 comprises s×s cells, the cells where the center of the object to be detected is located are responsible for detecting the object, and multiple anchor blocks are used for predicting the boundary frame of the object on each cell of the feature map. The anchor point frame is a boundary frame obtained by clustering the shapes and the sizes of bridges in the data set, and each boundary frame comprises 4 coordinate values, 4 confidence degrees and C conditional category probabilities.

Wherein b is _x 、b _y 、b _w 、b _h The center position, width and height of the bounding box respectively; c _x 、c _y Respectively normalizing the distance between the current cell and the upper left corner of the picture; p is p _w And p _h Decibels are the width and height of the anchor point frame; using sigma function sigma to establish a learning parameter t _x 、t _y 、t _w 、t _h 、t _o And the anchor point frame coordinates, conditional category probabilities. Bridge feature extraction network performs bridge detection on three scale feature maps of 52×52, 26×26 and 13×13. And finally, removing the redundant boundary boxes by a non-maximum inhibition method to obtain a final bridge target.

The mAP is defined as:

precision refers to predicting the correct ratio among all samples predicted positive, which is defined as:

recall refers to the ratio detected in all positive samples, defined as:

the F1 score is the harmonic mean of Precision and Recall, defined as:

wherein TP represents the positive sample correctly classified, i.e. the number of bridges correctly detected; FP represents the negative sample that is misclassified, i.e., the number of non-bridge targets detected as bridges; FN represents a positive sample that is misclassified, i.e., the number of bridges detected as non-bridge targets. Precision is the accuracy of class c targets; nc is the number of pictures containing class C targets and N is the total number of pictures in the dataset.

Experimental results and analysis

In order to verify the light weight of the proposed method, model complexity comparison is performed on several target detection methods of which the trunk feature extraction networks are all dark series, namely YOLOv3, YOLOv4, YOLO-tiny, YOLO-lite and the method. The network structures of the methods are similar, and the Yolo-tiny and Yolo-lite and the method are lightweight networks. Table 1 shows that the model complexity of the 5 methods is compared with that of the non-lightweight networks Yolov3 and Yolov4, the parameter quantity and the parameter size of the proposed methods are greatly reduced, and the parameter quantity and the parameter size of the proposed methods are suboptimal compared with the lightweight networks Yoloy-tiny and Yoloy-lite.

Table 1 model complexity comparison

Note that: the bolded numbers in the table are the optimal results of the method

In order to compare the detection precision and speed of the bridge detection method, 7 methods with excellent and novel performance are selected for comparison experiments, namely Efficientdet, retinanet, centernet, YOLOv3, YOLOv4, YOLO-tiny and YOLO-lite. Table 2 is the results obtained by training the bridge dataset with 7 mainstream target detection algorithms and the algorithms herein. As can be seen from the table, the mAP detected by the bridge varies from 61.09% to 94.26% and the FPS varies from 12.17 to 139.07. The optimal FPS value method is a lightweight network YOLO-tiny, but mAP is only 81.65%, and the requirement of high-precision bridge detection is difficult to meet. The mAP of the algorithm is only improved by 0.19% compared with that of YOLOv4, but FPS is greatly improved, and the method is inferior to that of YOLO-tiny. Precision and Recall are superior to the Efficientdet, retinanet, YOLOv, YOLO-tiny, YOLO-lite5 methods. F1-score was optimal among the 7 comparison methods. Therefore, by combining the results of table 1 and table 2, the algorithm ensures high-precision detection and improves the detection speed. Compared with other 7 algorithms, the comprehensive detection capability achieves the optimal effect.

TABLE 2 mainstream target detection algorithm result analysis for bridge dataset

In order to verify the capability of the proposed method for bridge detection in different scenes, 6 methods are selected for bridge detection. Wherein, centernet is the detection method with the highest speed in the non-lightweight network, YOLOv3 and YOLOv4 are detection methods with higher precision, and YOLO-tiny and YOLO-lite are lightweight networks. Fig. 11 to 16 show bridge inspection results under different conditions and scenes.

(1) Routine bridge detection experiment

Fig. 11 is a conventional bridge inspection result. The accuracy of each of the methods YOLOv3, YOLOv4 and YOLO-tiny and YOLO-lite can reach 1, the accuracy of each of the methods YOLO-tiny and YOLO-lite reaches 0.94 and 0.81 respectively, and the accuracy of each of the methods Centernet is lower than that of the other 5 methods, but the bridge can be successfully detected. The 6 methods have high detection precision for the conventional bridge, accurate positioning and no error leakage detection phenomenon, and have good bridge detection capability.

(2) Multi-scale bridge detection experiment

Fig. 12 shows the detection result of a plurality of bridges with large scale differences in the same image. Wherein, centernet, YOLOv, YOLO-tiny and YOLO-lite do not have the capability of multi-scale target detection, and only large and obvious bridges can be detected for multi-scale targets in the same image. YOLOv4 can detect 3 bridge targets of different scales, but the overall accuracy is not high. The method can well cope with multi-scale bridge detection tasks, and high-precision detection can be realized on both large targets and small targets in the same graph.

(3) Large-breadth multi-scale bridge detection experiment

Fig. 13 shows the detection result of the multi-scale bridge in the large-format HSRRSIs. Two missed inspections exist in the center et and the YOLO-tiny, only the bridge with larger size below the image can be detected, and the center et has lower precision. YOLOv3, YOLOv4 and YOLOv-lite each detect a bridge target with high accuracy, but there was one missing detection, and a bridge under the image was not detected. The method can accurately detect all bridge targets and has higher precision.

(4) Small-scale bridge detection experiment

Fig. 14 is a small-scale bridge inspection result. YOLO-tiniy has poor detection capability on small targets, and only one bridge target is detected. The Centernet correctly detects 7 bridge targets, and the accuracy is low and basically does not exceed 0.6.YOLO-lite is able to detect 11 bridge targets, but the overall accuracy is not high, not substantially exceeding 0.9. The method has similar effects as YOLOv3 and YOLOv4 in detection capability and detection precision, and can detect the bridge target with high precision under the conditions of no false detection and no missing detection.

(5) Bridge detection experiment with large length-width ratio

Fig. 15 shows the results of high aspect ratio bridge inspection. The center only correctly detects two bridge targets, one missing detection exists, and the accuracy of correctly detected bridges is not high. YOLOv3 successfully detects three bridges with high precision, but the target frame range of the long bridge on the right side is poor, and the whole bridge cannot be framed. YOLOv4 has good detection effect on bridges on the left side and the right side, but the middle bridge has the phenomenon of false detection, and one bridge target is repeatedly detected twice. YOLO-tiny has a good detection effect on two bridge targets on the right side, but has a repeated detection phenomenon on the left side. YOLO-lite can successfully detect 3 bridge targets, but the overall accuracy is not high, and the accuracy of the two bridges on the left side is only about 0.7. The method has the best detection capability and detection effect compared with other 5 methods.

(6) Cross-island bridge detection experiment

Fig. 16 shows the results of the cross-island bridge inspection. The bridge target can be accurately detected by the Centernet and the YOLO-lite, but the accuracy is required to be improved. The YOLOv3 detected three bridges, but both were misdetected, and the YOLOv4 situation was the same, with two misdetections. YOLO-tiny also detected three bridges, one false detection and one repeated detection. It can be seen that most bridge detection methods are not well defined for island-crossing bridges, and the detection of the island-crossing bridges is easily interfered by land areas in water. The method has no false detection and no missing detection, and can detect the bridge target in the image with high precision.

Through the 6 groups of comparison experiments, compared with a Centernet network with equivalent speed, the detection precision of the method is obviously improved; compared with YOLOv3 and YOLOv4, the detection speed is greatly improved under the condition that the detection accuracy is slightly improved; for the YOLO-tiny and the YOLO-lite which are both lightweight networks, the detection precision is improved, and the bridge detection in complex scenes can be well coped with. In conclusion, the method has strong generalization capability for bridge detection under various complex scenes, and can lead the current mainstream target detection method in detection speed or detection precision, so that the balance and optimization of speed and precision are achieved.

Example 3

Through a plurality of groups of comparison experiments, we obtain the following findings:

(1) For faster bridge detection networks, such as Centernet, YOLO-lite and YOLO-tiny, the detection accuracy is lower than that of slow bridge detection networks such as YOLOv3 and YOLOv4, whether the bridge detection networks are lightweight or non-lightweight, and the detection capability of most of unconventional bridges is poor. Therefore, only the trunk feature extraction network is compressed, or only the speed of the bridge detection network is increased by taking the reduction parameter as a means, so that the detection precision is greatly lost, and the high-efficiency and high-precision detection cannot be realized. The bridge detection method has the advantages of high precision and high timeliness, verifies that the feature extraction network is built by utilizing the depth separable convolution, can keep the feature extraction capability of the network to the greatest extent under the condition of not changing the network structure and the effective convolution layer number, and can further enhance the feature extraction and fusion capability by combining a multi-scale feature fusion pyramid and multi-branch parallel cavity convolution;

(2) In the multi-scale bridge detection experiment, all the methods can successfully detect the bridge with larger scale, but Centernrt, YOLOv, YOLO-lite and YOLO-tiny can not detect the small-scale bridge, and YOLO v4 and the method can successfully detect the large-scale bridge and also can successfully detect all the small-scale bridges in the image, namely the multi-scale bridge detection capability is achieved. Therefore, the key point of realizing the multi-scale bridge detection is that the detection capability of the small-scale bridge is improved as much as possible while the detection effect of the large-scale bridge is ensured, and the problem can be effectively solved by obtaining the multi-scale receptive fields and superposing different receptive fields;

(3) For large-format HSRRSIs bridge detection, the bridge target pixels occupy less space due to the large format and complex background, namely the situation that positive and negative samples are seriously unbalanced occurs. The results of the comparative experiments show that the detection effects of YOLOv3, YOLOv4, YOLO-lite and the methods are better than those of central et and YOLO-tini, and the first 4 methods all have feature fusion modules with different degrees, so that the problem can be effectively solved by carrying out feature fusion;

(4) In the cross-island bridge detection experiment, the problem of false detection and missing detection of different degrees occurs in the YOLOv3, the YOLOv4 and the YOLO-tiny, and the detection precision of the center and the YOLO-lite is lower. The reason is that the above bridge detection methods have weak feature extraction capability, and it can be understood that CNN is limited by receptive fields in the process of feature extraction, only local features of the bridge are extracted, and general feature extraction cannot be performed by combining with surrounding features or background information of the bridge. Resulting in difficulty in detecting the bridge after being divided by islands.

In order to realize efficient and accurate detection of the HSRRSIs bridge, a multi-scale feature fusion bridge detection method based on depth separable convolution is provided. The method mainly designs three modules, namely a trunk feature extraction network built by depth separable convolution, multi-branch parallel cavity convolution and a cross-level feature fusion pyramid. The method has only 10.8 (million) parameters and only 41.2MB, realizes effective compression of the network, has the detection speed of 60.04FPS, and can detect the bridge in real time. The average detection precision is 94.26%, which is higher than most target detection networks, and the method has stronger bridge detection capability, can cope with bridge detection tasks under multiple scenes such as multi-scale, large-breadth, complex background and the like, has comprehensive indexes superior to other bridge detection methods, and has stronger practicability. The subsequent work will continue to optimize the backbone feature extraction network, realizing faster and more accurate bridge detection.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims

1. A multi-scale feature fusion bridge detection method based on depth separable convolution is characterized by comprising the following steps:

the multi-branch parallel cavity convolution in Step2 enlarges the receptive field, and the receptive field represents the spatial range of the input image corresponding to the unit pixel on the output characteristic diagram; the method of cavity convolution can obtain receptive fields with different scales, thereby solving the problem of lower detection precision of the multi-scale bridge;

k'＝n×(k-1)+1 (5)

size of receptive field after hole convolution:

r＝[(n-1)×(k+1)+k]×[(n-1)×(k+1)+k] (6)

in order to extract the characteristics of the hollow part, the position information of bridges with different dimensions is positioned more accurately, the calculation resources are further saved, the conventional convolution operations with different dimensions and the hollow convolution operations are connected in series by combining the concept of an admission structure, and then the convolution operations are connected in parallel to form a group of convolution modules with asymmetric structures, and the dimension consistency of the characteristic diagram output by each parallel branch can be ensured;

in Step2, in order to reduce the calculation amount, three parallel branches firstly adopt 1×1 convolution to reduce the channel number; in order to meet the requirements of target detection with different sizes, three parallel branches respectively adopt convolution kernels with two sizes of 3×3, 3×3 and 5×5, and the corresponding void ratios are respectively 1, 3 and 5; the characteristics of the cavity part can be extracted by conventional convolution, so that not only can the continuous information be obtained, but also the receptive fields with different sizes can be obtained; finally, carrying out channel splicing on the feature images with different scales, adding the feature images with short-circuit edges of the input feature images, and outputting the feature images;

in Step3, the multi-scale cross-layer feature pyramid structure comprises a main feature extraction network including six main convolution modules, six convolution graphs with different sizes are obtained, and the number of channels is adjusted by using 1×1 convolution of the final three feature graphs P1, P2 and P3 to obtain feature layers p1_in, p2_in and p3_in; the P3_in is up-sampled and then stacked with the P2_in to obtain P2_m, and the P2_m is up-sampled and then stacked with the P1_in to obtain P1_out; the P1_out is subjected to downsampling and then is stacked with the P2_in and the P2_m to obtain the P2_out, and the P2_out is subjected to downsampling and then is stacked with the P3_in to obtain the P3_out; all feature graphs are subjected to feature fusion in the operation, and all the feature graphs contribute to multi-scale feature fusion; the input feature map and the output feature map with the same scale are directly connected, so that richer features can be fused; finally, the feature graphs are stacked for a plurality of times, so that the pyramid has stronger feature representation capability;

step4: and outputting a bridge detection result through the detection head.

2. The multi-scale feature fusion bridge detection method based on depth separable convolution as recited in claim 1, wherein the method comprises the following steps: in Step1, the convolutional neural network is used as the best choice for extracting target features, the conventional convolution forming the convolutional neural network is utilized for bridge detection, firstly, the input feature map of each channel and the corresponding convolution kernel are subjected to convolution operation, and then the results are added and output; for D _F ×D _F Input image of bridge of x M using N dimensions D _K ×D _K Performing convolution operation on the standard convolution kernel of the xM; wherein M is the number of input channels, N is the number of convolution kernels, namely the number of output channels; when the standard convolution kernel is adopted for convolution, the step length is 1, padding is adopted, and the size of the output characteristic diagram is D _F ×D _F X N, the calculated amount is:

P ₁ ＝D _F ×D _F ×D _K ×D _K ×M×N (1)。

3. the multi-scale feature fusion bridge detection method based on depth separable convolution as recited in claim 1, wherein the method comprises the following steps: the depth separable convolution improves the conventional convolution into two processes of layer-by-layer convolution and point-by-point convolution;

the layer-by-layer convolution is a convolution without crossing channels, each channel of the feature map corresponds to an independent convolution kernel in the process, each convolution kernel only acts on one specific channel, and the number of channels of the output feature map is equal to that of channels of the input feature map; for D _F ×D _F The method comprises the steps that (1) the bridge input image of the xM is convolved by using M convolution kernels, convolution calculation is only carried out in each channel, information among the channels is not added, and finally M feature images are output; the calculation amount of the layer-by-layer convolution is thus:

P ₂ ＝D _F ×D _F ×M×D _K ×D _K (2)

P ₃ ＝D _F ×D _F ×M×N (3)

n is the number of output channels, and is usually largerNegligible; in D _k For example, =3 =>I.e. the calculated amount of the depth separable convolution is only +.>The operation efficiency of the model is improved.