CN115223017B - Multi-scale feature fusion bridge detection method based on depth separable convolution - Google Patents

Multi-scale feature fusion bridge detection method based on depth separable convolution Download PDF

Info

Publication number
CN115223017B
CN115223017B CN202210610157.0A CN202210610157A CN115223017B CN 115223017 B CN115223017 B CN 115223017B CN 202210610157 A CN202210610157 A CN 202210610157A CN 115223017 B CN115223017 B CN 115223017B
Authority
CN
China
Prior art keywords
convolution
bridge
feature
detection
scale
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210610157.0A
Other languages
Chinese (zh)
Other versions
CN115223017A (en
Inventor
黄亮
孙宇
赵俊三
唐伯惠
陈国坤
李小祥
裘木兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202210610157.0A priority Critical patent/CN115223017B/en
Publication of CN115223017A publication Critical patent/CN115223017A/en
Application granted granted Critical
Publication of CN115223017B publication Critical patent/CN115223017B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale feature fusion bridge detection method based on depth separable convolution, which is characterized by comprising the following steps of: firstly, constructing a trunk feature extraction network by utilizing depth separable convolution to extract bridge features; secondly, the last layer of the feature map is subjected to multi-branch parallel cavity convolution to obtain a multi-scale receptive field, so that bridges with different scales are better matched, and multi-scale bridge features are extracted; then fully utilizing bridge details and semantic information with different depths, and carrying out cross-level fusion on three bridge effective feature layers with different levels through a multi-scale feature pyramid; and finally, testing the bridge detection result and evaluating the precision. According to the invention, mAP reaches 94.26%, FPS reaches 60.04, most of mainstream target detection networks can be advanced in precision and speed, and the system can be integrated into a mobile terminal to finish high-precision rapid bridge detection tasks; the network parameters are greatly reduced, the operation cost is reduced, the running speed of the network is improved, and the detection capability of the multi-scale bridge is improved.

Description

Multi-scale feature fusion bridge detection method based on depth separable convolution
Technical Field
The invention belongs to the technical field of bridge detection, and particularly relates to a multi-scale feature fusion bridge detection method based on depth separable convolution.
Background
Bridge is an important transportation facility, and is a key transportation junction between land and water. Secondly, as the urban step is gradually accelerated, the bridge plays an increasingly important role in urban planning construction, and the maximum demand is an indispensable part in urban planning construction. The bridge is used as a large-scale artificial ground object and is the object which is most easy to change in a geographic database, and the bridge is updated and maintained to play a guiding role in urban planning construction. The bridge is automatically detected by using the image processing technology, so that the detection speed can be increased, the detection precision can be improved, and the method has wide development prospect and important research significance in military and civil aspects. The high-spatial-resolution remote sensing images (High Spatial Resolution Remote Sensing Imags, HSRRSIs) play important roles in the fields of industry, agriculture, military, economy and the like, become important data sources for target detection, are influenced by environmental factors and imaging conditions, have large differences in bridge target background in the HSRRSIs, have obvious bridge shape differences, and are difficult to distinguish bridges in different HSRRSIs by utilizing unified characteristics; meanwhile, the different bridges in the same HSRRSIs have larger size difference, so that the phenomenon of imbalance of positive and negative samples is easy to occur, and the difficulty of bridge detection is further increased. Therefore, the research on the high-precision rapid detection of the bridge target has important research significance.
Currently, bridge inspection can be largely divided into the following two categories: 1) Bridge detection method based on traditional method. The method mainly adopts a mode of manually selecting the characteristics and then detecting by utilizing a sliding window. However, in a real scene, the characteristic extraction is difficult due to dependence on certain specific features (such as water bodies, shorelines and the like), or the false detection and omission of the bridge are caused due to imaging condition limitation and artificial subjective factors. For example, fan Lisheng et al propose a cross entropy-based feature extraction and river region target detection method, which can be used for bridge detection in a river region, but the feature parameters proposed by the method are too dependent on water body features under specific conditions, so that the detection robustness of bridges in HSRRSIs with different backgrounds is poor; jiangmei et al propose a bridge-target-oriented automatic detection multi-source remote sensing image fusion method which can effectively detect a bridge in a complex large-scale scene, but needs to fuse a near infrared image, a full-color image and an SAR image, but different data in the same area are difficult to acquire and complicated to work, and high-efficiency automatic detection of the bridge is difficult to realize; g Sithole et al propose a method for detecting bridges in laser scanning data, which utilizes topology information to identify bridge seed points, and sets a threshold value to detect single bridges by using the seed points, so that bridges with different shapes can be effectively detected, but the poor setting of the threshold value can divide river banks into bridges in a staggered manner, and the accuracy of bridge positioning is affected; the Chaudhuri et al propose a method for detecting a water bridge by utilizing a multispectral image, wherein the image is classified into water, concrete and background, but the problem of imbalance of positive and negative samples cannot be solved, and when a bridge target is very small or the image contains noise, a classification error phenomenon can be generated; huang Yong et al propose a scene semantic SAR image bridge detection algorithm which can effectively inhibit coherent speckle noise and reduce the phenomena of missing detection and false detection of bridges, but is only effective for the bridges on water. 2) A bridge detection method based on deep learning. For example, L Chen et al propose a bridge detection network based on a multi-resolution balance and attention mechanism, which can effectively solve the bridge detection problem in SAR images, but has complex model and low precision; zhou Xing provides an optical remote sensing image bridge detection method based on a dual-attention mechanism, which effectively solves the problem of low target detection precision under a complex background, but has to be improved in terms of detection speed.
At present, the deep learning method is used for target detection, which becomes a research hot spot, but the related reports of bridge detection are less. Target detection methods based on deep learning can be mainly divided into two types. One type is a region suggestion based target detection algorithm, also known as a two-stage algorithm, such as R-CNN, fast R-CNN, etc. The two-stage algorithm has higher accuracy, but the candidate region extraction process has high complexity, large calculated amount and low detection speed; another type is regression-based object detection, also known as one-stage algorithms, such as SSD, YOLO, etc. SSD detection speed is fast, but the detection capability to small targets is poor, and the feature extraction of the YOLO algorithm is more comprehensive and has high accuracy and high detection speed. However, the above methods often cannot achieve both speed and accuracy.
Therefore, in order to solve the above-mentioned problem, a multi-scale feature fusion bridge detection method based on depth separable convolution is proposed herein.
Disclosure of Invention
In order to solve the technical problems, the invention designs a multi-scale feature fusion bridge detection method based on depth separable convolution, which comprises the steps of firstly constructing a backbone feature extraction network by using the depth separable convolution to extract bridge features; secondly, the last layer of the feature map is subjected to multi-branch parallel cavity convolution to obtain a multi-scale receptive field, so that bridges with different scales are better matched, and multi-scale bridge features are extracted; then fully utilizing bridge details and semantic information with different depths, and carrying out cross-level fusion on three bridge effective feature layers with different levels through a multi-scale feature pyramid; finally, testing bridge detection results and evaluating accuracy; the mAP reaches 94.26%, the FPS reaches 60.04, most of mainstream target detection networks can be advanced in precision and speed, and the method can be integrated into a mobile terminal to finish high-precision rapid bridge detection tasks.
In order to achieve the technical effects, the invention is realized by the following technical scheme: a multi-scale feature fusion bridge detection method based on depth separable convolution is characterized by comprising the following steps:
step1: constructing a bridge feature extraction network by utilizing depth separable convolution, reducing network parameters and compressing a network model;
step2: applying multi-branch parallel cavity convolution to enlarge receptive fields on the final layer of bridge feature map, and further extracting features of bridges with different scales;
step3: the multi-scale feature fusion pyramid is utilized to realize cross-level bridge feature map fusion, and the details and semantic information of different feature maps of the bridge are fully utilized;
step4: and outputting a bridge detection result through the detection head.
In Step1, the convolutional neural network is used as the best choice for extracting target features, the conventional convolution forming the convolutional neural network is utilized for bridge detection, firstly, the input feature map of each channel and the corresponding convolution kernel are subjected to convolution operation, and then the results are added and output; for D F ×D F Input image of bridge of x M using N dimensions D K ×D K Performing convolution operation on the standard convolution kernel of the xM; wherein M is the number of input channels, N is the number of convolution kernels, namely the number of output channels; when the standard convolution kernel is adopted for convolution, the step length is 1, padding is adopted, and the size of the output characteristic diagram is D F ×D F X N, the calculated amount is:
P 1 =D F ×D F ×D K ×D K ×M×N (1)
further, the depth separable convolution improves the conventional convolution into two processes of layer-by-layer convolution and point-by-point convolution;
the layer-by-layer convolution is a convolution without crossing channels, each channel of the feature map corresponds to an independent convolution kernel in the process, each convolution kernel only acts on one specific channel, and the number of channels of the output feature map is equal to that of channels of the input feature map; for D F ×D F X M bridge input images, respectively using MThe convolution kernel carries out convolution, convolution calculation is only carried out in each channel, information among the channels is not added, and finally M feature images are output; the calculation amount of the layer-by-layer convolution is thus:
P 2 =D F ×D F ×M×D K ×D K (2)
the point-by-point convolution is used for feature combination and dimension change, the features of each point are traversed through 1X 1 convolution, and the space information of a plurality of channels is collected; each point-by-point convolution layer is followed by a BN layer and a ReLU layer, so that nonlinear change of the model is effectively increased, and generalization capability of the model is enhanced; for the output feature map of the layer-by-layer convolution, the point-by-point convolution uses N convolution kernels with the size of 1 multiplied by M to carry out convolution operation, and finally the size of the output feature map is D F ×D F X N, the calculated amount obtained is:
P 3 =D F ×D F ×M×N (3)
the ratio of the depth separable convolution to the conventional computation is:
n is the number of output channels, and is usually largerNegligible; taking dk=3 as an example, +.>I.e. the calculated amount of the depth separable convolution is only +.>The operation efficiency of the model is improved.
Further, the multi-branch parallel cavity convolution in Step2 enlarges the receptive field, and the receptive field represents the spatial range of the input image corresponding to the unit pixel on the output characteristic diagram; the method of cavity convolution can obtain receptive fields with different scales, thereby solving the problem of lower detection precision of the multi-scale bridge;
the cavity convolution is to introduce a parameter of cavity rate in the conventional convolution, wherein the cavity rate is the distance between each unit in the convolution kernel, and the cavity rate of the conventional convolution is 1; convolution kernel size after addition of holes:
k'=n×(k-1)+1 (5)
size of receptive field after hole convolution:
r=[(n-1)×(k+1)+k]×[(n-1)×(k+1)+k] (6)
wherein k' is the convolution kernel size after the cavity is added, n is the cavity rate, and k is the conventional convolution kernel size;
in order to extract the characteristics of the hollow part, the position information of bridges with different dimensions is positioned more accurately, the calculation resources are further saved, the conventional convolution operations with different dimensions and the hollow convolution operations are connected in series by combining the concept of an admission structure, and then the convolution operations are connected in parallel to form a group of convolution modules with asymmetric structures, and the consistency of the dimension of the characteristic diagram output by each parallel branch can be ensured.
Further, in Step2, in order to reduce the calculation amount, three parallel branches firstly use 1×1 convolution to reduce the number of channels; in order to meet the requirements of target detection with different sizes, three parallel branches respectively adopt convolution kernels with two sizes of 3×3, 3×3 and 5×5, and the corresponding void ratios are respectively 1, 3 and 5; the characteristics of the cavity part can be extracted by conventional convolution, so that not only can the continuous information be obtained, but also the receptive fields with different sizes can be obtained; and finally, carrying out channel splicing on the feature images with different scales, adding the feature images with short-circuit edges of the input feature images, and outputting the feature images.
Further, in Step3, the multi-scale cross-layer feature pyramid structure, the main feature extraction network comprises six main convolution modules to obtain six convolution graphs with different sizes, and the number of channels of the three final feature graphs P1, P2 and P3 is adjusted by 1×1 convolution to obtain feature layers p1_in, p2_in and p3_in; the P3_in is up-sampled and then stacked with the P2_in to obtain P2_m, and the P2_m is up-sampled and then stacked with the P1_in to obtain P1_out; the P1_out is subjected to downsampling and then is stacked with the P2_in and the P2_m to obtain the P2_out, and the P2_out is subjected to downsampling and then is stacked with the P3_in to obtain the P3_out; all feature graphs are subjected to feature fusion in the operation, and all the feature graphs contribute to multi-scale feature fusion; the input feature map and the output feature map with the same scale are directly connected, so that richer features can be fused; finally, stacking feature maps multiple times gives the pyramid more powerful feature representation capability.
The beneficial effects of the invention are as follows:
(1) The backbone network is built by utilizing the depth separable convolution, so that network parameters are greatly reduced, the operation cost is reduced, and meanwhile, the running speed of the network is improved, thereby providing possibility for real-time and efficient bridge detection;
(2) Introducing multi-branch parallel cavity convolution to obtain receptive fields with different sizes, reserving detail information of small targets, and improving the detection capability of the multi-scale bridge;
(3) Utilizing a multi-scale feature pyramid to fully utilize bridge feature information in different layers of feature graphs to realize cross-layer network feature fusion;
(4) The mAP reaches 94.26%, the FPS reaches 60.04, most of mainstream target detection networks can be advanced in precision and speed, and the method can be integrated into a mobile terminal to finish high-precision rapid bridge detection tasks.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a conventional convolution;
FIG. 3 is a depth separable convolution of the present invention;
FIG. 4 is a diagram of a bridge feature extraction network architecture of the present invention;
FIG. 5 is a receptive field corresponding to objects of different dimensions;
FIG. 6 is a schematic diagram of conventional and hole convolution;
FIG. 7 is a multi-branch parallel hole convolution;
FIG. 8 is a multi-scale cross-layer feature pyramid;
FIG. 9 is a dataset sample example;
FIG. 10 is a 416 pixel by 416 pixel bridge image;
FIG. 11 is a graph showing the results of conventional bridge inspection;
FIG. 12 is a graph showing the detection results of bridges with large scale differences in the same image;
FIG. 13 is a graph showing the detection results of a multi-scale bridge in a large-format HSRRSIs;
FIG. 14 is a small scale bridge inspection result;
FIG. 15 is a high aspect ratio bridge inspection result;
fig. 16 shows the results of the cross-island bridge inspection.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1 to 8, a multi-scale feature fusion bridge detection method based on depth separable convolution is characterized in that: firstly, constructing a bridge feature extraction network by utilizing depth separable convolution, so as to achieve the purposes of reducing network parameters and compressing a network model; secondly, multi-branch parallel cavity convolution is applied to the final layer of bridge characteristic diagram to enlarge the receptive field, and the characteristics of bridges with different scales are further extracted; then, realizing cross-level bridge feature map fusion by utilizing a multi-scale feature fusion pyramid, and fully utilizing the details and semantic information of different feature maps of the bridge; and finally outputting a bridge detection result through the detection head. The specific flow of the proposed method is shown in figure 1.
Depth separable convolution built bridge feature extraction network
The most direct method for improving the network performance of the bridge detection is to increase the depth and the width of the network, but as the network depth is increased continuously, the gradient explosion problem in the back propagation process is also accompanied, the parameters required to be learned by the network are also more and more huge, and the huge parameter amount is easy to generate the over fitting phenomenon, so that the performance of the bridge detection is affected. Secondly, the complex network structure and numerous parameters have high requirements on computer hardware, and data in the same batch of incoming networks are severely limited.
Convolutional neural networks have been widely used in various target detection methods as the best choice for extracting target features. The conventional convolution forming the convolutional neural network is utilized to carry out bridge detection, firstly, the input characteristic diagram of each channel and the corresponding convolution kernel are subjected to convolution operation, and then the results are added and output, and the process is shown in figure 2. For D F ×D F Input image of bridge of x M using N dimensions D K ×D K The standard convolution kernel of x M performs the convolution operation. Wherein M is the number of input channels, and N is the number of convolution kernels, i.e. the number of output channels. When the standard convolution kernel is adopted for convolution, the step length is 1, padding is adopted, and the size of the output characteristic diagram is D F ×D F X N, the calculated amount is:
P 1 =D F ×D F ×D K ×D K ×M×N (1)
depth separable convolution improves conventional convolution into two processes, layer-by-layer convolution and point-by-point convolution, as shown in fig. 3. The layer-by-layer convolution is a convolution without crossing channels, in the process, each channel of the feature map corresponds to an independent convolution kernel, each convolution kernel only acts on one specific channel, and the number of channels of the output feature map is equal to that of channels of the input feature map. For D F ×D F The bridge input image of the xM is convolved by M convolution kernels respectively, convolution calculation is only carried out in each channel, information among the channels is not added, and finally M feature images are output. The calculation amount of the layer-by-layer convolution is thus:
P 2 =D F ×D F ×M×D K ×D K (2)
the point-by-point convolution is used for feature combination and dimension change, and features of each point are traversed through the 1 multiplied by 1 convolution, so that spatial information of a plurality of channels is collected, cross-channel information integration is realized, and the problem that features of all channels are mutually separated in a layer-by-layer rolling process is solved. Secondly, each point-by-point convolution layer is followed by a BN layer and a ReLU layer, so that nonlinear change of the model is effectively increased, and generalization capability of the model is enhanced. For the output feature map of the layer-by-layer convolution, the point-by-point convolution uses N convolution kernels with the size of 1 multiplied by M to carry out convolution operation, and finally the size of the output feature map is D F ×D F X N, the calculated amount obtained is:
P 3 =D F ×D F ×M×N (3)
the ratio of the depth separable convolution to the conventional computation is:
n is the number of output channels, and is usually largerNegligible. Taking dk=3 as an example, +.>I.e. the calculated amount of the depth separable convolution is only +.>The operation efficiency of the model is greatly improved.
To reduce the number of parameters and reduce the amount of computation, a bridge feature extraction network (Depthwise Separable Convolution-CSPDarknet53, DSC-CSPDarknet 53) combining depth separable Convolution and CSPDarknet53 is proposed, which is mainly composed of a Convolution normalized activation function (Convoltion+ Banch Normalization +Mish, CBM) and a phase Cross Module (CSPX) module, as shown in FIG. 4. The CBM module is composed of a convolution layer connected with a batch standardization and then connected with a Mish laser function, wherein the convolution layer is composed of layer-by-layer convolution and point-by-point convolution, namely, the depth separable convolution layer. The CSPX module divides the bridge characteristic diagram into two parts: the first part passes through the CBM module and X residual components (Res units); the second part is directly combined with the first part by Concat. The DSC-CSPDarknet integrates gradient change in the bridge characteristic diagram, effectively strengthens the learning capacity of the network, adopts depth separable convolution for all convolution layers, reduces the calculated amount and ensures higher precision.
Multi-branch parallel cavity convolution to obtain multi-scale receptive field
The receptive field represents the spatial extent of the output feature map where the unit pixels correspond to the input image. In CNN, the size of the receptive field is directly affected by the size of the convolution kernel, and the size of the receptive field directly affects the detection effect of targets of different scales. Therefore, the single-scale receptive field generated by the single-size convolution kernel cannot meet the detection of the multi-scale bridge in the same image. Although operations such as downsampling and pooling can effectively enlarge the receptive field, spatial resolution is reduced, so that small-scale bridge information cannot be reconstructed. The method of cavity convolution can obtain receptive fields with different scales, thereby solving the problem of lower detection precision of the multi-scale bridge.
A 1 x1 convolution kernel can produce a receptive field of 1 x1 size suitable for detecting small-sized bridges, as shown in the right-hand red box of fig. 5; however, the 1×1 receptive field is difficult to cover a large-sized bridge, and as shown in the yellow box on the left side of fig. 5, only 7×7 receptive fields can be used for detection. If the whole image adopts a 7×7 receptive field, a large amount of irrelevant background is contained when a small target is detected, so that the detail information of the small-size bridge is lost, and the acquisition of the target characteristics is not facilitated.
The hole convolution is a parameter of introducing the hole rate, namely the distance between each unit in the convolution kernel, into the conventional convolution, and the hole rate of the conventional convolution is 1. The left panel of fig. 6 is a conventional convolution schematic, and the right panel of fig. 6 is a cavity convolution schematic. Convolution kernel size after addition of holes:
k'=n×(k-1)+1 (5)
size of receptive field after hole convolution:
r=[(n-1)×(k+1)+k]×[(n-1)×(k+1)+k] (6)
where k' is the convolution kernel size after adding the holes, n is the hole rate, and k is the conventional convolution kernel size. For the conventional convolution in fig. 6, the convolution kernel size is 3×3, the void ratio is 1, the resultant receptive field is shown as the blue part of the figure, the size is 3×3, and the pixels involved in the calculation, that is, including the weights, are 9 red dots in the figure. For the hole convolution in fig. 6, the convolution kernel size is 5×5, the hole rate is 2, the receptive field size is 7×7, the pixels involved in calculation, that is, the pixels containing weights, are only 9 red points in the figure, and the weights of other points are all 0. Therefore, the hole convolution can solve the problem of the loss of the space information of the feature mapping caused by the pooling operation, and can enlarge the receptive field without increasing additional parameters and calculation amount.
The extracted bridge information has discontinuity because the hole part of the hole convolution does not participate in the sampling operation. In order to extract the characteristics of the hollow part, the position information of bridges with different dimensions is positioned more accurately, the calculation resources are further saved, the conventional convolution operations with different dimensions and the hollow convolution operations are connected in series by combining the concept of an admission structure, and then the convolution operations are connected in parallel to form a group of convolution modules with asymmetric structures, and the consistency of the dimension of the characteristic diagram output by each parallel branch can be ensured. FIG. 7 is a schematic diagram of a multi-scale parallel hole convolution.
To reduce the computational effort, three parallel branches first reduce the number of channels using a 1×1 convolution; secondly, in order to meet the requirements of target detection with different sizes, three parallel branches respectively adopt convolution kernels with two sizes of 3×3, 3×3 and 5×5, and the corresponding void ratios are respectively 1, 3 and 5. The characteristics of the cavity part can be extracted by conventional convolution, so that not only can the continuous information be obtained, but also the receptive fields with different sizes can be obtained; and finally, carrying out channel splicing on the feature images with different scales, adding the feature images with short-circuit edges of the input feature images, and outputting the feature images.
Multi-scale feature pyramid
Networks with good performance generally have deeper network hierarchies, and feature maps of different depths within the network can express multi-scale features. However, the feature expression capability is also different because of the depth difference of the feature map. For bridge detection networks, the lower-layer feature resolution is higher, and more bridge position and detail information are contained, but the lower-layer feature resolution is lower and the noise is more due to fewer convolutions. The high-level features have stronger bridge semantic information, but have low resolution and poor perception of details. Namely, as the convolutional neural network level deepens, abstract features become more and more obvious, but the spatial information of a shallow layer is gradually lost. Therefore, good detection effect cannot be obtained by directly predicting bridge targets with different scales through feature maps with different depths in a network, and a cross-level feature pyramid is required to be constructed by utilizing the feature maps with different depths so as to realize multi-scale feature fusion.
Fig. 8 is a multi-scale cross-layer feature pyramid structure diagram, where the main feature extraction network includes six main convolution modules to obtain six convolution graphs with different sizes, and the number of channels of the three final feature graphs P1, P2, and P3 is adjusted by using 1×1 convolution to obtain feature layers p1_in, p2_in, and p3_in. The P3_in is up-sampled and then stacked with the P2_in to obtain P2_m, and the P2_m is up-sampled and then stacked with the P1_in to obtain P1_out. P1_out is downsampled and then stacked with P2_in and P2_m to obtain P2_out, and P2_out is downsampled and then stacked with P3_in to obtain P3_out. All feature graphs are subjected to feature fusion in the operation, and all the feature graphs contribute to multi-scale feature fusion; secondly, a connection is directly constructed between the input feature map and the output feature map with the same scale, so that richer features can be fused; finally, stacking feature maps multiple times gives the pyramid more powerful feature representation capability.
Example 2
Experimental data
The data set adopts the bridge target automatic identification data set in the high-resolution visible light image provided by the high-resolution remote sensing image interpretation software large race of the fourth set of 'Zhongkexing cup'. The data set contains 2000 remote sensing images shot by high-resolution No. 2, the resolution is 1 m after full-color images and multispectral images are fused, the images comprise 668 pixels by 668 pixels bridge images 1686, and 1001 pixels by 1001 pixels bridge 314, and each image contains at least one bridge target, mainly an overwater bridge.
Here, 6 bridges under different conditions and different scenes are selected in the test set for detection, as shown in fig. 9. Fig. 9 (a) shows a conventional bridge, which has a size of 668 pixels×668 pixels, and the imaging is clear and the bridge is obvious. Fig. 9 (b) contains a plurality of large bridge targets with larger scale difference, the size is 668 pixels×668 pixels, the image background is simple, the imaging condition is good, and the bridge targets are obvious but the size difference between the bridge targets is larger. Fig. 9 (c) shows a large-format remote sensing image, the size of which is 1001 pixels×1001 pixels, and the background of which is complex and the format of which is large, but the whole is clear, the bridge target pixels occupy less pixels, and the positive and negative sample electrodes are unbalanced. Fig. 9 (d) shows a small target bridge inspection, the size is 668 pixels×668 pixels, the image background is complex, a small amount of thin cloud coverage exists, the width color is uneven, the image contains 11 bridge targets, the maximum size is only 40 pixels×40 pixels, and compared with the image of 668 pixels×668 pixels, the image can be regarded as a small target. Fig. 9 (e) shows a bridge with a large aspect ratio, and the size is 668 pixels×668 pixels, and a large amount of background information is contained in the target frame, so that training difficulty is increased. Fig. 9 (f) shows an island-crossing bridge of 668 pixels×668 pixels, which spans a river with land-like areas such as a beach or island in the middle.
Design of experiment
The experimental environment is based on Windows 10 operating system, and the computer is configured as Intel (R) i7-9700k CPU,NVIDIA GeForce GTX1070Ti video card, 8GB video memory. Training and testing was performed using a GPU with a platform of pytorch1.2.0. During the training process, the learning rate is gradually decreased. A total of 100 epochs were trained, the first 50 epochs were learned with an initial learning rate of 1X 10-3, a batch size set to 16, the last 50 epochs learned with a learning rate of 1X 10-4, a batch size set to 8, and an IoU threshold set to 0.5. The data set is divided into a training set, a verification set and a test set according to the proportion of 8:1:1. The training set is subjected to online data enhancement through methods such as image rotation, scaling, translation, clipping and splicing, and the enhanced image size is uniformly scaled into 416 pixels by 416 pixels and then transmitted into a network. The experiments selected average accuracy (mean Average Precision, mAP) and Precision, recall, F-score as objective evaluation indexes, and the detection speed was evaluated using Frame Per Second (FPS). Network training uses the CIOU loss function, defined as follows:
wherein b represents a prediction frame; b gt Representing a real frame; ρ 2 (b,b gt ) Is the euclidean distance between the predicted and real frames; c is the minimum closure region diagonal length that can contain both the predicted and real frames.
Alpha is a weight parameter, defined as follows:
v is used to measure the similarity of aspect ratios and is defined as follows:
wherein w and h respectively represent the width and the height of the prediction frame; w (w) gt 、h gt Representing the width and height of the real frame, respectively.
The 416 pixel by 416 pixel bridge image shown in fig. 10 comprises s×s cells, the cells where the center of the object to be detected is located are responsible for detecting the object, and multiple anchor blocks are used for predicting the boundary frame of the object on each cell of the feature map. The anchor point frame is a boundary frame obtained by clustering the shapes and the sizes of bridges in the data set, and each boundary frame comprises 4 coordinate values, 4 confidence degrees and C conditional category probabilities.
Wherein b is x 、b y 、b w 、b h The center position, width and height of the bounding box respectively; c x 、c y Respectively normalizing the distance between the current cell and the upper left corner of the picture; p is p w And p h Decibels are the width and height of the anchor point frame; using sigma function sigma to establish a learning parameter t x 、t y 、t w 、t h 、t o And the anchor point frame coordinates, conditional category probabilities. Bridge feature extraction network performs bridge detection on three scale feature maps of 52×52, 26×26 and 13×13. And finally, removing the redundant boundary boxes by a non-maximum inhibition method to obtain a final bridge target.
The mAP is defined as:
precision refers to predicting the correct ratio among all samples predicted positive, which is defined as:
recall refers to the ratio detected in all positive samples, defined as:
the F1 score is the harmonic mean of Precision and Recall, defined as:
wherein TP represents the positive sample correctly classified, i.e. the number of bridges correctly detected; FP represents the negative sample that is misclassified, i.e., the number of non-bridge targets detected as bridges; FN represents a positive sample that is misclassified, i.e., the number of bridges detected as non-bridge targets. Precision is the accuracy of class c targets; nc is the number of pictures containing class C targets and N is the total number of pictures in the dataset.
Experimental results and analysis
In order to verify the light weight of the proposed method, model complexity comparison is performed on several target detection methods of which the trunk feature extraction networks are all dark series, namely YOLOv3, YOLOv4, YOLO-tiny, YOLO-lite and the method. The network structures of the methods are similar, and the Yolo-tiny and Yolo-lite and the method are lightweight networks. Table 1 shows that the model complexity of the 5 methods is compared with that of the non-lightweight networks Yolov3 and Yolov4, the parameter quantity and the parameter size of the proposed methods are greatly reduced, and the parameter quantity and the parameter size of the proposed methods are suboptimal compared with the lightweight networks Yoloy-tiny and Yoloy-lite.
Table 1 model complexity comparison
Note that: the bolded numbers in the table are the optimal results of the method
In order to compare the detection precision and speed of the bridge detection method, 7 methods with excellent and novel performance are selected for comparison experiments, namely Efficientdet, retinanet, centernet, YOLOv3, YOLOv4, YOLO-tiny and YOLO-lite. Table 2 is the results obtained by training the bridge dataset with 7 mainstream target detection algorithms and the algorithms herein. As can be seen from the table, the mAP detected by the bridge varies from 61.09% to 94.26% and the FPS varies from 12.17 to 139.07. The optimal FPS value method is a lightweight network YOLO-tiny, but mAP is only 81.65%, and the requirement of high-precision bridge detection is difficult to meet. The mAP of the algorithm is only improved by 0.19% compared with that of YOLOv4, but FPS is greatly improved, and the method is inferior to that of YOLO-tiny. Precision and Recall are superior to the Efficientdet, retinanet, YOLOv, YOLO-tiny, YOLO-lite5 methods. F1-score was optimal among the 7 comparison methods. Therefore, by combining the results of table 1 and table 2, the algorithm ensures high-precision detection and improves the detection speed. Compared with other 7 algorithms, the comprehensive detection capability achieves the optimal effect.
TABLE 2 mainstream target detection algorithm result analysis for bridge dataset
Note that: the bolded numbers in the table are the optimal results of the method
In order to verify the capability of the proposed method for bridge detection in different scenes, 6 methods are selected for bridge detection. Wherein, centernet is the detection method with the highest speed in the non-lightweight network, YOLOv3 and YOLOv4 are detection methods with higher precision, and YOLO-tiny and YOLO-lite are lightweight networks. Fig. 11 to 16 show bridge inspection results under different conditions and scenes.
(1) Routine bridge detection experiment
Fig. 11 is a conventional bridge inspection result. The accuracy of each of the methods YOLOv3, YOLOv4 and YOLO-tiny and YOLO-lite can reach 1, the accuracy of each of the methods YOLO-tiny and YOLO-lite reaches 0.94 and 0.81 respectively, and the accuracy of each of the methods Centernet is lower than that of the other 5 methods, but the bridge can be successfully detected. The 6 methods have high detection precision for the conventional bridge, accurate positioning and no error leakage detection phenomenon, and have good bridge detection capability.
(2) Multi-scale bridge detection experiment
Fig. 12 shows the detection result of a plurality of bridges with large scale differences in the same image. Wherein, centernet, YOLOv, YOLO-tiny and YOLO-lite do not have the capability of multi-scale target detection, and only large and obvious bridges can be detected for multi-scale targets in the same image. YOLOv4 can detect 3 bridge targets of different scales, but the overall accuracy is not high. The method can well cope with multi-scale bridge detection tasks, and high-precision detection can be realized on both large targets and small targets in the same graph.
(3) Large-breadth multi-scale bridge detection experiment
Fig. 13 shows the detection result of the multi-scale bridge in the large-format HSRRSIs. Two missed inspections exist in the center et and the YOLO-tiny, only the bridge with larger size below the image can be detected, and the center et has lower precision. YOLOv3, YOLOv4 and YOLOv-lite each detect a bridge target with high accuracy, but there was one missing detection, and a bridge under the image was not detected. The method can accurately detect all bridge targets and has higher precision.
(4) Small-scale bridge detection experiment
Fig. 14 is a small-scale bridge inspection result. YOLO-tiniy has poor detection capability on small targets, and only one bridge target is detected. The Centernet correctly detects 7 bridge targets, and the accuracy is low and basically does not exceed 0.6.YOLO-lite is able to detect 11 bridge targets, but the overall accuracy is not high, not substantially exceeding 0.9. The method has similar effects as YOLOv3 and YOLOv4 in detection capability and detection precision, and can detect the bridge target with high precision under the conditions of no false detection and no missing detection.
(5) Bridge detection experiment with large length-width ratio
Fig. 15 shows the results of high aspect ratio bridge inspection. The center only correctly detects two bridge targets, one missing detection exists, and the accuracy of correctly detected bridges is not high. YOLOv3 successfully detects three bridges with high precision, but the target frame range of the long bridge on the right side is poor, and the whole bridge cannot be framed. YOLOv4 has good detection effect on bridges on the left side and the right side, but the middle bridge has the phenomenon of false detection, and one bridge target is repeatedly detected twice. YOLO-tiny has a good detection effect on two bridge targets on the right side, but has a repeated detection phenomenon on the left side. YOLO-lite can successfully detect 3 bridge targets, but the overall accuracy is not high, and the accuracy of the two bridges on the left side is only about 0.7. The method has the best detection capability and detection effect compared with other 5 methods.
(6) Cross-island bridge detection experiment
Fig. 16 shows the results of the cross-island bridge inspection. The bridge target can be accurately detected by the Centernet and the YOLO-lite, but the accuracy is required to be improved. The YOLOv3 detected three bridges, but both were misdetected, and the YOLOv4 situation was the same, with two misdetections. YOLO-tiny also detected three bridges, one false detection and one repeated detection. It can be seen that most bridge detection methods are not well defined for island-crossing bridges, and the detection of the island-crossing bridges is easily interfered by land areas in water. The method has no false detection and no missing detection, and can detect the bridge target in the image with high precision.
Through the 6 groups of comparison experiments, compared with a Centernet network with equivalent speed, the detection precision of the method is obviously improved; compared with YOLOv3 and YOLOv4, the detection speed is greatly improved under the condition that the detection accuracy is slightly improved; for the YOLO-tiny and the YOLO-lite which are both lightweight networks, the detection precision is improved, and the bridge detection in complex scenes can be well coped with. In conclusion, the method has strong generalization capability for bridge detection under various complex scenes, and can lead the current mainstream target detection method in detection speed or detection precision, so that the balance and optimization of speed and precision are achieved.
Example 3
Through a plurality of groups of comparison experiments, we obtain the following findings:
(1) For faster bridge detection networks, such as Centernet, YOLO-lite and YOLO-tiny, the detection accuracy is lower than that of slow bridge detection networks such as YOLOv3 and YOLOv4, whether the bridge detection networks are lightweight or non-lightweight, and the detection capability of most of unconventional bridges is poor. Therefore, only the trunk feature extraction network is compressed, or only the speed of the bridge detection network is increased by taking the reduction parameter as a means, so that the detection precision is greatly lost, and the high-efficiency and high-precision detection cannot be realized. The bridge detection method has the advantages of high precision and high timeliness, verifies that the feature extraction network is built by utilizing the depth separable convolution, can keep the feature extraction capability of the network to the greatest extent under the condition of not changing the network structure and the effective convolution layer number, and can further enhance the feature extraction and fusion capability by combining a multi-scale feature fusion pyramid and multi-branch parallel cavity convolution;
(2) In the multi-scale bridge detection experiment, all the methods can successfully detect the bridge with larger scale, but Centernrt, YOLOv, YOLO-lite and YOLO-tiny can not detect the small-scale bridge, and YOLO v4 and the method can successfully detect the large-scale bridge and also can successfully detect all the small-scale bridges in the image, namely the multi-scale bridge detection capability is achieved. Therefore, the key point of realizing the multi-scale bridge detection is that the detection capability of the small-scale bridge is improved as much as possible while the detection effect of the large-scale bridge is ensured, and the problem can be effectively solved by obtaining the multi-scale receptive fields and superposing different receptive fields;
(3) For large-format HSRRSIs bridge detection, the bridge target pixels occupy less space due to the large format and complex background, namely the situation that positive and negative samples are seriously unbalanced occurs. The results of the comparative experiments show that the detection effects of YOLOv3, YOLOv4, YOLO-lite and the methods are better than those of central et and YOLO-tini, and the first 4 methods all have feature fusion modules with different degrees, so that the problem can be effectively solved by carrying out feature fusion;
(4) In the cross-island bridge detection experiment, the problem of false detection and missing detection of different degrees occurs in the YOLOv3, the YOLOv4 and the YOLO-tiny, and the detection precision of the center and the YOLO-lite is lower. The reason is that the above bridge detection methods have weak feature extraction capability, and it can be understood that CNN is limited by receptive fields in the process of feature extraction, only local features of the bridge are extracted, and general feature extraction cannot be performed by combining with surrounding features or background information of the bridge. Resulting in difficulty in detecting the bridge after being divided by islands.
In order to realize efficient and accurate detection of the HSRRSIs bridge, a multi-scale feature fusion bridge detection method based on depth separable convolution is provided. The method mainly designs three modules, namely a trunk feature extraction network built by depth separable convolution, multi-branch parallel cavity convolution and a cross-level feature fusion pyramid. The method has only 10.8 (million) parameters and only 41.2MB, realizes effective compression of the network, has the detection speed of 60.04FPS, and can detect the bridge in real time. The average detection precision is 94.26%, which is higher than most target detection networks, and the method has stronger bridge detection capability, can cope with bridge detection tasks under multiple scenes such as multi-scale, large-breadth, complex background and the like, has comprehensive indexes superior to other bridge detection methods, and has stronger practicability. The subsequent work will continue to optimize the backbone feature extraction network, realizing faster and more accurate bridge detection.
In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended only to assist in the explanation of the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. The invention is limited only by the claims and the full scope and equivalents thereof.

Claims (3)

1. A multi-scale feature fusion bridge detection method based on depth separable convolution is characterized by comprising the following steps:
step1: constructing a bridge feature extraction network by utilizing depth separable convolution, reducing network parameters and compressing a network model;
step2: applying multi-branch parallel cavity convolution to enlarge receptive fields on the final layer of bridge feature map, and further extracting features of bridges with different scales;
the multi-branch parallel cavity convolution in Step2 enlarges the receptive field, and the receptive field represents the spatial range of the input image corresponding to the unit pixel on the output characteristic diagram; the method of cavity convolution can obtain receptive fields with different scales, thereby solving the problem of lower detection precision of the multi-scale bridge;
the cavity convolution is to introduce a parameter of cavity rate in the conventional convolution, wherein the cavity rate is the distance between each unit in the convolution kernel, and the cavity rate of the conventional convolution is 1; convolution kernel size after addition of holes:
k'=n×(k-1)+1 (5)
size of receptive field after hole convolution:
r=[(n-1)×(k+1)+k]×[(n-1)×(k+1)+k] (6)
wherein k' is the convolution kernel size after the cavity is added, n is the cavity rate, and k is the conventional convolution kernel size;
in order to extract the characteristics of the hollow part, the position information of bridges with different dimensions is positioned more accurately, the calculation resources are further saved, the conventional convolution operations with different dimensions and the hollow convolution operations are connected in series by combining the concept of an admission structure, and then the convolution operations are connected in parallel to form a group of convolution modules with asymmetric structures, and the dimension consistency of the characteristic diagram output by each parallel branch can be ensured;
in Step2, in order to reduce the calculation amount, three parallel branches firstly adopt 1×1 convolution to reduce the channel number; in order to meet the requirements of target detection with different sizes, three parallel branches respectively adopt convolution kernels with two sizes of 3×3, 3×3 and 5×5, and the corresponding void ratios are respectively 1, 3 and 5; the characteristics of the cavity part can be extracted by conventional convolution, so that not only can the continuous information be obtained, but also the receptive fields with different sizes can be obtained; finally, carrying out channel splicing on the feature images with different scales, adding the feature images with short-circuit edges of the input feature images, and outputting the feature images;
step3: the multi-scale feature fusion pyramid is utilized to realize cross-level bridge feature map fusion, and the details and semantic information of different feature maps of the bridge are fully utilized;
in Step3, the multi-scale cross-layer feature pyramid structure comprises a main feature extraction network including six main convolution modules, six convolution graphs with different sizes are obtained, and the number of channels is adjusted by using 1×1 convolution of the final three feature graphs P1, P2 and P3 to obtain feature layers p1_in, p2_in and p3_in; the P3_in is up-sampled and then stacked with the P2_in to obtain P2_m, and the P2_m is up-sampled and then stacked with the P1_in to obtain P1_out; the P1_out is subjected to downsampling and then is stacked with the P2_in and the P2_m to obtain the P2_out, and the P2_out is subjected to downsampling and then is stacked with the P3_in to obtain the P3_out; all feature graphs are subjected to feature fusion in the operation, and all the feature graphs contribute to multi-scale feature fusion; the input feature map and the output feature map with the same scale are directly connected, so that richer features can be fused; finally, the feature graphs are stacked for a plurality of times, so that the pyramid has stronger feature representation capability;
step4: and outputting a bridge detection result through the detection head.
2. The multi-scale feature fusion bridge detection method based on depth separable convolution as recited in claim 1, wherein the method comprises the following steps: in Step1, the convolutional neural network is used as the best choice for extracting target features, the conventional convolution forming the convolutional neural network is utilized for bridge detection, firstly, the input feature map of each channel and the corresponding convolution kernel are subjected to convolution operation, and then the results are added and output; for D F ×D F Input image of bridge of x M using N dimensions D K ×D K Performing convolution operation on the standard convolution kernel of the xM; wherein M is the number of input channels, N is the number of convolution kernels, namely the number of output channels; when the standard convolution kernel is adopted for convolution, the step length is 1, padding is adopted, and the size of the output characteristic diagram is D F ×D F X N, the calculated amount is:
P 1 =D F ×D F ×D K ×D K ×M×N (1)。
3. the multi-scale feature fusion bridge detection method based on depth separable convolution as recited in claim 1, wherein the method comprises the following steps: the depth separable convolution improves the conventional convolution into two processes of layer-by-layer convolution and point-by-point convolution;
the layer-by-layer convolution is a convolution without crossing channels, each channel of the feature map corresponds to an independent convolution kernel in the process, each convolution kernel only acts on one specific channel, and the number of channels of the output feature map is equal to that of channels of the input feature map; for D F ×D F The method comprises the steps that (1) the bridge input image of the xM is convolved by using M convolution kernels, convolution calculation is only carried out in each channel, information among the channels is not added, and finally M feature images are output; the calculation amount of the layer-by-layer convolution is thus:
P 2 =D F ×D F ×M×D K ×D K (2)
the point-by-point convolution is used for feature combination and dimension change, the features of each point are traversed through 1X 1 convolution, and the space information of a plurality of channels is collected; each point-by-point convolution layer is followed by a BN layer and a ReLU layer, so that nonlinear change of the model is effectively increased, and generalization capability of the model is enhanced; for the output feature map of the layer-by-layer convolution, the point-by-point convolution uses N convolution kernels with the size of 1 multiplied by M to carry out convolution operation, and finally the size of the output feature map is D F ×D F X N, the calculated amount obtained is:
P 3 =D F ×D F ×M×N (3)
the ratio of the depth separable convolution to the conventional computation is:
n is the number of output channels, and is usually largerNegligible; in D k For example, =3 =>I.e. the calculated amount of the depth separable convolution is only +.>The operation efficiency of the model is improved.
CN202210610157.0A 2022-05-31 2022-05-31 Multi-scale feature fusion bridge detection method based on depth separable convolution Active CN115223017B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210610157.0A CN115223017B (en) 2022-05-31 2022-05-31 Multi-scale feature fusion bridge detection method based on depth separable convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210610157.0A CN115223017B (en) 2022-05-31 2022-05-31 Multi-scale feature fusion bridge detection method based on depth separable convolution

Publications (2)

Publication Number Publication Date
CN115223017A CN115223017A (en) 2022-10-21
CN115223017B true CN115223017B (en) 2023-12-19

Family

ID=83607940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210610157.0A Active CN115223017B (en) 2022-05-31 2022-05-31 Multi-scale feature fusion bridge detection method based on depth separable convolution

Country Status (1)

Country Link
CN (1) CN115223017B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117541881B (en) * 2024-01-03 2024-04-16 广东石油化工学院 Road damage detection method and system
CN117854045A (en) * 2024-03-04 2024-04-09 东北大学 Automatic driving-oriented vehicle target detection method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111047602A (en) * 2019-11-26 2020-04-21 中国科学院深圳先进技术研究院 Image segmentation method and device and terminal equipment
WO2021051520A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Image identification method, identification model training method, related device, and storage medium
CN113762209A (en) * 2021-09-22 2021-12-07 重庆邮电大学 Multi-scale parallel feature fusion road sign detection method based on YOLO
CN113971660A (en) * 2021-09-30 2022-01-25 哈尔滨工业大学 Computer vision method for bridge health diagnosis and intelligent camera system
CN114119965A (en) * 2021-11-30 2022-03-01 齐鲁工业大学 Road target detection method and system
CN114170526A (en) * 2021-11-22 2022-03-11 中国电子科技集团公司第十五研究所 Remote sensing image multi-scale target detection and identification method based on lightweight network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021051520A1 (en) * 2019-09-18 2021-03-25 平安科技(深圳)有限公司 Image identification method, identification model training method, related device, and storage medium
CN111047602A (en) * 2019-11-26 2020-04-21 中国科学院深圳先进技术研究院 Image segmentation method and device and terminal equipment
CN113762209A (en) * 2021-09-22 2021-12-07 重庆邮电大学 Multi-scale parallel feature fusion road sign detection method based on YOLO
CN113971660A (en) * 2021-09-30 2022-01-25 哈尔滨工业大学 Computer vision method for bridge health diagnosis and intelligent camera system
CN114170526A (en) * 2021-11-22 2022-03-11 中国电子科技集团公司第十五研究所 Remote sensing image multi-scale target detection and identification method based on lightweight network
CN114119965A (en) * 2021-11-30 2022-03-01 齐鲁工业大学 Road target detection method and system

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
基于并行附加特征提取网络的SSD地面小目标检测模型;李宝奇;贺昱曜;强伟;何灵蛟;;电子学报(第01期);全文 *
基于智能视觉的机械零件图像分割技术;洪庆;宋乔;杨晨涛;张培;常连立;;机械制造与自动化(第05期);全文 *
基于混合注意力的实时语义分割算法;严广宇;刘正熙;;现代计算机(第10期);全文 *
基于轻量级网络的自然场景下的文本检测;孙婧婧;张青林;;电子测量技术(第08期);全文 *
孙婧婧 ; 张青林 ; .基于轻量级网络的自然场景下的文本检测.电子测量技术.2020,(第08期),全文. *
改进MobileNetV2网络在遥感影像场景分类中的应用;杨国亮;李放;朱晨;许楠;;遥感信息(第01期);全文 *
李宝奇 ; 贺昱曜 ; 强伟 ; 何灵蛟 ; .基于并行附加特征提取网络的SSD地面小目标检测模型.电子学报.2020,(第01期),全文. *

Also Published As

Publication number Publication date
CN115223017A (en) 2022-10-21

Similar Documents

Publication Publication Date Title
CN110188705B (en) Remote traffic sign detection and identification method suitable for vehicle-mounted system
CN111126202B (en) Optical remote sensing image target detection method based on void feature pyramid network
US20230184927A1 (en) Contextual visual-based sar target detection method and apparatus, and storage medium
CN113298818B (en) Remote sensing image building segmentation method based on attention mechanism and multi-scale features
CN111259906B (en) Method for generating remote sensing image target segmentation countermeasures under condition containing multilevel channel attention
CN111612008B (en) Image segmentation method based on convolution network
CN115223017B (en) Multi-scale feature fusion bridge detection method based on depth separable convolution
Gao et al. MLNet: Multichannel feature fusion lozenge network for land segmentation
CN112733749A (en) Real-time pedestrian detection method integrating attention mechanism
CN111079739B (en) Multi-scale attention feature detection method
CN106295613A (en) A kind of unmanned plane target localization method and system
Tian et al. Small object detection via dual inspection mechanism for UAV visual images
CN111046917B (en) Object-based enhanced target detection method based on deep neural network
CN111797841B (en) Visual saliency detection method based on depth residual error network
Liu et al. Coastline extraction method based on convolutional neural networks—A case study of Jiaozhou Bay in Qingdao, China
CN110334656A (en) Multi-source Remote Sensing Images Clean water withdraw method and device based on information source probability weight
CN114973011A (en) High-resolution remote sensing image building extraction method based on deep learning
Zhu et al. Change detection based on the combination of improved SegNet neural network and morphology
Liu et al. CAFFNet: channel attention and feature fusion network for multi-target traffic sign detection
CN113297959A (en) Target tracking method and system based on corner attention twin network
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN113298817A (en) High-accuracy semantic segmentation method for remote sensing image
Zheng et al. Feature enhancement for multi-scale object detection
Cheng et al. A survey on image semantic segmentation using deep learning techniques
Gui et al. A scale transfer convolution network for small ship detection in SAR images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant