US12223632B2

US12223632B2 - Intelligent detection method and unmanned surface vehicle for multiple type faults of near-water bridges

Info

Publication number: US12223632B2
Application number: US17/755,086
Authority: US
Inventors: Jian Zhang; Zhili He; Shang Jiang
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-03-17
Filing date: 2021-05-08
Publication date: 2025-02-11
Anticipated expiration: 2041-05-08
Also published as: WO2022193420A1; US20230351573A1; CN112884760B; CN112884760A

Abstract

The invention discloses an intelligent detection method for multiple types of faults for near-water bridges and an unmanned surface vehicle. The method includes an infrastructure fault target detection network CenWholeNet and a bionics-based parallel attention module PAM. CenWholeNet is a deep learning-based Anchor-free target detection network, which mainly comprises a primary network and a detector, used to automatically detect faults in acquired images with high precision. Wherein, the PAM introduces an attention mechanism into the neural network, including spatial attention and channel attention, which is used to enhance the expressive power of the neural network. The unmanned surface vehicle includes hull module, video acquisition module, lidar navigation module and ground station module, which supports lidar navigation without GPS information, long-range real-time video transmission and highly robust real-time control, used for automated acquisition of information from bridge underside.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This Application is a Section 371 National Stage Application No. PCT/CN2021/092393, filed on May 8, 2021, and claims priority to Chinese Patent Application No. 202110285996.5, filed on Mar. 7, 2021, the contents of which are incorporated herein by reference in their entireties.

TECHNICAL FIELD

The invention belongs to the field of structural fault detection in civil engineering, and in particular relates to an intelligent detection method for multi-type faults of a near-water bridge and an unmanned surface vehicle.

BACKGROUND

During the service lifetime of engineering structures, due to the influence of load and environment, many faults will occur. Once these faults are generated, they will easily accumulate and expand, thus affecting the service life and overall safety of the structure, and even affecting the safety of people's lives and their properties. In recent years, there have been many cases of structural damage such as bridge collapse due to lack of effective inspection and maintenance. Therefore, regular inspection and maintenance of the structure is essential.

Traditional infrastructure fault detection methods are mainly manual. These methods require the help of complicated tools, and have problems such as low efficiency, high labor costs, and large detection blind spots. Therefore, many researchers have recently introduced intelligent detection methods and intelligent detection equipment into the field of infrastructure fault detection. The intelligent detection method is represented by deep learning technology, which has brought revolutionary solutions to many industries, such as medicine and health, aerospace and material science. For example, the patent document with the publication number CN111862112A discloses a learned medical image segmentation method, the patent document with the publication number CN111651916A, discloses a material property prediction method based on deep learning. Similarly, the use of deep learning techniques for intelligence of structural faults is attracting more and more attention. Researchers apply deep learning methods to the detection of different faults and different infrastructures. Such as concrete structure crack detection, reinforced concrete structure multi-fault detection, steel structure corrosion detection, bolt loosening detection, ancient building fault detection, shield tunnel defect detection, etc. However, intelligent algorithms are not enough. To achieve true automatic detection, intelligent detection equipment is also required. In order to meet the needs of different inspection projects, a variety of inspection robots have been proposed and applied. Such as bridge inspection drones, mobile tunnel inspection vehicles, bridge deck inspection robots, rope climbing robots, etc. For example, the patent document with the publication number CN112171692A discloses a flying adsorption robot suitable for intelligent detection of bridge deflection; the patent document with the publication number CN111413353A discloses an intelligent mobile comprehensive detection equipment for tunnel lining faults; patent document with the publication number CN111021244A discloses an orthotropic steel bridge deck fatigue crack detection robot; the patent document with the publication number CN109978847A discloses a cable-robot-based method for identifying the fault of the noose.

These methods have solved many engineering problems, but two outstanding shortcomings of the current solutions remains. (1) The current intelligent detection method is mainly based on the Anchor-based method, that is, a large number of a priori boxes need to be pre-set, that is, anchor boxes, so it is named Anchor-based method. For example, the patent document with the publication number CN111062437A discloses a bridge fault target detection model based on the Faster R-CNN model, and the patent document with the publication number CN111310558A also discloses a road damage extraction method based on the Faster R-CNN model. The patent document with the publication number CN111127399A, discloses a method for detecting bridge pier faults based on the YOLOv3 model. Both the Faster R-CNN model and the YOLO series models are very classic Anchor-based methods. The first prominent problem of Anchor-based methods is that the effect of the algorithm will be affected by the pre-set prior box. When dealing with features with complex shapes, having multiple aspect ratios and multiple sizes, the size and aspect ratio of the prior box may be too different from the target, which will reduce the recall rate of the prediction results. Therefore, in order to improve the detection accuracy, a large number of prior frames are often preset. This also brings about the second prominent problem of the Anchor-based method. A large number of a priori boxes will introduce a large number of hyperparameters and design choices, which will make the model very complex, and the computational load is large, and the computational efficiency is often not high. Therefore, traditional intelligent detection methods are not suitable for structural fault detection, and the engineering community urgently needs new intelligent detection algorithms that are more efficient and concise, and have a wider adaptability. (2) At present, where an intelligent equipment can detect fault is still very limited, and it is mainly facing the area that is easy to detect such as the outer surface of the structure. The detection method, the patent document with the publication number of CN111260615A, discloses a method for detecting apparent faults of bridges based on UAV. However, the UAV system is difficult to work in relatively closed spaces, such as the bottom area of a large number of small and medium bridges, where the headroom is low, and the situation is complex, and artificial and intelligent detection equipment is often helpless. Taking UAV as an example, its flight often requires a wider space free of interference, and GPS signal-assisted positioning and manipulation. However, the GPS signal in the bottom area of small and medium-sized bridges with very low clearance is often very weak, and the internal situation is also very complicated. There are risks such as signal loss and collision damage when drones fly in. And some areas are very small, there may be toxic gases, and it is difficult for humans to easily reach them. Therefore, these areas have become detection blind spots for many years. Effective detection of these areas is also the focus and difficulty of the project. The engineering community urgently needs new types of intelligent detection equipment to detect such areas that are difficult for humans and other intelligent equipment to detect.

SUMMARY OF THE INVENTION

In order to solve the above problems, the present invention discloses an intelligent detection method and unmanned surface vehicle for multi-type faults of near-water bridges, which are suitable for automatic and intelligent detection of faults at the bottom of small and medium bridges. The proposed solution includes intelligent algorithms and hardware equipment. It can ensure the detection accuracy while taking into account the detection speed, and has a wide adaptability and applicability to complex engineering environments.

For achieving the above object, technical scheme of the present invention is as follows.

An intelligent detection system for detecting multiple types of faults for near-water bridges, comprises a first component, a second component, and a third component. The first component is an intelligent detection algorithm: CenWholeNet, an infrastructure fault target detection network based on deep learning.

The second component is an embedded parallel attention module PAM into the target detection network CenWholeNet, and the parallel attention module includes two sub-modules: a spatial attention sub-module and a channel attention sub-module.

The third component is an intelligent detection equipment assembly: an unmanned surface vehicle system based on lidar navigation, the unmanned surface vehicle includes four modules, a hull module, a video acquisition module, a lidar navigation module and a ground station module.

Further, the infrastructure fault target detection network CenWholeNet described in the first component comprises the following steps:

- Step 1: a primary network: using the primary network to extract features of images;
- Step 2: a detector: converting the extracted image features, by the detector, into tensor forms required for calculation, and optimizes a result through a loss function; and
- Step 3: result output: the result output includes converting the tensor into a boundary box and outputting of prediction results of target detection.

Wherein Step 1 of the infrastructure fault target detection network CenWholeNet in the first component has the primary network, the method of using the primary network is as follows: giving an input image P∈

^W×H×3, wherein W is the width of the image, H is the height of the image, and 3 represents the number of channels of the image, that is, three RGB channels; extracting features of the input image P through the primary network;

- using two convolutional neural network models, Hourglass network and deep residual network ResNet.

Wherein Step 2 of the infrastructure fault target detection network CenWholeNet in the first component has the detector, the method of using the detector is as follows:

converting the features extracted by the primary network into an output set consisting of 4 tensors

=[{tilde over (H)},{tilde over (D)},Õ,

], by the detector, as a core of CenWholeNet;

- using

\tilde{H} \in {[0, 1]}^{C \times \frac{W}{r} \times \frac{H}{r}}

- represent a heat map of a central key point, where C is a category of the fault, which is taken as C=3 here, and r is an output step size, that is, the down sampling ratio, wherein a default step size is 4 and by down sampling, the calculation efficiency is improved;
- defining

\tilde{H} \in {[0, 1]}^{C \times \frac{W}{r} \times \frac{H}{r}}

- as ground-truth heatmap, for category c, the ground-truth center point of location (i, j) is p_cij∈
  ^C×W×H; First computing its down sampled equivalent position

{\hat{p}}_{cxy} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}},

wherein

x = ⌊ \frac{i}{r} ⌋,

y = ⌊ \frac{j}{r} ⌋;

- then through a Gaussian kernel function, map

{\hat{p}}_{cxy} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}}

- to tensor

Y_{p} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}},

- Y_pis defined by:

Y_{p} (c, x, y) = \exp (- \frac{{(x - {\hat{p}}_{cxy} (x))}^{2} + {(y - {\hat{p}}_{cxy} (y))}^{2}}{2 σ_{p}^{2}})

- wherein {circumflex over (p)}_cxy(x) and {circumflex over (p)}_cxy(y) represent center point position (x,y), σ_p=gaussian_radius/3; and gaussian_radius represent a maximum radius representing an offset of the corner points of a detection frame, wherein the maximum radius ensures that the intersection ratio between the offset detection frame and the ground-truth detection frame is IoU≥t, and t=0.7 is taken in all experiments; integrating all the corresponding Y, points to get the ground-truth heat map H:

H_{c, x, y} = \max_{p} [Y_{p} (c, x, y)], c \in [1, C], x \in [1, \frac{W}{r}], y \in [1, \frac{H}{r}]

- wherein, H_c,x,yrepresents a value of H at the position (c,x,y), a probability that this position is a center point; specifically, H_c,x,y=1 represents a central key point, a positive sample; conversely, H_c,x,y=0 is a background and the negative sample; focal loss as a metric to measure a distance between {tilde over (H)} and H, according to the following equation:

ℒ_{Heat} = - \frac{1}{N} \sum_{c = 1}^{C} \sum_{x = 1}^{W / r} \sum_{y = 1}^{H / r} {\begin{matrix} {(1 - {\tilde{H}}_{c, x, y})}^{α} \log ({\tilde{H}}_{c, x, y}) & if H_{c, x, y} = 1 \\ {({\tilde{H}}_{c, x, y})}^{α} \log (1 - {\tilde{H}}_{c, x, y}) & Otherwise \\ {(1 - H_{c, x, y})}^{β} \end{matrix}

- wherein N is a total count of all central key points, α and β are hyperparameters, configured to control the weights; in all cases, α=2, β=4; by minimizing
  _Heat, the neural network model is configured to better predict a position of a center point of the target;
- obtaining a size information W×H of a prediction box to finally determine the boundary box;
- defining a size of the ground-truth boundary box corresponding to the k^thkey point p_kbe d_k=(w_k,h_k), and integrate all d_kto get the ground-truth boundary box dimension tensor

D \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} :

D=d ₁ ⊕d ₂ ⊕ . . . ⊕d _N

- wherein ⊕ represents pixel-level addition; for all fault categories, the model is configured to give a predicted dimension tensor

\tilde{D} \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}},

- and L1 Loss is configured to measure D and {tilde over (D)} similarity, determined by the following equation:

ℒ_{D} = \frac{1}{N} \sum_{k = 1}^{N} SmoothL 1 Loss ({\tilde{d}}_{k}, d_{k}) = \frac{1}{N} \sum_{k = 1}^{N} {\begin{matrix} 0.5 { {\tilde{d}}_{k} - d_{k} }_{2}^{2} & if { {\tilde{d}}_{k} - d_{k} }_{1} < 1 \\ { {\tilde{d}}_{k} - d_{k} }_{1} - 0.5 & Otherwise \end{matrix}

- obtaining a rough width and height of each prediction box by minimizing
  _D, by the model;
- correcting an error caused by down sampling by introducing a position offset, because the image is scaled by r times; recording the coordinates of the k^thkey point p_kas (x_k,y_k), then the mapped coordinates are (└x_k/r┘, └y_k/r┘), then get the ground-truth offset:

o_{k} = (\frac{x_{k}}{r} - ⌊ \frac{x_{k}}{r} ⌋, \frac{y_{k}}{r} - ⌊ \frac{y_{k}}{r} ⌋)

- integrating all o_kto get the ground-truth offset matrix

O \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} :

O=o ₁ ⊕o ₂ ⊕ . . . ⊕o _N

- wherein, the 2 of a first dimension represents the offset of the key point (x, y) in the W and H directions; correspondingly, the model will give a prediction tensor

\tilde{O} \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}},

- and smooth L1 Loss is used to train the offset loss:

ℒ_{Off} = \frac{1}{N} \sum_{k = 1}^{N} SmoothL 1 Loss ({\tilde{o}}_{k}, o_{k}) = \frac{1}{N} \sum_{k = 1}^{N} {\begin{matrix} 0.5 { {\tilde{o}}_{k} - o_{k} }_{2}^{2} & if { {\tilde{o}}_{k} - o_{k} }_{1} < 1 \\ { {\tilde{o}}_{k} - o_{k} }_{1} - 0.5 & Otherwise \end{matrix}

- introducing a new set of tensors to modify the prediction frame and improve the detection accuracy, in order to make the model pay more attention to the overall information of the target; specifically, taking an angle between the connecting line of a diagonal point of the detection frame and the x-axis, and the diagonal length of the detection frame as the training targets; defining coordinates of an upper left corner and lower right corner of the detection frame to be (x_k ¹,y_k ¹) and (x_k ²,y_k ²), so the diagonal length of the detection frame l_kis calculated as:
  l _k=√{square root over ((x _k ¹ −x _k ²)²+(y _k ¹ −y _k ²)²)}
- an inclination of a connecting line between the upper left and lower right corners θ_kis calculated by the following formula:

θ_{k} = π - \arctan (\frac{y_{k}^{2} - y_{k}^{1}}{x_{k}^{2} - x_{k}^{1}})

- constructing a pair of complementary polar coordinates polar

{polar}_{k} = (\frac{1}{2} l_{k}, θ_{k})

- and further to obtain a ground-truth polar coordinate matrix

Polar \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} :

Polar = (\frac{1}{2} l_{1}, θ_{1}) \oplus (\frac{1}{2} l_{2}, θ_{2}) \oplus \dots \oplus (\frac{1}{2} l_{N}, θ_{N})

- the model also gives a prediction tensor

\in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}};

- Polar an
  is trained by a same L1 loss:

ℒ_{Polar} = \frac{1}{N} \sum_{k = 1}^{N} { k - {polar}_{k} }_{1}

Finally, for each position, the model will predict the output of C+6, which will form the set

=[{tilde over (H)},{tilde over (D)},Õ,

], which will also share the weights of the network; and the loss function of is defined by:

=

_Heat+λ_Off

_Off+λ_D

_D+λ_Polar

_Polar

Wherein all the experiments, λ_off=10, λ_Dand λ_Polarare both take as 0.1.

In Step 3 of the infrastructure fault target detection network CenWholeNet in the first component, the method of outputting a result is as follows:

- outputting results by extracting a possible center keypoint coordinates from a predicted heatmap tensor {tilde over (H)}, and then obtaining a predicted bounding box according to the information in the corresponding {tilde over (D)}, Õ and
  ; wherein the greater the value of {tilde over (H)}_c,x,y, the more likely it is the center point; for category c, if the point p_cxysatisfies the following formula, it is considered that p_cxyis an candidate center point;

{\tilde{H}}_{c, x, y} = \max_{i} [{\tilde{H}}_{c, x + i, y + j}]

- wherein we do not need non-maximum suppression (NMS), using a 3×3 max-pooling convolutional layer to extract candidate center points; letting a set of center points be {tilde over (P)}={({tilde over (x)}_k,{tilde over (y)}_k)}_k=1 ^N ^p, wherein N_pis a total number of selected center points; for any center points ({tilde over (x)}_k,{tilde over (y)}_k), extract corresponding size information ({tilde over (w)}_k,{tilde over (h)}_k)=({tilde over (D)}_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k,D_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k), offset information (δ{tilde over (x)}_k,δ{tilde over (y)}_k)=(Õ_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k,Õ_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k) and polar coordinate information ({tilde over (l)}_k,θ_k)=(
  ); first, calculate the prediction frame size correction value according to ({tilde over (l)}_k,{tilde over (θ)}_k):

{\begin{matrix} Δ {\tilde{h}}_{k} = {\tilde{l}}_{k} \sin ({\tilde{θ}}_{k}) \\ Δ {\tilde{w}}_{k} = - {\tilde{l}}_{k} \cos ({\tilde{θ}}_{k}) \end{matrix}

- defining specific location of the prediction box as

{\begin{matrix} Top = {\tilde{y}}_{k} + δ {\tilde{y}}_{k} - & Bottom = {\tilde{y}}_{k} + δ {\tilde{y}}_{k} + \\ (α_{y} \cdot \frac{1}{2} {\tilde{h}}_{k} + β_{y} \cdot Δ {\tilde{h}}_{k}); & (α_{y} \cdot \frac{1}{2} {\tilde{h}}_{k} + β_{y} \cdot Δ {\tilde{h}}_{k}) \\ Left = {\tilde{x}}_{k} + δ {\tilde{x}}_{k} - & Right = {\tilde{x}}_{k} + δ {\tilde{x}}_{k} + \\ (α_{x} \cdot \frac{1}{2} {\tilde{w}}_{k} + β_{x} \cdot Δ {\tilde{w}}_{k}); & (α_{x} \cdot \frac{1}{2} {\tilde{w}}_{k} + β_{x} \cdot Δ {\tilde{w}}_{k}) \end{matrix}

- wherein bounding box resizing hyperparameters as α_y=α_x=0.9, β_f=β_x=0.1.

Further, a method of establishing the parallel attention module in the second component is as follows.

As we all know, attention plays a very important role in human perception. When human eyes or ears and other organs acquire information, they tend to focus on more interesting targets and improve their attention; while suppressing uninteresting targets, reduce its attention. Inspired by human attention, some researchers recently proposed a bionic idea, attention mechanism: by embedding attention modules in neural networks, increase the weight of feature tensors in meaningful regions, reducing the weights of areas such as meaningless backgrounds, which can improve the performance of the network.

The present invention discloses a lightweight, plug-and-play parallel attention module PAM, configured to improves expressiveness of neural networks; wherein PAM considers two dimensions of feature map attention, spatial attention and channel attention force, and combine them in parallel;

- giving an input feature map as X∈
  ^C×W×H, wherein C, H and W denote channel, height and width, respectively; first, transforming
  ₁by implementing the spatial attention submodule: X→Ũ∈
  ^C×W×H; then, transforming
  ₂by implementing the channel attention sub-module: X→Û∈
  ^C×W×Hfinally, outputting feature map U∈
  ^C×W×H; the transformations consists essentially of convolution, maximum pooling operation, mean pooling operation and ReLU function; and overall calculation process is as follows:
  U=Ũ⊕Û=
  ₁(X)⊕
  ₂(X)
- wherein ⊕ represents output pixel-level tensor addition;
- the spatial attention sub-module is configured to emphasize “where” to improve attention, and pay attention to the locations of regions of interest (ROIs); first, maximum pooling and mean pooling operations are performed on the feature map along a channel direction to obtain several two-dimensional images, λ₁U_{avg_s}∈
  ^1×W×Hand λ₂U_{max_s}∈
  ^1×W×H, wherein λ₁and λ₂are adjust hyperparameters for different pooling operation weights, and taking λ₁=2, λ₂=1; U_{avg_s}
  U_{max_s}are calculated by the following formulas, and MaxPool and AvgPool represent maximum pooling operation and the average pooling operation respectively;

U_{avg_s} (1, i, j) = AvgPool (X) = \frac{1}{C} \sum_{k = 1}^{C} X (k, i, j), i \in [1, W], j \in [1, H]

U_{\max_s} (1, i, j) = MaxPool (X) = \max_{k \in [1, C]} (X (k, i, j)), i \in [1, W], j \in [1, H]

Next, introducing convolution operation to generate the spatial attention weight U_spa∈

^1×W×H; the overall calculation process of the spatial attention sub-module is as follows:

₁(X)=Ũ=U _spa ⊗X=σ(Conv([λ₁ U _{avg_s},λ₂ U _{max_s}]))⊗X
which is equivalent to:

₁(X)=σ(Conv([MaxPool(X),AvgPool(X),AvgPool(X)]))⊗X

- wherein, ⊗ represents pixel-level tensor multiplication, σ represents a sigmoid activation function, Conv represents a convolution operation, and a convolution kernel size is 3×3; and a spatial attention weight is copied along a channel axis;
- the channel attention sub-module is configured to find the relationship of internal channels, and care about “what” is interesting in a given feature map; first, mean pooling and max pooling are performed along width and height directions to generate a number of 1-
  dimensional vector, λ₃U_{avg_c}∈
  ^C×1×1and λ₄U_{max_c}∈
  ^C×1×1, and λ₃and λ₄are adjust hyperparameters for different pooling operation weights, and taking λ₃=2, λ₄=1; U_{avg_c}and U_{max_c}are calculated by the following formulas:

U_{avg_c} (k, 1, 1) = AvgPool (X) = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} X (k, i, j), k \in [1, C]

U_{\max_c} (k, 1, 1) = MaxPool (X) = \max_{i \in [1, W], j \in [1, H]} (X (k, i, j)), k \in [1, C]

Subsequently, introducing point-wise convolution (PConv) as a channel context aggregator to realize point-wise inter-channel interaction; in order to reduce amount of parameters, PConv is designed in a form of an Hourglass, and setting an attenuation ratio to r; finally, channel attention is obtained force weight U_cha∈

^C×1×1; the calculation process of this sub-module is as follows:

₂(X)=Û=U _cha ⊗X=σ(ΣPConv([λ₃ U _{avg_c},λ₄ U _{max_c}]))⊗X
which is equivalent to

₂(X)=σ(ΣPConv2(δ(PConv1([λ₃ U _{avg_c},λ₄ U _{max_c}]))))⊗X

- δ represents the ReLU activation function; the size of the convolution kernel of PConv1 is C/r×C×1×1, and the size of the convolution kernel of the inverse transform PConv2 is C×C/r×1×1; selecting ratio r as 16, the channel attention weights is copied along the width and height directions;
- wherein the PAM is a plug-and-play module, which ensures the strict consistency of the input tensor and output tensor at a dimension level; PAM is configured to be embedded in any position of any convolutional neural network model as a supplementary module, the method of providing PAM embedding Hourglass and ResNet including: for the ResNet network, the PAM is embedded in the residual block after the batch normalization layer, before the residual connection, and the same in each residual block; for the Hourglass network, it is divided into two parts: downsampling and upsampling and the downsampling part embeds the PAM between the residual blocks, as a transition module, the upsampling part embeds the PAM before the residual connection. Details are presented in the drawings.

Further, the LIDAR-based unmanned surface vehicle of the third component comprises four modules including, a hull module, a video acquisition module, a lidar navigation module and ground station module, working together in a cooperative manner.

The hull module includes a trimaran and a power system; the trimaran is configured to be stable, resist level 6 wind and waves, and has an effective remote control distance of 500 meters, adaptable to engineering application scenarios; the size of the hull is 75×47×28 cm, which is convenient for transportation; an effective load of the surface vehicle is 5 kg, and configured to be installed with multiple scientific instruments; in addition, the unmanned surface vehicle has the function of constant speed cruise, which reduces the control burden of personnel.

The video acquisition module is composed of a three-axis camera pan/tilt, a fixed front camera and a fill light; the three-axis camera pan/tilt supports lox optical zoom, auto focus, photography and 60 FPS video recording; said video acquisition module is configured to meet the needs of faults of different scales and locations shooting requirements; the fixed front camera is configured to determine a hull posture; a picture is transmitted back to a ground station in real time through a wireless image transmission device, on the one hand for fault identification, on the other hand for assisting a control of the USV; a controllable LED fill light board is installed to cope with small and medium-sized bridges and other low-light working environments, which contains 180 high-brightness LED lamp beads; 3D print a pan/tilt carrying the LED fill light board to meet the needs of multi-angle fill light; in addition, a fixed front-view LED light is also installed beads, providing light source support for the front-view camera.

The lidar navigation module includes lidar, mini computer, a set of transmission system and control system; lidar is configured to perform 3600 omnidirectional scanning; after it is connected with the mini computer, it can perform real-time mapping of the surrounding environment of the unmanned surface vehicle; through wireless image transmission, the information of the surrounding scene is transmitted back to the ground station in real time, so as to realize the lidar navigation of the unmanned surface vehicle; based on the lidar navigation, the unmanned surface vehicle no longer needs GPS positioning, in areas with weak GPS signals such as under the bridges and underground culverts; the wireless transmission system supports real-time transmission of 1080 P video, with a maximum transmission distance of 10 kilometers; redundant transmission is used to ensure link stability and strong anti-interference; the control system consists of wireless image transmission equipment, Pixhawk 2.4.8 flight control and SKYDROID T12 receiver, and through the flight control and receiver, the control system effectively control the equipment on board.

The ground station module includes two remote controls and multiple display devices; a main remote control is used to control the unmanned surface vehicle, and a secondary remote control is used to control the surface vehicle borne scientific instruments, and the display device is used to monitor the real-time information returned by the camera and lidar; on the one hand, the display device displays the picture in real time, and on the other hand, it processes the image in real time to identify the fault; the devices cooperate with each other to realize the intelligent fault detection without a GPS signal.

The beneficial effects of the present invention are described below.

1. In terms of intelligent detection algorithm, the present invention is the first application of Anchor-free target detection algorithm in the field of structural faults. The detection results of the traditional Anchor-based method are affected by the setting of the prior frame (that is, the anchor boxes), which leads to this algorithm to deal with structural faults with complex shapes, various sizes, and various aspect ratio features (for example, the aspect ratio of the steel bar may be large, and the aspect ratio of the peeling may be small), the size and aspect ratio of the preset a priori frame will be very different from the target, which will cause low detection result recall rate. In addition, in order to achieve a better detection effect, a large number of a priori frames are often preset. This introduces many hyperparameters and design choices. This makes the design of the model more complex, and at the same time brings a larger amount of computation. Compared with the Anchor-based method, the method disclosed by the present invention abandons the complex a priori frame setting, directly predicts key points and related vectors (i.e. width, height and other information), and composes them into a detection frame. The method of the invention is simpler, more direct and effective, solves the problem fundamentally, and is more suitable for the detection of engineering structure faults with complex features. In addition, the present invention proposes a novel and lightweight attention module by considering the gain effect of the attention mechanism on the expressive ability of the neural network model. The experimental results show that the method described in the present invention is superior to multiple neural network models with extensive influence, and achieves a comprehensive and better effect in the two dimensions of efficiency and accuracy. The disclosed attention module can also improve different neural network models by sacrificing negligible computation.

2. In terms of intelligent detection equipment, the present invention discloses an unmanned surface vehicle solution that does not rely on GPS signals to detect faults at the bottom of small and medium bridges. Due to the constraints of design and performance, the current testing equipment is often not instrumental when inspecting a large number of small and medium-sized bridges. Taking drones as an example, their flight often requires a wider space free of interference and requires GPS-assisted positioning. However, in the area at the bottom of small and medium bridges with very low clearance, urban underground culverts and sewers, etc., the space is relatively closed, the GPS signal is often very weak, and the internal situation is very complicated. There are risks such as signal loss and collision damage when the drone flies in. And some areas are very small, there may be toxic gases, and it is difficult for humans to easily reach them. Therefore, the engineering community urgently needs a new type of intelligent detection equipment to detect areas that are difficult to detect by artificial and other intelligent equipment. The present invention takes the lead in a highly robust unmanned surface system suitable for fault detection in relatively closed areas. The experimental results show that while improving the detection efficiency, the system can reduce the safety risk and detection difficulty of engineers and save a lot of manpower cost, has strong engineering applicability and broad application prospects. In addition, the system proposed by the present invention is not only suitable for the bottom of medium and small bridges, but also has great application potential in engineering scenarios such as urban underground culverts and sewers.

DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic diagram of the overall framework in accordance with the aspects of the present invention;

FIG. 2 is a schematic diagram of the CenWholeNet network in accordance with the aspects of the present invention;

FIG. 3 is a detailed diagram of the attention module PAM in accordance with the aspects of the present invention;

FIG. 4 is a schematic diagram of the architecture of the unmanned ship system in accordance with the aspects of the present invention;

FIG. 5 is a schematic diagram of the polar coordinate supplementary information in accordance with the aspects of the present invention;

FIG. 6 the proposed PAM embedded ResNet network scheme diagram in accordance with the aspects of the present invention;

FIG. 7 PAM embedded Hourglass network scheme diagram in accordance with the aspects of the present invention;

FIG. 8 is a schematic diagram of the application of the method in accordance with the aspects of the present invention in a bridge group;

FIG. 9 is a schematic diagram of the real-time mapping of the unmanned surface vehicle in accordance with the aspects of the present invention;

FIG. 10 is a schematic diagram of the detection results of the method in accordance with the aspects of the present invention;

FIG. 11 is a comparison table of the detection results between the algorithm framework in the present invention and other advanced target detection algorithms; and

FIG. 12 is the algorithm framework by the present invention compared with a training process of other advanced target detection algorithms.

DESCRIPTION OF THE EMBODIMENTS

The present invention will be further described below with reference to the accompanying drawings and specific embodiments. It should be understood that the following specific embodiments are only used to illustrate the present invention and not to limit the scope of the present invention. After reading the present disclosure, those skilled in the art can make modifications to the various equivalent forms of the present disclosure within the scope defined by the appended claims of the present application.

An intelligent detection method for multi-type faults of near-water bridges. The overall flow chart of the technical solution is shown in FIG. 1 , including the following components:

a first component, an intelligent detection algorithm: CenWholeNet, an infrastructure fault target detection network based on deep learning, described and illustrated in FIG. 2 ;

a second component, an embedded parallel attention module PAM into the target detection network CenWholeNet, the parallel attention module includes two sub-modules: a spatial attention sub-module and a channel attention sub-module, process is illustrated in FIG. 3 ;

a third component, an intelligent detection equipment assembly: an unmanned surface vehicle system based on lidar navigation, the unmanned surface vehicle includes four modules, a hull module, a video acquisition module, a lidar navigation module and a ground station module. Structural design of the unmanned surface vehicle is illustrated in FIG. 4 .

Wherein the infrastructure fault target detection network CenWholeNet described in the first component comprises the following steps.

- Step 1: a primary network: using the primary network to extract features of images;
- Step 2: a detector: converting the extracted image features, by the detector, into tensor forms required for calculation, and optimizes a result through a loss function;
- Step 3: result output: the result output includes converting the tensor into a boundary box and outputting of prediction results of target detection.

Wherein Step 1 of the infrastructure fault target detection network CenWholeNet in the first component has the primary network, the method of using the primary network is as follows:

n giving an input image P∈

^W×H×3, wherein W is the width of the image, H is the height of the image, and 3 represents the number of channels of the image, that is, three RGB channels; extracting features of the input image P through the primary network; using two convolutional neural network models, Hourglass network and deep residual network ResNet.

- converting the features extracted by the primary network into an output set consisting of 4 tensors
  =[{tilde over (H)},{tilde over (D)},Õ,
  ], by the detector, as a core of CenWholeNet;
- using

\tilde{H} \in {[0, 1]}^{C \times \frac{W}{r} \times \frac{H}{r}}

- to represent a heat map of a central key point, where C is a category of the fault, which is taken as C=3 here, and r is an output step size, that is, the down sampling ratio, wherein a default step size is 4 and by down sampling, the calculation efficiency is improved;
- defining

H \in {[0, 1]}^{C \times \frac{W}{r} \times \frac{H}{r}}

- as a ground-truth heatmap, for category c, the ground-truth center point of location (i,j) is p_cij∈
  ^C×W×H; First computing its down sampled equivalent position

{\hat{p}}_{cxy} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}},

- wherein

x = ⌊ \frac{i}{r} ⌋,

y = ⌊ \frac{j}{r} ⌋;

- then through a Gaussian kernel function, map

{\hat{p}}_{cxy} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}}

- to tensor

Y_{p} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}},

- Y_pis defined by

Y_{p} (c, x, y) = \exp (- \frac{{(x - {\hat{p}}_{cxy} (x))}^{2} + {(y - {\hat{p}}_{cxy} (y))}^{2}}{2 σ_{p}^{2}})

- wherein {circumflex over (p)}_cxy(x) and {circumflex over (p)}_cxy(y) represent center point position (x,y), σ_p=gaussian_radius/3; and gaussian_radius represent a maximum radius representing an offset of the corner points of a detection frame, wherein the maximum radius ensures that the intersection ratio between the offset detection frame and the ground-truth detection frame is IoU≥t, and t=0.7 is taken in all experiments; integrating all the corresponding Y_ppoints to get the ground-truth heat map H:

H_{c, x, y} = \max_{p} [Y_{p} (c, x, y)], c \in [1, C], x \in [1, \frac{W}{r}], y \in [1, \frac{H}{r}]

- wherein, H_c,x,yrepresents a value of H at the position (c,x,y), a probability that this position is a center point; specifically, H_c,x,y=1 represents a central key point, a positive sample; conversely, H_c,x,y=0 is a background and the negative sample; focal loss as a metric to measure a distance between I and H, according to the following equation:

ℒ_{Heat} = - \frac{1}{N} \sum_{c = 1}^{C} \sum_{x = 1}^{W / r} \sum_{y = 1}^{H / r} {\begin{matrix} {(1 - {\tilde{H}}_{c, x, y})}^{α} \log ({\tilde{H}}_{c, x, y}) & if H_{c, x, y} = 1 \\ {({\tilde{H}}_{c, x, y})}^{α} \log (1 - {\tilde{H}}_{c, x, y}) & Otherwise \\ {(1 - {\tilde{H}}_{c, x, y})}^{β} \end{matrix}

D \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} :

D=d ₁ ⊕d ₂ ⊕ . . . ⊕d _N

\tilde{D} \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}},

ℒ_{D} = \frac{1}{N} \sum_{k = 1}^{N} SmoothL1Loss ({\tilde{d}}_{k}, d_{k}) = \frac{1}{N} \sum_{k = 1}^{N} {\begin{matrix} 0.5 { {\tilde{d}}_{k} - d_{k} }_{2}^{2} & if { {\tilde{d}}_{k} - d_{k} }_{1} < 1 \\ { {\tilde{d}}_{k} - d_{k} }_{1} - 0.5 & Otherwise \end{matrix}

o_{k} = (\frac{x_{k}}{r} - ⌊ \frac{x_{k}}{r} ⌋, \frac{y_{k}}{r} - ⌊ \frac{y_{k}}{r} ⌋)

- integrating all o_kto get the ground-truth offset matrix

O \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} :

O=o ₁ ⊕o ₂ ⊕ . . . ⊕o _N

\tilde{O} \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}},

- and smooth L1 Loss is used to train the offset loss:

ℒ_{Off} = \frac{1}{N} \sum_{k = 1}^{N} SmoothL 1 Loss ({\tilde{o}}_{k}, o_{k}) = \frac{1}{N} \sum_{k = 1}^{N} {\begin{matrix} 0.5 { {\tilde{o}}_{k} - o_{k} }_{2}^{2} & if { {\tilde{o}}_{k} - o_{k} }_{1} < 1 \\ { {\tilde{o}}_{k} - o_{k} }_{1} - 0.5 & Otherwise \end{matrix}

- introducing a new set of tensors to modify the prediction frame and improve the detection accuracy, in order to make the model pay more attention to the overall information of the target; specifically, taking an angle between the connecting line of a diagonal point of the detection frame and the x-axis, and the diagonal length of the detection frame as the training targets, as shown in FIG. 5 ; defining coordinates of an upper left corner and lower right corner of the detection frame to be (x_k ¹,y_k ¹) and (x_k ²,y_k ²), so the diagonal length of the detection frame l_kis calculated as:
  l _k=√{square root over ((x _k ¹ −x _k ²)²+(y _k ¹ −y _k ²)²)}
- an inclination of a connecting line between the upper left and lower right corners θ_kis calculated by the following formula:

θ_{k} = π - \arctan (\frac{y_{k}^{2} - y_{k}^{1}}{x_{k}^{2} - x_{k}^{1}})

- constructing a pair of complementary polar coordinates

{polar}_{k} = (\frac{1}{2} l_{k}, θ_{k})

- and further to obtain a ground-truth polar coordinate matrix

Polar \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} :

Polar = (\frac{1}{2} l_{1}, θ_{1}) \oplus (\frac{1}{2} l_{2}, θ_{2}) \oplus \dots \oplus (\frac{1}{2} l_{N}, θ_{N})

- the model also gives a prediction tensor

\in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}};

- Polar and
  is trained by a same L1 loss:

ℒ_{Polar} = \frac{1}{N} \sum_{k = 1}^{N} { - {polar}_{k} }_{1}

Finally for each position, the model will predict the output of C+6, which will form the set

=[{tilde over (H)},{tilde over (D)},Õ,

=

_Heat+λ_Off

_Off+λ_D

_D+λ_Polar

_Polar

Wherein all the experiments, λ_Off=10, λ_Dand λ_Polarare both take as 0.1.

{\tilde{H}}_{c, x, y} = \max_{i} [{\tilde{H}}_{c, x + i, y + j}]

wherein we do not need non-maximum suppression (NMS), using a 3×3 max-pooling convolutional layer to extract candidate center points; letting a set of center points be {tilde over (P)}={({tilde over (x)}_k,{tilde over (y)}_k)}_k=1 ^N ^p, wherein N_pis a total number of selected center points; for any center points ({tilde over (x)}_k,{tilde over (y)}_k), extract corresponding size information ({tilde over (w)}_k,{tilde over (h)}_k)=({tilde over (D)}_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k, {tilde over (D)}_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k), offset information (δ{tilde over (x)}_k,δ{tilde over (y)}_k)=(Õ_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k,Õ_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k) and polar coordinate information ({tilde over (l)}_k,{tilde over (θ)}_k)=(

); first, calculate the prediction frame size correction value according to ({tilde over (l)}_k,{tilde over (θ)}_k):

{\begin{matrix} Δ {\tilde{h}}_{k} = {\tilde{l}}_{k} \sin ({\tilde{θ}}_{k}) \\ Δ {\tilde{w}}_{k} = - {\tilde{l}}_{k} \cos ({\tilde{θ}}_{k}) \end{matrix}

- defining specific location of the prediction box as

{\begin{matrix} Top = {\tilde{y}}_{k} + δ {\tilde{y}}_{k} - (α_{y} \cdot \frac{1}{2} {\tilde{h}}_{k} + β_{y} \cdot Δ {\tilde{h}}_{k}); \\ Bottom = {\tilde{y}}_{k} + δ {\tilde{y}}_{k} + (α_{y} \cdot \frac{1}{2} {\tilde{h}}_{k} + β_{y} \cdot Δ {\tilde{h}}_{k}) \\ Left = {\tilde{x}}_{k} + δ {\tilde{x}}_{k} - (α_{x} \cdot \frac{1}{2} {\tilde{w}}_{k} + β_{x} \cdot Δ {\tilde{w}}_{k}); \\ Right = {\tilde{x}}_{k} + δ {\tilde{x}}_{k} + (α_{x} \cdot \frac{1}{2} {\tilde{w}}_{k} + β_{x} \cdot Δ {\tilde{w}}_{k}) \end{matrix}

- wherein bounding box resizing hyperparameters as α_y=α_x=0.9, β_y=β_x=0.1.

giving an input feature map as X∈

^C×W×Hwherein C, H and W denote channel, height and width, respectively; first, transforming

₁by implementing the spatial attention submodule: X→Ũ∈

^C×W×H; then, transforming

₂by implementing the channel attention sub-module: X→Û∈

^C×W×Hfinally, outputting feature map U∈

^C×W×H; transformations consists essentially of convolution, maximum pooling operation, mean pooling operation and ReLU function; and overall calculation process is as follows:
U=Ũ⊕Û=

₁(X)⊕

₂(X)

- wherein ⊕ represents output pixel-level tensor addition;
- the spatial attention sub-module is configured to emphasize “where” to improve attention, and pay attention to the locations of regions of interest (ROIs); first, maximum pooling and mean pooling operations are performed on the feature map along a channel direction to obtain several two-dimensional images, λ₁U_{avg_s}∈
  ^1×W×Hand λ₂U_{max_s}∈
  ^1×W×Hwherein λ₁and λ₂are adjust hyperparameters for different pooling operation weights, and taking λ₁=2, λ₂=1; U_{avg_s}
  U_{max_s}are calculated by the following formulas, and MaxPool and AvgPool represent maximum pooling operation and the average pooling operation respectively;

U_{avg_s} (1, i, j) = AvgPool (X) = \frac{1}{C} \sum_{k = 1}^{C} X (k, i, j),

i \in [1, W], j \in [1, H]

U_{\max_s} (1, i, j) = MaxPoo l (X) = \max_{k \in [1, C]} (X (k, i, j)),

i \in [1, W], j \in [1, H]

₁(X)=σ(Conv([MaxPool(X),AvgPool(X),AvgPool(X)]))⊗X

- wherein, ⊗ represents pixel-level tensor multiplication, σ represents a sigmoid activation function, Conv represents a convolution operation, and a convolution kernel size is 3×3; and a spatial attention weight is copied along a channel axis;
- the channel attention sub-module is configured to find the relationship of internal channels, and care about “what” is interesting in a given feature map; first, mean pooling and max pooling are performed along width and height directions to generate a number of 1-dimensional vector, λ₃U_{avg_c}∈
  ^C×1×1and λ₄U_{max_c}∈
  ^C×1×1, and λ₃and λ₄are adjust hyperparameters for different pooling operation weights, and taking λ₃=2, λ₄=1; U_{avg_c}and U_{max_c}are calculated by the following formulas:

U_{avg_c} (k, 1, 1) = AvgPoo l (X) = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} X (k, i, j),

k \in [1, C]

U_{\max_c} (k, 1, 1) = MaxPoo l (X) = \max_{i \in [1, W], j \in [1, H]} (X (k, i, j)), k \in [1, C]

^C×1×1; the calculation process of this sub-module is as follows:

₂(X)=Û=U _cha ⊗X=σ(ΣPConv([λ₃ U _{avg_c},λ₄ U _{max_c}]))⊗X
which is equivalent to:

₂(X)=σ(ΣPConv2(δ(PConv1([λ₃ U _{avg_c},λ₄ U _{max_c}]))))⊗X

- δ represents the ReLU activation function; the size of the convolution kernel of PConv1 is C/r×C×1×1, and the size of the convolution kernel of the inverse transform PConv2 is C×C/r×1×1; selecting ratio r as 16, the channel attention weights is copied along the width and height directions;
- wherein the PAM is a plug-and-play module, which ensures the strict consistency of the input tensor and output tensor at a dimension level; PAM is configured to be embedded in any position of any convolutional neural network model as a supplementary module, the method of providing PAM embedding Hourglass and ResNet including: for the ResNet network, the PAM is embedded in the residual block after the batch normalization layer, before the residual connection, and the same in each residual block; for the Hourglass network, it is divided into two parts: downsampling and upsampling and the downsampling part embeds the PAM between the residual blocks, as a transition module, the upsampling part embeds the PAM before the residual connection. Details of embedment is illustrated in FIG. 6 and FIG. 7 .

The video acquisition module is composed of a three-axis camera pan/tilt, a fixed front camera and a fill light; the three-axis camera pan/tilt supports 10× optical zoom, auto focus, photography and 60 FPS video recording; said video acquisition module is configured to meet the needs of faults of different scales and locations shooting requirements; the fixed front camera is configured to determine a hull posture; a picture is transmitted back to a ground station in real time through a wireless image transmission device, on the one hand for fault identification, on the other hand for assisting a control of the USV; a controllable LED fill light board is installed to cope with small and medium-sized bridges and other low-light working environments, which contains 180 high-brightness LED lamp beads; 3D print a pan/tilt carrying the LED fill light board to meet the needs of multi-angle fill light; in addition, a fixed front-view LED light is also installed beads, providing light source support for the front-view camera.

The lidar navigation module includes lidar, mini computer, a set of transmission system and control system; lidar is configured to perform 360° omnidirectional scanning; after it is connected with the mini computer, it can perform real-time mapping of the surrounding environment of the unmanned surface vehicle; through wireless image transmission, the information of the surrounding scene is transmitted back to the ground station in real time, so as to realize the lidar navigation of the unmanned surface vehicle; based on the lidar navigation, the unmanned surface vehicle no longer needs GPS positioning, in areas with weak GPS signals such as under the bridges and underground culverts; the wireless transmission system supports real-time transmission of 1080 P video, with a maximum transmission distance of 10 kilometers; redundant transmission is used to ensure link stability and strong anti-interference; the control system consists of wireless image transmission equipment, Pixhawk 2.4.8 flight control and SKYDROID T12 receiver, and through the flight control and receiver, the control system effectively control the equipment on board.

Embodiment 1

The inventors tested the proposed technical solutions of the present invention under the condition of a water system bridge group (for example, Jiulong Lake water system bridge group in Nanjing, Jiangsu Province, China), as shown in FIG. 8 . The 3D lidar carried by the unmanned surface vehicle is combined with the SLAM algorithm, and the real-time mapping effect is shown in FIG. 9 . There are 5 small and medium sized bridges in the bridge group. The collected images include three types of faults: cracking, flaking and rebar exposure. The pixel resolution of the fault images is 512×512. Model building, training and testing based on the PyTorch deep learning framework. The Batchsize during training is taken as 2, the Batchsize during testing is taken as 1, and the learning rate is taken as 5×10⁻⁴. The detection result of the solution proposed by the present invention is shown in FIG. 9 , and the heat map is the visual result directly output by the network, which can provide evidence for the result of target detection.

The detection method disclosed in the present invention is also compared with the state-of-the-art object detection models on the same dataset, including the widely influential object detection method Faster R-CNN in Anchor-based methods and obtained in the industry. The latest YOLOv5 model in the widely used YOLO method, the acclaimed CenterNet method in Anchor-free. In addition, we also compared attention module PAM of the present invention with SENet and CBAM, the excellent and classic attention modules recognized by the deep learning community.

The chosen evaluation metrics are the average precision AP and average recall AR, which are commonly used in the deep learning field. They are the average values of different categories and different images. The calculation process is briefly described below. First introduce a key concept, the intersection of IoU. It is a common concept in the field of target detection. It measures the degree of overlap between the candidate box, that is, the prediction result of the model and the ground-truth bounding box, that is, the ratio of intersection and union, which can be calculated by the following formula.

IoU = \frac{area (Prediction results ⋂ GroundTruth)}{area (Prediction results ⋃ GroundTruth)}

For each prediction box, three relationships are considered between it and the ground-truth bounding box. The number of prediction boxes with the IoU of the ground-truth bounding box greater than the specified threshold is recorded as the true class TP; the number of prediction boxes with the IoU of the ground truth bounding box less than the threshold is recorded as the false positive class FP, the number of undetected ground-truth bounding box, denoted as false negative class FP. Then the accuracy can be calculated as

Precision = \frac{TP}{TP + FP} = \frac{TP}{all detections}

The recall rate can be calculated as

Recall = \frac{TP}{TP + FN} = \frac{TP}{all ground truths}

Therefore, depending on the IoU threshold, different accuracies can be calculated. The IoU is usually divided into 10 classes, 0.50:0.05:0.95. AP₅₀used in the example is the precision when the IoU threshold is 0.50, AP₇₅is the precision when the IoU threshold is 0.75, and the average precision AP represents the average precision under 10 IoU thresholds, that is,

AP = \frac{1}{1 0} (A P_{5 0} + A P_{5 5} + A P_{6 0} + \dots + {AP}_{9 5})

This is the most important metric to measure model checking performance. The average recall AR is the maximum recall for each image given 1, 10, and 100 detections. Then averaging under the category and 10 IoU thresholds, 3 sub-indicators AR₁, AR₁₀and AR₁₀₀can be obtained. Obviously, the closer the values of AP and AR are to 1, the better the test results and the closer to the label.

The comparison of prediction results between different methods is shown in FIG. 10 below, where the parameter quantity is the quantity of a constant deep learning model “volume”. FPS (frame-per-second) represents the number of images processed by the algorithm in one second, which is the running speed of the algorithm. Compared with the Faster R-CNN method, the method proposed by the present invention is significantly better than the Faster R-CNN in the two dimensions of efficiency and accuracy. Compared with the 4 sub-versions of YOLO v5, YOLO v5s, YOLO v5m, YOLO v51 and YOLO v5x, we can see that the effect is not very ideal, and we are very shocked by the poor detection results of YOLOv5. Comparable performance can only be achieved by training the best YOLO v5 sub-version YOLO v5x for more Epochs. Although YOLO v5 is slightly faster in running speed, its accuracy is far inferior to the method proposed in this paper. Compared with the CenterNet method, the running speed is the same, but the detection effect is much higher than that of CenterNet. Two conclusions can be drawn from the comparison at the attention module level: (1) The PAM proposed by the present invention can achieve a general and substantial enhancement effect on different deep learning models under the premise of sacrificing a small amount of computation; (2) Compared with SENet and CBAM, PAM can obtain more enhancement, which is obviously better than SENet and CBAM.

The comparison of the training process between different methods is shown in FIG. 11 , and the method proposed in the present invention is marked with a circle. It can be clearly seen that although the training results will oscillate to different degrees, our method can generally achieve higher AP and AR than traditional methods. That is, a better target detection effect can be obtained.

To sum up, the specific embodiment verifies the effectiveness of the technical solution of the present invention and the applicability to complex engineering. Compared with the traditional deep learning method, the proposed intelligent detection method is more suitable for multi-disease detection with variable slenderness ratio and complex shape. The proposed unmanned ship system also has high robustness and high practicability.

The above disclosure is only a typical embodiment of the present invention. However, the embodiment of the present invention is not limited thereto. After reading the patent by any person skilled in the art, the homogeneous modification of the patent should fall into the protection of the present invention scope.

Claims

The invention claimed is:

1. A method of using an intelligent detection system for detecting multiple types of faults for near-water bridges, comprising

providing the intelligent detection system, comprised of

a first component, an intelligent detection algorithm: CenWholeNet, an infrastructure fault target detection network based on deep learning, being electrically coupled to a second component;

the second component, an embedded parallel attention module PAM into the target detection network CenWholeNet, the parallel attention module includes two sub-modules: a spatial attention sub-module and a channel attention sub-module, being electrically coupled to a third component; and

the third component, an intelligent detection equipment assembly: an unmanned surface vehicle system based on lidar navigation, the unmanned surface vehicle includes four modules, a hull module, a video acquisition module, a lidar navigation module and a ground station module;

a computer readable storage medium, having stored thereon a computer program, said program arranged to:

Step 1: using a primary network to extract features of images;

Step 2: converting the extracted image features, by a detector, into tensor forms required for calculation, and optimizing a result through a loss function;

Step 3: outputting results includes converting the tensor forms into a boundary box and outputting of prediction results of target detection.

2. The method of claim 1, wherein

Step 1 of the infrastructure fault target detection network CenWholeNet in the first component having the primary network, the method of using the primary network is as follows:

giving an input image P∈

^W×H×3, wherein W is the width of the image, His the height of the image, and 3 represents the number of channels of the image, that is, three RGB channels;

extracting features of the input image P through the primary network;

using two convolutional neural network models, Hourglass network and deep residual network ResNet.

3. The method of claim 1, wherein Step 2 of the infrastructure fault target detection network CenWholeNet in the first component having the detector, the method of using the detector is as follows:

=[{tilde over (H)},{tilde over (D)},Õ,

], by the detector, as a core of CenWholeNet;

using

\tilde{H} \in {[0, 1]}^{C \times \frac{W}{r} \times \frac{H}{r}}

represent a heat map of a central key point, where C is a category of the fault, which is taken as C=3 here, and r is an output step size, that is, the down sampling ratio, wherein a default step size is 4 and by down sampling, the calculation efficiency is improved;

defining

H \in {[0, 1]}^{C \times \frac{W}{r} \times \frac{H}{r}}

as a ground-truth heatmap, for category c, the ground-truth center point of location (i, j) is p_cij∈

^C×W×H;

first, computing the ground-truth center point of location (i, j)'s down sampled equivalent position

{\hat{p}}_{cxy} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}},

wherein

x = ⌊ \frac{i}{r} ⌋, y = ⌊ \frac{j}{r} ⌋;

then through a Gaussian kernel function, map

{\hat{p}}_{cxy} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}}

to tensor

Y_{p} \in ℝ^{C \times \frac{W}{r} \times \frac{H}{r}}

Y_pis defined by:

Y_{p} (c, x, y) = \exp (- \frac{{(x - {\hat{p}}_{c x y} (x))}^{2} + {(y - {\hat{p}}_{cxy} (y))}^{2}}{2 σ_{p}^{2}})

wherein {circumflex over (p)}_cxy(x) and {circumflex over (p)}_cxy(y) represent center point position (x,y),

σ_p=gaussian_radius/3; and gaussian_radius represent a maximum radius representing an offset of the corner points of a detection frame, wherein the maximum radius ensures that the intersection ratio between the offset detection frame and the ground-truth detection frame is IoU≥t, and t=0.7 is taken in all experiments; integrating all the corresponding Y_ppoints to get the ground-truth heat map H:

H_{c, x, y} = \max_{p} [Y_{p} (c, x, r)], c \in [1, C], x \in [1, \frac{W}{r}], y \in [1, \frac{H}{r}]

wherein, H_c,x,yrepresents a value of H at the position (c,x,y), a probability that this position is a center point; specifically, H_c,x,y=1 represents a central key point, a positive sample; conversely, H_c,x,y=0 is a background and the negative sample; focal loss as a metric to measure a distance between H and H, according to the following equation:

ℒ_{Heat} = - \frac{1}{N} \sum_{c = 1}^{C} \sum_{x = 1}^{W / r} \sum_{y = 1}^{H / r} {\begin{matrix} {(1 - {\tilde{H}}_{c, x, y})}^{α} \log ({\tilde{H}}_{c, x, y}) & if H_{c, x, y} = 1 \\ {({\tilde{H}}_{c, x, y})}^{α} \log (1 - {\tilde{H}}_{c, x, y}) {(1 - H_{c, x, y})}^{β} & Otherwise \end{matrix}

wherein N is a total count of all central key points, α and β are hyperparameters, configured to control the weights; in all cases, α=2, β=4; by minimizing

_Heat, the neural network model is configured to better predict a position of a center point of the target;

obtaining a size information W×H of a prediction box to finally determine the boundary box;

defining a size of a ground-truth boundary box corresponding to the k^thkey point p_kbe d_k=(w_k,h_k), and integrate all d_kto get a ground-truth boundary box dimension tensor

D \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}}

D = d_{1} \oplus d_{2} \oplus \dots \oplus d_{N}

wherein ⊕ represents pixel-level addition; for all fault categories, the model is configured to give a predicted dimension tensor

\tilde{D} \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}},

and L1 Loss is configured to measure D and {tilde over (D)} similarity, determined by the following equation:

ℒ_{D} = \frac{1}{N} \sum_{k = 1}^{N} Smooth L 1 Loss ({\tilde{d}}_{k}, d_{k}) = \frac{1}{N} \sum_{k = 1}^{N} {\begin{matrix} 0.5 { {\tilde{d}}_{k} - d_{k} }_{2}^{2} & if { {\tilde{d}}_{k} - d_{k} }_{1} < 1 \\ { {\tilde{d}}_{k} - d_{k} }_{1} - 0.5 & Otherwise \end{matrix}

obtaining a rough width and height of each prediction box by minimizing

_D, by the model;

correcting an error caused by down sampling by introducing a position offset, because the image is scaled by r times; recording the coordinates of the k^thkey point p_kas (x_k,y_k), then the mapped coordinates are

(⌊ \frac{x_{k}}{r} ⌋, ⌊ \frac{y_{k}}{r} ⌋),

then get the ground-truth offset:

o_{k} = (\frac{x_{k}}{r} - ⌊ \frac{x_{k}}{r} ⌋, \frac{y_{k}}{r} - ⌊ \frac{y_{k}}{r} ⌋)

integrating all o_kto get the ground-truth offset matrix

O \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}}; O = o_{1} \oplus o_{2} \oplus ... \oplus o_{N}

wherein, the 2 of a first dimension represents the offset of the key point (x, y) in the W and H directions; correspondingly, the model will give a prediction tensor

\tilde{O} \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}},

and smooth L1 Loss is used to train the offset loss:

ℒ_{Off} = \frac{1}{N} \sum_{k = 1}^{N} Smooth L 1 Loss ({\tilde{o}}_{k}, o_{k}) = \frac{1}{N} \sum_{k = 1}^{N} {\begin{matrix} 0.5 { {\tilde{o}}_{k} - o_{k} }_{2}^{2} & if { {\tilde{o}}_{k} - o_{k} }_{1} < 1 \\ { {\tilde{o}}_{k} - o_{k} }_{1} - 0.5 & Otherwise \end{matrix}

introducing a new set of tensors to modify the prediction frame and improve the detection accuracy, in order to make the model pay more attention to the overall information of the target; specifically, taking an angle between a connecting line of diagonal points of the detection frame and the x-axis, and the diagonal length of the detection frame as the training targets; defining coordinates of an upper left corner and lower right corner of the detection frame to be (x_k ¹,y_k ¹) and (x_k ²,y_x ²), so the diagonal length of the detection frame l_kis calculated as:

l _k=√{square root over ((x _k ¹ −x _k ²)²+(y _k ¹ −y _k ²)²)}

an inclination of the connecting line between the upper left and lower right corners θ_kis calculated by the following formula:

θ_{k} = π - arc \tan (\frac{y_{k}^{2} - y_{k}^{1}}{x_{k}^{2} - x_{k}^{1}})

constructing a pair of complementary polar coordinates

{polar}_{k} = (\frac{1}{2} l_{k}, θ_{k})

and further to obtain a ground-truth polar coordinate matrix

Polar \in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}} : Polar = (\frac{1}{2} l_{1}, θ_{1}) \oplus (\frac{1}{2} l_{2}, θ_{2}) \oplus \dots \oplus (\frac{1}{2} l_{N}, θ_{N})

the model also gives a prediction tensor

\in ℝ^{2 \times \frac{W}{r} \times \frac{H}{r}};

Polar and

is trained by a same L1 loss:

ℒ_{Polar} = \frac{1}{N} \sum_{k = 1}^{N} { - {polar}_{k} }_{1}

for each position, the model will predict the output of C+6, which will form the set

=[{tilde over (H)},{tilde over (D)},Õ,

=

_Heat+λ_Off

_Off+λ_D

_D+λ_Polar

_Polar

Wherein all the experiments, λ_Off=10, λ_Dand λ_Polarare both take as 0.1.

4. The method of claim 1, wherein Step 3 of the infrastructure fault target detection network CenWholeNet in the first component, the method of outputting result is as follows:

outputting results by extracting a possible center keypoint coordinates from a predicted heatmap tensor {tilde over (H)}, and then obtaining a predicted bounding box according to the information in the corresponding {tilde over (D)}, Õ and

; wherein the greater the value of {tilde over (H)}_c,x,y, the more likely it point p_cxyis the center point; for category c, if the point p_cxysatisfies the following formula, it is then p_cxyis an candidate center point;

{\tilde{H}}_{c, x, y} = \max_{i^{2} \leq 1, j^{2} \leq 1, i, j \in ℤ} [{\tilde{H}}_{c, x + i, y + j}]

wherein we do not need non-maximum suppression (NMS), using a 3×3 max-pooling convolutional layer to extract candidate center points; letting a set of center points {tilde over (P)}={({tilde over (x)}_k,{tilde over (y)}_k)}_k=1 ^N ^p, wherein N_pis a total number of selected center points; for any center points ({tilde over (x)}_k,{tilde over (y)}_k), extract corresponding size information ({tilde over (w)}_k,{tilde over (h)}_k)=({tilde over (D)}_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k, {tilde over (D)}_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k), offset information (δ{tilde over (x)}_k,δ{tilde over (y)}_k)=(Õ_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k,Õ_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k) and polar coordinate information ({tilde over (l)}_k,{tilde over (θ)}_k)=(

_{1,{tilde over (x)}} _k _{,{tilde over (y)}} _k,

_{2,{tilde over (x)}} _k _{,{tilde over (y)}} _k); first, calculate the prediction frame size correction value according to ({tilde over (l)}_k,{tilde over (θ)}_k):

{\begin{matrix} Δ {\tilde{h}}_{k} = {\tilde{l}}_{k} \sin ({\tilde{θ}}_{k}) \\ Δ {\tilde{w}}_{k} = - {\tilde{l}}_{k} \cos ({\tilde{θ}}_{k}) \end{matrix}

defining specific location of the prediction box as

{\begin{matrix} \begin{matrix} Top = {\tilde{y}}_{k} + δ {\tilde{y}}_{k} - (α_{y} \cdot \frac{1}{2} {\tilde{h}}_{k} + β_{y} \cdot Δ {\tilde{h}}_{k}); \\ Bottom = {\tilde{y}}_{k} + δ {\tilde{y}}_{k} + (α_{k} \cdot \frac{1}{2} {\tilde{h}}_{k} + β_{y} \cdot Δ {\tilde{h}}_{k}) \end{matrix} \\ \begin{matrix} Left = {\tilde{x}}_{k} + δ {\tilde{x}}_{k} - (α_{x} \cdot \frac{1}{2} {\tilde{w}}_{k} + β_{x} \cdot Δ {\tilde{w}}_{k}); \\ Right = {\tilde{x}}_{k} + δ {\tilde{x}}_{k} + (α_{x} \cdot \frac{1}{2} {\tilde{w}}_{k} + β_{k} \cdot Δ {\tilde{w}}_{k}) \end{matrix} \end{matrix}

wherein bounding box resizing hyperparameters as α_y=α_x=0.9, β_y=β_x=0.1.

5. The method of claim 1, wherein the method of establishing the parallel attention module in the second component is as follows:

providing a lightweight, plug-and-play parallel attention module PAM, configured to improves expressiveness of neural networks; wherein PAM considers two dimensions of feature map attention, spatial attention and channel attention force, and combine them in parallel;

giving an input feature map as X∈

₁by implementing the spatial attention submodule: X→Ũ∈

^C×W×H; then, transforming

₂by implementing the channel attention sub-module: X→Û∈

^C×W×Hfinally, outputting feature map U∈

^C×W×H; the transformations consists essentially of convolution, maximum pooling operation, mean pooling operation and ReLU function; and overall calculation process is as follows:

U=Ũ⊕Û=

₁(X)⊕

₂(X)

wherein ⊕ represents output pixel-level tensor addition;

wherein the spatial attention sub-module is configured to emphasize “where” to improve attention, and pay attention to the locations of regions of interest (ROIs); first, maximum pooling and mean pooling operations are performed on the feature map along a channel direction to obtain several two-dimensional images, λ₁U_{avg_s}∈

^1×W×Hand λ₂U_{max_s}∈

^1×W×H, wherein λ₁and λ₂are adjust hyperparameters for different pooling operation weights, and taking λ₁=2, λ₂=1; U_{avg_s}

U_{max_s}are calculated by the following formulas, and MaxPool and AvgPool represent maximum pooling operation and the average pooling operation respectively;

U_{avg_s} (1, i, j) = AvgPool (X) = \frac{1}{C} \sum_{k = 1}^{C} X (k, i, j), i \in [1, W], j \in [1, H] U_{\max_s} (1, i, j) = Max Pool (X) = \max_{k \in [1, C]} (X (k, i, j)), i \in [1, W], j \in [1, H]

₁(X)=Ũ=U _spa ⊗X=σ(Conv([λ₁ U _{avg_s},λ₂ U _{max_s}]))⊗X

which is equivalent to:

₁(X)=σ(Conv([MaxPool(X),AvgPool(X),AvgPool(X)]))⊗X

wherein, ⊗ represents pixel-level tensor multiplication, σ represents a sigmoid activation function, Conv represents a convolution operation, and a convolution kernel size is 3×3; and a spatial attention weight is copied along a channel axis;

the channel attention sub-module is configured to find the relationship of internal channels, and care about “what” is interesting in a given feature map; first, mean pooling and max pooling are performed along width and height directions to generate a number of 1-dimensional vector, λ₃U_{avg_c}∈

^C×1×1

λ₄U_{max_c}∈

^C×1×1, and λ₃and λ₄are adjust hyperparameters for different pooling operation weights, and taking λ₃=2, λ₄=1; U_{avg_c}and U_{max_c}are calculated by the following formulas:

U_{avg_c} (k, 1, 1) = AvgPool (X) = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} X (k, i, j), k \in [1, C] U_{\max_c} (k, 1, 1) = Max Pool (X) = \max_{i \in [1, W], j \in [1, H]} (X (k, i, j)), k \in [1, C]

^C×1×1; the calculation process of this sub-module is as follows:

ℱ_{2} (X) = \hat{U} = U_{cha} \otimes X = σ (\sum PConv ([λ_{3} U_{avg_c}, λ_{4} U_{\max_c}])) \otimes X

which is equivalent to

ℱ_{2} (X) = σ (\sum PConv 2 (δ (PConv 1 ([λ_{3} U_{avg_c}, λ_{4} U_{\max_c}])))) \otimes X

δ represents the ReLU activation function; the size of the convolution kernel of PConv1 is C/r×C×1×1, and the size of the convolution kernel of the inverse transform PConv2 is C×C/r×1×1; selecting ratio r as 16, the channel attention weights is copied along the width and height directions;

wherein the PAM is a plug-and-play module, which ensures the strict consistency of the input tensor and output tensor at a dimension level; PAM is configured to be embedded in any position of any convolutional neural network model as a supplementary module, the method of providing PAM embedding Hourglass and ResNet including: for the ResNet network, the PAM is embedded in the residual block after the batch normalization layer, before the residual connection, and the same in each residual block; for the Hourglass network, the Hourglass network is divided into two parts: downsampling and upsampling and the downsampling part embeds the PAM between the residual blocks, as a transition module, the upsampling part embeds the PAM before the residual connection.

6. The method of claim 1, wherein the LIDAR-based unmanned surface vehicle of the third component comprises

four modules including, the hull module, the video acquisition module, the lidar navigation module and ground station module, working together in a cooperative manner;

the hull module includes a trimaran and a power system; the trimaran is configured to be stable, resist level 6 wind and waves, and has an effective remote control distance of 500 meters, adaptable to engineering application scenarios; the size of the hull is 75×47×28 cm, which is convenient for transportation; an effective load of the surface vehicle is 5 kg, and configured to be installed with multiple scientific instruments; in addition, the unmanned surface vehicle has the function of constant speed cruise, which reduces the control burden of personnel;

the video acquisition module is composed of a three-axis camera pan/tilt, a fixed front camera and a fill light; the three-axis camera pan/tilt supports 10× optical zoom, auto focus, photography and 60 FPS video recording; said video acquisition module is configured to meet the needs of faults of different scales and locations shooting requirements; the fixed front camera is configured to determine a hull posture; a picture is transmitted back to a ground station in real time through a wireless image transmission device, on the one hand for fault identification, on the other hand for assisting a control of the USV; a controllable LED fill light board is installed to cope with small and medium-sized bridges and other low-light working environments, which contains 180 high-brightness LED lamp beads; 3D print a pan/tilt carrying the LED fill light board to meet the needs of multi-angle fill light; in addition, a fixed front-view LED light is also installed beads, providing light source support for the front-view camera;

the lidar navigation module includes lidar, mini computer, a set of transmission system and control system; lidar is configured to perform 360° omnidirectional scanning; after the lidar is connected with the mini computer, the lidar can perform real-time mapping of the surrounding environment of the unmanned surface vehicle; through wireless image transmission, the information of the surrounding scene is transmitted back to the ground station in real time, so as to realize the lidar navigation of the unmanned surface vehicle; based on the lidar navigation, the unmanned surface vehicle no longer needs GPS positioning, in areas with weak GPS signals such as under the bridges and underground culverts; the wireless transmission system supports real-time transmission of 1080P video, with a maximum transmission distance of 10 kilometers; redundant transmission is used to ensure link stability and strong anti-interference; the control system consists of wireless image transmission equipment, Pixhawk 2.4.8 flight control and SKYDROID T12 receiver, and through the flight control and receiver, the control system effectively control the equipment on board;

the ground station module includes two remote controls and multiple display devices; a main remote control is used to control the unmanned surface vehicle, and a secondary remote control is used to control the surface vehicle borne scientific instruments, and the display device is used to monitor the real-time information returned by the camera and lidar; the display device displays the picture in real time; the devices cooperate with each other to realize the intelligent fault detection without a GPS signal.