CN116091426A

CN116091426A - Pavement crack detection method based on coder-decoder

Info

Publication number: CN116091426A
Application number: CN202211700351.4A
Authority: CN
Inventors: 颜成钢; 陈雨中; 杨浩男; 张文豪; 武松鹤; 朱尊杰; 高宇涵; 孙垚棋; 陈楚翘; 王鸿奎; 王廷宇; 殷海兵; 张继勇; 李宗鹏; 赵治栋
Original assignee: Hangdian Lishui Research Institute Co Ltd
Current assignee: Hangdian Lishui Research Institute Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-05-09

Abstract

The invention discloses a pavement crack detection method based on a coder-decoder, which can effectively detect cracks of a pillow layer by constructing a new end-to-end encoder-decoder residual error network. By combining the deep supervision mechanism and fusion loss, the model can capture fine details, so that the network is more easily optimized. The model can directly output a high-quality saliency map, which is almost close to a corresponding ground reality value. The resulting saliency map uniformly highlights clearly defined defective objects while effectively filtering out background noise. The model of the invention has stronger robustness and does not need any post-processing, and the real-time speed on a single GPU is higher.

Description

Pavement crack detection method based on coder-decoder

Technical Field

The invention belongs to the field of significance detection, and particularly relates to a pavement crack detection method based on a coder-decoder.

Background

The highway network of china is the largest worldwide. In the past few decades, china's highway network has been rapidly developed, and the road network covers all regions of the country, thereby providing convenience for people to travel. With the rapid development of China's economy, china's highway network has also been further expanded and improved. At present, china has established a large number of highways, and the pavement quality and facility equipment of the highways are improved. In addition, china also strengthens the supervision of highway construction, and ensures the quality and safety of highway construction.

Road surface cracks are a common defect on road surfaces, and are usually caused by aging of road materials, climate change, overload of vehicles, construction quality problems and the like. The road surface crack not only damages the visual effect of the road, but also can influence the safety of the road.

On the one hand, road surface cracks can make the road surface become uneven, cause the vehicle to control the difficulty, increase the risk of traffic accident. Particularly in humid climates, water convergence of road surface cracks can lead to easy sliding of vehicles and cause greater threat to driving safety.

On the other hand, road cracks can also accelerate the aging and damage of road materials, resulting in damaged road structures, which need to be repaired or updated. This not only increases the cost of maintaining and building the road, but also affects the useful life of the road.

Therefore, detecting and repairing road cracks in time is very important for maintaining road safety and quality. If the road is found to be cracked, the road should be repaired in time so as not to cause greater damage.

The common road crack detection methods include the following:

manual detection method: and (5) manually recording the crack condition of the road surface by personnel on-site inspection, and judging the information such as the type, the position, the length and the like of the crack. The method is simple and easy to implement, but has low efficiency, and is difficult to comprehensively detect all cracks on the road.

The camera detection method comprises the following steps: the road surface is photographed by a camera or unmanned aerial vehicle mounted on the vehicle and cracks are identified using a computer-aided analysis system. The method can rapidly detect the crack condition of the road surface, but requires certain equipment investment and technical difficulty.

Laser scanning method: the road surface is scanned by a laser scanner mounted on the vehicle and cracks are identified using a computer-aided analysis system. The method can rapidly and accurately detect the crack condition of the road surface, but needs certain equipment investment and technical difficulty.

Machine learning method: the automatic identification of the cracks on the road surface is realized by manually collecting a large amount of crack image data on the road surface and training a model by using a machine learning technology. The method can rapidly and accurately detect the crack condition of the road surface, does not need special equipment, and can use a common camera or a mobile phone for shooting.

Compared with the traditional manual detection, the automatic surface defect detection has the advantages of high precision and high efficiency, and is an effective method for reducing the labor cost. Because of the rapid growth of road network driving mileage, automatic detection of road surface cracks plays a vital role in an intelligent traffic system. Pavement crack detection systems typically include the removal of non-crack images and the quantitative detection of cracks.

Depth necessarily results in better detection accuracy. Experiments prove that the detection accuracy can be ensured and the detection speed can be improved by selecting a network system structure with proper depth. Although great progress has been made in the field of DCNN-based crack detection, it remains to be explored how to obtain more detailed crack characteristics. In the detection of road cracks, there are difficulties such as small cracks, a lot of picture noise, unclear boundaries, incomplete area information, and the like.

Disclosure of Invention

Aiming at the defects existing in the prior art, the invention provides a pavement crack detection method based on a coder-decoder.

The invention proposes a new end-to-end encoder-decoder residual network for significance detection of defective objects. And the designed multi-level channel weighting fusion module and the residual error optimization module are alternately utilized to gradually recover predicted spatial significance values from the encoded multi-level semantic features, so that complete defect objects are promoted to be detected, and non-significance backgrounds are restrained. Compared with the existing significance detection method, the method can accurately segment the complete defect object with definite boundary and effectively filter irrelevant background noise.

A pavement crack detection method based on a coder-decoder comprises the following steps:

step (1), acquiring a data set;

the data set adopts a public data set ack 500;

step (2), constructing a new end-to-end encoder-decoder residual network;

the new end-to-end encoder-decoder residual network comprises an encoder network and a decoder network;

the encoder network includes an input layer, four residual blocks of ResNet-34, and a bridge module.

In the decoder network, multi-level channel weighted fusion Modules (MCW) and residual optimization modules (ORM) are used alternately to progressively recover the significance information encoded in the previous multi-scale features. The output of the bridge module passes through the MCW and then the ORM, and the bridge module is cyclically reciprocated, and a total of 5 MCWs and 4 ORMs form a decoder module.

And 3, training the constructed end-to-end encoder-decoder residual network through the data set in the step 1.

And 4, finishing pavement crack detection through the trained end-to-end encoder-decoder residual error network.

Further, for the encoder network, resNet-34 is selected as the backbone network. The entire encoder network contains one input layer, four residual blocks of ResNet-34, and one bridge module. The input layer has 64 channels, the kernel size is 3×3 and the stride is 1, and the maximum pooling operation is added at the tail of the input layer to further enlarge the size of the accepting field. The convolved output of the input layer is input to a batch normalization layer to balance the scale of the features, followed by a ReLU activation function to enhance the nonlinear representation capability.

Formally, given one input image, multi-scale features are extracted at 5 levels. The residual block of each resnet34 is embedded with a channel attention module and a spatial attention module.

An additional bridging module is designed at the end of the encoder network to further capture global context awareness information, which is suitable for accurately locating the region of the defective object. The bridge module comprises 3X3 convolution blocks with the expansion rates of {1,2,4} of 3 512 channels, wherein the outputs of the three modules are spliced together and then sent to the next 3X3 convolution block to be used as the output of the bridge module. Notably, each convolution layer is followed by a batch normalization and a ReLU activation function.

Further, the inputs of the MCW module are the residual block output EN of the same level in the encoder and the output characteristics DE of the decoder stage of the previous level. In particular, the EN of the bottommost MCW is the output of the bridge module. The received EN and DE are first subjected to a concatenation operation in the channel dimension, and then the channel is changed to the original size by a convolution of 1x 1. And then passing through a channel attention module and finally adding with the input DE to obtain the output OUT1 of the MCW module.

Further, the input of the ORM module is the MCW module output OUT1 of the current hierarchy. The output OUT2, the output feature DE of the decoder stage, is obtained by first a convolution of 3x3, then a channel shuffling operation, then a convolution of 3x3 and a convolution of 1x1, and finally adding to the input OUT1. Notably, each 3x3 convolution is followed by a BN, reLU operation.

Further, the specific method in the step 3 is as follows:

training the constructed end-to-end encoder-decoder residual network through the data set in the step 1, adopting a depth supervision mechanism in the training process, carrying OUT 3x3 convolution on the output of each stage of the encoder network, namely OUT2 of each level, reducing the number of channels to 1, and enabling the result to be the side output of each level. Then a bilinear upsampling is performed, the same resolution matching is performed as for the input image, and a sigmoid activation function maps the predicted value to [0,1]. The output of each level is supervised, with the top-most side output saliency map as the final output of the present invention.

A fusion penalty is constructed to oversee the training process of the network to learn more detailed information in boundary location and structure capture.

The fusion loss used an organic fusion of BCE, ioU, SSIM three losses.

wherein ,

and

BCE loss, ioU loss and SSIM loss are indicated, respectively. The weights alpha, beta and gamma can be changed along with different training stages. In the early stage of training, the weight of BCE is amplified, convergence is quickened, the weight of IoU and SSIM loss is gradually increased in the middle stage, and model refinement is quickened.

Further, the first 10 epochs are defined as early, defining α=2, β=0.5, γ=0.5; defined later as mid-term, α=1, β=0.5+0.1 (epoch-10), γ=0.5+0.1 (epoch-10).

Further, BCE loss is defined as:

wherein G is ground true phase, S is predicted significance map.

Further, ioU loss is used to evaluate the similarity of G and S, and can be defined as:

further, the structural information is captured by SSIM loss. In particular, the method comprises the steps of,

and

two blocks (size=n×n) cut out from the saliency map S and ground truth G, respectively. SSIM is defined as:

the invention has the following beneficial effects:

the invention is a new end-to-end encoder decoder residual error network, which can effectively detect cracks of the occipital layer. By combining the deep supervision mechanism and fusion loss, the model can capture fine details, so that the network is more easily optimized. The model can directly output a high-quality saliency map, which is almost close to a corresponding ground reality value. The resulting saliency map uniformly highlights clearly defined defective objects while effectively filtering out background noise. Our model is robust and does not require any post-processing, and is faster in real-time on a single GPU.

Drawings

FIG. 1 is a schematic diagram of an end-to-end encoder-decoder residual network architecture according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a residual block structure of a resnet34 according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a bridge module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-level channel weighted fusion Module (MCW) according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a residual optimization module (ORM) according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the drawings and examples.

step (1), acquiring a data set;

the data set adopts a public data set ack 500;

step (2), constructing a new end-to-end encoder-decoder residual network;

as shown in fig. 1, the new end-to-end encoder-decoder residual network includes an encoder network and a decoder network;

for the encoder network, resNet-34 is selected as the backbone network for the encoder portion. On the one hand, by using layer-jump connections (i.e. shortcut identity mapping) on a simple network (simply stacking convolutional layers) employed by VGG networks, the residual learning framework is easier to optimize. Residual structures, on the other hand, are easy to implement for deeper networks, and still have a low complexity. In this way, the model can obtain accuracy benefits from significantly increased depth, i.e., more context information is covered, due to the expansion of the acceptance field. The entire encoder network contains one input layer, four residual blocks of ResNet-34, and one bridge module. Furthermore, unlike the original ResNet-34, the input layer of the present invention has 64 channels, a kernel size of 3×3 and stride of 1, instead of a kernel size of 7×7 and stride of 2; the maximum pooling operation is then added at the end of the input layer to further expand the size of the acceptance field. The present invention performs such operations, encoding better spatial details, respectively, and capturing better filter responses before and after pooling operations. The convolved output of the input layer is input to a batch normalization layer to balance the scale of the features, followed by a ReLU activation function to enhance the nonlinear representation capability.

Formally, given one input image, multi-scale features are extracted at 5 levels. Attention mechanisms are considered to have learning capabilities of accurate, compact features and are widely used in a variety of computer vision tasks due to their effectiveness and efficiency. The residual block of each resnet34 is embedded with a channel attention module and a spatial attention module (as shown in fig. 2).

This is more difficult than edge detection, which requires only simple gradient information, since a complete segmentation of the uniform region is required for significant target detection. For this purpose, the invention designs an additional bridging module at the end of the encoder network to further capture global context awareness information, which is suitable for accurately locating the region of the defective object. As shown in fig. 3, the bridge module includes 3×3 convolution blocks each having an expansion ratio of {1,2,4} of 3 512 channels, and outputs of the three modules are spliced together and then sent to the next 3×3 convolution block to be used as an output of the bridge module. Notably, each convolution layer is followed by a batch normalization and a ReLU activation function.

In the decoder network, multi-level channel weighted fusion Modules (MCW) and residual optimization modules (ORM) are used alternately to progressively recover the significance information encoded in the previous multi-scale features. The output of the bridge module passes through the MCW and then the ORM, and the bridge module is cyclically reciprocated, and a total of 5 MCWs and 4 ORMs form a decoder module. The feature map generated directly by the encoder network is more focused on insignificant background areas. The main reason for this is that global context information is not fully considered, leading to incorrect significance prediction results. To solve this problem, the present invention designs a multi-level channel weighted fusion Module (MCW) to capture more efficient feature areas and filter out background noise interferents. After processing the proposed MCW, the model is more focused on the region of the defect object or its edges.

As shown in fig. 4, the inputs to the MCW module are the residual block output EN of the same level in the encoder and the output characteristics DE of the decoder stage of the previous level. In particular, the EN of the bottommost MCW is the output of the bridge module. The received EN and DE are first subjected to a concatenation operation in the channel dimension, and then the channel is changed to the original size by a convolution of 1x 1. And then passing through a channel attention module and finally adding with the input DE to obtain the output OUT1 of the MCW module.

As shown in fig. 5, the input of the ORM module is the MCW module output OUT1 of the current hierarchy. The output OUT2, the output feature DE of the decoder stage, is obtained by first a convolution of 3x3, then a channel shuffling operation, then a convolution of 3x3 and a convolution of 1x1, and finally adding to the input OUT1. Notably, each 3x3 convolution is followed by a BN, reLU operation.

The constructed end-to-end encoder-decoder residual network is trained through the data set in the step 1, in the training process, a depth supervision mechanism is adopted, the output of each stage of the encoder network, namely OUT2 of each level is subjected to 3x3 convolution, the channel number is reduced to 1, and the result is obtained as the side output of each level. Then a bilinear upsampling is performed, the same resolution matching is performed as for the input image, and a sigmoid activation function maps the predicted value to [0,1]. The output of each level is supervised, with the top-most side output saliency map as the final output of the present invention.

Since salient object detection can also be regarded as a dense binary classification problem in nature, its output represents the probability score that each pixel is a foreground object. Therefore, previous approaches always use cross entropy (typically applied to classification tasks) as a training penalty. However, this simple strategy is difficult to direct the network to capture global structural information of significant targets, resulting in ambiguous boundaries or incomplete detection results. To overcome this problem, the present invention constructs a fusion penalty to supervise the training process of the network to learn more detailed information in boundary location and structure capture.

The fusion loss used an organic fusion of BCE, ioU, SSIM three losses.

wherein ,

and

BCE loss, ioU loss and SSIM loss are indicated, respectively. And the weight alpha, beta,Gamma will vary with the different phases of training. In the early stage of training, the weight of BCE is amplified, convergence is quickened, the weight of IoU and SSIM loss is gradually increased in the middle stage, and model refinement is quickened.

Specifically, 10 epochs prior to the training of the present invention are defined as early, with α=2, β=0.5, γ=0.5; defined later as mid-term, α=1, β=0.5+0.1 (epoch-10), γ=0.5+0.1 (epoch-10).

More specifically, BCE loss is widely used for binary task classification, defined as:

wherein G is ground true phase, S is predicted significance map.

IoU loss is used to evaluate the similarity of G and S and can be defined as:

SSIM loss was originally proposed in the image quality assessment effort, which captured structural information. In particular, the method comprises the steps of,

and

representing P _S and P_G Mean value of image->

and

Representing P _S and P_G Variance of image>

and

Representing P _S and P_G Covariance of image, C ₁ and C₂ A constant is expressed for ensuring that the denominator is not zero (C is taken in the present invention ₁ ＝C ₂ ＝0.0001)。

The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several alternatives or modifications can be made to the described embodiments without departing from the spirit of the invention, and these alternatives or modifications should be considered to be within the scope of the invention.

The invention, in part not described in detail, is within the skill of those skilled in the art.

Claims

1. The pavement crack detection method based on the coder-decoder is characterized by comprising the following steps of:

step (1), acquiring a data set;

the data set adopts a public data set ack 500;

step (2), constructing a new end-to-end encoder-decoder residual network;

the encoder network comprises an input layer, four residual blocks of ResNet-34 and a bridge module;

in the decoder network, a multi-level channel weighted fusion Module (MCW) and a residual optimization module (ORM) are alternately used to gradually recover significance information encoded in previous multi-scale features; the output of the bridge module passes through the MCW and then the ORM, and the bridge module is circularly reciprocated, and a decoder module is formed by 5 MCWs and 4 ORMs;

step 3, training the constructed end-to-end encoder-decoder residual network through the data set in the step 1;

2. A method of detecting road cracks based on codec according to claim 1, wherein for the encoder network, res net-34 is selected as the backbone network; the whole encoder network comprises an input layer, four residual blocks of ResNet-34 and a bridging module; the input layer is provided with 64 channels, the kernel size is 3 multiplied by 3, the stride is 1, and the maximum pooling operation is added at the tail part of the input layer so as to further enlarge the size of the receiving field; the convolved output of the input layer is input to a batch normalization layer to balance the scale of the feature, followed by a ReLU activation function to enhance the nonlinear representation capability;

formally, given an input image, multi-scale features are extracted at 5 levels; the residual block of each resnet34 is embedded with a channel attention module and a spatial attention module;

an additional bridging module is designed at the end of the encoder network to further capture global context awareness information, which is applicable to accurately locating the region of the defective object; the bridging module comprises 3X3 convolution blocks with 512 channels and expansion rates of {1,2,4} respectively, outputs of the three modules are spliced together and then sent to the next 3X3 convolution block to be used as the output of the bridging module; notably, each convolution layer is followed by a batch normalization and a ReLU activation function.

3. A method of detecting road cracks based on a codec according to claim 2, wherein the input of the MCW module is the residual block output EN of the same level in the encoder and the output characteristics DE of the decoder stage of the previous level; in particular, the EN of the bottommost MCW is the output of the bridge module; firstly, splicing the received EN and DE in the channel dimension, and then changing the channel into the original size through a convolution of 1x 1; and then passing through a channel attention module and finally adding with the input DE to obtain the output OUT1 of the MCW module.

4. A method of detecting road cracks based on codec according to claim 3, wherein the input of the ORM module is the MCW module output OUT1 of the current level; firstly, carrying OUT a convolution of 3x3, then carrying OUT a channel shuffling operation, then carrying OUT a convolution of 3x3 and a convolution of 1x1, and finally adding with an input OUT1 to obtain an output OUT2, namely an output characteristic DE of a decoder stage; BN and ReLU operations are performed after each 3x3 convolution.

5. The method for detecting cracks on a pavement based on a codec according to claim 4, wherein the specific method in step 3 is as follows:

training the constructed end-to-end encoder-decoder residual network through the data set in the step 1, adopting a depth supervision mechanism in the training process, carrying OUT 3x3 convolution on the output of each stage of the encoder network, namely OUT2 of each level, reducing the number of channels to 1, and enabling the result to be the side output of each level; then performing a bilinear upsampling, performing the same resolution matching as the input image, and a sigmoid activation function mapping the predicted value to [0,1]; each level of output is supervised, and the top-level side output saliency map is taken as the final output of the invention;

constructing a fusion loss to supervise the training process of the network so as to learn more detailed information in boundary position and structure capture;

the fusion loss adopts organic fusion of BCE, ioU, SSIM three losses;

wherein ,

and

BCE loss, ioU loss, and SSIM loss, respectively; the weights alpha, beta and gamma can change along with different training stages; in the early stage of training, the weight of BCE is amplified, convergence is quickened, the weight of IoU and SSIM loss is gradually increased in the middle stage, and model refinement is quickened.

6. A method for detecting cracks in a pavement based on a codec according to claim 5, wherein the first 10 epochs are defined as early, α=2, β=0.5, γ=0.5; defined later as mid-term, α=1, β=0.5+0.1 (epoch-10), γ=0.5+0.1 (epoch-10).

7. The codec-based pavement crack detection method of claim 5, wherein BCE loss is defined as:

wherein G is ground true phase, S is predicted significance map.

8. The method for detecting cracks in a pavement based on a codec according to claim 5, wherein IoU loses similarity for evaluating G and S, which can be defined as:

9. the codec-based pavement crack detection method of claim 5, wherein the structural information is captured by SSIM loss; in particular, the method comprises the steps of,

and

Two blocks, of size=n×n, cut from saliency map S and ground truth G, respectively; SSIM is defined as: