CN112634292A

CN112634292A - Asphalt pavement crack image segmentation method based on deep convolutional neural network

Info

Publication number: CN112634292A
Application number: CN202110012193.2A
Authority: CN
Inventors: 万海峰; 李娜; 孙启润; 黄磊; 苑兆迪
Original assignee: Yantai University
Current assignee: Yantai University
Priority date: 2021-01-06
Filing date: 2021-01-06
Publication date: 2021-04-09
Anticipated expiration: 2041-01-06
Also published as: CN112634292B

Abstract

The invention discloses a method for segmenting an asphalt pavement crack image based on a deep convolutional neural network, which comprises the following steps of: preparing a crack picture data set, and carrying out picture preprocessing according to the data set; determining a structure of a CrackResatentionNet model, determining a loss function and an optimizer, further initializing a weight matrix by using normal distribution, modifying and updating a parameter gradient through forward propagation to reach a predicted value of an output layer and backward propagation, updating the weight matrix, and finally loading the trained CrackResatentionNet model, thereby predicting a well-segmented asphalt pavement image and accurately outputting the split image. The invention can proportionally fuse the outputs of two added attention modules, more emphasizes position information, the output of each coding layer is fused with the attention output and is connected with a corresponding decoding layer, and the output of the previous decoding layer is used as the input to the next decoding layer. Therefore, the decoding layer and the up-sampling operation thereof can fully utilize the spatial information and improve the segmentation precision of the image.

Description

Asphalt pavement crack image segmentation method based on deep convolutional neural network

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method for segmenting an asphalt pavement crack image based on a deep convolutional neural network.

Background

Cracks are the initial manifestation of early damage and potential degradation of asphalt pavements, and the adverse effects of cracks on the performance and function of road engineering are more and more pronounced due to the dramatic increase in traffic volume and the increase in traffic load levels. When the asphalt pavement cracks early, the cracks are timely and accurately detected and identified, and the scale of the cracks is accurately evaluated, so that a road engineering management and maintenance mechanism can be guided to adopt a scientific pavement pre-maintenance scheme, and the road is prevented from generating irreversible large-scale structural damage and the service life is shortened. The manual inspection of the cracks of the asphalt pavement needs a great deal of time and labor cost, and the accuracy is not enough; the crack detection precision of the image captured by the camera is obviously improved, the consistency and the objectivity are better, and the precision is synchronously improved. However, the segmentation of the cracks at the pixel level is still not accurate enough due to the influence of shadow, uneven illumination or irregular crack shape. The segmentation of the asphalt pavement crack image aiming at the deep convolutional neural network has important significance for the accurate, automatic and intelligent detection of the asphalt pavement crack.

By integrating the depth learning and image processing technology of the depth convolution neural network architecture of the attention mechanism, integrating the image segmentation method and the technical equipment and embedding the image segmentation method and the technical equipment into the pavement detection equipment, the asphalt pavement crack can be accurately and intelligently identified, so that the asphalt pavement crack data information is dynamically acquired in real time in all weather by means of the cloud platform, and the intelligent detection and management and maintenance decision level and efficiency of the asphalt pavement crack are improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an image segmentation method which is simple in structure, easy to implement, high in convergence speed and accurate, and can accurately and efficiently identify cracks of the asphalt pavement, so that a scientific method is provided for intelligent nondestructive detection and maintenance decision scheme formulation of the asphalt pavement.

A method for segmenting an asphalt pavement crack image based on a deep convolutional neural network comprises the following steps:

the method comprises the following steps: preparing a picture data set of the asphalt pavement cracks;

step two: preprocessing a picture; the pre-processing of the picture includes scaling a large image to a uniform size;

step three: setting a model structure of CrackResattentionNet, wherein the model of CrackResattentionNet adopts a structure based on an encoder-decoder and comprises an encoder and a decoder, and an attention module is additionally arranged between each encoder and each decoder and is positioned behind each encoder and connected with the corresponding decoder;

step four: determining a loss function; the comparison was made using pixel cross entropy loss (CE), balanced pixel cross entropy loss (BCE), and Dice loss:

step 401: the pixel cross entropy loss CE is shown in equation (6) below:

i represents the index of the pixel, n x n represents the size of the output image, p is the true value of the sample, the positive class is 1, the negative class is 0,

a probability of predicting a sample as positive;

step 402: the balanced pixel cross-entropy penalty is similar to the pixel cross-entropy penalty, with a sum of weights of 1, as shown in equation (7) below:

wherein BCE is balance pixel cross entropy loss, n x n represents the size of an output pixel, beta is a balance coefficient, p is a real value of a sample, a positive class is 1, a negative class is 0,

a probability of predicting a sample as positive;

step 403: the Dice loss is designed from the perspective of the cross-over ratio IoU, and is shown in equation (8):

in the formula (8), TP is pixel true positive, FN is pixel false negative;

step five: determining an optimizer, and adopting an Adam optimizer;

step six: initializing a weight matrix; for the ResNet34 pre-training model part, using the weight of the pre-training model, and for other layers except ResNet34, including an input layer, an output layer, a coding layer 5, a decoding layer 1 to a decoding layer 5, initializing a weight matrix by using normal distribution;

step seven: forward propagation; the input signal obtains the output of each layer with the help of the weight matrix, and finally reaches the predicted value of the output layer;

step eight: backward propagation; after a network prediction result calculated by any group of random parameters is obtained through forward propagation, correcting and updating by utilizing the gradient of a loss function relative to each parameter;

step nine: updating the weight matrix; updating the weight matrix according to the gradient of the parameters obtained by back propagation;

step ten: if the maximum training times are not reached, returning to the step seven, continuing to forward propagate, otherwise, saving the CrackResattentionNet binary model with the best performance;

step eleven: inputting a crack image of the asphalt pavement to be segmented; collecting the shot asphalt pavement crack images and using the collected images as the input of a system;

step twelve: preprocessing an image; the pre-processing of the picture includes scaling a large image to a uniform size;

step thirteen: loading the trained CrackResattentionNet, comprising the following steps:

step 1301: finding out a trained model file according to the transmitted file name;

step 1302: reading the model file to a memory;

step 1303: the prediction model predicts by using parameters in the loaded model file;

fourteen steps: segmentation and output of the crack image; inputting an asphalt pavement image with cracks, and predicting the well-segmented asphalt pavement image through the trained CrackResattentionNet, wherein pixels of the cracks are displayed in white, and other backgrounds are displayed in black;

step fifteen: and acquiring a trained CrackResattentionNet model file, storing the trained CrackResattentionNet model file on a disk, and simultaneously loading a model binary file into a memory.

In the above step one, a scheme of adopting step 101 or step 102 is specifically adopted:

step 101: directly using the annotated public fracture segmentation dataset comprising fracture images and the annotated fracture shapes and positions as a fracture picture dataset;

step 102: shooting real pavement crack pictures to form a crack picture data set; each crack photograph was manually annotated with crack shape and location by Labelme software.

The manual labeling of step 102 is realized by the following 4 sub-steps:

step 1021, starting a Labelme software window, and opening a pavement crack picture;

step 1022, drawing a polygon on the outer contour of the crack by using a mouse according to the shape of the crack, so that the polygon just covers the crack;

step 1023, naming the crack as a crack mark and saving the image file;

step 1024, Labelme will automatically generate a json file containing the position and the mark of each coordinate point of the polygon.

In the second step, the image is scaled to a uniform size of 448 × 448 pixels, and if the image is rectangular, it needs to be first uniform to a square size.

In the third step, the encoder is composed of an input layer, an encoding layer-1 to an encoding layer-5, wherein the encoding layer-1 to the encoding layer-4 respectively correspond to the first layer to the fourth layer of the ResNet34 network which is ready to be trained, and are ResNet34-1 to ResNet34-4 respectively; the decoder consists of a decoding layer-1 to a decoding layer-5 and an output layer.

In the third step, the attention module obtains the outputs from the coding layer-1 to the coding layer-4, and obtains the corresponding outputs-1 to-4 of the attention module through attention calculation; the output of the attention module is added with the output of the corresponding coding layer and the output of the previous decoding layer, and the sum is directly sent to the next decoding layer as input; the coding layer-5 is a coding layer with a structure different from the structures of the coding layer-1 to the coding layer-4, a convolution kernel with the size of 2 multiplied by 2 is used for carrying out convolution with the step length of 2, the padding is 0, and the size of an output matrix is divided by 2 on the original size; discarding, batch normalization processing and activating functions are connected after convolution operation; the output of the coding layer-5 is directly input into the decoding layer-5; decoding layer-5 contains convolution block-1, convolution block-2 and deconvolution block, the last part is convolution block-3; the convolution blocks-1 to-3 will use convolution kernels of size 1 × 1, perform convolution with step size 1, and will obtain the same size as the input size; after convolution, sequentially connecting discarding, batch normalization processing and activating functions; the deconvolution block will first perform deconvolution via the ConvTranspose2d function with a convolution kernel size of 2 x 2 and a step size of 2, which will multiply the input size by 2, with the batch normalization process and the activation function immediately following it.

In the third step, the above-mentioned step,an attention module comprising a location attention module and a channel attention module; the location attention module will extract a larger range of context information in the local features; feature map A, B, C generated using convolutional layer, where { A, B, C, D }. epsilon.R^C×H×WThen deforming A, B, C to R^C×NWhere N — H × W is the number of pixels; then transpose B to R^N×CThus, matrix multiplication is performed between transposes of C and B, resulting in R^N×NThen applying a softmax layer to calculate a spatial attention feature map S e R^N×NAs in formula (1):

s in formula (1)_jiMeasuring the influence of the ith position on the j positions; the more similar the feature representations of two locations, the greater the correlation between them; then transposes S, and performs matrix multiplication between transposes of A and S, resulting in R^C×NThen the result is transformed into R^C×H×WFinally, multiplying by the scale parameter alpha, and carrying out element summation operation on the original convolution characteristic D to obtain the final output H ∈ R^C×H×WAs shown in equation (2):

α in the formula (2) is initialized to 0, and by gradual learning, more weight is obtained.

The channel attention module first performs convolution to extract feature maps E, F, G, H, and { E, F, G, H }. epsilon.R^C ^×H×W(ii) a Then the matrix F, G is transformed into R^C×NWhere N — H × W is the number of pixels; then transpose F to R^N×CThus, matrix multiplication is performed between transposes of F and E to obtain R^N×NThe result matrix of (1), now applying a softmax layer to compute a spatial attention map X ∈ R^N×NAs in formula (3):

x in formula (3)^jiMeasuring the influence of the ith position on the j positions; then carrying out matrix multiplication between the softmax result X and the deformed G to obtain a result R^C×NThen morph the result to R^C×H×WFinally, multiplying by a scale parameter beta, and carrying out element summation operation by using the original convolution characteristic H to obtain the final output I epsilon R^C×H×WAs shown in the following equation (4):

in the formula (4), beta is initialized to 0, and is gradually learned to more weights through learning; the scaled sum operation based on the matrix element level is calculated as shown in equation (5) below:

in the formula (5)

Is a hyper-parameter, and the position attention of the fracture segmentation is emphasized by 0.8.

The invention adopts a deep convolution-based neural network architecture and incorporates an attention mechanism, and has the advantages that: 1. the encoder of the core part mainly utilizes the convolution layer of ResNet34 to extract image features, and an encoding layer is added behind the convolution layer to better extract information; 2. the decoder uses the deconvolution layer to perform semantic segmentation on the cracked and non-cracked pixels; 3. connecting an additional location attention module and a channel attention module behind each encoder to capture remote context information; 4. the outputs of the two attention modules will be fused proportionally, which may emphasize the location information more. The output of each encoding layer will merge with the attention output and connect with the corresponding decoding layer, the output of the previous decoding layer being input to the next decoding layer. Therefore, the decoding layer and the up-sampling operation thereof can fully utilize the spatial information and improve the prediction precision.

Drawings

FIG. 1 is an overall flow chart of the asphalt pavement crack image segmentation system of the present invention;

FIG. 2 is a CrackResattentionNet network architecture diagram of the present invention;

fig. 3 is a schematic diagram of an encoding block 5 in an embodiment of the present invention;

FIG. 4 is a block diagram of a decoding block 5 according to an embodiment of the present invention;

FIG. 5 is a block diagram of

decoding blocks

1, 2, 3, 4 according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an output block in an embodiment of the present invention;

FIG. 7 is a schematic view of an attention module in an embodiment of the invention;

FIG. 8 is a diagram illustrating various model fracture prediction-BCE loss for a common data set in an embodiment of the present invention;

FIG. 9 is a graph of various model crack predictions-BCE loss for an example smoking platform data set in an embodiment of the present invention.

Detailed Description

In order to facilitate an understanding of the invention, the invention is described in more detail below with reference to the accompanying drawings and specific examples. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Example one

The asphalt pavement crack image segmentation method based on the deep convolutional neural network comprises the following steps:

the method comprises the following steps: preparing a picture data set of the asphalt pavement cracks; specifically, the scheme of step 101 or step 102 is adopted:

step 101: directly using the labeled public fracture segmentation data set as a fracture picture data set;

step 102: shooting real asphalt pavement crack photos to form a crack picture data set; if 5000 pictures of cracks of the asphalt pavement are shot by a vehicle-mounted camera on a certain road section of the smoke table; because the shape and the position of the crack are not marked in the shot crack picture, the shape and the position of the crack are manually marked in each crack picture through Labelme software;

preferably, step 102 comprises the following implementation:

step 1023, naming the crack as a crack mark and saving the image file;

Step two: preprocessing a picture; preprocessing the picture comprises scaling the image with large pixels to a uniform size; if the unified size is 448 multiplied by 448 pixels, if the image is rectangular, the unified size is also needed to be square, for example, the 800 multiplied by 600 center is changed to 600 multiplied by 600;

step three: setting a model structure of CrackResAttentionNet; the CrackResAttentionNet model adopts a structure based on an encoder-decoder, and comprises an encoder and a decoder, wherein an attention module is added between each encoder and each decoder, and is positioned behind each encoder and connected with the corresponding decoder;

based on the above, as shown in fig. 2, the encoder is composed of an input layer, an encoding layer-1 to an encoding layer-5, wherein the encoding layer-1 to the encoding layer-4 correspond to the first layer to the fourth layer of the resenet 34 network which is pre-trained, respectively, and are ResNet34-1 to ResNet 34-4. As shown in fig. 2, the decoder is composed of a decoding layer-1 to a decoding layer-5, and an output layer.

On the basis of the above, as shown in fig. 2, the attention module respectively obtains the outputs of the coding layer-1 to the coding layer-4, and obtains the corresponding attention module outputs-1 to-4 through attention calculation. The output of the attention module is added with the output of the corresponding coding layer and the output of the previous decoding layer, and the sum is directly sent to the next decoding layer as input.

Step four: determining a loss function; the invention uses the following loss function for comparison: namely pixel cross entropy loss (CE), balanced pixel cross entropy loss (BCE), and Dice loss.

Preferably, the loss function adopts balanced pixel cross entropy loss (BCE), test data in the sample is implemented, and higher prediction precision can be obtained by adopting BCE loss, so that a better segmentation effect is obtained;

step five: determining an optimizer, preferably adopting an Adam optimizer, wherein the Adam optimizer has the advantages of high efficiency, small occupied memory, suitability for large-scale data and the like;

step six: initializing a weight matrix, for a ResNet34 pre-training model part, using weights of the pre-training model, and for other layers except ResNet34, including an input layer, an output layer, a coding layer 5, a decoding layer 1 to a decoding layer 5, initializing the weight matrix by using normal distribution; the initialization normal distribution initialization weight matrix firstly establishes truncated normal distribution through a general normal distribution and a truncation interval, and then samples are obtained from the truncated normal distribution in an inverse distribution sampling mode to obtain corresponding truncated normal distribution samples which serve as initial values of the weight matrix.

Preferably, the weight initialization value is obtained by sampling from a truncated normal distribution with a variance of 0.01, so that the model can be converged more quickly in the following training process.

Step seven: forward propagation, wherein the input signal obtains the output of each layer with the help of the weight matrix, and finally reaches the predicted value of the output layer;

step eight: backward propagation; after a network prediction result calculated by any group of random parameters is obtained through forward propagation, the parameters are corrected and updated by utilizing the gradient of a loss function relative to each parameter;

step nine: updating the weight matrix, and updating the weight matrix according to the gradient of the parameters obtained by back propagation to achieve the effect of reducing the loss function;

step ten: and if the maximum training times are not reached, returning to the step seven, continuing the forward propagation, and otherwise, saving the CrackResattentionNet binary model with the best performance.

Step eleven: inputting an asphalt pavement crack image to be segmented, collecting the shot pavement crack image and taking the collected image as the input of a system;

step 1302: reading the model file to a memory;

preferably, the efficiency of segmentation is accelerated by reading the trained CrackResattentionNet model from disk into memory and then using the trained parameters directly for prediction.

Fourteen steps: segmenting and outputting the asphalt pavement crack image;

Example two

On the basis of the above embodiment, with reference to fig. 1 to 7, it is further explained that the asphalt pavement crack image segmentation method based on the deep convolutional neural network of the present invention includes the following steps:

the method comprises the following steps: preparing a picture data set of the asphalt pavement cracks; adopting the scheme of step 101 or step 102:

step 101: the labeled public asphalt pavement crack segmentation data set is directly used, and the embodiment adopts the following steps: https:// www.irit.fr/. Sylvie. Chambon/Crack _ Detection _ database. html, wherein the data set comprises an asphalt pavement Crack image and marked Crack shapes and positions, and is used as an asphalt pavement Crack image data set;

step 102: shooting real pavement crack pictures through a vehicle-mounted camera to form a crack picture data set; if 5000 pictures of asphalt pavement cracks are shot on a certain road section of the smoke table; since the taken crack pictures are not marked with the shape and the position of the crack, the shape and the position of the crack need to be manually marked for each crack picture through Labelme software.

Step two: preprocessing a picture; the preprocessing of the picture comprises the steps of scaling a large image to a uniform size of 448 multiplied by 448 pixels, if the image is rectangular, the image also needs to be uniformly square in size, for example, the 800 multiplied by 600 center is changed to 600 multiplied by 600;

step three: setting the model structure of the crackreatinetentinenet, as shown in fig. 2, the model of the crackreatentinenet adopts a structure based on an encoder-decoder, and comprises an encoder and a decoder, and an attention module is added between each encoder and each decoder, and is positioned behind each encoder and connected with the corresponding decoder.

Specifically, as shown in fig. 2, the encoder comprises an input layer, an encoding layer-1 to an encoding layer-5, wherein the encoding layer-1 to the encoding layer-4 respectively correspond to the first layer to the fourth layer of the resenet 34 network which is pre-trained, and are respectively ResNet34-1 to ResNet 34-4.

Specifically, as shown in fig. 2, the decoder is composed of a decoding layer-1 to a decoding layer-5 and an output layer.

Specifically, as shown in fig. 2, the attention module obtains the outputs from the coding layer-1 to the coding layer-4, and obtains the corresponding attention module outputs-1 to-4 through attention calculation. The output of the attention module is added with the output of the corresponding coding layer and the output of the previous decoding layer, and the sum is directly sent to the next decoding layer as input.

Further, as shown in fig. 3, the coding layer-5 is a coding layer having a structure different from the coding layer-1 to the coding layer-4, and it uses a convolution kernel with a size of 2 × 2 to perform convolution with a step size of 2, and the padding is 0, and the size of the output matrix is divided by 2; discarding, batch normalization processing and activating functions are connected after convolution operation; the output of the coding layer-5 is directly input into the decoding layer-5;

further, as shown in FIG. 4, the decoding layer-5 contains a convolution block-1, a convolution block-2, and a de-convolution block, the last part being convolution block-3. Convolution blocks-1 to-3 will use convolution kernels of size 1x1, performing convolutions of step size 1 (also filled with 0), which will get the same size as the input size. After convolution, discarding, batch normalization processing and activating functions are connected in sequence. The deconvolution block will first perform deconvolution via the ConvTranspose2d function with a convolution kernel size of 2 x 2 and a step size of 2, which will multiply the input size by 2, with the batch normalization process and the activation function immediately following it.

Further, as shown in fig. 5, the decoding layers-1, -2, -3 to-4 have the same structure, which includes a convolution block-1, an inverse convolution block, and a convolution block-2. Both convolution block-1, convolution block-2 will perform the convolution using a convolution kernel of size 1x1 and step size 1 and fill in 0, which will result in the same size as the input, followed by discard, batch normalization and activation. The deconvolution block will perform deconvolution by the ConvTranspose2d function with a kernel size of 3 x 3 and a step size of 2, which will multiply the input size by 2, immediately followed by the batch normalization process and the activation function.

Further, as shown in fig. 6, the output layer includes a deconvolution block-1, a convolution block-2, and a deconvolution block-2; the deconvolution block-1 will be deconvoluted by the function ConvTranspose2d with a convolution kernel size of 3 x 3 with a step size of 2, which will multiply the input size by 2. The batch normalization process and activation function then follows. The convolution block-1 and convolution block-2 have the same structure, it will perform convolution with step size 1 with a convolution kernel of size 3 x 3 and fill 0, the output matrix is the same size as the input. Discard, batch normalization, and activate function concatenate. Deconvolution block-2 only needs to perform deconvolution by the ConvTranspose2d function, with a kernel size of 2 x 2 and a step size of 2, which will multiply the input size by 2. The output will be the final predicted image, which is the same size as the input image.

On the basis of the above, the attention module is positioned behind each encoder and connected with a corresponding decoder; the cracks are segmented differently in scale, illumination and different views, and since the convolution operation introduces more local receptive fields, there may be differences in the features corresponding to the pixels with the same label, which may lead to inconsistencies within the same class, affecting accuracy. Therefore, context information is extracted by establishing a correlation mechanism among the global features, the segmentation capability of the crack is increased on the basis of crack segmentation, long-distance context information can be effectively captured, and the feature representation capability is improved.

As shown in fig. 7, two types of attention modules are added to obtain a global context according to local characteristics in the network. For the output of the ResNet34 decoding layer, the convolutional layer is first applied to obtain the features of the different layers, which does not change the size of the input.

The first attention module is a location attention module. The location attention module will extract a greater range of contextual information in the local features. Feature map A, B, C generated using convolutional layer, where { A, B, C, D }. epsilon.R^C×H×W. Then deforming A, B and C into R^C×NWhere N × W is the number of pixels.

Then transpose B to R^N×CThus, a matrix multiplication between transposes of C and B can be performed, resulting in R^N×NThen applying a softmax layer to calculate a spatial attention feature map S e R^N×NAs in formula (1):

wherein s is_jiMeasurement ofThe effect of the ith position on the j positions. The more similar the feature representations of two locations, the greater the correlation between them. Then transpose S, which will have the same shape R^N×NBut the size is changed. The invention performs a matrix multiplication between transposes of A and S, with the result that R is^C×NThen morph this result into R^C×H×WFinally, multiplying by the scale parameter alpha, and carrying out element summation operation on the original convolution characteristic D to obtain the final output H ∈ R^C×H×WAs shown in equation (2):

where α is initialized to 0 and gets more weight through gradual learning.

Also, for the channel attention module shown as B in FIG. 7, it can emphasize the interdependent feature maps and improve the semantic-specific feature representation. Convolution is first performed to extract the feature maps E, F, G, H, and { E, F, G, H }. epsilon.R^C×H×W. Then the matrix F, G is transformed into R^C×NWhere N × W is the number of pixels. Then transpose F to R^N×CThus, a matrix multiplication between transposes of F and E can be performed to obtain R^N×NThe result matrix of (1), now applying a softmax layer to compute a spatial attention map X ∈ R^N×NAs in formula (3):

wherein x_jiThe effect of the ith position on the j positions is measured.

Then carrying out matrix multiplication between the softmax result X and the deformed G to obtain a result R^C×NThen morph the result to R^C×H×WFinally, multiplying by a scale parameter beta, and carrying out element summation operation by using the original convolution characteristic H to obtain the final output I epsilon R^C×H×WAs shown in the following equation (4):

where β is initialized to 0 and gradually learns more weight through learning.

Thus, the final feature of each channel is a weighted sum of the features of all channels and the original features, which models well the long-range semantic dependencies between feature maps. Under the condition that the position attention H and the channel attention I are both C multiplied by H multiplied by W, the two attention results are fused, and the attention ratio of the channel is given as

Accordingly, the ratio of channel attention is

The scaled sum operation based on the matrix element level is calculated as shown in equation (5) below:

here, the

Is a hyper-parameter and the location attention of crack segmentation can be emphasized with 0.8.

The encoder and decoder are connected by a bridge, as shown in fig. 2, the bridge connector connects each encoder layer and decoder and is implemented by merging the output of each encoder layer with the output of the attention module and the last decoded layer; inputting this fused output to the decoding layer, the encoding layer and corresponding attention information can be captured.

Step four: determining a loss function; the invention uses three loss functions for comparison, namely pixel cross entropy loss (CE), balanced pixel cross entropy loss (BCE) and Dice loss:

wherein the pixel cross entropy loss CE is shown in the following formula (6):

the probability that a sample is predicted to be positive.

The balanced pixel cross-entropy penalty is similar to the pixel cross-entropy penalty, but it only assigns weights to the positive and negative samples, with the sum of the weights being 1. The formula is shown in the following formula (7):

the probability that a sample is predicted to be positive. The Dice loss is designed from the perspective of the cross-over ratio (IoU), as shown in equation (8): a

In the formula (8), TP is pixel true positive, FN is pixel false negative;

step five: determining an optimizer; by using the Adam optimizer, the Adam optimizer has the advantages of high efficiency, small occupied memory, suitability for large-scale data and the like.

Step six: initializing a weight matrix; for the ResNet34 pre-training model part, weights of the pre-training model are used, and for other layers except ResNet34 including the input layer, the output layer, the encoding layer 5, the decoding layer 1 to the decoding layer 5, weight matrices are initialized using random normality.

Step seven: forward propagation; the input signal obtains the output of each layer with the help of the weight matrix, and finally reaches the predicted value of the output layer.

Step eight: backward propagation; after the network prediction results calculated by any random set of parameters are obtained through forward propagation, the network prediction results are updated by using the gradient of the loss function relative to each parameter.

Step nine: updating the weight matrix; and updating the weight matrix according to the gradient of the parameters obtained by back propagation.

Step eleven: inputting a crack image of the asphalt pavement to be segmented; and collecting road surface crack images shot by the vehicle-mounted camera as the input of the system.

Step twelve: preprocessing an image; the pre-processing of the picture includes scaling the large image to a uniform size of 448 x 448 pixels, and if the image is rectangular, it needs to be first uniform to a square size (e.g., change the 800 x 600 truncated center to 600 x 600).

step 1302: reading the model file to a memory;

fourteen steps: segmentation and output of the crack image; the road surface image with cracks is input, and the well-divided road surface image can be predicted through the trained CrackResAttentionNet, wherein the pixels of the cracks are displayed in white, and other backgrounds are displayed in black.

And fifthly, storing the trained CrackResAttentionNet model file obtained in the step fifteen on a magnetic disk, and loading a model binary file to a memory in the step.

EXAMPLE III

Based on the above embodiments, the performance of the CrackResattentionNet model is evaluated by the following two test data, one based on the common pavement crack data set and the other based on the smoke bench data set.

All tests were performed on a computer of the following specifications:

the software environment is based on ubuntu16.04, python being the primary programming language. The experiments were performed on a Pytorch 1.5 deep learning framework.

The invention adopts a small batch random gradient descent method as an optimization algorithm for training. The values of the hyper-parameters are as follows: the weight attenuation factor is 0.0002, the momentum is 0.9, the learning rate is 0.01, the small batch number is 4, and the epochs number is 60.

For each type of experiment, the invention will run a typical image segmentation model, including ENet, ExFuse, FCN, LinkNet, SegNet, and UNet, in addition to CrackResAttentionNet.

An ENet (efficient Neural network) split network is particularly good at low latency operation because it has fewer parameters; the ExFuse (enhancing Feature Fusion for Semantic segmentation) segmentation network effectively combines the low-order features and the high-order features, thereby greatly improving the segmentation accuracy; fcn (full volumetric networks) is the first segmentation model to make a major breakthrough using full convolution instead of full connected layers; the LinkNet segmentation model is also based on a coder decoder framework, and better accuracy is obtained through fewer parameters; the SegNet segmentation model is specially designed for efficient semantic segmentation; UNet is a symmetric encoder-decoder architecture, like the letter U-shape, that was initially used for medical image segmentation.

For each of the above exemplary models, the present invention will train three different penalty functions, namely pixel cross-entropy penalty (CE), balanced pixel cross-entropy penalty (BCE), and Dice penalty (Dice).

For the crack segmentation task in the present invention, the following evaluation indices were used: accuracy, average IoU, precision (P), recall (R), and F1. The F1 score is a harmonic mean of accuracy and recall, the crack pixel (white pixel in the image) is defined as a positive sample, and the pixels are classified into four types according to the combination of the marked and predicted results: true Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).

Accuracy (Precision) is defined as the number of correctly identified pixels on all predicted pixels; the precision is defined as the ratio of correctly predicted crack pixels relative to all predicted crack pixels; recall (Recall) is defined as the ratio of correctly predicted crack pixels relative to all real crack pixels; the F1 score is the harmonic mean of accuracy and Recall, where accuracy (Precision) is equation (9), Recall (Recall) is equation (10), and F1 score is equation (11).

An intersection set (IOU) reflects the degree of overlap between two objects. In the invention, the IOU is evaluated on the "crack" category to provide a measure of overlap between the actual cracks of the asphalt pavement and the predicted cracks, as shown in equation (12).

The test results are as follows:

1. public data set

TABLE 1 common fracture data set-CE loss

Segmentation model	Accuracy/%)	Average IoU	Recall/%)	F1/％
					ENet	80.03	0.7222	83.94	81.94
ExFuse	82.22	0.7170	81.17	81.69
					FCN	81.87	0.7102	77.72	79.74
LinkNet	81.15	0.7097	82.62	81.88
					SegNet	78.00	0.6632	75.18	76.56
UNet	80.19	0.7042	82.88	81.51
					CrackResAttentionNet	82.58	0.7283	85.13	83.84

TABLE 2 common fracture data set-BCE loss

TABLE 3 common fracture data set-Dice loss

Segmentation model	Accuracy/%)	Average IoU	Recall/%)	F1/％
					ENet	76.18	0.5545	56.68	65.00
ExFuse	48.92	0.4888	50.00	49.45
					FCN	80.17	0.6783	79.11	79.64
LinkNet	86.97	0.7076	83.00	84.94
					SegNet	82.92	0.6696	84.35	83.63
UNet	80.76	0.7002	85.20	82.92
					CrackResAttentionNet	90.72	0.7169	81.93	86.10

2. Cigarette bench data set

TABLE 4 tobacco stage crack data set-CE loss

TABLE 5 tobacco stage crack data set-BCE loss

Segmentation model	Accuracy/%)	Average IoU	Recall/%)	F1/％
					ENet	94.67	0.8120	92.34	93.49
ExFuse	95.18	0.8203	91.85	93.48
					FCN	93.05	0.8295	90.64	91.83
LinkNet	95.07	0.8253	92.08	93.55
					SegNet	91.24	0.7806	83.04	86.95
UNet	94.28	0.8161	90.26	92.23
					CrackResAttentionNet	96.17	0.8369	93.44	94.79

TABLE 6 tobacco stage crack data set-Dice loss

Segmentation model	Accuracy/%)	Average IoU	Recall/%)	F1/％
					ENet	94.80	0.8217	92.10	93.43
ExFuse	92.10	0.7412	87.66	89.87
					FCN	90.23	0.7765	89.16	89.69
LinkNet	94.45	0.8076	91.62	93.01
					SegNet	91.80	0.7486	90.23	91.01
UNet	93.76	0.8011	91.10	92.41
					CrackResAttentionNet	95.43	0.8275	94.2	94.81

From the test results shown in the above table, it can be seen that the CrackResattentionNet proposed by the present invention performs better than the existing typical methods, especially in terms of accuracy and method, directly reflecting the location and severity of the crack.

For the same method, by comparing three different loss functions (CE, BCE, rice), it can be seen that the balanced pixel cross entropy loss (BCE) has better performance than the other two methods. The BCE-loss sample segmentation output for each model is shown in fig. 8 and 9, from which it can be seen that the segmentation of the image by crackresintenenet is very close to the true value, while typical models such as SegNet, FCN, ExFuse have significant mishandling for noise, resulting in segmentation of white non-crack regions.

The invention utilizes CrackResattentionNet and typical models (ENet, ExFuse, FCN, LinkNet, SegNet, UNet) under three different loss functions (CE, BCE, Dice), and test results on a common crack data set and a smoke bench data set show that CrackResattentionNet using BCE loss functions has precision (89.40%), average IoU (71.51%), recall rate (81.09%) and F1 (85.04%) on the common data set, precision (96.17%), average IoU (83.69%), recall rate (93.44%) and F1 (94.79%).

The invention has proposed a structure and its concrete application method based on crack detection of bituminous pavement of the codec network and picture element level image segmentation, the encoder of the core part mainly utilizes the convolution layer of ResNet34 to withdraw the image characteristic, and increased a coded layer in order to withdraw the information better behind it; the decoder uses a deconvolution layer to perform semantic segmentation on cracked and non-cracked pixels, and an additional position attention module and a channel attention module are connected behind each encoder to capture remote context information; the outputs of the two attention modules are fused in proportion, so that the position information can be more emphasized, the output of each coding layer is fused with the attention output and is connected with the corresponding decoding layer, and the output of the previous decoding layer is used as the input to the next decoding layer; therefore, the decoding layer and the up-sampling operation thereof can fully utilize the spatial information and improve the prediction precision. By implementing the technical scheme of the invention, the asphalt pavement crack can be accurately and intelligently identified, and the intelligent detection and management and maintenance decision level and efficiency of the asphalt pavement crack are improved.

The technical features mentioned above are combined with each other to form various embodiments which are not listed above, and all of them are regarded as the scope of the present invention described in the specification; also, modifications and variations may be suggested to those skilled in the art in light of the above teachings, and it is intended to cover all such modifications and variations as fall within the true spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for segmenting an asphalt pavement crack image based on a deep convolutional neural network is characterized by comprising the following steps:

step 401: the pixel cross entropy loss CE is shown in equation (6) below:

a probability of predicting a sample as positive;

a probability of predicting a sample as positive;

in the formula (8), TP is pixel true positive, FN is pixel false negative;

step five: determining an optimizer, and adopting an Adam optimizer;

step 1302: reading the model file to a memory;

2. The method according to claim 1, wherein in the first step, the scheme of step 101 or step 102 is specifically adopted:

step 101: directly using a marked public crack segmentation data set, wherein the data set comprises an asphalt pavement crack image and a marked crack shape and position as a crack image data set;

step 102: shooting real asphalt pavement crack photos to form a crack picture data set; manually marking the shape and the position of each crack photo by Labelme software;

102, manually marking the label by adopting the following 4 sub-steps:

step 1021, starting a Labelme software window, and opening a picture of the asphalt pavement crack;

step 1023, naming the crack as a crack mark and saving the image file;

3. The method of claim 2, wherein in step two, the image is scaled to a uniform size of 448 x 448 pixels, and if the image is rectangular, it is also required to be uniform to a square size.

4. The method of claim 3, wherein in the third step, the encoder comprises an input layer, an encoding layer-1 to an encoding layer-5, wherein the encoding layer-1 to the encoding layer-4 respectively correspond to the first layer to the fourth layer of the ResNet34 network which is pre-trained, and are ResNet34-1 to ResNet34-4 respectively; the decoder consists of a decoding layer-1 to a decoding layer-5 and an output layer.

5. The method according to claim 4, wherein in the third step, the attention module respectively obtains the outputs from coding layer-1 to coding layer-4, and obtains the corresponding attention module output-1 to attention module output-4 through attention calculation; the output of the attention module is added with the output of the corresponding coding layer and the output of the previous decoding layer, and the sum is directly sent to the next decoding layer as input; the coding layer-5 is a coding layer with a structure different from the structures of the coding layer-1 to the coding layer-4, a convolution kernel with the size of 2 multiplied by 2 is used for carrying out convolution with the step length of 2, the padding is 0, and the size of an output matrix is divided by 2 on the original size; discarding, batch normalization processing and activating functions are connected after convolution operation; the output of the coding layer-5 is directly input into the decoding layer-5; decoding layer-5 contains convolution block-1, convolution block-2 and deconvolution block, the last part is convolution block-3; convolution blocks-1 to-3 will use convolution kernels of size 1x1, perform a convolution with step size 1, and will get the same size as the input size; after convolution, sequentially connecting discarding, batch normalization processing and activating functions; the deconvolution block will first perform deconvolution via the ConvTranspose2d function with a convolution kernel size of 2 x 2 and a step size of 2, which will multiply the input size by 2, with the batch normalization process and the activation function immediately following it.

6. The method of claim 5, wherein in step three, the attention module comprises a location attention module and a channel attention module; the location attention module will extract a larger range of context information in the local features; feature map A, B, C generated using convolutional layer, where { A, B, C, D }. epsilon.R^C×H×WThen deforming A, B, C to R^C×NWhere N — H × W is the number of pixels; then transpose B to R^N×CThus, matrix multiplication is performed between transposes of C and B, resulting in R^N×NThen applying a softmax layer to calculate a spatial attention feature map S e R^N×NAs in formula (1):

s in formula (1)_jiMeasuring the i-th position versus j positions(ii) an effect; the more similar the feature representations of two locations, the greater the correlation between them; then transposes S, and performs matrix multiplication between transposes of A and S, resulting in R^C×NThen the result is transformed into R^C×H×WFinally, multiplying by the scale parameter alpha, and carrying out element summation operation on the original convolution characteristic D to obtain the final output H ∈ R^C×H×WAs shown in equation (2):

in the formula (2), alpha is initialized to 0, and more weights are obtained through gradual learning;

the channel attention module first performs convolution to extract feature maps E, F, G, H, and { E, F, G, H }. epsilon.R^C×H×W(ii) a Then the matrix F, G is transformed into R^C×NWhere N — H × W is the number of pixels; then transpose F to R^N×CThus, matrix multiplication is performed between transposes of F and E to obtain R^N×NThe result matrix of (1), now applying a softmax layer to compute a spatial attention map X ∈ R^N×NAs in formula (3):

in the formula (5)