CN113610778A

CN113610778A - Bridge surface crack detection method and system based on semantic segmentation

Info

Publication number: CN113610778A
Application number: CN202110817766.9A
Authority: CN
Inventors: 卢涛; 饶茜雅; 吴志豪; 张彦铎; 吴云韬
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-11-05
Anticipated expiration: 2041-07-20
Also published as: CN113610778B

Abstract

The invention discloses a bridge surface crack detection method and a system based on semantic segmentation, which comprises the following steps: collecting bridge crack images and carrying out pixel-level semantic annotation on the images to manufacture bridge crack segmentation data sets; constructing a semantic segmentation network based on feature coding and decoding and using jump connection based on cross attention to combine high-level semantics and low-level fine-grained surface information; on the basis of general classification loss, utilizing a class consistency loss to enable a network to map each pixel in an image to an n-dimensional vector in a feature space, so that feature vectors of pixels belonging to the class are close, and feature vectors of pixels of different classes are far; extracting a crack morphological framework after obtaining a crack image generated by segmentation, and eliminating a crack short branch to calculate the crack length; width pixels are calculated based on image cracks in the skeleton direction. The method can realize the rapid and accurate segmentation of the bridge cracks and ensure that the segmented crack image structure is more complete.

Description

Bridge surface crack detection method and system based on semantic segmentation

Technical Field

The invention belongs to the field of bridge crack detection, and particularly relates to a bridge surface crack detection method and system based on semantic segmentation.

Background

The bridge is an important traffic infrastructure and needs to be inspected and maintained regularly, and the appearance of cracks on the surface of the bridge is an important safety hazard and is the key point of inspection. The traditional bridge surface crack detection is based on manual measurement or digital image processing technology. Crack length detection methods based on manual measurement generally adopt tape measures and the like for field measurement, crack width measuring instruments and the like are generally adopted for width detection instruments of concrete bridge cracks, a large amount of labor cost is consumed, and real-time detection cannot be achieved by traditional manual measurement methods. With the popularization of video monitoring equipment, digital image processing technology is widely applied to the field of actual production and living. In recent years, many scholars at home and abroad adopt a digital image processing technology to carry out target detection and target segmentation to replace the traditional manual inspection, and a good effect is achieved. However, the traditional digital image processing method is greatly influenced by the environment, the accuracy is not high enough, and the effect improvement difficulty is high. At present, researchers at home and abroad have developed researches for solving problems in the field of computer vision by using a deep learning method, and have proved that the detection precision can be remarkably improved. However, unlike general target detection, bridge surface crack detection requires dividing cracks and calculating the actual size of the cracks for use in an actual scene. Therefore, the research is based on a deep learning mode, crack segmentation is carried out on the premise of keeping the structural integrity of the crack, and the actual size of the crack is accurately calculated as far as possible, so that the method has theoretical and practical significance.

Disclosure of Invention

The invention aims to provide a method and a system for detecting a bridge surface crack based on semantic segmentation, which solve the problems of detecting the bridge crack by utilizing image semantic segmentation and improving the structural integrity of the crack.

In order to solve the technical problems, the technical scheme of the invention is as follows: a bridge surface crack detection method and system based on semantic segmentation comprises the following steps:

(1) collecting bridge crack images and carrying out pixel-level semantic annotation on the collected images to manufacture a bridge surface crack data set;

(2) constructing a feature encoder module, and extracting semantic features from an original picture through convolution operation and downsampling;

(3) constructing a feature decoder module based on cross attention, gradually recovering position information from high-level semantic features through upsampling and convolution, performing jump connection with each corresponding layer decoder of an encoder, constructing two continuous cross attention modules at the jump connection position to extract context association, and more effectively combining high-level semantic and low-level fine-grained surface layer information;

(4) introducing a class consistency loss and training a network on the basis of the classification loss, so that the network maps each pixel in the image to an n-dimensional vector in a feature space, the feature vectors of the pixels belonging to the class are close, the feature vectors of the pixels of different classes are far, and the classification result of the pixel level is obtained;

(5) and carrying out morphological framework extraction on the crack segmentation result output by the network, eliminating crack short branches to calculate the length of the crack, calculating the width of the crack based on the image crack in the framework direction, and further estimating the actual size of the crack according to a proportional relation.

In some alternative embodiments, in step (1), a data set of bridge cracks is created based on the characteristics of the cracks that are susceptible to missed inspection. The data set contained 600 crack image artwork from 10 bridges. Randomly cutting out 9 small images with 300 multiplied by 300 pixels from each original image, and dividing the image into positive samples and negative samples, wherein the positive samples and the negative samples respectively represent crackle images and crackle-free images. The positive samples included web cracks, fine short cracks (with jitter blur) in the lower edge in tension, vertical cracks (including low contrast images) on the web, vertical cracks (with complex background texture) at the diaphragm, pier foundation concrete cracks (with water stain interference). Negative examples include honeycomb pitted surface, peeled corners (with high similarity to corner edges), cavity holes, rebar corrosion (with lighting shadows), sky, trees, water stains and shadows. As the positive and negative samples are not balanced (the number of crack-free images is far greater than that of crack images), 2400 crack-free images and 3000 crack-free images are screened from the small images to prepare data sets, and 800 images are used as a test set (the ratio of the positive sample to the negative sample is 2: 3).

In some optional embodiments, in step (2), a feature encoder is constructed, and the encoding path of the encoder includes three steps, respectively denoted as s₁,s₂,s₃The input of each step is denoted as e_i0,e_i1And e_i2Wherein e is_i0For original drawing, the same operation is adopted in each step: two convolutional layers are used, each followed by a ReLU active layer, then a maximum pooling of downsampling is performed with step size 2 and the number of channels is doubled in each downsampling. Extracting semantic features of images with different scales in each step of the encoder, and recording the output of each step as e_o0,e_o1And e_o2. The output of each step is the input to the next step, so there is e_ok＝e_{i k+1}K ∈ 0,1, then e_o2Inputting two convolution layers, wherein each convolution layer is connected with a ReLU activation layer, and the output is used as the input of a decoder;

in some alternative embodiments, in step (3):

the decoding path of the feature decoder consists of three steps s₄,s₅And s₆Each step comprising: one convolution layer that upsamples and halves the number of channels, and then two convolution layers are used, each convolution followed by a ReLU activation function. The output of each step is defined as d_o0,d_o1,d_o2。

In particular, it is established and combined in the corresponding steps of encoder and decoderTwo consecutive crosses attention the jump connection of the modules to combine the dense predictions of different granularities. The hopping connection is established at s_iAnd s_7-iI ∈ 1,2,3, the output of each step of the decoder is first connected with the corresponding feature map extracted at multiple scales of the encoder: encoder output e_o2Feature map and d_o0Splicing in channel dimension, denoted as T₀In the same way, e_o1And d_o1Splicing T₁，e_o0And d_o2Splicing to T₂Input to the decoder as input to the cross attention group_i0,d_i1,d_i2Is defined as: CroA (CroA (T)₀)),CroA(CroA(T₁)),CroA(CroA(T₂) CroA represents a cross attention module, i.e., T' from the output of the first module is used as the input to the second module. And acquiring the association relation between each feature point and the whole feature map through a cross attention module, and performing associated feature enhancement.

The attention module of the cross extracts context information in the horizontal and vertical directions to enhance the functionality of the pixel feature representation. Each cross attention module operates to:

and inputting a feature map T. Two low-dimensional features Q and K are generated by 1 × 1 convolutional layers. Q and K further generate an attention map A through a "similarity" operation. Obtaining a vector Q at each position u in the spatial dimension of Q_u. Meanwhile, Ω is obtained by extracting a feature vector by K in the same row or column as the position u_uSet by Ω_i,uRepresents omega_uThe ith element of (1).

d_i,u∈D,

Wherein d is_i,uIs a characteristic Q_uAnd omega_i,uI ═ 1, …, H + W-1]Then applying softmax to D on the channel scale to get attention map a;

applying another 1X 1 convolutional layer on TTo generate V for feature adaptation. Obtaining a vector V at each position u in the spatial dimension of V_uSum set phi_u. Set phi_uIs a set of feature vectors in V, which lie in the same row or column of position u, by

Context information is added to the local feature T to enhance the representation in a pixel-wise manner. Wherein the content of the first and second substances,

for extracting context information, T_u'is the feature vector at position u in T', A_i,uIs a scalar value at position u in channels i and a. Therefore, it has a broad context view and selectively extracts context according to spatial attention.

The feature vectors are mapped to the required number of classes using a 1 x 1 convolution at the last layer of the decoder.

In some alternative embodiments, in step (4), the loss/is reduced in classification_segIntroduces a class consistency loss l on the basis of_classFor semantic segmentation tasks, pixels belonging to the same class should have similar features, while pixels from different classes have more different features, naming this property as class consistency. The class consistency loss may cause the network to map each pixel in the image to an n-dimensional vector in feature space such that the feature vectors of pixels belonging to the class are close, the feature vectors of pixels of different classes are far, and the final loss function is defined as: l ═ l_seg+ml_class。l_classFor class consistency loss,/_class＝αl_var+βl_dis+γl_regWhere m, α, β and l are weight coefficients.

By means of_varPenalizing the same class features with larger distances,

by means of_disPenalizing heterogeneous classes with smaller distancesThe characteristics are that the temperature of the material is higher than the temperature of the material,

c_a≠c_b(ii) a By means of_regPushing all class features to the mean point of the class in which they are located,

where C is a set of classes present in the small batch of images, N_cThe number of active elements belonging to class c, c ∈ C, h_iE H is the feature vector for spatial position i. Mu.s_cIs the average characteristic of class C.

Wherein

Is a function of the distance of the segments,

characteristic h_iAnd mu_cDistance d of_v＝‖μ_c-h_iII when d_vGreater than delta_dTime of flight

Is a quadratic function when d_vIn (delta)_v,δ_d]Is a linear function when d_v≤δ_vTime of flight

Is zero.

Wherein

Is a function of the distance of the segments,

feature(s)

And

is a distance of

When d is_vLess than 2 delta_dTime of flight

Is a quadratic function when d_vGreater than 2 delta_dTime of flight

Is zero.

δ_vAnd delta_dRespectively, the set margins.

To reduce the computational effort, the convolutional layer is first used to reduce the size at the output of the cross attention module, and then the three penalties described above are applied to the feature map with fewer channels.

In some alternative embodiments, step (5) comprises: and (3) performing morphological skeletonization on the cracks, eliminating short branches of the cracks, calculating the lengths of the cracks and calculating the widths of the cracks.

Defining a skeleton by using a maximum disc method, wherein a target skeleton consists of the circle centers of all internally tangent discs in a target, and the maximum disc method is expressed as follows:

wherein

Wherein B is a structural element, and B is a structural element,

representing j successive etches to a,

for the short branches formed in the skeletonization process, firstly traversing the crack picture, and iteratively deleting the boundary point of a region, wherein the definition of the boundary point is as follows: the pixel value is 1, at least one pixel point with the adjacent pixel value of 0 represents a point of a background area by 0, a point of a crack area by 1 is represented by 8 fields, and whether the boundary point is a deletion point is judged by adopting two steps. In step I, a contour point p is identified if the following condition is met₁Marking as a deletion point:

in the formula, N (p)₁) Representing the number of non-zero adjacent pixels, T (p)₁) Represents p_2,p_3,…p_9,Sorting the number of grab switches of the sequence from 0 to 1;

in step II, the conditions of the above formulas (a) and (b) are kept unchanged, and the conditions of (c) and (d) are changed to:

the judgment method is as follows: the first step is to apply the condition of step I, if at least one of the conditions (a) to (d) in step I is violated, then p₁If all conditions are met, marking the boundary points as deletion points, and after the step I is applied to all boundary points, modifying the values of the deletion points to 0 and then deleting the boundary points; step two, applying the conditions in the step II to the results of the step one, and the rule is the same as that of the step I; and finally obtaining an image formed by the points, namely the trimmed crack skeleton.

The problem of fracture discontinuity can not occur after the fracture connection process is interrupted, so the fracture length can be directly obtained.

On the basis of the skeleton method, a method for calculating the width pixel of the image crack is applied. The image on the skeleton is a single-pixel image, the tangential direction of each point on the skeleton line can be calculated according to the skeleton image, the tangent of each point in the crack is calculated, then the normal line perpendicular to the tangent line on the skeleton line is calculated, and the intersection point distance between the normal line and the crack boundary is the width of the crack.

And estimating the size of the crack according to the actual proportional relation between the bridge crack and the picture crack. And finally, multiplying the segmented bridge crack image by the proportional relation according to the proportional relation of the actual scene and the photo to estimate the actual length and width of the crack in the image.

According to another aspect of the present invention, there is provided a bridge surface crack detection system based on semantic segmentation, including:

the data acquisition module is used for manufacturing a bridge crack semantic segmentation data set;

the encoder module is used for performing downsampling through convolution and pooling to extract semantic features from the original picture;

the cross attention-based feature decoder module is used for extracting context correlation, more effectively combining high-level semantics and low-level fine-grained surface layer information, performing jump connection on features acquired from each corresponding layer of the encoder and the decoder, and constructing two continuous cross attention modules at the jump connection position to gradually recover position information from the high-level semantic features through up-sampling and convolution;

the loss calculation module is used for introducing a category consistency loss on the basis of the classification loss and training a network to obtain a pixel-level classification result;

and the crack calculation module extracts a crack morphological framework, eliminates a crack short branch to calculate the crack length, and calculates the image crack width based on the morphological framework.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a bridge surface crack detection method and system based on semantic segmentation, which more effectively combines high-level semantics and low-level fine-grained surface layer information by using two cross attention at a jump connection of a network, and uses a category consistency loss on the basis of classification loss so as to enable feature vectors of pixels belonging to the category to be close and feature vectors of pixels of different categories to be far away. Global attention is realized, and therefore the feature expression capability of the network is enhanced.

Drawings

FIG. 1 is a schematic flow chart of a semantic segmentation bridge crack detection method based on cross attention and category consistency loss according to an embodiment of the present invention;

FIG. 2 is a diagram of a bridge surface crack detection network structure based on semantic segmentation according to an embodiment of the present invention;

FIG. 3 is a block diagram of a cross attention module provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a bridge surface crack detection system based on semantic segmentation according to an embodiment of the present invention;

FIG. 5 is a graph comparing test results provided by embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The semantic segmentation bridge crack detection method based on the cross attention and the category consistency loss, disclosed by the embodiment of the invention, as shown in FIG. 1, comprises the following steps:

s1: manufacturing a data set of the cracks on the surface of the bridge;

in the embodiment of the invention, a bridge surface crack data set is manufactured according to the structural characteristics of the bridge surface cracks which are easy to miss detection. The data set comprises 600 crack images of 10 bridges from Wuhan Hubei, in order to improve the generalization capability of the model, the selected images come from different scenes, and in order to facilitate the calculation of the actual crack size, the proportion of the scene range shot by the camera under each scene to the image size is kept uniform.

Several small pictures of equal size (300 × 300 pixels) are randomly cut out from each original picture. And dividing the image into a positive sample and a negative sample, wherein the positive sample and the negative sample respectively represent a cracked image and a non-cracked image. The positive samples included web cracks, fine short cracks (with jitter blur) in the lower edge in tension, vertical cracks (including low contrast images) on the web, vertical cracks (with complex background texture) at the diaphragm, pier foundation concrete cracks (with water stain interference). Negative examples include honeycomb pitted surface, peeled corners (with high similarity to corner edges), cavity holes, rebar corrosion (with lighting shadows), sky, trees, water stains and shadows. As the positive and negative samples are not balanced (the number of crack-free images is far greater than that of crack images), 2400 crack-free images and 3000 crack-free images are screened from the small images to prepare data sets, and 800 images are used as a test set (the ratio of the positive sample to the negative sample is 2: 3).

S2: constructing a feature encoder module, and extracting semantic features from an original picture through convolution operation and downsampling;

in the embodiment of the invention, a feature encoder is constructed, and the encoding path of the encoder comprises three steps which are respectively marked as s₁,s₂,s₃The input of each step is denoted as e_i0,e_i1And e_i2Wherein e is_i0For original drawing, the same operation is adopted in each step: two 3 x 3 convolutions are used, each convolutional layer followed by a ReLU active layer and a maximum pooling, then downsampled by a step size of 2, and the number of channels is doubled in each downsampling. Extracting semantic features of images with different scales in each step of the encoder, and recording the output of each step as e_o0,e_o1And e_o2. The output of each step is the input to the next step, so there is e_ok＝e_i _k+1K ∈ 0,1, then e_o2Inputting two 3 x 3 convolutional layers, each convolutional layer being followed by a ReLU active layer, and taking the output as the decoder input;

s3: gradually recovering position information from high-level semantic features through upsampling and convolution, performing jump connection with each corresponding layer decoder of the encoder, constructing two continuous cross attention modules at the jump connection position to extract context association, and more effectively combining high-level semantics and low-level fine-grained surface layer information;

in the embodiment of the present invention, as shown in FIG. 2, the decoding path of the feature decoder consists of three steps s₄,s₅And s₆Each step comprising: one upsampling and 2 x 2 convolution that halves the number of channels, then two 3 x 3 convolutions are used, each convolution followed by a ReLU activation function. The output of each step is defined as d_o0,d_o1,d_o2。

In particular, a jump connection incorporating two consecutive cross attention modules is established at the corresponding steps of the encoder and decoder to incorporate dense predictions of different granularities. The hopping connection is established at s_iAnd s_7-iI ∈ 1,2,3, the output of each step of the decoder is first connected with the corresponding feature map extracted at multiple scales of the encoder: encoder output e_o2Feature map and d_o0Splicing in channel dimension, denoted as T₀In the same way, e_o1And d_o1Splicing T₁，e_o0And d_o2Splicing to T₂Input to the decoder as input to the cross attention group_i0,d_i1,d_i2Is defined as: CroA (CroA (T)₀)),CroA(CroA(T₁)),CroA(CroA(T₂) CroA represents a cross attention module, i.e., T' from the output of the first module is used as the input to the second module. And acquiring the association relation between each feature point and the whole feature map through a cross attention module, and performing associated feature enhancement.

In an embodiment of the invention, a local feature map T is input to the cross attention module,

wherein H, W, C represent the height, width, and number of channels of the feature map, respectively. Firstly, T is convolved by two 1 multiplied by 1 to generate two low-dimensional features Q and K,

wherein C' is less than C.

Q and K further generate an attention map a by a "similarity" operation,

obtaining a vector Q at each position u in the spatial dimension of Q_u，

Meanwhile, Ω is obtained by extracting a feature vector by K in the same row or column as the position u_uIn the collection of the images, the image data is collected,

by omega_i,uRepresents omega_uThe ith element of (1).

d_i,u∈D,

Wherein d is_i,uIs a characteristic Q_uAnd omega_i,uI ═ 1, …, H + W-1]Then applying softmax to D on the channel scale to get attention map a; another 1 x 1 convolutional layer is applied on T to generate V for feature adaptation,

obtaining a vector V at each position u in the spatial dimension of V_uSum set phi_u，

Set phi_uIs a set of feature vectors in V, which lie in the same row or column of position u, by

for extracting context information, T_u'is the feature vector at position u in T',

A_i,uis a scalar value at position u in channels i and a. Therefore, it has a broad context view and selectively extracts context according to spatial attention. The cross attention module structure is shown in fig. 3.

A single cross attention can capture context information in both horizontal and vertical directions, but there is no connection between one pixel and the pixels of its non-intersecting path, so two cross attention modules are used in succession, establishing an arbitrary positional association, enabling full image context information to be obtained from all pixels to generate new features with dense and rich context information.

S4: introducing a category consistency loss on the basis of the classification loss and training a network to obtain a pixel-level classification result;

at a classification loss of l_segIntroduces a class consistency loss l on the basis of_classFor semantic segmentation tasks, pixels belonging to the same class should have similar features, while pixels from different classes have more different features, naming this property as class consistency. The class consistency loss may cause the network to map each pixel in the image to an n-dimensional vector in feature space such that the feature vectors of pixels belonging to the class are close, the feature vectors of pixels of different classes are far, and the final loss function is defined as: l ═ l_seg+ml_class。l_classFor class consistency loss,/_class＝αl_var+βl_dis+γl_regWhere m, α, β and l are weight coefficients. In a specific implementation, m is 1, α is 1, and γ is 0.001.

By means of_varPenalizing the same class features with larger distances,

by means of_disPenalizing different classes of features with smaller distances,

Wherein

Is a function of the distance of the segments,

Is zero.

Wherein

Is a function of the distance of the segments,

feature(s)

And

is a distance of

When d is_vLess than 2 delta_dTime of flight

Is a quadratic function when d_vGreater than 2 delta_dTime of flight

Is zero.

δ_vAnd delta_dRespectively, the set margins, in particular, set delta_v＝0.5，δ_d＝1.5。

To reduce the computational effort, the convolution layer is first used to reduce the size at the output of the crosshair attention module, in a specific implementation, 16 is set as the number of channels in the reduced dimension, and then the above three penalties are applied to the feature map with fewer channels.

S5: and carrying out morphological framework extraction on the crack segmentation result output by the network, eliminating crack short branches to calculate the length of the crack, calculating the width of the crack based on the image crack in the framework direction, and further estimating the actual size of the crack according to a proportional relation.

In the embodiment of the present invention, the skeleton is defined by using a maximum disk method, wherein a target skeleton of the skeleton is composed of the centers of all inscribed disks in a target, and the maximum disk method is expressed as follows:

wherein

Wherein B is a structural element, and B is a structural element,

representing j successive etches to a,

Because the problem of fracture discontinuity can not appear after the fracture connection process is interrupted, the fracture length can be directly obtained.

On the basis of the method of the framework,a method of calculating the width pixel of an image crack is used. The image on the skeleton is a single-pixel image, and the tangential direction of each point on the skeleton line can be obtained according to the skeleton image. Let M be any point on the skeleton picture, then there are two points of 1 on 8 neighborhoods of the M point. FIG. 4-3 is a tangent line view of eight neighborhoods of point M, N₁、N₂Two points in eight neighborhoods of point M, respectively, where the tangential direction of point M is equal to MN₁And MN₂Average in the tangential direction, i.e.

After calculating the tangent of each point in the crack, calculating the normal perpendicular to the tangent on the skeleton line. The intersection point distance between the normal line and the crack boundary is the width of the crack. Considering that the change of the crack width in an actual scene is slow, in order to reduce the calculation amount and improve the efficiency, points with a certain interval on the crack skeleton (the size of the interval is comprehensively determined according to factors such as image resolution, crack type and the like) can be selected, and the quantitative width of the points with the certain interval is obtained. And finally, selecting a point with the largest quantization width from the crack regions, and solving the quantization width of each point on the front section and the rear section of the point on the skeleton, wherein the maximum value of the quantization width is the maximum value of the width of the crack region of the section.

And finally, the actual length and width of the crack in the image can be estimated by multiplying the segmented bridge crack image by the proportional relation.

The invention also provides a semantic segmentation bridge crack detection system based on cross attention and category consistency loss, as shown in fig. 4, comprising:

and the loss calculation module is used for introducing a category consistency loss on the basis of the classification loss and training a network to obtain a pixel-level classification result.

The specific implementation of each module may refer to the description of the above method embodiment, and the embodiment of the present invention will not be repeated.

In the test example of the invention, several classical segmentation algorithms including an FCN algorithm, an RAU-Net algorithm and a Mask-CNN algorithm are selected to verify the effectiveness of the experiment and are verified on a self-made bridge data set. The pair of the method proposed by the present invention and each classical algorithm is shown in table 1.

TABLE 1 comparison of accuracy and elapsed time for each algorithm in a bridge fracture data set

Method	mIoU	Time/ms
			FCN	0.432	123
Mask R-CNN	0.443	183
			Ours	0.458	128

From the above table, it can be seen that the present invention can achieve a better result in accuracy while consuming less time. Compared with FCN, the method has higher precision, and compared with RAU-Net, the method has higher speed. Therefore, the algorithm provided by the chapter can achieve a better effect on precision and speed. The comparison of the segmentation effect of each algorithm in the bridge crack data set is shown in fig. 5, and it can be seen from fig. 5 that compared with each algorithm, the method provided by the invention can obtain a more complete crack structure.

TABLE 2 comparison of measured actual values and measured values of two crack images taken

Experiment of	Calculated length/cm	Calculate width/cm	Length error/%)	Width error/%)
					First sheet	48.04	1.01	3.66	3.78
The second sheet	114.88	0.3754	4.57	4.28

In order to detect the length and the width of a bridge crack in an actual scene and prevent the influence of the change of natural illumination conditions on an experiment, natural light is shielded and an independent uniform light source is adopted, and about 30 pictures are taken to verify the accuracy of the method. The practical result shows that the average error of the length measurement of 30 crack pictures is 2.76%, the average error of the width measurement is 4.34%, the experimental result reflects that the skeleton crack length calculation precision is high and meets the system design requirement, and the crack maximum width calculation precision of the crack width measurement modeling method also meets the system requirement. For the width characteristics of the cracks, the average width of the cracks can reflect the overall damage degree of the cracks, the maximum width of the cracks can reflect the local damage degree of the cracks, and the actual damage condition of the cracks can be reflected by comprehensively considering the average width of the cracks and the local damage degree of the cracks.

It should be noted that, according to the implementation requirement, each step/component described in the present application can be divided into more steps/components, and two or more steps/components or partial operations of the steps/components can be combined into new steps/components to achieve the purpose of the present invention.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A bridge surface crack detection method based on semantic segmentation is characterized by comprising the following steps:

step 1, collecting bridge crack images and carrying out pixel-level semantic annotation on the collected images to manufacture a bridge crack segmentation data set;

step 2, constructing a feature encoder module, and extracting multilevel semantic features from the original picture through multiple convolution operations and downsampling;

step 3, constructing a feature decoder module based on cross attention, gradually recovering position information from high-level semantic features in an original picture through multiple times of upsampling and convolution operations, performing jump connection on each decoder layer corresponding to the encoder module, constructing two continuous cross attention modules at the jump connection position to extract context association, and combining high-level semantic and low-level fine-grained surface layer information;

step 4, introducing a category consistency loss and training a network on the basis of the classification loss, so that the network maps each pixel in the image to an n-dimensional vector in a feature space, the feature vectors of the pixels belonging to the category are close, the feature vectors of the pixels of different categories are far, and the classification result of the pixel level is obtained;

and 5, performing morphological framework extraction on the crack segmentation result output by the network, eliminating crack short branches to calculate the length of the crack, calculating the width of the crack based on the image crack in the framework direction, and estimating the actual size of the crack according to a proportional relation.

2. The bridge surface crack detection method based on semantic segmentation according to claim 1, wherein the step 1 specifically comprises: the method comprises the steps of collecting a plurality of different original images of crack images of a plurality of bridges, randomly cutting a plurality of square images with the same pixel size from each original image, dividing the square images into positive samples and negative samples, wherein the positive samples and the negative samples respectively represent crack images and crack-free images, screening the crack images and the crack-free images from the square images to manufacture a bridge crack segmentation data set, and taking the images in the bridge crack segmentation data set as a test set.

3. The bridge surface crack detection method based on semantic segmentation as claimed in claim 2, wherein the positive sample comprises a mesh crack image, a crack image with dither blur, a crack image with low contrast, a crack image with complex background texture and a crack image with water stain interference; the negative sample comprises a honeycomb pitted surface, peeled off corners, hollow holes, steel bar corrosion, sky, trees, water stains and shadows.

4. The bridge surface crack detection method based on semantic segmentation according to claim 1, wherein the step 2 specifically comprises: constructing a characteristic encoder module, wherein the encoding path of the encoder module comprises three sequential steps which are respectively marked as s₁,s₂,s₃Wherein the input of each step is denoted as e_i0,e_i1And e_i2Wherein e is_i0For original drawing, each step comprises: using two convolutional layers, calculating a ReLU activation function after each convolutional layer is calculated, then performing maximum pooling of down-sampling with the step length of 2, and doubling the number of channels in each down-sampling; the encoder module extracts image semantic features of different scales in each step, and records the output of each step as e_o0,e_o1And e_o2The output of each step being the input to the next step, i.e. e_ok＝e_ik+1，k∈0,1。

5. The bridge surface crack detection method based on semantic segmentation according to claim 4, wherein the process of constructing the cross attention-based feature decoder module in the step 3 is as follows: the decoding path of the feature decoder module consists of three steps s₄,s₅And s₆Wherein each step comprises: one upsampling and halving the number of channels in each upsampling, and then using two convolutional layers, each of which is calculated and then subjected to a ReLU excitationCalculating a live function, and defining the output of each step as d_o0,d_o1,d_o2The output of each step serves as the input to the next step.

6. The bridge surface crack detection method based on semantic segmentation as claimed in claim 5, wherein the jump connection is established at s_iAnd s_7-iI e 1,2,3, i.e. the corresponding feature maps extracted by the encoder module at multiple scales are connected to the output of each step of the feature decoder module: encoder Module output e_o2Feature map and d_o0Splicing in channel dimension, denoted as T₀In the same way, e_o1And d_o1Splice is denoted T₁，e_o0And d_o2Splice is denoted T₂Said T is₀、T₁And T₂Respectively as the input of three corresponding layers of the cross attention module.

7. The bridge surface crack detection method based on semantic segmentation according to claim 6, wherein the cross attention module works in the following process: inputting a feature map T, generating two low-dimensional features Q and K through 1 multiplied by 1 convolutional layer calculation, and further generating an attention map A through similarity operation of Q and K; obtaining a vector Q at each position u in the spatial dimension of Q_uMeanwhile, Ω is obtained by extracting a feature vector by K in the same row or the same column as the position u_uSet by Ω_i,uRepresents omega_uThe ith element of (1); the "similarity" operation is defined as:

d_i,u∈D,

applying another 1 x 1 convolutional layer calculation on T to generate V for feature adaptation, obtaining a vector V at each position u in the spatial dimension of V_uSum set phi_uSaid set Φ_uIs a set of feature vectors in V that are located in the same row or column of position u, by

Adding context information to the local feature T to enhance the representation in pixel-wise manner; wherein, the

For extracting context information, said T'_uIs the feature vector at position u in T', said A_i,uIs a scalar value at position u in channels i and A;

the feature vectors are mapped to the required number of classes using 1 x 1 convolutional layer computation at the last layer of the feature decoder module.

8. The bridge surface crack detection method based on semantic segmentation according to claim 1, wherein the S4 specifically is: at a classification loss of l_segIntroduces a class consistency loss l on the basis of_classFor semantic segmentation tasks, pixels belonging to the same class should have similar features, while pixels from different classes have more different features, this property is named class consistency, the class consistency loss is used for the network to map each pixel in the image to an n-dimensional vector in the feature space, so that the feature vectors of the pixels belonging to the class are close and the feature vectors of the pixels of the different classes are far, and the final loss function l is defined as: l ═ l_seg+ml_classWhere m is a weight coefficient.

9. The method of claim 1A bridge surface crack detection method based on semantic segmentation is characterized in that S5 specifically comprises the following steps: defining a skeleton by using a maximum disc method, wherein a target skeleton consists of the circle centers of all internally tangent discs in a target, and the maximum disc method is expressed as follows:

wherein

Wherein B is a structural element, and B is a structural element,

representing j successive etches to a,

the judgment method is as follows: the first step is to apply the condition of step I, if at least one of the conditions (a) to (d) in step I is violated, then p₁If all conditions are met, marking the boundary points as deletion points, and after the step I is applied to all boundary points, modifying the values of the deletion points to 0 and then deleting the boundary points; secondly, applying the conditions of the step II to the result of the first step, wherein the rule is the same as that of the first step; finally, obtaining an image formed by the points, namely the trimmed crack skeleton;

the problem of fracture discontinuity can not occur after the process of fracture connection is interrupted, so the fracture length can be directly obtained;

on the basis of a skeleton method, a method for calculating width pixels of image cracks is applied: because the image on the skeleton is a single-pixel image, the tangential direction of each point on the skeleton line can be calculated according to the skeleton image, the tangent of each point in the crack is calculated, then the normal line perpendicular to the tangent line on the skeleton line is calculated, and the intersection point distance between the normal line and the crack boundary is the width of the crack;

estimating the crack size according to the actual proportional relation between the bridge cracks and the picture cracks: and finally, according to the proportional relation between the actual scene and the picture, multiplying the segmented bridge crack image by the proportional relation to estimate the actual length and width of the crack in the image.

10. A system using the method for detecting cracks on the surface of a bridge based on semantic segmentation as claimed in claim 1, comprising:

the data acquisition module is used for acquiring bridge crack images and performing pixel-level semantic annotation on the acquired images to manufacture a bridge crack segmentation data set;

the encoder module is used for extracting multilevel semantic features from an original picture through multiple convolution operations and downsampling;

the cross attention-based feature decoder module is used for gradually recovering position information from high-level semantic features in an original picture through multiple times of upsampling and convolution, performing jump connection with each corresponding layer decoder of the encoder module, constructing two continuous cross attention modules at the jump connection position to extract context association, and combining high-level semantics and low-level fine-grained surface layer information;

the loss calculation module is used for introducing a category consistency loss on the basis of the classification loss and training a network to enable the network to map each pixel in the image to an n-dimensional vector in a feature space, so that the feature vectors of the pixels belonging to the category are close to each other, the feature vectors of the pixels belonging to different categories are far away from each other, and the classification result of the pixel level is obtained;

and the crack calculation module is used for carrying out morphological framework extraction on the crack segmentation result output by the network, eliminating crack short branches to calculate the crack length, calculating the width based on the image crack in the framework direction, and estimating the actual size of the crack according to the proportional relation.