CN113610778B

CN113610778B - Bridge surface crack detection method and system based on semantic segmentation

Info

Publication number: CN113610778B
Application number: CN202110817766.9A
Authority: CN
Inventors: 卢涛; 饶茜雅; 吴志豪; 张彦铎; 吴云韬
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2024-03-26
Anticipated expiration: 2041-07-20
Also published as: CN113610778A

Abstract

The invention discloses a bridge surface crack detection method and system based on semantic segmentation, comprising the following steps: collecting a bridge crack image, performing pixel-level semantic annotation on the image, and manufacturing a bridge crack segmentation data set; constructing a semantic segmentation network based on feature encoding and decoding and utilizing cross attention-based jump connection for combining high-level semantics and low-level fine-grained surface layer information; based on the general classification loss, using a type of consistency loss to enable the network to map each pixel in the image to n-dimensional vectors in the feature space, so that the feature vectors of the pixels belonging to the type are close, and the feature vectors of the pixels of different types are far away; extracting a crack morphology skeleton after obtaining a crack image generated by segmentation, and performing crack short branch elimination to calculate crack length; width pixels are calculated based on the image cracks in the skeleton direction. The method can realize rapid and accurate segmentation of the bridge cracks and enable the segmented crack image structure to be more complete.

Description

Bridge surface crack detection method and system based on semantic segmentation

Technical Field

The invention belongs to the field of bridge crack detection, and particularly relates to a method and a system for detecting a bridge surface crack based on semantic segmentation.

Background

As an important traffic infrastructure, bridges need to be inspected and maintained regularly, and the occurrence of cracks on the bridge surface is an important potential safety hazard, so that the occurrence of cracks is an important point of inspection. Traditional bridge surface crack detection is based on manual measurement or digital image processing technology. The crack length detection method based on manual measurement generally adopts a tape measure and the like to carry out field measurement, and a crack width detector and the like are generally adopted for a width detection instrument of a concrete bridge crack, so that a great deal of labor cost is required, and the traditional manual measurement method cannot carry out real-time detection. With the popularization of video monitoring equipment, digital image processing technology is widely applied to the field of actual production and life. In recent years, a plurality of scholars at home and abroad adopt a digital image processing technology to carry out target detection and target segmentation to replace the traditional manual inspection, thereby obtaining better effects. However, the traditional digital image processing method is greatly affected by the environment, the accuracy is not high enough, and the effect improvement difficulty is high. At present, researchers at home and abroad have conducted researches for solving the problem in the field of computer vision by using a deep learning method, and have proved that the detection precision can be remarkably improved. However, unlike general target detection, bridge surface crack detection requires that the crack be segmented and the actual size of the crack calculated for the actual scenario. Therefore, the method based on deep learning is studied, crack segmentation is carried out on the premise of maintaining the structural integrity of the cracks, and the actual size of the cracks is calculated as accurately as possible, so that the method has theoretical and practical significance.

Disclosure of Invention

The invention aims to provide a method and a system for detecting cracks on the surface of a bridge based on semantic segmentation, which solve the problems of detecting the cracks of the bridge and improving the structural integrity of the cracks by utilizing image semantic segmentation.

In order to solve the technical problems, the technical scheme of the invention is as follows: a bridge surface crack detection method and system based on semantic segmentation comprises the following steps:

(1) Collecting a bridge crack image, performing pixel-level semantic annotation on the collected image, and manufacturing a bridge surface crack data set;

(2) Constructing a feature encoder module, and extracting semantic features from an original picture through convolution operation and downsampling;

(3) Constructing a cross attention-based feature decoder module, gradually recovering position information from high-level semantic features through up-sampling and convolution, performing jump connection with each corresponding layer decoder of an encoder, constructing two continuous cross attention modules at jump connection positions to extract context association, and more effectively combining high-level semantic and low-level fine-granularity surface layer information;

(4) Introducing a class consistency loss on the basis of the classification loss and training a network, so that each pixel in the image is mapped to an n-dimensional vector in a feature space by the network, the feature vectors of the pixels belonging to the class are close, the feature vectors of the pixels of different classes are far away, and a classification result of the pixel level is obtained;

(5) And (3) carrying out morphological skeleton extraction on the crack segmentation result output by the network, carrying out crack short branch elimination to calculate the crack length, calculating the width based on the image crack in the skeleton direction, and further estimating the actual size of the crack according to the proportional relation.

In some alternative embodiments, in step (1), a data set of bridge cracks is created based on the characteristics of the crack subject to omission. The dataset contained 600 crack image artwork from 10 bridges. 9 300×300 pixel minidrawings were randomly cut from each original, and then the images were divided into positive and negative samples, which represent cracked images and non-cracked images, respectively. Positive samples included reticulated cracks, lower edge tensile zone thin short cracks (with jitter blur), vertical cracks on the web (including low contrast images), vertical cracks at the diaphragm (with complex background texture), pier foundation concrete cracks (with water stain interference). Negative examples include honeycomb pitting, flaking off corners (high similarity to corner edges), void holes, steel bar rust (with lighting shadows), sky, trees, water stains and shadows. Because the positive and negative samples are unbalanced (there are no crack images far more than there are crack images), 2400 crack images and 3000 crack-free images are screened from the small image to produce a data set, and 800 images are taken as a test set (positive and negative sample ratio 2:3).

In some alternative embodiments, in step (2), a feature encoder is constructed, the encoding path of the encoder comprising threeThe steps are respectively denoted as s ₁ ,s ₂ ,s ₃ The input to each step is denoted as e _i0 ,e _i1 And e _i2 Wherein e is _i0 For the original image, each step adopts the same operation: two convolutional layers are used, each followed by a ReLU active layer, and then a maximum pooling of downsampling is performed in steps of 2, and the number of channels is doubled in each downsampling. Each step of the encoder extracts image semantic features with different scales, and the output of each step is marked as e _o0 ,e _o1 And e _o2 . The output of each step is the input of the next step, so there is e _ok ＝e _{i k+1} K is 0,1, then e _o2 Inputting two convolution layers, wherein each convolution layer is connected with a ReLU activation layer, and taking output as decoder input;

in some alternative embodiments, in step (3):

the decoding path of the feature decoder is composed of three steps s ₄ ,s ₅ Sum s ₆ The method comprises the following steps: an up-sampling and halving the number of channels convolutional layers, then two convolutional layers are used, each followed by a ReLU activation function. The output of each step is defined as d _o0 ,d _o1 ,d _o2 。

In particular, a jump connection incorporating two consecutive crisscross attention modules is established at corresponding steps of the encoder and decoder to incorporate dense predictions of different granularity. The jump connection is established at s _i And s _7-i Between i e 1,2,3, the corresponding feature maps extracted over multiple scales by the encoder are first connected to the output of each step of the decoder: encoder output e _o2 Feature map and d _o0 Splicing in the channel dimension, denoted as T ₀ E, in the same way _o1 And d _o1 Splice T ₁ ，e _o0 And d _o2 Spliced into T ₂ As input to the cross attention group, input d to the decoder _i0 ,d _i1 ,d _i2 The definition is as follows: croA (CroA (T) ₀ )),CroA(CroA(T ₁ )),CroA(CroA(T ₂ ) CroA represents the crisscross attention module, i.e. the first moduleThe T' of the block output is taken as input to the second module. And acquiring the association relation from each feature point to the whole feature map through the crisscross attention module, and carrying out associated feature enhancement.

The crisscrossed attention module extracts context information in both the horizontal and vertical directions to enhance the functionality of the pixel feature representation. Each crisscross attention module operates to:

the feature map T is input. Two low-dimensional features Q and K are generated through a 1 x 1 convolution layer. Q and K further generate attention patterns A through a "similarity" operation. Obtaining a vector Q at each position u in the spatial dimension of Q _u . At the same time, omega is obtained by extracting the feature vector by K in the same row or the same column as the position u _u Aggregation by Ω _i,u Representing omega _u Is the i-th element of (c).d _i,u ∈D,/>Wherein d _i,u Is characteristic Q _u And omega _i,u Degree of correlation between the two, i= [1, …, H+W-1]Then applying softmax to D on the channel scale gets attention seeking graph a;

another 1 x 1 convolution layer is applied over T to generate V for feature adaptation. Each position u in the spatial dimension of V obtains a vector V _u Sum of phi _u . Aggregate phi _u Is a set of feature vectors in V, which are located in the same row or column of position u, byThe context information is added to the local feature T to enhance the pixelwise representation. Wherein (1)>For extracting context information, T _u 'is the eigenvector at position u in T', A _i,u Is the scalar value at position u in channels i and a. Thus, the first and second substrates are bonded together,it has a broad context Wen Shitu and selectively extracts context based on spatial attention attempts.

The feature vectors are mapped to the required number of classes at the final layer of the decoder using a 1 x 1 convolution.

In some alternative embodiments, in step (4), at a classification loss l _seg Introduces class consistency loss l on the basis of _class For semantic segmentation tasks, pixels belonging to the same class should have similar features, while pixel features from different classes differ more, and this feature is named class consistency. This class consistency penalty may cause the network to map each pixel in the image to an n-dimensional vector in the feature space such that feature vectors for pixels belonging to that class are close, feature vectors for pixels of different classes are far, and the final penalty function is defined as: l=l _seg +ml _class 。l _class For class consistency loss, l _class ＝αl _var +βl _dis +γl _reg Where m, α, β and l are weight coefficients.

By using l _var Penalizing the same class of features with larger distances,by using l _dis Punishment of different class features with smaller distances, < +.>c _a ≠c _b The method comprises the steps of carrying out a first treatment on the surface of the By using l _reg Pushing all class features to the mean point of the class in which they are located,/->Where C is a group of classes present in the small lot image, N _c The number of active elements belonging to class c, c.epsilon. C, h _i E H is the eigenvector of spatial location i. Mu (mu) _c Is the average feature of class C e C.

Wherein the method comprises the steps ofIs a function of the distance between the segments,

feature h _i And mu _c Distance d of (2) _v ＝‖μ _c -h _i II, when d _v Greater than delta _d Time of dayAs a quadratic function, when d _v At (delta) _v ,δ _d ]Is a linear function when d _v ≤δ _v Time->Zero.

features (e.g. a character)And->Distance of->When d _v Less than 2 delta _d Time->As a quadratic function, when d _v Greater than 2 delta _d Time->Zero.

δ _v And delta _d The set margins respectively.

To reduce the computational effort, the size is first reduced using convolutional layers on the output of the crisscross attention module, and then the three above-mentioned losses are applied on feature maps with fewer channels.

In some alternative embodiments, step (5) comprises: and (3) skeletonizing crack morphology, eliminating crack short branches, calculating crack length and calculating crack width.

The method of using the maximum disk defines a skeleton, wherein the target skeleton consists of the circle centers of all inscribed disks in the target, and the maximum disk method is expressed as:wherein->Wherein B is a structural element, (-) -A>Representing a succession of j etches for a,

aiming at the short branch formed in the skeletonizing process, firstly traversing the crack picture, and iteratively deleting the boundary point of an area, wherein the definition of the boundary point is as follows: and at least one pixel point with a pixel value of 1 and at least one adjacent pixel value of 0, wherein 0 represents a point of a background area, 1 represents a point of a crack area, 8 fields are adopted, and two steps are adopted to judge whether the boundary point is a deletion point. In step I, a contour point p is determined if the following conditions are met ₁ Marked as delete points:

in the middle of，N(p ₁ ) Representing the number of non-zero adjacent pixels, T (p ₁ ) Represents p _2, p _3,… p _9, The number of the grabbing and changing times of the sequencing sequence from 0 to 1;

in step II, the conditions of the above formulas (a) and (b) are maintained, and the conditions of (c) and (d) are changed to:

the judging method comprises the following steps: the first step is to apply the condition of step I, if at least one of the conditions (a) - (d) in step I is violated, then p ₁ If the value is unchanged, marking as a deleting point, and deleting after the value of the deleting point is modified to 0 after the step I is applied to all boundary points; the second step applies the condition of the step II to the result of the first step, and the rule is the same as that of the first step; the image formed by the finally obtained points is the trimmed crack skeleton.

The crack length can be directly calculated because the crack discontinuity problem does not occur after the intermittent crack connection process.

Based on the skeleton method, a method for calculating the width pixels of the image cracks is applied. The image on the skeleton is a single-pixel image, the tangential direction of each point on the skeleton line can be calculated according to the skeleton image, the tangential line of each point in the crack is calculated, then the normal line perpendicular to the tangential line on the skeleton line is calculated, and the intersection point distance of the normal line and the crack boundary is the width of the crack.

And estimating the crack size according to the proportional relation between the actual bridge crack and the picture crack. And finally, multiplying the proportional relation by the bridge crack image obtained by segmentation to obtain the actual length and width of the crack in the graph.

According to another aspect of the present invention, there is provided a bridge surface crack detection system based on semantic segmentation, comprising:

the data acquisition module is used for manufacturing a bridge crack semantic segmentation data set;

the encoder module is used for extracting semantic features from the original picture through downsampling by convolution and pooling operations;

the feature decoder module based on cross attention is used for extracting context association, combining high-level semantic and low-level fine granularity surface layer information more effectively, performing jump connection on the features acquired from each corresponding layer of the encoder and the decoder, constructing two continuous cross attention modules at jump connection positions, and gradually recovering position information from the high-level semantic features through up-sampling and convolution;

the loss calculation module is used for introducing a type of consistency loss based on the classification loss and training a network to obtain a classification result of the pixel level;

and the crack calculation module is used for extracting a crack morphology framework, carrying out crack short branch elimination to calculate the crack length, and calculating the image crack width based on the morphology framework.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a bridge surface crack detection method and system based on semantic segmentation, which utilizes two crisscross attentions to combine high-level semantics and low-level fine granularity surface layer information more effectively at a jump joint of a network, and utilizes a type of consistency loss on the basis of classification loss so as to enable feature vectors of pixels belonging to the type to be close and feature vectors of pixels of different types to be far away. Global attention is achieved, thereby enhancing the feature expression capabilities of the network.

Drawings

FIG. 1 is a schematic flow diagram of a semantic segmentation bridge crack detection method based on crisscross attention and class consistency loss provided by an embodiment of the present invention;

FIG. 2 is a diagram of a network structure for detecting cracks on a bridge surface based on semantic segmentation according to an embodiment of the present invention;

FIG. 3 is a block diagram of a crisscross attention module provided by embodiments of the present invention;

FIG. 4 is a schematic diagram of a bridge surface crack detection system based on semantic segmentation according to an embodiment of the present invention;

fig. 5 is a graph showing a comparison of test results according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The semantic segmentation bridge crack detection method based on the crisscross attention and the class consistency loss, which is disclosed by the embodiment of the invention, is shown in fig. 1, and comprises the following steps of:

s1: manufacturing a bridge surface crack data set;

in the embodiment of the invention, a bridge surface crack data set is manufactured according to the structural characteristics of the existence of the bridge surface crack which is easy to miss. The data set comprises 600 crack images of 10 bridges from the Wuhan in Hubei province, the selected images are from different scenes in order to improve the generalization capability of the model, and the scene range shot by the camera and the image size proportion under each scene are kept uniform for facilitating the calculation of the actual crack size.

Several equi-sized (300 x 300 pixels) minigrams were randomly cropped from each original. The image is divided into a positive sample and a negative sample, the positive and negative samples representing a cracked image and a non-cracked image, respectively. Positive samples included reticulated cracks, lower edge tensile zone thin short cracks (with jitter blur), vertical cracks on the web (including low contrast images), vertical cracks at the diaphragm (with complex background texture), pier foundation concrete cracks (with water stain interference). Negative examples include honeycomb pitting, flaking off corners (high similarity to corner edges), void holes, steel bar rust (with lighting shadows), sky, trees, water stains and shadows. Because the positive and negative samples are unbalanced (there are no crack images far more than there are crack images), 2400 crack images and 3000 crack-free images are screened from the small image to produce a data set, and 800 images are taken as a test set (positive and negative sample ratio 2:3).

S2: constructing a feature encoder module, and extracting semantic features from an original picture through convolution operation and downsampling;

in the embodiment of the invention, a feature encoder is constructed, and the encoding path of the encoder comprises three steps, which are respectively denoted as s ₁ ,s ₂ ,s ₃ The input to each step is denoted as e _i0 ,e _i1 And e _i2 Wherein e is _i0 For the original image, each step adopts the same operation: two 3 x 3 convolutions are used, each convolution layer followed by a ReLU activation layer and a maximum pooling, then downsampling by a step size of 2, and doubling the number of channels in each downsampling. Each step of the encoder extracts image semantic features with different scales, and the output of each step is marked as e _o0 ,e _o1 And e _o2 . The output of each step is the input of the next step, so there is e _ok ＝e _i _k+1 K is 0,1, then e _o2 Inputting two 3×3 convolution layers, each convolution layer being followed by a ReLU activation layer, taking the output as decoder input;

s3: gradually recovering position information from the high-level semantic features through up-sampling and convolution, performing jump connection with each corresponding layer decoder of the encoder, constructing two continuous crisscross attention modules at jump connection positions to extract context association, and more effectively combining the high-level semantic and low-level fine granularity surface layer information;

in the embodiment of the present invention, as shown in fig. 2, the decoding path of the feature decoder is composed of three steps s ₄ ,s ₅ Sum s ₆ The method comprises the following steps: an up-sampling and halving the number of channels by a 2 x 2 convolution, then two 3 x 3 convolutions are used, each followed by a ReLU activation function. The output of each step is defined as d _o0 ,d _o1 ,d _o2 。

In particular, a jump connection incorporating two consecutive crisscross attention modules is established at corresponding steps of the encoder and decoder to incorporate dense predictions of different granularity. The jump connection is established at s _i And s _7-i I epsilon 1,2,3, corresponding features extracted on multiple scales by the encoderThe diagram first connects the output of each step of the decoder to: encoder output e _o2 Feature map and d _o0 Splicing in the channel dimension, denoted as T ₀ E, in the same way _o1 And d _o1 Splice T ₁ ，e _o0 And d _o2 Spliced into T ₂ As input to the cross attention group, input d to the decoder _i0 ,d _i1 ,d _i2 The definition is as follows: croA (CroA (T) ₀ )),CroA(CroA(T ₁ )),CroA(CroA(T ₂ ) CroA represents the crisscross attention module, i.e. T' output by the first module is taken as input to the second module. And acquiring the association relation from each feature point to the whole feature map through the crisscross attention module, and carrying out associated feature enhancement.

In the embodiment of the invention, the local feature map T is input to the crisscross attention module, where H, W, C represents the height, width and number of channels, respectively, of the feature map. First, T is convolved by two 1X 1 to generate two low-dimensional features Q and K, < >>Wherein C' is less than C.

Q and K further generate attention attempts a through "similarity" operations,obtaining a vector Q at each position u in the spatial dimension of Q _u ，/>At the same time, omega is obtained by extracting the feature vector by K in the same row or the same column as the position u _u Gather (S)>By omega _i,u Representing omega _u Is the i-th element of (c).d _i,u ∈D,/>Wherein d _i,u Is characteristic Q _u And omega _i,u Degree of correlation between the two, i= [1, …, H+W-1]Then applying softmax to D on the channel scale gets attention seeking graph a; applying another 1 x 1 convolution layer on T to generate V for feature adaptation,/->Each position u in the spatial dimension of V obtains a vector V _u Sum of phi _u ，/>Aggregate phi _u Is a set of feature vectors in V, which are located in the same row or column of position u by +.>The context information is added to the local feature T to enhance the pixelwise representation. Wherein (1)>For extracting context information, T _u 'is the eigenvector at position u in T'>A _i,u Is the scalar value at position u in channels i and a. Thus, it has a broad context Wen Shitu and selectively extracts context according to a spatial attention map. The crisscrossed attention module structure is shown in figure 3.

A single crisscross attention can capture context information in both horizontal and vertical directions, but there is no connection between one pixel and its non-intersecting path pixels anyway, so using two crisscross attention modules in succession, an arbitrary position association is established, full image context information can be obtained from all pixels to generate new features with dense and rich context information.

S4: introducing a class consistency loss on the basis of the classification loss and training a network to obtain a pixel-level classification result;

at a classification loss l _seg Introduces class consistency loss l on the basis of _class For semantic segmentation tasks, pixels belonging to the same class should have similar features, while pixel features from different classes differ more, and this feature is named class consistency. This class consistency penalty may cause the network to map each pixel in the image to an n-dimensional vector in the feature space such that feature vectors for pixels belonging to that class are close, feature vectors for pixels of different classes are far, and the final penalty function is defined as: l=l _seg +ml _class 。l _class For class consistency loss, l _class ＝αl _var +βl _dis +γl _reg Where m, α, β and l are weight coefficients. In a specific implementation, m=1, α=β=1, γ=0.001 is set.

By using l _var Penalizing the same class of features with larger distances,by using l _dis Punishment of different class features with smaller distances, < +.>c _a ≠c _b The method comprises the steps of carrying out a first treatment on the surface of the By using l _reg Pushing all class features to the mean point of the class in which they are located,/->Where C is a group of classes present in the small lot image, N _c The number of active elements belonging to class c, c.epsilon. C, h _i E H is the eigenvector of spatial location i. Mu (mu) _c Is the average of class C e CFeatures.

δ _v And delta _d Respectively, the margins are set, in a specific implementation, delta is set _v ＝0.5，δ _d ＝1.5。

To reduce the computational effort, the size is first reduced using a convolutional layer on the output of the crisscross attention module, in a specific implementation 16 is set as the number of channels for dimension reduction, and then the three losses described above are applied on the feature map with fewer channels.

S5: and (3) carrying out morphological skeleton extraction on the crack segmentation result output by the network, carrying out crack short branch elimination to calculate the crack length, calculating the width based on the image crack in the skeleton direction, and further estimating the actual size of the crack according to the proportional relation.

In the embodiment of the invention, the framework is defined by using the method of the maximum disc, the target framework of the framework is composed of the circle centers of all inscribed discs in the target, and the maximum disc method is expressed as follows:wherein the method comprises the steps ofWherein B is a structural element, (-) -A>Representing a succession of j etches for a,

aiming at the short branch formed in the skeletonizing process, firstly traversing the crack picture, and iteratively deleting the boundary point of an area, wherein the definition of the boundary point is as follows: and at least one pixel point with a pixel value of 1 and at least one adjacent pixel value of 0, wherein 0 represents a point of a background area, 1 represents a point of a crack area, 8 fields are adopted, and two steps are adopted to judge whether the boundary point is a deletion point. At the position ofIn step I, a contour point p is determined if the following conditions are satisfied ₁ Marked as delete points:

wherein N (p) ₁ ) Representing the number of non-zero adjacent pixels, T (p ₁ ) Represents p _2, p _3,… p _9, The number of the grabbing and changing times of the sequencing sequence from 0 to 1;

Because the crack discontinuity problem can not occur after the intermittent crack connection process, the crack length can be directly obtained.

Based on the skeleton method, a method for calculating the width pixels of the image cracks is applied. The image on the skeleton is a single-pixel image, and the tangential direction of each point on the skeleton line can be calculated according to the skeleton image. If M is any point on the skeleton picture, two points which are 1 are arranged on 8 adjacent areas of the M points. FIG. 4-3 is a tangent diagram of eight neighborhood regions of point M, N ₁ 、N ₂ Two points out of eight neighbors of point M, respectively, where the tangential direction of point M is equal to MN ₁ With MN ₂ Average in tangential direction, i.e. After calculating the tangent line of each point in the crack, calculating the normal line perpendicular to the tangent line on the skeleton line. The intersection distance of the normal line and the crack boundary is the width of the crack. Considering that the variation of the crack width is slow in the actual scene, in order to reduce the calculation amount and improve the efficiency, a point with a certain interval on the crack skeleton (the size of the interval is comprehensively determined according to factors such as the image resolution, the crack type and the like) can be selected, and the quantized width at the point can be obtained for the point with the certain interval. And finally, selecting the point with the largest quantization width from the points, and solving the quantization width of each point on the front section and the rear section of the points on the skeleton, wherein the maximum value of the quantization width is the maximum value of the crack area width of the section.

And calculating the proportional relation between the actual scene and the picture according to the measured data and the picture, and finally multiplying the proportional relation by the segmented bridge crack image to estimate the actual length and width of the crack in the picture.

The invention also provides a semantic segmentation bridge crack detection system based on crisscross attention and class consistency loss, as shown in fig. 4, comprising:

and the loss calculation module is used for introducing a type of consistency loss based on the classification loss and training the network to obtain a pixel-level classification result.

The specific implementation of each module may refer to the description of the method embodiment, and the embodiment of the present invention will not be repeated.

In the test example of the invention, several classical segmentation algorithms including FCN algorithm, RAU-Net algorithm and Mask-CNN algorithm are selected to verify the validity of the experiment, and verification is performed on the bridge data set manufactured by the user. The pairs of the method and each classical algorithm proposed by the invention are shown in table 1.

Table 1 comparison of individual algorithm accuracy and time spent in bridge fracture dataset

Method	mIoU	Time/ms
			FCN	0.432	123
Mask R-CNN	0.443	183
			Ours	0.458	128

From the above table it can be seen that the present invention can achieve a better result with less time consumption while achieving a better result in accuracy. Compared with FCN, the method has higher accuracy and higher speed compared with RAU-Net. Therefore, the algorithm provided by the chapter can achieve a better effect in precision and speed. The segmentation effect of each algorithm in the bridge crack dataset is compared with that of fig. 5, and as can be seen from fig. 5, the method provided by the invention can obtain a more complete crack structure compared with each algorithm.

Table 2 comparison of measured actual and measured values of two split pictures extracted

Experiment	Calculating the length/cm	Calculating width/cm	Length error/%	Width error/%
					First sheet of	48.04	1.01	3.66	3.78
Second sheet	114.88	0.3754	4.57	4.28

In order to detect the length and width of a bridge crack in an actual scene and prevent the influence of the change of natural illumination conditions on experiments, the independent uniform light source is adopted for shielding natural light, and about 30 photos are taken to verify the accuracy of our method. The practical result shows that the average error of the length measurement of 30 crack pictures is 2.76%, the average error of the width measurement is 4.34%, the experimental result reflects that the precision of skeletonizing to calculate the crack length is higher, the system design requirement is met, and the precision of calculating the maximum width of the crack by a crack width quantization model method also meets the system requirement. For the width characteristics of the cracks, the average width of the cracks can reflect the overall damage degree of the cracks, the maximum width of the cracks can reflect the local damage degree of the cracks, and the two comprehensive consideration can better reflect the real damage condition of the cracks.

It should be noted that each step/component described in the present application may be split into more steps/components, or two or more steps/components or part of the operations of the steps/components may be combined into new steps/components, as needed for implementation, to achieve the object of the present invention.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The bridge surface crack detection method based on semantic segmentation is characterized by comprising the following steps of:

step 1, acquiring a bridge crack image, performing pixel-level semantic annotation on the acquired image, and manufacturing a bridge crack segmentation data set;

step 2, constructing a feature encoder module, and extracting multi-level semantic features from an original picture through multiple convolution operations and downsampling;

step 3, constructing a characteristic decoder module based on cross attention, gradually recovering position information from high-level semantic features in an original picture through up-sampling and convolution operation for a plurality of times, performing jump connection on each decoder layer corresponding to the encoder module, constructing two continuous cross attention modules at jump connection positions, extracting context association, and combining high-level semantic and low-level fine-grained surface layer information;

step 4, introducing a class consistency loss based on the classification loss, and training a network to map each pixel in the image to an n-dimensional vector in a feature space by the network so that the feature vectors of the pixels belonging to the class are close and the feature vectors of the pixels of different classes are far, thereby obtaining a classification result of the pixel level;

step 5, morphological skeleton extraction is carried out on the crack segmentation result output by the network, crack short branches are eliminated to calculate crack length, width is calculated based on image cracks in the skeleton direction, and the actual size of the cracks is estimated according to the proportional relation;

the process of constructing the feature decoder module based on the crisscross attention in the step 3 is as follows: the decoding path of the feature decoder module is composed of three steps s ₄ ,s ₅ Sum s ₆ The method comprises the following steps: one up-sample and halving the number of channels in each up-sample, then using two convolution layers, each calculated to perform a ReLU activation function calculation, the output of each step is defined as d, respectively _o0 ,d _o1 ,d _o2 The output of each step serves as the input for the next step;

the jump connection is established at s _i And s _7-i Between i e 1,2,3, i.e. the corresponding feature map extracted by the encoder module over multiple scales is connected to the output of each step of the feature decoder module: encoder module output e _o2 Feature map and d _o0 Splicing in the channel dimension, denoted as T ₀ E, in the same way _o1 And d _o1 Splice is marked as T ₁ ，e _o0 And d _o2 Splice is marked as T ₂ The T is ₀ 、T ₁ And T ₂ Respectively used as the input of three corresponding layers of the crisscross attention module;

the s is ₄ The method comprises the following steps: at a classification loss l _seg Is introduced on the basis of (a)Class consistency loss l _class For semantic segmentation tasks, pixels belonging to the same class should have similar features, while the features of pixels from different classes differ more, this feature is named class consistency, which is a loss of class consistency for the network to map each pixel in the image to an n-dimensional vector in feature space to bring the feature vector of the pixel belonging to that class closer, the feature vector of the pixel of the different class farther away, and the final loss function/is defined as: l=l _seg +ml _class Where m is a weight coefficient.

2. The method for detecting the surface cracks of the bridge based on semantic segmentation according to claim 1, wherein the step 1 is specifically: collecting a plurality of different crack image original pictures of a plurality of bridges, randomly cutting out a plurality of square images with the same pixel size from each original picture, dividing the square images into a positive sample and a negative sample, wherein the positive sample and the negative sample respectively represent crack images and crack-free images, screening out a plurality of crack images and a plurality of crack-free images from the square images to manufacture a bridge crack segmentation data set, and taking a plurality of images in the bridge crack segmentation data set as a test set.

3. The method for detecting cracks on the surface of a bridge based on semantic segmentation according to claim 2, wherein the positive sample comprises a netlike crack image, a crack image with shake blur, a crack image with low contrast, a crack image with complex background texture and a crack image with water stain interference; the negative samples include honeycomb pitting, peeling off corners, cavity holes, steel bar rust, sky, trees, water stains and shadows.

4. The method for detecting the surface cracks of the bridge based on semantic segmentation according to claim 1, wherein the step 2 is specifically: constructing a characteristic encoder module, wherein the encoding path of the encoder module comprises three steps which are sequentially continuous and respectively marked as s ₁ ,s ₂ ,s ₃ Wherein the input of each step is denoted as e _i0 ,e _i1 And e _i2 Wherein e is _i0 For the original image, each step comprises: using two convolution layers, calculating a ReLU activation function after each convolution layer is calculated, then carrying out maximum pooling of one downsampling with the step length of 2, and doubling the number of channels in each downsampling; the encoder module extracts image semantic features of different scales at each step, and the output of each step is denoted as e _o0 ，e _o1 And e _o2 The output of each step serves as the input to the next step, i.e. e _ok ＝e _ik+1 ，k∈0，1。

5. The bridge surface crack detection method based on semantic segmentation according to claim 1, wherein the working process of the crisscross attention module is as follows: inputting a feature map T, generating two low-dimensional features Q and K through 1X 1 convolution layer calculation, and further generating attention force map A through similarity operation; obtaining a vector Q at each position u in the spatial dimension of Q _u At the same time, Ω is obtained by extracting feature vectors from K in the same row or column as the position u _u Aggregation by Ω _i，u Representing omega _u Is the i-th element of (a); the "similarity" operation is defined as:wherein d _i，u Is characteristic Q _u And omega _i，u Degree of correlation between, i= [1, ], h+w-1]Then applying softmax to D on the channel scale gets attention seeking graph a;

applying another 1 x 1 convolution layer calculation on T to generate V for feature adaptation, obtaining a vector V at each position u in the spatial dimension of V _u Sum of phi _u The set phi _u Is a set of feature vectors in V, the feature vectors in V are located in the same row or column of the position u by Adding context information to the local feature T to enhance the pixelwise representation; wherein said->For extracting context information, said T' _u Is a feature vector at position u in T', said A _i，u Is the scalar value at position u in lanes i and a;

the feature vectors are mapped to the required number of classes at the final layer of the feature decoder module using a 1 x 1 convolutional layer calculation.

6. The method for detecting cracks on the surface of a bridge based on semantic segmentation according to claim 1, wherein s is ₅ The method comprises the following steps: the method of using the maximum disk defines a skeleton, wherein the target skeleton consists of the circle centers of all inscribed disks in the target, and the maximum disk method is expressed as:wherein->Wherein B is a structural element, (-) -A>Representing a succession of j etches for a,

aiming at the short branch formed in the skeletonizing process, firstly traversing the crack picture, and iteratively deleting the boundary point of an area, wherein the definition of the boundary point is as follows: at least one pixel point with a pixel value of 1 and at least one adjacent pixel point with a pixel value of 0, wherein 0 represents a point of a background area, 1 represents a point of a crack area, which is represented by 8 fields, and two are adoptedJudging whether the boundary point is a deleted point, in the step I, if the following condition is satisfied, a contour point p ₁ Marked as delete points:

wherein N (p) ₁ ) Representing the number of non-zero adjacent pixels, T (p ₁ ) Represents p ₂ ，p ₃ ，...p ₉ The number of the grabbing and changing times of the sequencing sequence from 0 to 1;

the judging method comprises the following steps: the first step is to apply the condition of step I, if at least one of the conditions (a) - (d) in step I is violated, then p ₁ If the value is unchanged, marking as a deleting point, and deleting after the value of the deleting point is modified to 0 after the step I is applied to all boundary points; the second step applies the condition of the step II to the result of the first step, and the rule is the same as that of the first step; the finally obtained image formed by the points is the trimmed crack skeleton;

the crack length is directly calculated because the problem of crack discontinuity can not occur after the intermittent crack connection process;

based on the skeleton method, a method for calculating width pixels of an image crack is applied: because the image on the skeleton is a single-pixel image, the tangential direction of each point on the skeleton line is calculated according to the skeleton image, the tangential line of each point in the crack is calculated, and then the normal line perpendicular to the tangential line on the skeleton line is calculated, wherein the intersection point distance between the normal line and the crack boundary is the width of the crack;

estimating the crack size according to the proportion relation between actual bridge cracks and picture cracks: and according to the proportional relation between the actual scene and the photo, finally, multiplying the proportional relation by the segmented bridge crack image to estimate the actual length and width of the crack in the graph.

7. A system for utilizing a semantic segmentation based bridge surface crack detection method as set forth in claim 1, comprising:

the data acquisition module is used for acquiring bridge crack images and carrying out pixel-level semantic annotation on the acquired images to manufacture a bridge crack segmentation data set;

the encoder module is used for extracting multi-level semantic features from the original picture through multiple convolution operations and downsampling;

the characteristic decoder module based on the crisscross attention is used for gradually recovering position information from high-level semantic characteristics in an original picture through up-sampling and convolution for a plurality of times, performing jump connection with each corresponding layer decoder of the encoder module, constructing two continuous crisscross attention modules at jump connection positions, extracting context association, and combining high-level semantic and low-level fine-granularity surface layer information;

the loss calculation module is used for introducing a type of consistency loss based on the classification loss and training a network, so that each pixel in the image is mapped to an n-dimensional vector in a feature space by the network, the feature vectors of the pixels belonging to the type are close, the feature vectors of the pixels of different types are far away, and a classification result of the pixel level is obtained;

and the crack calculation module is used for extracting a morphological skeleton from the crack segmentation result output by the network, eliminating crack short branches to calculate the crack length, calculating the width based on the image crack in the skeleton direction, and estimating the actual size of the crack according to the proportional relation.