CN114140390A

CN114140390A - Crack detection method and device based on semi-supervised semantic segmentation

Info

Publication number: CN114140390A
Application number: CN202111291405.1A
Authority: CN
Inventors: 蔡长青; 刘爽
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-03-04

Abstract

The invention discloses a crack detection method and device based on semi-supervised semantic segmentation. The method mainly comprises the following steps: acquiring an image of a crack to be detected; inputting the crack image into a student model and a teacher model for training, and updating the weight of the student model through gradient reduction of a loss function obtained through training; updating the weight of the teacher model through the exponential moving average value of the weight of the student model; and evaluating the accuracy of the training model. The student model and the teacher model have the same network structure with EfficientNet as an encoder and UNet as a decoder, multi-scale crack characteristic information can be efficiently extracted, and loss of image information is reduced. The invention also adopts a semi-supervised learning means, thereby reducing the annotation workload. Experiments show that the invention can reduce the workload of data annotation and keep higher detection precision.

Description

Crack detection method and device based on semi-supervised semantic segmentation

Technical Field

The invention relates to the field of image detection, in particular to a surface crack detection method and device based on a semi-supervised semantic segmentation network.

Background

With the development of economy and society in China, most of infrastructure is lost to different degrees due to overload use, wherein cracks on the surface of the infrastructure are obvious characterization phenomena of the loss of the infrastructure. The detection of surface cracks is crucial to ensure the safety and usability of the civil infrastructure.

In recent years, the automatic detection method has high efficiency and objective detection results, so that the traditional manual detection is gradually replaced. In the automatic detection method, the semantic segmentation algorithm based on deep learning shows good performance in crack detection. However, in the semantic segmentation algorithm, the currently common fully supervised segmentation method needs to label a large amount of data manually, which is time-consuming. Is not beneficial to popularization and application in the field of image detection.

Disclosure of Invention

In view of this, the present invention provides a crack detection method and apparatus based on semi-supervised semantic segmentation.

The invention provides a crack detection method based on semi-supervised semantic segmentation, which is characterized by comprising the following steps of;

acquiring a crack image, wherein the crack image comprises a first crack image with an annotation and a second crack image without the annotation;

inputting the first crack image into a student model, and inputting the second crack image into the student model and a teacher model to perform crack feature extraction and feature segmentation; the student model completes semi-supervised training by the supervised loss and the unsupervised loss, the supervised loss is obtained by calculating dice loss and cross entropy loss of the first crack image and the second crack image, and the unsupervised loss is obtained by performing network regularization on the second crack image;

updating the weights of the student models through gradient descent of supervision loss and unsupervised loss;

updating the weight of the teacher model by an exponential moving average of the weights of the student models;

and adjusting the output of the student model and the output of the teacher model according to the weight of the student model and the weight of the teacher model, so that the output of the student model and the output of the teacher model are consistent, and obtaining a crack characteristic prediction result.

Further, the structure of the student model and the teacher model is a codec structure, and the codec structure comprises an encoder and a decoder; wherein the encoder comprises a CNN-based image feature extraction network; the decoder comprises an image feature segmentation network based on U-Net; and inputting the first crack image or the second crack image as an input image into an encoder, and performing feature extraction and segmentation by using a coder-decoder to obtain an output image.

Further, the CNN-based image feature extraction network comprises an EfficientNet network;

the EfficientNet network comprises 1 convolution layer and 23 mobile turning bottleneck convolution modules;

the mobile roll-over bottleneck convolution module comprises 4 convolution layers of 1 multiplied by 1, 1 separable convolution layer with depth and 1 global average pooling layer;

the working phase of the EfficientNet network comprises the following steps:

performing convolution processing on an input image to obtain a first-stage image;

performing moving turning bottleneck convolution processing twice on the first-stage image to obtain a second-stage image;

carrying out three-time moving and turning bottleneck convolution processing on the second-stage image to obtain a third-stage image;

carrying out three-time moving and turning bottleneck convolution processing on the third-stage image to obtain a fourth-stage image;

performing four-time moving and turning bottleneck convolution processing on the fourth-stage image to obtain a fifth-stage image;

performing four-time moving and turning bottleneck convolution processing on the fifth-stage image to obtain a sixth-stage image;

carrying out five times of moving and turning bottleneck convolution processing on the sixth-stage image to obtain a seventh-stage image;

carrying out moving and turning bottleneck convolution processing twice on the seventh-stage image to obtain an eighth-stage image;

and outputting the second-stage image, the third-stage image, the fifth-stage image, the sixth-stage image and the eighth-stage image to a decoder.

Further, the image feature segmentation network based on the U-Net comprises 5 upsampling layers and 1 convolutional layer, wherein an image input by an encoder is subjected to upsampling layer by layer and then is input into the convolutional layer to obtain a feature image which is used as an output image to be output; the feature image has the same size as the input image.

Further, the supervision loss formula is

Wherein L is_diceIs die loss, L_{cross-entropy}Is the cross entropy loss, p_iIs the prediction of the input data, y is the true value, and N is the number of data.

Further, the unsupervised loss formula is L_unsup＝E_x,η′，η[||f(c,θ′,η′)-f(x,θ,η)||²](ii) a Where f (x, θ ', η') is the prediction of the teacher model, f (x, θ, η) is the prediction of the student model, x represents input data, θ 'represents weight, η, η' represents noise, and E represents expectation.

Further, the total loss function formula of the student model and the teacher model is L_total＝L_sup+ω(t)·L_unsup(ii) a Wherein L is_supIs a loss of supervision; l is_unsupIs unsupervised loss; ω (t) is a Gaussian preheating function having the formula

Where t is the current training step, t_maxIs the total training step.

Further, the crack detection method based on semi-supervised semantic segmentation further comprises the following steps: the accuracy of the model was evaluated using F1-Score.

Further, the training step of the crack detection method based on semi-supervised semantic segmentation comprises the following steps:

acquiring annotated data X1 and unannotated data X2, wherein an annotation label is Y1;

inputting X1 and X2 into a student model to obtain an input prediction result P1 of X1 and an input prediction result P2 of X2;

calculating a loss function Lsup (Y1, P1) of X1;

inputting X2 into the teacher model to obtain an input prediction result P3 of X2;

computing the loss function Luntup (P2, P3) for unannotated data X2

Calculating the gradient decline of the total loss function Lsup (Y1, P1) + Lunsup (P2, P3);

updating the weight of the student model according to the gradient descent result;

the weights of the teacher model are updated by an exponential moving average.

The invention also discloses a computer device, which comprises a processor and a memory;

the memory is used for storing programs;

the processor executes the program to realize a crack detection method based on semi-supervised semantic segmentation.

The invention has the following beneficial effects: the invention is composed of a student model and a teacher model, and has the same network structure, so that multi-scale crack characteristic information can be efficiently extracted, and the loss of image information is reduced. Meanwhile, the invention adopts a semi-supervised learning means, thereby reducing the annotation workload. Experiments show that the invention can reduce the workload of data annotation and keep higher detection precision.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a network structure of a student model and a teacher model in a crack detection method and device based on semi-supervised semantic segmentation;

FIG. 2 is a network structure of a codec in a crack detection method and apparatus based on semi-supervised semantic segmentation;

FIG. 3 is a network structure of a mobile rollover bottleneck convolution module in a crack detection method and device based on semi-supervised semantic segmentation.

Reference numerals: in fig. 2, Conv,3 × 3 represents a convolution module with a convolution kernel size of 3 × 3, MBConv1,3 × 3 represents a moving flip bottleneck convolution module with an expansion ratio of 1, a convolution kernel size of 3 × 3, MBConv6,3 × 3 represents a moving flip bottleneck convolution module with an expansion ratio of 6, a convolution kernel size of 3 × 3, MBConv6,5 × 5 represents a moving flip bottleneck convolution module with an expansion ratio of 6, a convolution kernel size of 5 × 5, Up-Conv2D represents an upsampling module, Concat represents a Concat function, Conv2D represents a two-dimensional convolution module, and numbers in (H, W, C) represent the height, width, and number of channels of an image in this step.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The embodiment provides a crack detection method and device based on semi-supervised semantic segmentation, and the overall architecture is shown in fig. 1 and comprises two parts, namely a student model and a teacher model. The student model and the teacher model have the same structure, and the specific structure thereof will be described later.

Inputting the image with the crack annotation into a student model, and inputting the image without the crack annotation into the student model and a teacher model for training;

the loss function of the student model comprises two parts of supervised loss and unsupervised loss, wherein the supervised loss comprises the combination of dice loss and cross entropy loss; unsupervised loss is similar to regularized loss, and a specific loss value is obtained after multiplication by a weight. The loss function of the teacher model is an unsupervised loss, and a specific formula of the loss function is provided later.

The weights of the student models are updated through gradient reduction of the loss function, the weights of the teacher models are updated according to the exponential moving average of the weights of the student models, and finally F1-score is used for evaluating the accuracy of model prediction.

In the embodiment, noise is added into input data, the noise has the same matrix form as the input data, and the value of the noise is between-0.2 and 0.2; through the training, the model of the embodiment has certain anti-noise capability, and the robustness of the model is improved.

The weights of the teacher model are obtained by Exponential Moving Average (EMA) of the student models. Thus, the teacher model collects information after each training step and gives the model a better intermediate representation. The parameter update formula is as follows:

θ′_t＝θ′_t-1+(1-α)θ_t

wherein α is a smoothing coefficient hyperparameter (α is 0.99 in the ascent stage, and α is 0.999 in the remaining stages); theta_tIs the weight of the student model at the training step t; theta'_tIs the weight of the teacher model at training step t time.

The embodiment completes the training process by the following method:

calculating a loss function Lsup (Y1, P1) of X1;

computing the loss function Luntup (P2, P3) for unannotated data X2

the weights of the teacher model are updated by an exponential moving average.

This embodiment introduces codec structures used by a student model and a teacher model. The function of the encoder is to perform progressive downsampling on the input image through a convolutional neural network to extract the features of the image. Most of CNNs used by the encoder come from networks used by classification tasks, such as VGG, ResNet and the like, so that weights trained in advance on a large data set can be borrowed, and a better result is obtained through transfer learning. In order to adjust the height, width and number of channels of an image comprehensively, the present embodiment adopts EfficientNet as the encoder of the present embodiment. The function of the decoder is to extract the features of the image, and in the present embodiment, UNet is used as the decoder to perform the upsampling and cascading operations. The structure of the codec is shown in fig. 2, in this embodiment, EfficientNet uses a Mobile Inverted Bottleneck Convolution (MBConv) module used by MobileNetV2 as a basic building block of the model. At the same time, the module also encapsulates the compression and excitation methods in SEnet to optimize the network structure. The structure of MBConv is shown in FIG. 3. MBConv firstly increases the number of feature mapping channels through 1 × 1 convolutional layers, then performs 3 × 3 or 5 × 5 convolutional operation on the increased number of channels, and then performs compression and excitation operation, so that the neural network can better map channel dependence and acquire global information. Finally, the second 1x1 convolutional layer down-samples the number of channels to the initial number of channels. The specific working process is as follows:

carrying out three times of moving and turning bottleneck convolution processing on the second-stage image to obtain a third-stage image;

carrying out moving turnover bottleneck convolution processing twice on the seventh-stage image to obtain an eighth-stage image;

The decoder part of the embodiment fuses the feature map obtained by the up-sampling operation and the feature map with the same scale, so that the size of the final feature map is restored to the size of the input image. The skip connection is extended at the first level in the decoder part, comprising two 3 x 3 convolutional layers, each layer followed by bulk normalization and ReLU activation on input. In the remaining levels, the decoder blocks are the remaining blocks. In the feature extraction section, five scale feature maps in EfficientNet are selected in total. In the up-sampling part, the last feature mapping of the encoder is selected as 2 × up-sampling, and then is fused with the feature map of the same scale in the feature extraction part. The process is repeated for the feature maps of five scales in sequence, and the size of the final feature map is restored to the size of the input image. Compared with the original UNet model, the model has deeper compression paths and contains richer characteristic information.

The embodiment describes a crack detection method and a crack detection device based on semi-supervised semantic segmentation. The semi-supervised learning method mainly comprises the steps of adding items related to unannotated data into a loss function, and enhancing generalization capability of a model to unknown data by using the unannotated data. The model employs a consistency regularization method. The main idea is that for an input, the prediction should be consistent with the prediction of the original data even if slightly disturbed.

For annotated data, a supervised loss is used, which is a combination of dice loss and cross entropy loss.

The formula of the supervision loss is

Wherein L is_dicrIs die loss, L_{cross-entropy}Is the cross entropy loss, p_iIs a prediction of input dataY is the true value, and N is the number of data.

The dice loss formula is

The cross entropy loss formula is

For unannotated data, the MSE (Mean Squared Error) loss function is used to define the expected distance between the teacher model and the student model predictions. Unsupervised loss formula is L_unsup＝E_x,η′,η[||F(x,θ′,η′)-f(x,θ,η)||²](ii) a Where f (x, θ ', η') is the prediction of the teacher model, f (x, θ, η) is the prediction of the student model, x represents input data, θ 'represents weight, η, η' represents noise, and E represents expectation.

Since the student model is trained with only limited annotated data during the initial training phase, the performance is poor and the prediction is not reliable. Therefore, the unsupervised cost also needs to be multiplied by a weight ω (t). ω (t) is a widely used time-dependent Gaussian preheat function for controlling the balance between supervised and unsupervised coherence losses, defined as follows:

where 0.1 is the regularization weight, t is the current training step length, t is_maxIs the total training step.

The proposed crack image semi-supervised semantic segmentation model learns from annotated and unannotated data by the following total loss function: l is_total＝L_sup+ω(t)·L_unsup(ii) a Wherein L is_supIs a loss of supervision; l is_unsupIs unsupervised loss; ω (t) is the Gaussian preheat function.

The experimental data for this example are: when only 60% of the annotation data was used, this example had an F1-score of 0.6540 on the concrete Crack data set and an F1-score of 0.8321 on the Crack500 data set. The result shows that the embodiment greatly reduces the labeling workload while maintaining higher precision.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A crack detection method based on semi-supervised semantic segmentation is characterized by comprising the following steps;

2. The crack detection method based on semi-supervised semantic segmentation as recited in claim 1, wherein the structures of the student model and the teacher model are codec structures, and the codec structures comprise an encoder and a decoder; wherein the encoder comprises a CNN-based image feature extraction network; the decoder comprises an image feature segmentation network based on U-Net; and inputting the first crack image or the second crack image as an input image into an encoder, and performing feature extraction and segmentation by using a coder-decoder to obtain an output image.

3. The crack detection method based on semi-supervised semantic segmentation as recited in claim 2, wherein the CNN-based image feature extraction network comprises an EfficientNet network;

the working phase of the EfficientNet network comprises the following steps:

4. The crack detection method based on semi-supervised semantic segmentation according to claim 2, wherein the image feature segmentation network based on U-Net comprises 5 upsampling layers and 1 convolutional layer, and an image input by an encoder is subjected to layer-by-layer upsampling and then is input into the convolutional layer to obtain a feature image which is output as an output image; the feature image has the same size as the input image.

5. The crack detection method based on semi-supervised semantic segmentation as recited in claim 1, wherein the supervised loss formula is

Wherein L is_diceIs the loss of the die or dice,

L_{cross-entropy}is the cross-entropy loss of the entropy of the sample,

is a parameter in the loss function, p_iIs the prediction of the input data, y is the true value, and N is the number of data.

6. The crack detection method based on semi-supervised semantic segmentation as recited in claim 1, wherein the unsupervised loss formula is L_unsup＝E_{x，η′，η}[||f(x，θ′，η′)-f(x，θ，η)||²](ii) a Where f (x, θ ', η') is the prediction of the teacher model, f (x, θ, η) is the prediction of the student model, x represents input data, θ 'represents weight, η, η' represents noise, and E represents expectation.

7. The crack detection method based on semi-supervised semantic segmentation as recited in claim 1, wherein the overall loss function formula of the student model and the teacher model is L_total＝L_sup+ω(t)·L_unsup(ii) a Wherein L is_supIs a loss of supervision; l is_unsupIs without supervisionLosing; ω (t) is a Gaussian preheating function having the formula

Where t is the current training step, t_maxIs the total training step.

8. The crack detection method based on semi-supervised semantic segmentation according to claim 1, further comprising: the accuracy of the model was evaluated using F1-Score.

9. The semi-supervised semantic segmentation based crack detection method according to claim 1, wherein the training of the semi-supervised based crack detection semantic segmentation network comprises:

calculating a loss function Lsup (Y1, P1) of X1;

computing the loss function Luntup (P2, P3) for unannotated data X2

the weights of the teacher model are updated by an exponential moving average.

10. A computer device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method according to any one of claims 1-9.